| |
 |
|
| Science Forum Index » Compression Forum » LZ77 Idea |
|
Page 1 of 1 |
|
| Author |
Message |
| zeroone |
Posted: Wed Jun 29, 2005 9:19 am |
|
|
|
Guest
|
Hi everyone,
at the moment, I am working on an LZ77 (more specific: LZSS) compressor.
Usually a constant amount of bits are used for codewords, consisting of
offset and length of matches.
Since working bytewise is easier and often also faster, I chose 3 Byte
(24 bits) codewords to be the best choice for storing offset and length.
My idea (I don't know if someone has already used this):
Instead using always 3 byte codewords I also include 2 and even 1 byte
codewords. So if no 4 byte match is found (minimum match length when
using 3 byte codewords, to save at least 1 byte) the compressor looks
for 3 byte matches to be encoded using 2 bytes.
The first bit of the codewords is used as a flag to indicate, if the
codeword has 2 or 3 bytes length. Of course, matchoffset und matchlength
lengths suffer from less bits, but at least there is a chance to
compress also 3 byte matches. Up to here everything works fine
(theoretically, since I did not implement this algo by now).
What if also no 3 byte matches are found? Usually the compressor has to
output a literal now, BUT what if we search for 2 byte matches to be
encoded using 1 byte codewords. Of cource, again, offset and length is
very limited, but small compression is better that no compression at all ;)
My problem: I don't know how to indicate 3 different codeword lengths.
One flagbit can only handle two different states. Using a second flagbit
would hurt offset and length (especially when using 1 and 2 byte
codewords) very bad.
So does anyone have any idea how to solve this? Any help is appreciated.
Big thanks,
zeroone |
|
|
| Back to top |
|
|
|
| Jim Leonard |
Posted: Wed Jun 29, 2005 1:03 pm |
|
|
|
Guest
|
zeroone wrote:
[quote:443373d449]So does anyone have any idea how to solve this? Any help is appreciated.
[/quote:443373d449]
A few random thoughts spring to mind:
- Add a codeword so that the extra bit isn't "wasted". Does your
sliding window support sizes greater than 16777216? If so, consider a
32-bit code.
- Limit yourself to two codeword types (short and long) so that you
don't need an extra bit (my 8088 compressor does this, but also because
my sliding window is short enough to support it; yours may not).
- Use Golomb codes so that your codewords are variable (slows things
down as you no longer have byte-wise codes)
Hope this helps... |
|
|
| Back to top |
|
|
|
| zeroone |
Posted: Wed Jun 29, 2005 4:43 pm |
|
|
|
Guest
|
Jim Leonard wrote:
[quote:4e146001bc]zeroone wrote:
So does anyone have any idea how to solve this? Any help is appreciated.
A few random thoughts spring to mind:
- Add a codeword so that the extra bit isn't "wasted". Does your
sliding window support sizes greater than 16777216? If so, consider a
32-bit code.
- Limit yourself to two codeword types (short and long) so that you
don't need an extra bit (my 8088 compressor does this, but also because
my sliding window is short enough to support it; yours may not).
- Use Golomb codes so that your codewords are variable (slows things
down as you no longer have byte-wise codes)
Hope this helps...
[/quote:4e146001bc]
Thank you for your reply.
As I want to do everything byte-wise I won't use variable length codes.
What do you mean with 32-bit codes? 32 bits long codewords? Spending
more than 24 bits for codewords does not make sense in my compression,
since offset and length can perfectly be stored within only 3 bytes (of
course considering a max. 65k sliding window).
Using only two codeword types is possible without any problems.
But if possible I'll try to include all three types. Coding even 2 byte
matches would achieve great compression. That's the reason I am asking
for help, maybe there is a solution for this issue. If I am on the wrong
track I have to implement the two type version..
Hopefully anyone has some advise or just 'thoughts'  |
|
|
| Back to top |
|
|
|
| Jim Leonard |
Posted: Wed Jun 29, 2005 6:00 pm |
|
|
|
Guest
|
zeroone wrote:
[quote:bea185d0d8]What do you mean with 32-bit codes? 32 bits long codewords? Spending
more than 24 bits for codewords does not make sense in my compression,
since offset and length can perfectly be stored within only 3 bytes (of
course considering a max. 65k sliding window).
[/quote:bea185d0d8]
Exactly. If you increase the size of your window, you can expand to a
4th 32-bit code. |
|
|
| Back to top |
|
|
|
| Jacek Wawrzaszek |
Posted: Wed Jun 29, 2005 7:07 pm |
|
|
|
Guest
|
"zeroone" wrote:
[quote:f6ce08eb16]
As I want to do everything byte-wise I won't use variable length codes.
[/quote:f6ce08eb16]
You can use variable length code to encode type of codeword and still
do it byte-wise:
0 -> 7-bit codeword
10 -> 14-bit codeword
11 -> 22-bit codeword
[quote:f6ce08eb16]
Using only two codeword types is possible without any problems.
But if possible I'll try to include all three types. Coding even 2 byte
matches would achieve great compression. That's the reason I am asking
for help, maybe there is a solution for this issue. If I am on the wrong
track I have to implement the two type version..
[/quote:f6ce08eb16]
If you have many 2 byte matches, you might try to use all 7 bits of the
shortest codeword as offset and encode with it only 2 byte strings.
J. |
|
|
| Back to top |
|
|
|
| zeroone |
Posted: Thu Jun 30, 2005 3:32 am |
|
|
|
Guest
|
Jacek Wawrzaszek wrote:
[quote:b4bb9e9d70]"zeroone" wrote:
As I want to do everything byte-wise I won't use variable length codes.
You can use variable length code to encode type of codeword and still do
it byte-wise:
0 -> 7-bit codeword
10 -> 14-bit codeword
11 -> 22-bit codeword
[/quote:b4bb9e9d70]
That seems to be a great solution. Although I loose some bits all three
types can be used in only one function. Thank you!
Now I am wondering, if it would be better to split it up into two
distinct functions:
The first encodes big matches down to 3 bytes:
1 -> 23-bit codeword (offset+length) if the match is > 3 bytes
0 -> 15-bit codeword (*only*offset) if matches are exactly 3 bytes long.
Length is not needed any more, because this type encodes only 3 byte
matches.
The second function encodes only matches of 2 bytes, since all bigger
matches have been previously encoded:
8-bit codeword (*only* offset). Again length isn't needed, because it is
constant.
How does this idea sound? Would it achieve good compression? Are there
still any improvements?
Up to here I favor the first method (one big function), since the second
short-match function has to compensate the flagbytes that are stored in
each 8th byte. So there must be at least one 2-byte-match each 8th byte.
Big thanks,
zeroone |
|
|
| Back to top |
|
|
|
| cr88192 |
Posted: Mon Jul 04, 2005 11:25 pm |
|
|
|
Guest
|
"cr88192" <cr88192@NOSPAM.hotmail.com> wrote in message
news:c5043$42c556c7$ca801623$8012@saipan.com...
[quote:03264379ea]
"zeroone" <noone@no-spam.de> wrote in message
news:da3bcp$i9$02$1@news.t-online.com...
cr88192 wrote:
[/quote:03264379ea]
<snip>
[quote:03264379ea]
At the moment I favor either using 1 or 2 bits of the codewords
theirselves to support all three types, or using just 2 types (long and
short) and -again- using no additional flagbytes, but the codewords
theirselves to store the type (in this case their first bit).
yes, one either needs a smaller length or offset in this case.
3.12, likely too poor to be useful.
4.11, maybe ok. 5.10, maybe better, maybe worse.
24 bit case, one either has 7.16 or 8.15, so either a run limit of 128, or
a window limit of 32kB.
I really don't know what's better..
well, trial and error would probably be needed to determine this.
[/quote:03264379ea]
I finally got around to it. in case anyone cares, here is what I found from
testing:
2 flagbits hurts compression more than using a single bit in the run;
3.12 and 5.10 lead to worse compression, 4.11 gave the best results;
7.16 is better than 8.15 in my tests.
results are pretty close to those of my 'lzbss' algo, just without the whole
codespace mangling thing (allowing a simpler and possibly faster decoder).
rfc3261: lzbss 195kB, this 199kB, gzip 174kB.
calagary corpus (as a tarball): lzbss 1.21MB, this 1.23MB, gzip 1.03MB.
I guess this is probably reasonable for a bytewise lz77 variant.
so? I don't know...
encoding (for now at least):
window is "flat", offsets are relative to the current position,
offsets>65536-256 are not allowed (so that using a circular buffer for
decode is possible).
flagbits are packed 8 at a time into a byte going from low to high.
flagbits 0=run, 1=literal.
0: 16 bit run
MSB=0, low 7=high bits of relative offset
high 4=low bits of offset, low 4=length-2
1: 24 bit run
MSB=1, low 7=length
high 8 bits of offset
low 8 bits of offset
may add some sort of file-header or something (I just change algos too damn
often...).
or whatever... |
|
|
| Back to top |
|
|
|
|
|
All times are GMT - 5 Hours
The time now is Sat Nov 21, 2009 1:28 am
|
|