Main Page | Report this Page
 
   
Science Forum Index  »  Compression Forum  »  DEFLATE block size algorithms
Page 1 of 1    
Author Message
Guest
Posted: Tue Feb 26, 2008 3:38 am
What method is currently used in zlib and 7-Zip to determine the size
of each data block in the DEFLATE stream?

I ran a quick test on two gzipped tar files, one of which was
compressed with gzip -9 and one of which was compressed with 7za -
tgzip -mx=9. In both cases, the block size was not uniform throughout
the file. For 7-Zip, the size ranged from under 4K to nearly 16K. For
gzip, it ranged from under 6K to over 60K. In general, 7-Zip used
smaller blocks and seems to get better compression.

Block splitting could be especially important in PNG files.
Photographs and scans stored losslessly in PNG format often cannot
make good use of LZ77 compression, because patterns in real-world
images seldom repeat *precisely*. In many cases (such as the standard
"Lenna" and "Mandrill" test images), using only Huffman compression
actually gets a better ratio. Therefore, it becomes important to
optimize the Huffman tables through intelligent block division.

I have thought of one possible method that could be implemented,
though I haven't actually reduced it to code yet. The DEFLATE
algorithm would first find all LZ77 matches in the file, if desired.
(As noted above, PNG photos often do better skipping this step.) After
the optional LZ77 phase, Huffman compression would be done as follows.

First, the algorithm would create blocks of a small fixed size, maybe
256 bytes. For each block, it would try all 3 methods (stored, fixed
Huffman, custom Huffman) and record the best option, as well as the
compressed size in bits.

Secondly, the compressor would try to concatentate together each
consecutive set of two blocks. This concatentation would only be done
if the combined block was smaller, in bits, than the sum of its two
parts. Again, all three methods would be tried for each potential
block. It would try blocks 0+1, 1+2, 2+3, 3+4, and so on.

Then, the compressor would go back and do step 2 again on the result.
It would keep combining blocks until one of two things happened:
either the whole file was one big block (unlikely that this would be
optimal) or it went through step 2 and found no more efficient
combinations. At this point, it would be finished and the resulting
file could be written out.

Has anyone else already done this, or something similar? Are there any
flaws I'm overlooking as to why this wouldn't work or would not be
effective?
Einstein
Posted: Tue Feb 26, 2008 6:00 am
Guest
On Feb 26, 5:38 am, jdgrii8...@ngcsu.edu wrote:
Quote:
What method is currently used in zlib and 7-Zip to determine the size
of each data block in the DEFLATE stream?

I ran a quick test on two gzipped tar files, one of which was
compressed with gzip -9 and one of which was compressed with 7za -
tgzip -mx=9. In both cases, the block size was not uniform throughout
the file. For 7-Zip, the size ranged from under 4K to nearly 16K. For
gzip, it ranged from under 6K to over 60K. In general, 7-Zip used
smaller blocks and seems to get better compression.

Block splitting could be especially important in PNG files.
Photographs and scans stored losslessly in PNG format often cannot
make good use of LZ77 compression, because patterns in real-world
images seldom repeat *precisely*. In many cases (such as the standard
"Lenna" and "Mandrill" test images), using only Huffman compression
actually gets a better ratio. Therefore, it becomes important to
optimize the Huffman tables through intelligent block division.

I have thought of one possible method that could be implemented,
though I haven't actually reduced it to code yet. The DEFLATE
algorithm would first find all LZ77 matches in the file, if desired.
(As noted above, PNG photos often do better skipping this step.) After
the optional LZ77 phase, Huffman compression would be done as follows.

First, the algorithm would create blocks of a small fixed size, maybe
256 bytes. For each block, it would try all 3 methods (stored, fixed
Huffman, custom Huffman) and record the best option, as well as the
compressed size in bits.

Secondly, the compressor would try to concatentate together each
consecutive set of two blocks. This concatentation would only be done
if the combined block was smaller, in bits, than the sum of its two
parts. Again, all three methods would be tried for each potential
block. It would try blocks 0+1, 1+2, 2+3, 3+4, and so on.

Then, the compressor would go back and do step 2 again on the result.
It would keep combining blocks until one of two things happened:
either the whole file was one big block (unlikely that this would be
optimal) or it went through step 2 and found no more efficient
combinations. At this point, it would be finished and the resulting
file could be written out.

Has anyone else already done this, or something similar? Are there any
flaws I'm overlooking as to why this wouldn't work or would not be
effective?


Yes some blocks are more compressible at larger sizes, and some at
smaller. I have looked at a sequence that broke a file into say X
number of blocks, then it tried to run compression upon each

Those that compress go normally, those that do not compress get put
aside. Next sequential combining of two blocks happens, and another
attempt at compression. Ultimately I kept combining blocks until they
were all compressed, or the whole would not compress. Hmmm, this
actually is a good plan and memory, why I am confusing myself is out
of my thoughts now, instead I can, and will process smaller chunks of
the whole.
Mark Adler
Posted: Tue Feb 26, 2008 7:19 am
Guest
On Feb 26, 5:38 am, jdgrii8...@ngcsu.edu wrote:
Quote:
What method is currently used in zlib and 7-Zip to determine the size
of each data block in the DEFLATE stream?

zlib simply has a fixed number of symbols per block, defined by the
memLevel parameter. The default is 16K symbols. (A symbol is either
a literal or a length/distance pair.) The current version of zlib
does not attempt to adjust the block size as a function of the data.
Note that gzip is different, and did/does attempt to intelligently
vary the block size, but the algorithm used turned out to have little
benefit, and so zlib went to the fixed block size approach.

In general if the statistics of the data are relatively static, larger
blocks are better since the code description overhead is amortized
over more symbols. If the statistics of the data are more dynamic,
smaller blocks are better, since the codes can adapt more readily to
the changing nature of the data. Obviously there is a balance to be
struck there that is highly data dependent.

If the user happens to know that the data to compress is about to
change in character, then you can tell zlib to emit a block and start
a new one by flushing (Z_SYNC_FLUSH).

Your approach to combine small adjacent blocks into larger ones when
beneficial is a good one, but will be a little time consuming.
Especially if you're only using Huffman coding, in which case most of
the time will be spent fiddling with blocks. It would be interesting
to see the gain achievable and the cost in execution time for image
data.

Mark
Errol Smith
Posted: Thu Feb 28, 2008 11:58 pm
Guest
jdgrii8338@ngcsu.edu wrote:
Quote:
Has anyone else already done this, or something similar? Are there any
flaws I'm overlooking as to why this wouldn't work or would not be
effective?

I've always wanted to try a successive division system, where you start
with 1 block (the whole file), then 2 sections then 3 sections etc until
you start to lose compression ratio rather than gain it in each block.
At that point you start moving the endpoints by binary chop until you
get the optimal set of blocks.
I think.
Smile
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Tue Oct 07, 2008 3:01 am