Main Page | Report this Page
Computers Forum Index  »  Computer Compression  »  counter-intuitive gzip compression ratio...
Page 1 of 1    

counter-intuitive gzip compression ratio...

Author Message
Buxte Hude...
Posted: Mon Nov 02, 2009 10:28 pm
Guest
hello all,

I have an interesting case of counter-intuitive packing ratios with gzip
that I would be interesting in knowing if anyone else has observed or
would have an idea about the origin - or if it is just a random
epiphenomenon.

I have this 4.8 GB hdd with a single 340 MB partition (for historical and
emulation reasons - don't ask). That 340 MB partition was filled up to
about 155 MB. For archival purposes, I copied the partition to a file
and also the entire hdd to another file.

Intending to compress both files and knowing that lots of zeroes
compress better Smile, before capturing the data, I had first mounted the
partition in order to fill all available space with a zero-filled file
that I then deleted, and also zero-filled the unpartitioned space by
creating a partition in the available space, zeroing the device, and
deleting the partition.

So I then dumped both the single partition and the entire disk, using
dd, to a 64-bit filesystem of course. So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.

Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB. Also lzma and xz get the 340 MB file down to
about 90 MB.

The 75 MB file is correct, as evidenced by inflating it back and taking
an md5sum and comparing to the md5sum of the original hdd.

How is that possible? How adding vast amounts of low-entropy crap can
improve a compression ratio??
 
Tom St Denis...
Posted: Tue Nov 03, 2009 2:29 am
Guest
On Nov 2, 5:28 pm, Buxte Hude <buxteh... at (no spam) buxtehude.no> wrote:
Quote:
hello all,

I have an interesting case of counter-intuitive packing ratios with gzip
that I would be interesting in knowing if anyone else has observed or
would have an idea about the origin - or if it is just a random
epiphenomenon.

I have this 4.8 GB hdd with a single 340 MB partition (for historical and
emulation reasons - don't ask). That 340 MB partition was filled up to
about 155 MB. For archival purposes, I copied the partition to a file
and also the entire hdd to another file.

Intending to compress both files and knowing that lots of zeroes
compress better Smile, before capturing the data, I had first mounted the
partition in order to fill all available space with a zero-filled file
that I then deleted, and also zero-filled the unpartitioned space by
creating a partition in the available space, zeroing the device, and
deleting the partition.

So I then dumped both the single partition and the entire disk, using
dd, to a 64-bit filesystem of course. So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.

Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB. Also lzma and xz get the 340 MB file down to
about 90 MB.

The 75 MB file is correct, as evidenced by inflating it back and taking
an md5sum and comparing to the md5sum of the original hdd.

How is that possible? How adding vast amounts of low-entropy crap can
improve a compression ratio??

It's late and I might misunderstand, is the 340MB file from a
partition that is resized, zeroed, then resized back to 340? Or did
you

1. copy the 340MB out
2. then inside the 4.8G disk resize/zero/resize

Because if you copied it out first made the slack space on the 340
isn't zeroed?

Tom
 
Mark Adler...
Posted: Tue Nov 03, 2009 6:15 am
Guest
On 2009-11-02 14:28:46 -0800, Buxte Hude <buxtehude at (no spam) buxtehude.no> said:
Quote:
So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.

Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB.
....
How is that possible?

It's not. gzip compression is relatively local, so it effectively
doesn't know anything about all those zeros far away. Adding the 4.5
GB of zeros should have added about 4 MB to the compressed file.

Have you verified your assertion, that the 340 MB file is exactly
contained in the 4.8 GB file?

Mark
 
Niels Fröhling...
Posted: Wed Nov 04, 2009 6:15 am
Guest
Buxte Hude wrote:

Quote:
How is that possible? How adding vast amounts of low-entropy crap can
improve a compression ratio??

gzip moves a context-window over your data. If you have a large enough
zero-area the entire context will be filled with zeroes (in practice this is
almost like a context-flush). If your defragmenter had something to say on your
old disk, we may assume that you got some related contents clustered and
seperated from other content by voids.
As such the compressor won't try to apply learned statistics from cluster-1 to
cluster-2 and so on (because the statistics get flushed/cleared by the zero areas).
While the compacted partition-data causes the compressor to apply a much
denser and probably much less related statistical model.

You can try to verify this in that you have a file/directory-based compressor
run on the partitions content (like rar with "solid" option enabled). It should
compress even be better.

Ciao
Niels
 
jules Gilbert...
Posted: Sat Nov 14, 2009 3:13 am
Guest
On Nov 2, 10:34 pm, Mark Adler <mad... at (no spam) alumni.caltech.edu> wrote:
Quote:
On 2009-11-02 14:28:46 -0800, Buxte Hude <buxteh... at (no spam) buxtehude.no> said:

So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.

Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB.
...
How is that possible?

It's not.  gzip compression is relatively local, so it effectively
doesn't know anything about all those zeros far away.  Adding the 4.5
GB of zeros should have added about 4 MB to the compressed file.

Have you verified your assertion, that the 340 MB file is exactly
contained in the 4.8 GB file?

Mark

gzip has some unusual properties -- well, unusual for a compressor.
For instance, it produces many more one's than zero's. And that's
just plain wrong.

--jg
 
Mark Adler...
Posted: Sat Nov 14, 2009 12:59 pm
Guest
On 2009-11-13 19:13:42 -0800, jules Gilbert <jules.stocks at (no spam) gmail.com> said:
Quote:
gzip has some unusual properties -- well, unusual for a compressor.
For instance, it produces many more one's than zero's.

An interesting assertion. However it's wrong. From a large gzip file I got:

140008610 ones out of 280097016 bits

That one was the other way around, with more zeros than ones. But not
many more. Considering a random distribution, one standard deviation
from exactly one-half is about the square root of the expected number
of ones, which is ~12,000. So the number of ones is about 3.4 sigma
low from what is expected.

3.4 sigma does seem a little unlikely, so there might be a very slight
tendency to produce more ones than zeros, which might be the case in
the Huffman code descriptors at the start of each dynamic block.

Or it might just be chance.

Mark
 
Mark Adler...
Posted: Sat Nov 14, 2009 1:00 pm
Guest
On 2009-11-13 23:59:42 -0800, Mark Adler <madler at (no spam) alumni.caltech.edu> said:
Quote:
3.4 sigma does seem a little unlikely, so there might be a very slight
tendency to produce more ones than zeros,

Oops -- I meant more zeros than ones.

Mark
 
stan...
Posted: Sun Nov 15, 2009 3:24 am
Guest
jules Gilbert wrote:
Quote:
On Nov 2, 10:34 pm, Mark Adler <mad... at (no spam) alumni.caltech.edu> wrote:
On 2009-11-02 14:28:46 -0800, Buxte Hude <buxteh... at (no spam) buxtehude.no> said:

gzip has some unusual properties -- well, unusual for a compressor.
For instance, it produces many more one's than zero's. And that's
just plain wrong.

gzip is actual code I can examine and use. Where is your's? You would
have much more credibility if you weren't the perpetual RSN guy. You
are well aware that the reaction to you around here is at best
sceptical and runs to outright total disbelief. Your thoughts about
proven code we all use daily can't be realistically considered by
rational minds.
 
 
Page 1 of 1    
All times are GMT
The time now is Sun Nov 29, 2009 3:42 am