 |
|
| Computers Forum Index » Computer Compression » counter-intuitive gzip compression ratio... |
|
Page 1 of 1 |
|
| Author |
Message |
| Buxte Hude... |
Posted: Mon Nov 02, 2009 10:28 pm |
|
|
|
Guest
|
hello all,
I have an interesting case of counter-intuitive packing ratios with gzip
that I would be interesting in knowing if anyone else has observed or
would have an idea about the origin - or if it is just a random
epiphenomenon.
I have this 4.8 GB hdd with a single 340 MB partition (for historical and
emulation reasons - don't ask). That 340 MB partition was filled up to
about 155 MB. For archival purposes, I copied the partition to a file
and also the entire hdd to another file.
Intending to compress both files and knowing that lots of zeroes
compress better , before capturing the data, I had first mounted the
partition in order to fill all available space with a zero-filled file
that I then deleted, and also zero-filled the unpartitioned space by
creating a partition in the available space, zeroing the device, and
deleting the partition.
So I then dumped both the single partition and the entire disk, using
dd, to a 64-bit filesystem of course. So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.
Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB. Also lzma and xz get the 340 MB file down to
about 90 MB.
The 75 MB file is correct, as evidenced by inflating it back and taking
an md5sum and comparing to the md5sum of the original hdd.
How is that possible? How adding vast amounts of low-entropy crap can
improve a compression ratio?? |
|
|
| Back to top |
|
|
|
| Tom St Denis... |
Posted: Tue Nov 03, 2009 2:29 am |
|
|
|
Guest
|
On Nov 2, 5:28 pm, Buxte Hude <buxteh... at (no spam) buxtehude.no> wrote:
Quote: hello all,
I have an interesting case of counter-intuitive packing ratios with gzip
that I would be interesting in knowing if anyone else has observed or
would have an idea about the origin - or if it is just a random
epiphenomenon.
I have this 4.8 GB hdd with a single 340 MB partition (for historical and
emulation reasons - don't ask). That 340 MB partition was filled up to
about 155 MB. For archival purposes, I copied the partition to a file
and also the entire hdd to another file.
Intending to compress both files and knowing that lots of zeroes
compress better  , before capturing the data, I had first mounted the
partition in order to fill all available space with a zero-filled file
that I then deleted, and also zero-filled the unpartitioned space by
creating a partition in the available space, zeroing the device, and
deleting the partition.
So I then dumped both the single partition and the entire disk, using
dd, to a 64-bit filesystem of course. So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.
Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB. Also lzma and xz get the 340 MB file down to
about 90 MB.
The 75 MB file is correct, as evidenced by inflating it back and taking
an md5sum and comparing to the md5sum of the original hdd.
How is that possible? How adding vast amounts of low-entropy crap can
improve a compression ratio??
It's late and I might misunderstand, is the 340MB file from a
partition that is resized, zeroed, then resized back to 340? Or did
you
1. copy the 340MB out
2. then inside the 4.8G disk resize/zero/resize
Because if you copied it out first made the slack space on the 340
isn't zeroed?
Tom |
|
|
| Back to top |
|
|
|
| Mark Adler... |
Posted: Tue Nov 03, 2009 6:15 am |
|
|
|
Guest
|
On 2009-11-02 14:28:46 -0800, Buxte Hude <buxtehude at (no spam) buxtehude.no> said:
Quote: So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.
Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB.
....
How is that possible?
It's not. gzip compression is relatively local, so it effectively
doesn't know anything about all those zeros far away. Adding the 4.5
GB of zeros should have added about 4 MB to the compressed file.
Have you verified your assertion, that the 340 MB file is exactly
contained in the 4.8 GB file?
Mark |
|
|
| Back to top |
|
|
|
| Niels Fröhling... |
Posted: Wed Nov 04, 2009 6:15 am |
|
|
|
Guest
|
Buxte Hude wrote:
Quote: How is that possible? How adding vast amounts of low-entropy crap can
improve a compression ratio??
gzip moves a context-window over your data. If you have a large enough
zero-area the entire context will be filled with zeroes (in practice this is
almost like a context-flush). If your defragmenter had something to say on your
old disk, we may assume that you got some related contents clustered and
seperated from other content by voids.
As such the compressor won't try to apply learned statistics from cluster-1 to
cluster-2 and so on (because the statistics get flushed/cleared by the zero areas).
While the compacted partition-data causes the compressor to apply a much
denser and probably much less related statistical model.
You can try to verify this in that you have a file/directory-based compressor
run on the partitions content (like rar with "solid" option enabled). It should
compress even be better.
Ciao
Niels |
|
|
| Back to top |
|
|
|
| jules Gilbert... |
Posted: Sat Nov 14, 2009 3:13 am |
|
|
|
Guest
|
On Nov 2, 10:34 pm, Mark Adler <mad... at (no spam) alumni.caltech.edu> wrote:
Quote: On 2009-11-02 14:28:46 -0800, Buxte Hude <buxteh... at (no spam) buxtehude.no> said:
So there I got my 340 MB file, and
my 4.8 GB file that has inside the partition table, the contents of the
340 MB file and then 4.5 GB of zeroes. I then gzipped both files.
Suprise: the 4.8 GB file compresses down to 77 MB, whereas the 340 MB
file compresses to 125 MB.
...
How is that possible?
It's not. gzip compression is relatively local, so it effectively
doesn't know anything about all those zeros far away. Adding the 4.5
GB of zeros should have added about 4 MB to the compressed file.
Have you verified your assertion, that the 340 MB file is exactly
contained in the 4.8 GB file?
Mark
gzip has some unusual properties -- well, unusual for a compressor.
For instance, it produces many more one's than zero's. And that's
just plain wrong.
--jg |
|
|
| Back to top |
|
|
|
| Mark Adler... |
Posted: Sat Nov 14, 2009 12:59 pm |
|
|
|
Guest
|
On 2009-11-13 19:13:42 -0800, jules Gilbert <jules.stocks at (no spam) gmail.com> said:
Quote: gzip has some unusual properties -- well, unusual for a compressor.
For instance, it produces many more one's than zero's.
An interesting assertion. However it's wrong. From a large gzip file I got:
140008610 ones out of 280097016 bits
That one was the other way around, with more zeros than ones. But not
many more. Considering a random distribution, one standard deviation
from exactly one-half is about the square root of the expected number
of ones, which is ~12,000. So the number of ones is about 3.4 sigma
low from what is expected.
3.4 sigma does seem a little unlikely, so there might be a very slight
tendency to produce more ones than zeros, which might be the case in
the Huffman code descriptors at the start of each dynamic block.
Or it might just be chance.
Mark |
|
|
| Back to top |
|
|
|
| Mark Adler... |
Posted: Sat Nov 14, 2009 1:00 pm |
|
|
|
Guest
|
On 2009-11-13 23:59:42 -0800, Mark Adler <madler at (no spam) alumni.caltech.edu> said:
Quote: 3.4 sigma does seem a little unlikely, so there might be a very slight
tendency to produce more ones than zeros,
Oops -- I meant more zeros than ones.
Mark |
|
|
| Back to top |
|
|
|
| stan... |
Posted: Sun Nov 15, 2009 3:24 am |
|
|
|
Guest
|
jules Gilbert wrote:
Quote: On Nov 2, 10:34 pm, Mark Adler <mad... at (no spam) alumni.caltech.edu> wrote:
On 2009-11-02 14:28:46 -0800, Buxte Hude <buxteh... at (no spam) buxtehude.no> said:
gzip has some unusual properties -- well, unusual for a compressor.
For instance, it produces many more one's than zero's. And that's
just plain wrong.
gzip is actual code I can examine and use. Where is your's? You would
have much more credibility if you weren't the perpetual RSN guy. You
are well aware that the reaction to you around here is at best
sceptical and runs to outright total disbelief. Your thoughts about
proven code we all use daily can't be realistically considered by
rational minds. |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Sun Nov 29, 2009 3:42 am
|
|