Main Page | Report this Page
Computers Forum Index  »  Computer Compression  »  A Revolutionary Breakthrough in Data Compression!...
Page 1 of 3    Goto page 1, 2, 3  Next

A Revolutionary Breakthrough in Data Compression!...

Author Message
sebastian...
Posted: Sat Sep 26, 2009 2:29 am
Guest
Not really. I just figured I'd make my first post on comp.compression
with a splash. =)

What I have got is a minor improvement of the classic Huffman
algorithm. So basically, the problem with the classic approach is that
you have to transmit the statistics along with the compressed data.
This is usually addressed by modifying the algorithm to arrange the
codes using the 'canonical' scheme, which in essense enables you to
send just the length of each code instead, resulting in exactly 256
bytes of overhead.

As it turns out, we can actually do better than that, and it doesn't
even require rearrangement of the codes: transmit the Huffman 'tree'
itself!

I'll describe the method with a bit of pseudo-code:

EMIT_TREE(NODE)
IF NODE IS LEAF
EMIT_BIT(LEAF TAG)
EMIT_BYTE(NODE DATA)
ELSE
EMIT_BIT(NONLEAF TAG)
EMIT_TREE(LEFT CHILD OF NODE)
EMIT_TREE(RIGHT CHILD OF NODE)

Reconstruction from the stream is simple and unambiguous. And since a
Huffman tree is guaranteed to contain 511 or less nodes, the upper
bound for the total size of the compression model is 511 + ( 256 * 8 )
== 2559 bits == 319.875 bytes. This is, of course, 63.875 bytes larger
than the worst (and only) case for canonical encoding, but then the
lower bound is a mere 2.375 bytes, and thus the 'average' (whatever
that means) case should be somewhere in the middle.

Besides the possible savings in overhead, the 'classic' Huffman
algorithm can be used, which is much more simple and efficient than
the 'canonical' method.

Cheers,

Sebastian Garth
 
Willem...
Posted: Sat Sep 26, 2009 7:18 am
Guest
sebastian wrote:
) Reconstruction from the stream is simple and unambiguous. And since a
) Huffman tree is guaranteed to contain 511 or less nodes, the upper
) bound for the total size of the compression model is 511 + ( 256 * 8 )
)== 2559 bits == 319.875 bytes. This is, of course, 63.875 bytes larger
) than the worst (and only) case for canonical encoding, but then the
) lower bound is a mere 2.375 bytes, and thus the 'average' (whatever
) that means) case should be somewhere in the middle.

How did you get that lower bound ?

And, more importantly, where did you get the ridiculous idea that you could
simply take the linear average of the lower and upper bounds as the average
for the distribution ?

) Besides the possible savings in overhead, the 'classic' Huffman
) algorithm can be used, which is much more simple and efficient than
) the 'canonical' method.

It's neither more simple nor more efficient. What it is, is more clear
to those who just learned the Huffman algorithm, because it directly
translates fropm the theory.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
biject...
Posted: Sat Sep 26, 2009 2:17 pm
Guest
Actually there are several ways to encode the table data. I have
played with many. In fact you can do the whole thing bijectively.
The thing most people don't realize is that of the several ways of
doing it. Some will be shorter if you have a few symbols others will
be shorter if you have many symbols. But its kind of a zero sum game.

After much experimentation I settled on just using a bijective vitter
style huffman adapative as the best approach for my uses when doing up
to 256 symbol compression.

You can do bijective static huffman with tables but its harder and
there are more special cases one has to worry about.

You know if you have an optimal solution if the following is meet
C(U(x) = x and U(C(x)= x
wherr C() and U() the compression and uncompression
functions and x is any file.

The solution is not unique even for huffman since it still depends on
how you optimally code the table data.
Since you can do it bijective to favor various numbers of symbols or
even various groups of symbols. But in case if a long file there
various means make up a very small portion of file. In the standard
way of doing tables they are not bijective so they waste space right
out of the gate.

David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"
 
sebastian...
Posted: Sat Sep 26, 2009 3:20 pm
Guest
Quote:
Well, I have some bad news for you: In this kind of averaging, the average is usually very close to the upper bound.

Not necessarily. Testing it on text files, I've found the compression
model to be roughly 100 bytes or less. Your post, for example,
generated an 81.375 byte model, and the total compression output
(model + data) was 1220 bytes (or ~62% the original size of 1962
bytes). Binary data (such as executables), of course, do tend toward
the upper bound, though.

Quote:
In some extreme cases, where the input data is extremely skewed, you would lose a tiny bit of compression. You could try to work out some cases where a codelength of more than 32
bits is actually needed, for your enlightenment.


I'm not so sure. I would conceive that it could also turn out to be a
much-less-than-negligible loss of compression. At any rate, my method
is perfectly optimal (worst case, the output size (not including the
compression model, of course) is *exactly* the size of the input), and
the compression speed is acceptable enough in every case, so I really
don't see any compelling reason to compromise the compression rate at
all.
 
Harold Aptroot...
Posted: Sat Sep 26, 2009 3:52 pm
Guest
"sebastian" <sebastiangarth at (no spam) gmail.com> wrote in message
news:8b5cb235-d396-4243-9170-d200aeaf8a9e at (no spam) o35g2000vbi.googlegroups.com...
Quote:
Not really. I just figured I'd make my first post on comp.compression
with a splash. =)

What I have got is a minor improvement of the classic Huffman
algorithm. So basically, the problem with the classic approach is that
you have to transmit the statistics along with the compressed data.
This is usually addressed by modifying the algorithm to arrange the
codes using the 'canonical' scheme, which in essense enables you to
send just the length of each code instead,
resulting in exactly 256 bytes of overhead.

Actually that's not true. There are of course cases in which you may need
256 bytes, but usually it will be far less.
1) it's unusual to allow such huge codelengths - as it's very impractical
for decoding. You may only need 4 or 5 bits per symbol.
2) you could compress the lengths-data. and why wouldn't you? Deflate and
BZIP2 both contain nice examples of how to do that.

Quote:
As it turns out, we can actually do better than that, and it doesn't
even require rearrangement of the codes: transmit the Huffman 'tree'
itself!

I'll describe the method with a bit of pseudo-code:

EMIT_TREE(NODE)
IF NODE IS LEAF
EMIT_BIT(LEAF TAG)
EMIT_BYTE(NODE DATA)
ELSE
EMIT_BIT(NONLEAF TAG)
EMIT_TREE(LEFT CHILD OF NODE)
EMIT_TREE(RIGHT CHILD OF NODE)

Reconstruction from the stream is simple and unambiguous. And since a
Huffman tree is guaranteed to contain 511 or less nodes, the upper
bound for the total size of the compression model is 511 + ( 256 * 8 )
== 2559 bits == 319.875 bytes. This is, of course, 63.875 bytes larger
than the worst (and only) case for canonical encoding, but then the
lower bound is a mere 2.375 bytes, and thus the 'average' (whatever
that means) case should be somewhere in the middle.

Your lower bound seems a little odd to me, how did you get it?

Quote:
Besides the possible savings in overhead, the 'classic' Huffman
algorithm can be used, which is much more simple and efficient than
the 'canonical' method.

Cheers,

Sebastian Garth

Well it's simpler alright, but they're provably equally efficient
(codelengths are not changed)
It takes very little time to calculate the canonical codes..
Canonical codes also make decoding easier


Good luck
Harold
 
Willem...
Posted: Sat Sep 26, 2009 4:07 pm
Guest
sebastian wrote:
)>> In some extreme cases, where the input data is extremely skewed, you would lose a tiny bit of compression. You could try to work out some cases where a codelength of more than 32
) bits is actually needed, for your enlightenment.
)
) I'm not so sure.

Then try it.
Just build a tree where one leaf has more than 32 nodes, and then try to
find a minimally skewed symbol distribution that would generate such a
tree.

) I would conceive that it could also turn out to be a
) much-less-than-negligible loss of compression.

Instead of 'conceiving', you should actually try to work things out.

If one symbol requires 33 bits to encode, then you should be able to work
out for yourself that there then is also another symbol that takes 33 bits.
And one that takes 32, one that takes 31, etcetera.

Given that each step of the way, the probabilities roughly halve, the
two symbols requiring 33 bits have at most a 1-in-8-million chance of
appearing. In other words: it only happens when you're compressing 8
megabytes of data *in a single block*.


) At any rate, my method
) is perfectly optimal (worst case, the output size (not including the
) compression model, of course) is *exactly* the size of the input), and
) the compression speed is acceptable enough in every case, so I really
) don't see any compelling reason to compromise the compression rate at
) all.

You *should* include the compression model.
If a tradeoff means I can shave 100 bytes off the compression model,
and it costs me 10 bytes of compression size, then I *gained* 90 bytes.

You're compromising the compression ratio by insisting on storing redundant
information.
Canonical Huffman basically throws away unneeded information of the tree,
so it can always be stored in less space than the tree itself.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
sebastian...
Posted: Sat Sep 26, 2009 4:19 pm
Guest
Quote:
Ok, but then you would only need 2 bytes to specify all lengths in the naive way. 1 byte for L and one for R. So you still save nothing..For 2 symbols you really only need zero bits to specify their lengths since both lengths can only be 1, but that's a bit of a cheat.

You would still need some way to communicate the number of symbols
being used, of course, but that would only amount to 8 extra bits in
the best case scenario. Even so, my encoding works out to less in most
cases, I believe, since the canonical approach requires that you
either send all 256 bytes of length, or the actual symbols plus symbol
lengths.

Quote:
Not necessarily, and when you do it is typically not a lot. True it /can/ occur, but you'd need a truly ridiculous input distribution.

Well, to be fair, 'typically' is a pretty vague expression. When
dealing with binary data, especially, ridiculous distributions are
actually fairly common. On a side note, all of the 'canonical' Huffman
implementations I have found on the net have been broken (in that they
often enough produced corrupted output in decompressing), making it
difficult to compare my algorithm with the standard 'suboptimal'
approach. Anyone know of links to any stable/reliable implementations
that I can use as a benchmark? Because I'm really starting to sound
like I'm talking out of my ass here! =p

Quote:
Limiting the length of the symbols to lower than n-1 (where n is the size of the alphabet) only introduces the possiblity of suboptimality, which is a shame, but greatly simplifies the implementation of the decoder.

Hmm. From what I've read it was more like 32 bits for N == 256. Maybe
I misunderstood, though. And just to be sure, my implementation allows
the maximum symbol length, yet is as simple as can be (if naive: it
just traverses the nodes to output a symbol). I do realize that if you
go the 'suboptimal' route, you can effectively use lookup tables to
decode the compressed stream en masse, but as I've said before, I've
found that the naive approach is reasonably efficient (about 3 times
slower than my compressor, which does use lookup tables).

Quote:
Not necessarily, the "multilevel tabel approach" can be used for any symbol length, but longer symbols would require more levels of tables, decreasing the speed normally gained by table-decoding. Introducing the possiblity of suboptimality limits the number of tables required for decoding, and in typical cases does not actually introduce the suboptimality.

I actually read an explanation of that very approach (by the GNU
implementor), but as I understood it, allowing for the maximum code
length would require a prohibitive amount of memory for the tables,
correct?
 
Harold Aptroot...
Posted: Sat Sep 26, 2009 4:38 pm
Guest
"Harold Aptroot" <harold.aptroot at (no spam) gmail.com> wrote in message
news:9c238$4abe0077$5355b09d$9505 at (no spam) cache100.multikabel.net...
Quote:
"sebastian" <sebastiangarth at (no spam) gmail.com> wrote in message
news:8b5cb235-d396-4243-9170-d200aeaf8a9e at (no spam) o35g2000vbi.googlegroups.com...
Not really. I just figured I'd make my first post on comp.compression
with a splash. =)

What I have got is a minor improvement of the classic Huffman
algorithm. So basically, the problem with the classic approach is that
you have to transmit the statistics along with the compressed data.
This is usually addressed by modifying the algorithm to arrange the
codes using the 'canonical' scheme, which in essense enables you to
send just the length of each code instead,
resulting in exactly 256 bytes of overhead.

Actually that's not true. There are of course cases in which you may need
256 bytes, but usually it will be far less.
1) it's unusual to allow such huge codelengths - as it's very impractical
for decoding. You may only need 4 or 5 bits per symbol.

To clarify, I meant "4 or 5 bits per to store each LENGTH of a symbol"
Wasn't really awake yet :)

Quote:
2) you could compress the lengths-data. and why wouldn't you? Deflate and
BZIP2 both contain nice examples of how to do that.

greetings,
harold
 
sebastian...
Posted: Sat Sep 26, 2009 4:44 pm
Guest
Quote:
Instead of 'conceiving', you should actually try to work things out.

Let's set aside the argument of distribution probabilities, for a
moment. Say we have 100000 bytes of data, which is comprised of the
symbols 'A', 'B', and 'C', where 'A' and 'B' appear only once. What
would be the (suboptimal) canonical code for each symbol?

Quote:
You *should* include the compression model. If a tradeoff means I can shave 100 bytes off the compression model, and it costs me 10 bytes of compression size, then I *gained* 90 bytes. You're compromising the compression ratio by insisting on storing redundant information. Canonical Huffman basically throws away unneeded information of the tree, so it can always be stored in less space than the tree itself.

To be sure, though, even 320 bytes is not a lot of overhead, in the
grand scheme of things. I mean, just what I have typed so far in this
post is nearly 3 times that! So although technically correct, your
statement is effectively an overemphasis, all things considered.
 
Willem...
Posted: Sat Sep 26, 2009 6:19 pm
Guest
sebastian wrote:
)
)
)>> Instead of 'conceiving', you should actually try to work things out.
)
) Let's set aside the argument of distribution probabilities, for a
) moment. Say we have 100000 bytes of data, which is comprised of the
) symbols 'A', 'B', and 'C', where 'A' and 'B' appear only once. What
) would be the (suboptimal) canonical code for each symbol?

In this case, canonical encoding is optimal, so your question is
meaningless.

I'll leave it to you to find an example where canonical encoding
(with a limit of 32 bits to the length) is indeed suboptimal.

)>> You *should* include the compression model. ...
)
) To be sure, though, even 320 bytes is not a lot of overhead, in the
) grand scheme of things. I mean, just what I have typed so far in this
) post is nearly 3 times that! So although technically correct, your
) statement is effectively an overemphasis, all things considered.

Look, *you* are the one who is claiming that you found a way to reduce the
size of the compression model. You're shooting yourself in the foot here.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
Willem...
Posted: Sat Sep 26, 2009 6:24 pm
Guest
sebastian wrote:
)
)
)>> Ok, but then you would only need 2 bytes to specify all lengths in the naive way. 1 byte for L and one for R. So you still save nothing..For 2 symbols you really only need zero bits to specify their lengths since both lengths can only be 1, but that's a bit of a cheat.
)
) You would still need some way to communicate the number of symbols
) being used, of course,

Wrong.
You can tell from a set of given lengths alone when you're finished.

) Well, to be fair, 'typically' is a pretty vague expression. When
) dealing with binary data, especially, ridiculous distributions are
) actually fairly common.

Ridiculous, in this case, is a ratio of about one to eight billion.
(Apologies for the earlier post where I mistakenly claimed eight megabytes.)

) I actually read an explanation of that very approach (by the GNU
) implementor), but as I understood it, allowing for the maximum code
) length would require a prohibitive amount of memory for the tables,
) correct?

Here's an exercise for you:
Given 1 gigabyte of data, what is the longest possible huffman length ?


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
Harold Aptroot...
Posted: Sat Sep 26, 2009 7:25 pm
Guest
"sebastian" <sebastiangarth at (no spam) gmail.com> wrote in message
news:a088fa3e-0ae8-416a-b8f4-1fb8b134712a at (no spam) g6g2000vbr.googlegroups.com...
Quote:
How did you get that lower bound ?

Let's say that the input stream was comprised of only the symbols 'L'
and 'R'. Your tree then looks like this:

ROOT
/ \
'L' 'R'

So the overhead would be 1 bit for each node, and 8 bits for each
character, or 19 bits == 2.375 bytes total.

Ok, but then you would only need 2 bytes to specify all lengths in the naive
way
1 byte for L and one for R
So you still save nothing..

For 2 symbols you really only need zero bits to specify their lengths since
both lengths can only be 1, but that's a bit of a cheat.

Quote:
1) it's unusual to allow such huge codelengths - as it's very impractical
for decoding. You may only need 4 or 5 bits per symbol.

Yes, but then you'd be sacrificing a certain amount of compression.
Correct?

Not necessarily, and when you do it is typically not a lot. True it /can/
occur, but you'd need a truly ridiculous input distribution.
Limiting the length of the symbols to lower than n-1 (where n is the size of
the alphabet) only introduces the possiblity of suboptimality, which is a
shame, but greatly simplifies the implementation of the decoder.

Quote:
2) you could compress the lengths-data. and why wouldn't you? Deflate and
BZIP2 both contain nice examples of how to do that.

That's a good point. Any idea what this usually amounts to?

Actually they're very different, in BZIP2 they use delta encoding, in
deflate they use huffman coding (which sounds strange, but it really isn't,
since every time you recursively apply it you are left with less symbols to
store the length of)

Quote:
Canonical codes also make decoding easier

I heard a similar claim elsewhere, but it was my impression that this
was only true if you chose the suboptimal approach (limiting the code
lengths to 32 bits, say). Is that assumption incorrect?

Not necessarily, the "multilevel tabel approach" can be used for any symbol
length, but longer symbols would require more levels of tables, decreasing
the speed normally gained by table-decoding.
Introducing the possiblity of suboptimality limits the number of tables
required for decoding, and in typical cases does not actually introduce the
suboptimality.

greeting,
harold
 
Willem...
Posted: Sat Sep 26, 2009 8:58 pm
Guest
sebastian wrote:
)>> It wouldn't be suboptimal. It would be:
)>> C=0
)>> A=10
)>> B=11
)
) I see - so we would need more symbols before suboptimality would
) surface. Got it.

A *LOT* more.

) I don't see how it was meaningless. I was asking quite simply because
) I wasn't sure.

You don't seem to know much about Huffman encoding.
Why do you keep insisting that your method is better ?

) I didn't really intend to make this a 'classic' vs. 'canonical' issue,
) either,

What did you intend ?

) but if I had, I think the matter would be clear: The canonical
) method is certainly ingenious, and addresses the compression model
) problem quite well. Without it, Huffman encoding probably wouldn't be
) nearly as ubiquitous as it is today. Nonetheless, my method also
) addresses the issue quite well, and I might add, better than the
) canonical method,

Wrong. The canonical method addresses the issue better than your method.

) in that it is always optimal.

THe canonical method is also always optimal, unless somebody chooses to
limit the maximal symbol length. This limit is by no means central to the
method.

) That, and the fact
) that the original Huffman scheme is preserved,

Which is completely irrelevant to anything whatsoever.

) is reason enough for me
) to insist that my method is, in a sense, superior. No trade-offs,

There *is* no trade-off in canonical huffman encoding.

) no
) compromises, and no need to resort to secondary encoding schemes. That
) much I believe is undisputable.

You believe wrongly. It is not only disputable, it is provably wrong.

) Will it make a significant impact on
) the way things are currently done? Probably not. The canonical
) approach, though suboptimal, performs reasonably well in most cases,

The canonical method, in and of itself optimal, already performs better
than the classical method. Adding the length restriction *improves* the
performance, because the shrink in model size easily outweighs any possible
increase in compressed data size.

) and as such I don't think it likely that anyone's going to go to all
) the trouble to recode their current implementations, or break
) compatability with existing formats over it. But at least now there is
) another viable option available to us, whereas before there was none.

Get this through your head:

Your method is worse by any measure imaginable, with the possible
exception that it might be more easily understood by compression newbies.

I'm sorry to patronize you like this, but you seem to still be convinced
that your method is somehow 'superior' when, in fact, it isn't.

(Unless you're trolling of course.)


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
Harold Aptroot...
Posted: Sat Sep 26, 2009 9:15 pm
Guest
"sebastian" <sebastiangarth at (no spam) gmail.com> wrote in message
news:1e6f74e2-d57f-4fb3-85de-1d7096573efb at (no spam) 31g2000vbf.googlegroups.com...
Quote:

Instead of 'conceiving', you should actually try to work things out.

Let's set aside the argument of distribution probabilities, for a
moment. Say we have 100000 bytes of data, which is comprised of the
symbols 'A', 'B', and 'C', where 'A' and 'B' appear only once. What
would be the (suboptimal) canonical code for each symbol?

It wouldn't be suboptimal. It would be:
C=0
A=10
B=11
Making A and/or B longer can not make C any shorter
 
sebastian...
Posted: Sat Sep 26, 2009 10:28 pm
Guest
Quote:
Get this through your head:

Honestly, just because you are communicating with complete strangers
doesn't mean you should abandon common courtesy, Willem. It's one
thing to disagree with someone - and maybe even think they are
completely bonkers - but that sort of tone is just really uncalled
for. I really do welcome any and all thoughtful criticism (and to be
honest, I almost agree that many of your assertions may be true, if
only I had some solid proof), but I'm not going to sit here and be
insulted.

So please, show some decency, and try to communicate yourself a bit
less offensively. Fair enough?

Regards,

Sebastian Garth
 
 
Page 1 of 3    Goto page 1, 2, 3  Next
All times are GMT
The time now is Sat Nov 28, 2009 5:54 am