Main Page | Report this Page
Computers Forum Index  »  Computer Compression  »  A Revolutionary Breakthrough in Data Compression!...
Page 3 of 3    Goto page Previous  1, 2, 3

A Revolutionary Breakthrough in Data Compression!...

Author Message
sebastian...
Posted: Tue Sep 29, 2009 9:46 pm
Guest
Quote:
160 is for *all cases*
3 is the *best possible case*

Okay, but you did say "For canonical, both are smaller", and I assumed
by that you meant the lower and upper bounds. In any case, it appears
you were right, anyway, since the lower bound for canonical could
actually be one bit cheaper.

Quote:
Februari 1991, as per the very first hit on google.

It actually looks like he's directly transmitting the statistics
(frequency count of each character), though:

/* -- write the frequency array to the output file -- */
for (c = 0; c < 256; c++) {
if (ht[c].cnt > 0) {
fwrite(&ht[c].ch, sizeof(char), 1, fo);
fwrite(&ht[c].cnt, sizeof(BYTECOUNTER), 1, fo);
}
}

Even so, it does seem likely that someone would have thought of this
before, I had just never heard of it being done myself (all
implementations I've seen either transmit the statistics or canonical
lengths).

Quote:
You can tell when the last symbol was processed by checking what the bit-encoding of the current symbol is. If you reached all-ones, you're done.

Ah. Yes, that was one of the points I had missed before.

Quote:
I'm sure you can work out some other cases.

Okay, I see now. All less than classical.

Alright, so you were right, after all. I'm sorry it took so much
explaining to bring me around. And I apologize for being so stubborn.
I guess it's easy to forget that you guys have been around here for
quite a while, and more often than not know what you are talking
about.

I still feel my method is adequate for now (in that it probably
doesn't necessitate an immediate rewrite), but eventually I will be re-
implementing it to use canonical encoding, for the sake of
completeness, anyway.

Well, thanks again for all of your help. I'm sure I'll be back to
pester you guys when I get around to my range-coding compressor!

Cheers,

- Sebastian Garth
 
Mark Adler...
Posted: Tue Sep 29, 2009 11:41 pm
Guest
On Sep 28, 9:51 pm, sebastian <sebastianga... at (no spam) gmail.com> wrote:
Quote:
Of the transmitted compression model (in bits, say).

I don't know what the upper bound is, but what matters in practice is
the average. A quick look at a bunch of deflate data showed that the
average code descriptor size was around 95 bytes.

Quote:
So basically, if anyone's to prove to me that canonical is
really so much better,

I'm not following this thread, but if the question is: how much do you
gain in the description by forcing the code to be canonical, that's
not hard to calculate, at least in the worst case. What you end up
with is the number of ways to permute the code. For each code length
with at least one symbol of that length, we can permute the symbols
within the codes of that length. So if the number of symbols in each
length are a, b, c, etc., then the number of permutations are a!b!
c!...

Since (a+b)! > a!b! (for a and b greater than zero), the worst case is
a flat code with all of them the same length. So for 256 symbols as
the example (all of the codes are eight bits long), there are 256!
permutations. The log-base2 of 256! is 1684, or 210.5 bytes. So in
that worst case (which is an actual case when trying to Huffman code,
say, random data), you'll need at least 210.5 bytes more descriptor
for the code if you don't force it to be canonical or place some other
restriction on the code to reduce the number of possible permutations.

Mark
 
Phil Carmody...
Posted: Thu Oct 01, 2009 1:22 am
Guest
Willem <willem at (no spam) stack.nl> writes:
Quote:
sebastian wrote:
[SNIP - canonical huffman encoding vs. classical]
If you build a huffman tree for a set of symbols, then you'll notice that
at a lot of points along the way, you have a choice to make. A choice that
does not actually make any difference at all on the resulting size.
To be more specific: whenever you join two symbol sets, you have to assign
a 0-bit to one set and a 1-bit to the other.
All those choices have to be recorded in the final huffman model.

But what if you did *not* have to record those choices ?

That's where canonical encoding comes in. As it turns out, if you only
record the path length of each node, then you can reconstruct a tree, which
happens to be one of the thousands of possible trees that would come out
of the original method.

So, canonical encoding records less information than classical.
It's quite simple to see, then, that canonical encoding takes less space.

I'm a bit rusty on Huffman - is there not the posibility for there
to be multiple 'canonical' encodings? (All sharing the same property
of only requiring the lengths, but with different shared presumptions.)

Phil
--
Any true emperor never needs to wear clothes. -- Devany on r.a.s.f1
 
Niels Fröhling...
Posted: Thu Oct 01, 2009 5:14 am
Guest
Phil Carmody wrote:

Quote:
I'm a bit rusty on Huffman - is there not the posibility for there
to be multiple 'canonical' encodings? (All sharing the same property
of only requiring the lengths, but with different shared presumptions.)

Yes. Of course you have to take care that the implicit order of your symbols
(within their respective sets) is generated symmetrically in the encoder and
decoder, but if you would seed a pseudo-randomnumber generator identically you
could generate the symbol-order with that.

A slightly simplified explanation is the following (semi-adaptive huffman
considered, in adaptive huffman all of this goes through permanent transitions
and is irrelevant):

- huffman-codes are optimal only on power-of-two frequency fractions

- in case you have non-pot frequencies the constructed huffman-tree would be
identical to the one which is rounded down to pot-frequencies iteratively over
it's subtrees

- you have only a limited number of possible huffman trees per alphabet size
(applying symbol aliasing, the possible set is the same for all permutations,
independent of the frequencies):

symbols -> trees
1 -> 1
2 -> 1
3 -> 1
4 -> 2
5 -> ...

I haven't found the exact formula, but it should be straightforward
understandable. Maybe Mark knows the calculation.

(this is a rule of all prefix-codes, so the logic applies to rice-codes or all
elias as well, only there is the possible set is even smaller [as per
definition, otherwise they would be called huffman codes])

- for canonical huffman-trees (or what we maybe regularly understand as CHC) we
enforce a defined order on all symbols (within their respective sets) with equal
rounded down frequency fractions

- as such the definition of canonical huffman codes means - the elimination of
all possible permutations of symbol orderings (within their respective sets)
into a single possible ordered set, namely the deterministic creation of the
only valid symbol ordering out of any arbitrary arriving permutation (you may
suppose that the term 'canon'ical doesn't accidentally means: *"reduced to the
simplest and most significant form possible without loss of generality"*)

- the huffman tree could then be send as pot-frequency array instead of as exact
frequency array (as the order of equally _treated_ frequencies is now implicit)

- the shape of the possible huffman tree can also be send as a single index,
though then you have to create all possible huffman trees over your alphabet and
stop on arrival, so is not really practical in view of huffman coding suppose to
be fast

- the _shape_ of the specific huffman-tree over your alphabet is also uniquely
codeable by sending the depth-intervals of the huffman-tree (thus eliminating
the need for frequency information):

00
01
100
101
110
111

[0 - 1, 2 - 5]

which is the commonly applied approach, that is for several
implementation-related factors, you get a lot of side-products/information which
you can directly use for LUTs in your decoder

it is not the only way to send canonical huffman-trees!

- you always have to send the symbol-set, with or without a canonical approach

- you can omit the interval-boundaries altogether by just indicating the number
of symbols in the current incrementing interval-set:

[interval][number of symbols][codes (implicit)][symbols in set]
[0][0] [] []
[1][1] [0] [3]
[2][2] [10 & 11] [1 & 2]

- you can replace "symbols-in-set" by "index-in-set" to send minimal
~log2(all-symbols!) bits. And well you can go through all kind of
range-compression here all over the place to reduce this coding further, like
reducing the number of bits required to specify the sets population as the
possible number of symbols decreases ... a table containing maximum possible
intervals to-go after a specific tree-shape occured ... sending the number of
intervals too (and take advantage of that knowledge) ...

So the final words are: the "normal" canonical huffman-tree transmition is not
superior because you transmit your alphabet more efficiently, it's because you
almost completely get rid of frequency-magnitude transfer, you reduce the
tree-shape problem to one of symbol-reordering. Which means the approximate
upper bound of this kind of tree transmission is:

max(all-symbols) =
1 +
log2(all-symbols!) +
log2(all-symbols) * (all-symbols - 1)

1 -> symbol containing the alphabet size
all-symbols - 1 -> max number of intervals to send
log2(all-symbols) -> max bits for spec. size of each intervals symbol set
log2(all-symbols!) -> bits required to send all symbols

max(256) = 1 + 1684 + 2040 = 3728 bits = 466 bytes

The most common canonical huffman coder is the one in JPEG, not so difficult to
find ...

Ciao
Niels

Disclaimer: as always I apologise for stating incorrectness.
 
Willem...
Posted: Thu Oct 01, 2009 3:02 pm
Guest
Phil Carmody wrote:
) I'm a bit rusty on Huffman - is there not the posibility for there
) to be multiple 'canonical' encodings? (All sharing the same property
) of only requiring the lengths, but with different shared presumptions.)

Well... no, yes and yes.

No, because 'canonical' besically means that you choose one
of all possible ones and appoint that as TheOne(tm).

Yes, because you could conceivably disagree about which canonical
encoding is TheOne(tm).

And Yes, because in certain unlikely cases, it is possible that
one symbol set could be encoded into two different canonical encodings.
(For example, if you have the string 'AABBCD', you could get several
different sets of lengths. 1-2-3-3, 2-1-3-3, or even 2-2-2-2.)


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
 
Page 3 of 3    Goto page Previous  1, 2, 3
All times are GMT
The time now is Thu Dec 10, 2009 10:25 pm