 |
|
| Computers Forum Index » Computer Compression » Huffman & Entropy Question... |
|
Page 1 of 3 Goto page 1, 2, 3 Next |
|
| Author |
Message |
| JasonFX... |
Posted: Tue Oct 06, 2009 4:05 pm |
|
|
|
Guest
|
Hi,
If I have an information source x bits long with an entropy of n bits,
Is it possible to mathematically calculate the maximum potential
compression ratio that Huffman Coding could yield on the information
source with respect to its entropy?
Thanks
Jason |
|
|
| Back to top |
|
|
|
| Thomas Richter... |
Posted: Tue Oct 06, 2009 8:20 pm |
|
|
|
Guest
|
JasonFX schrieb:
Quote: Hi,
If I have an information source x bits long with an entropy of n bits,
Is it possible to mathematically calculate the maximum potential
compression ratio that Huffman Coding could yield on the information
source with respect to its entropy?
What you're saying is somewhat self-contradictory. Lemme explain:
"Entropy" is something that works with respect to a model, i.e. you have
(conditioned or unconditioned) probabilities for the symbols emitted
from a source. From this, one can compute the entropy. A limited size
message allows you only to estimate the entropy by using relative
frequencies.
However, leaving this aside: Let's suppose we have a random source R
with entropy E, then one can compute how much a *typical* string from
the random source can be compressed. Depending on whether your input
string is typical or not, it may or may not be compressed to this ratio.
Clearly, if you have a proper model, typical strings are the most likely
strings generated by the source, and thus you have *usually* the
compression ratio Huffman grants you, but in the unlikely event of an
atypical string you might have no compression at all, and rather an
expansion.
So long,
Thomas |
|
|
| Back to top |
|
|
|
| sebastian... |
Posted: Wed Oct 14, 2009 3:39 pm |
|
|
|
Guest
|
Quote: Clearly, if you have a proper model, typical strings are the most likely strings generated by the source, and thus you have *usually* the compression ratio Huffman grants you, but in the unlikely event of an atypical string you might have no compression at all, and rather an expansion.
Of course, expansion only occurs from to the overhead of transmitting
to model. In compressing the symbols alone, Huffman encoding
guarantees an output less than or equal to the input size. |
|
|
| Back to top |
|
|
|
| glen herrmannsfeldt... |
Posted: Wed Oct 14, 2009 5:38 pm |
|
|
|
Guest
|
Thomas Richter <thor at (no spam) math.tu-berlin.de> wrote:
< sebastian schrieb:
(snip)
<> Of course, expansion only occurs from to the overhead of transmitting
<> to model. In compressing the symbols alone, Huffman encoding
<> guarantees an output less than or equal to the input size.
< That wasn't my point, though. My point was rather: "NO, IT DOESN'T".
< Example: use the following Huffman code:
< A -> 1
< B -> 01
< C -> 000
< D -> 001
< Does this Huffman code ensure "compression"?
< Not at all. If I feed it by *any* message I like to, *most* of the
< messages will be expanded, so it clearly doesn't ensure it.
As far as I know, there are two ways to use Huffman coding.
One is to use a static table based on expected statisical
distribution for the data. Group 3 FAX does that. Morse code
does that.
The other way is to generate one based on the specific data set
to be compressed. In that case, and not including the bits needed
to code the table, it would seem that output should not be greater,
though could be equal. If all symbols have equal probability
then the result will be equal.
I like the way Ultrium (LTO) tapes handle compression. There is
a bit on each tape block indicating compressed or not. When
writing, the drive compares the size of the compressed block
to the source block, writes whichever is smaller and sets the
bit accordingly. If the bit is part of the block header, which
is necessary anyway, compressed data will never be bigger.
Otherwise, you could say it is one bit bigger. I have seen
DLT increase the size by about 10% with compression on.
-- glen |
|
|
| Back to top |
|
|
|
| Thomas Richter... |
Posted: Wed Oct 14, 2009 7:50 pm |
|
|
|
Guest
|
sebastian schrieb:
Quote: Clearly, if you have a proper model, typical strings are the most likely strings generated by the source, and thus you have *usually* the compression ratio Huffman grants you, but in the unlikely event of an atypical string you might have no compression at all, and rather an expansion.
Of course, expansion only occurs from to the overhead of transmitting
to model. In compressing the symbols alone, Huffman encoding
guarantees an output less than or equal to the input size.
That wasn't my point, though. My point was rather: "NO, IT DOESN'T".
Example: use the following Huffman code:
A -> 1
B -> 01
C -> 000
D -> 001
Does this Huffman code ensure "compression"?
Not at all. If I feed it by *any* message I like to, *most* of the
messages will be expanded, so it clearly doesn't ensure it.
The reason is that this works only well if we really have a random
source with p(A) = 1/2, p(B) = 1/4, p(C)=p(D)=1/8, at least
approximately. If the source derives from that, compression will be
lower, or even an expansion.
However, *even for such a source*, messages like
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
are *not* impossible. Just unlikely. So you do get, at best,
"compression on average" over a huge number of input messages. You do
not get a compression "for a message generated by the random source".
You only get that for "typical messages". The above message would be
rather a-typical, but not impossible. And hence, even for a random
source that *fits* the Huffman code, Huffman doesn't *ensure*
compression. It just makes *compression likely to happen*.
That was the point I was trying to make.
People here often mix the concepts of probabilities, relative
frequencies, models, random sources, specific sequences, random
sequences, etc. and hence create an immense noise level.
So long,
Thomas |
|
|
| Back to top |
|
|
|
| glen herrmannsfeldt... |
Posted: Thu Oct 15, 2009 8:38 am |
|
|
|
Guest
|
Thomas Richter <thor at (no spam) math.tu-berlin.de> wrote:
< glen herrmannsfeldt schrieb:
(snip, someone wrote)
<> < Does this Huffman code ensure "compression"?
<> < Not at all. If I feed it by *any* message I like to, *most* of the
<> < messages will be expanded, so it clearly doesn't ensure it.
<> As far as I know, there are two ways to use Huffman coding.
<> One is to use a static table based on expected statisical
<> distribution for the data. Group 3 FAX does that. Morse code
<> does that.
<> The other way is to generate one based on the specific data set
<> to be compressed. In that case, and not including the bits needed
<> to code the table, it would seem that output should not be greater,
<> though could be equal. If all symbols have equal probability
<> then the result will be equal.
< No, and no. (-: Yes, I'm picky, and I'm of course making some fun of it.
< But, to be precise:
< There is only "one type of Huffman coding", and that is based on
< probabilities. Where you get the probabilities from is another issue,
< and not part of the code.
Fortunately I said "use the huffman code", use being important.
< Thus, for "static coding", you take them for
< granted, hopefully matching the source, where the tables have been
< derived by observing typical sources. For "dynamic coding", you create
< tables by assuming the relative frequencies found in the source are
< equal to the probabilities of a random source creating this string.
< Note, however, that both are assumptions. The second assumption does
< indeed give you the shortest possible prefix code (so, yes!) for that
< specific string, but again, relative frequencies != probabilities in
< general. Probability is something of a model you cannot observe
< directly, relative frequency is just a matter of counting.
For some reason this reminds me of a current discussion in comp.dsp
on the meaning of causal. If you are compressing a whole file,
yes, just count and generate a frequency table. In a causal system,
I can only use the frequencies of previous characters to predict
the probability. That isn't required, though, for file compression.
I can read the whole file, generate the table of frequencies.
With the small assumption of a finite length file (true of all
current computer systems) I don't see any problem calling
those probabilities, but I will accept that you do.
(In the DSP case, the FFT is often used as part of a filter
algorithm, the result being non-causal, but often useful.)
< Back to the OP: There, we had the question what we can say about Huffman
< compression if we know something about the entropy of the source.
< Entropy, again, is a term computed from probabilities. Not from relative
< frequencies. Mixing the two is a common error, but that's not the same
< thing. Probabilities can be assumed and are part of the model. Relative
< frequencies are measured on an input string. If we only know the
< entropy, we still do not know anything about the string. Only how it
< would typically look like, but not how it does look like.
But if we do know the frequencies and want to generate a huffman
code from those frequencies...
I thought unix used to have a program to do huffman code file
compression but I don't see it now. It is fairly rare to have
a source with statistically independent characters, though.
-- glen |
|
|
| Back to top |
|
|
|
| Thomas Richter... |
Posted: Thu Oct 15, 2009 11:06 am |
|
|
|
Guest
|
glen herrmannsfeldt schrieb:
Quote: Thomas Richter <thor at (no spam) math.tu-berlin.de> wrote:
sebastian schrieb:
(snip)
Of course, expansion only occurs from to the overhead of transmitting
to model. In compressing the symbols alone, Huffman encoding
guarantees an output less than or equal to the input size.
That wasn't my point, though. My point was rather: "NO, IT DOESN'T".
Example: use the following Huffman code:
A -> 1
B -> 01
C -> 000
D -> 001
Does this Huffman code ensure "compression"?
Not at all. If I feed it by *any* message I like to, *most* of the
messages will be expanded, so it clearly doesn't ensure it.
As far as I know, there are two ways to use Huffman coding.
One is to use a static table based on expected statisical
distribution for the data. Group 3 FAX does that. Morse code
does that.
The other way is to generate one based on the specific data set
to be compressed. In that case, and not including the bits needed
to code the table, it would seem that output should not be greater,
though could be equal. If all symbols have equal probability
then the result will be equal.
No, and no. (-: Yes, I'm picky, and I'm of course making some fun of it.
But, to be precise:
There is only "one type of Huffman coding", and that is based on
probabilities. Where you get the probabilities from is another issue,
and not part of the code. Thus, for "static coding", you take them for
granted, hopefully matching the source, where the tables have been
derived by observing typical sources. For "dynamic coding", you create
tables by assuming the relative frequencies found in the source are
equal to the probabilities of a random source creating this string.
Note, however, that both are assumptions. The second assumption does
indeed give you the shortest possible prefix code (so, yes!) for that
specific string, but again, relative frequencies != probabilities in
general. Probability is something of a model you cannot observe
directly, relative frequency is just a matter of counting.
Back to the OP: There, we had the question what we can say about Huffman
compression if we know something about the entropy of the source.
Entropy, again, is a term computed from probabilities. Not from relative
frequencies. Mixing the two is a common error, but that's not the same
thing. Probabilities can be assumed and are part of the model. Relative
frequencies are measured on an input string. If we only know the
entropy, we still do not know anything about the string. Only how it
would typically look like, but not how it does look like.
So long,
Thomas |
|
|
| Back to top |
|
|
|
| Mark Adler... |
Posted: Thu Oct 15, 2009 6:29 pm |
|
|
|
Guest
|
On 2009-10-15 01:38:02 -0700, glen herrmannsfeldt <gah at (no spam) ugcs.caltech.edu> said:
Quote: I thought unix used to have a program to do huffman code file
compression but I don't see it now.
It was called "pack". It slowly disappeared after the introduction of
the generally more effective compress (LZW). gzip is able to unpack
compressed files created by pack, but does not compress to that format.
zlib provides the Z_HUFFMAN_ONLY compression strategy.
Mark |
|
|
| Back to top |
|
|
|
| glen herrmannsfeldt... |
Posted: Thu Oct 15, 2009 7:07 pm |
|
|
|
Guest
|
Mark Adler <madler at (no spam) alumni.caltech.edu> wrote:
< On 2009-10-15 01:38:02 -0700, glen herrmannsfeldt <gah at (no spam) ugcs.caltech.edu> said:
<> I thought unix used to have a program to do huffman code file
<> compression but I don't see it now.
< It was called "pack". It slowly disappeared after the introduction of
< the generally more effective compress (LZW). gzip is able to unpack
< compressed files created by pack, but does not compress to that format.
Yes. I thought unix utilities stayed around forever, but it
seems that pack is gone.
The man page for compress on my system says, near the end:
"Compression is generally much better than that achieved by
Huffman coding (as used in the historical command pack),
or adaptive Huffman coding (as used in the historical command
compact), and takes less time to compute."
That seems as close as we get to a description of pack and compact.
-- glen |
|
|
| Back to top |
|
|
|
| Niels Fröhling... |
Posted: Thu Oct 15, 2009 11:28 pm |
|
|
|
Guest
|
Thomas Richter wrote:
Quote: Back to the OP: There, we had the question what we can say about Huffman
compression if we know something about the entropy of the source.
Entropy, again, is a term computed from probabilities. Not from relative
frequencies. Mixing the two is a common error, but that's not the same
thing. Probabilities can be assumed and are part of the model. Relative
frequencies are measured on an input string. If we only know the
entropy, we still do not know anything about the string. Only how it
would typically look like, but not how it does look like.
Hey I have a question which sits in the back of my head, since you start to be
"so picky" :^)
Do you actually know a single example, where you do know the real-entropy^tm
of a source? You make the impression that real-entropy^tm is a deterministic /
calculateable term, and not an extrapolation of an observation.
I would agree with that theoretically the sources real-entropy^tm is distinct
to observed entropy, but both (in general) are respective a given model, and
because of that the real-entropy^tm is of limited use (you know your observed
entropy is never absolute correct not even in respect to a specific model, and
it's only x% efficient) given the fact that it is very difficult to get exactly.
Well actually I do can imagine a source with observation==real, that is when
you say something for example can only and only create "Hamlet", not a byte
more, no variation, nothing different.
Buuut then real-entropy^tm is identical to the entropy of the coded stream
(let's say LZ, 123k, the compressed file can't give you anything else than
"Hamlet").
This of course (for me) can't be true (it means claiming universal==specific,
contradiction), and is of doubtful use, because you don't want to create a
specific model for each finite message. I'm referring to programs here, you
don't want to program a specific model for a specific single message.
That's the point of my example, either you have infinite source and you try to
model fragments of it by approximating the source's real-entropy^tm though
piecewise observation (human-brain => "Hamlet", but also others, infinite
others), or you model context-free atomic universally independent finite sources
which makes no sense whatsoever.
In the end I suspect, expect that real-entropy^tm can't be quantified in
absolutes, and the best approximation would come from KC (then we get model
independent entropy, if you don't count KC as a model as well).
And simply waiting for the computer with infinite computational abilities, we
just accept our little poor custom-model dependent entropy as liable
approximation, which in most cases because of convenience is plain order-0
entropy (of a little piece of an infinite string: that are all the other "files"
you want to compress and haven't yet).
So, I hope I'm not completely out of my mind. And to ask the question again,
maybe lost in my discourse:
Do you actually know a single example, where you do know the real-entropy^tm
of a source?
Oh, I don't allow to count artificial constructed sources according an
entropy. My emphasis is more about an equivalent to "know real-Pi^tm", which was
observed, then approximated, then modelled, then absolutely determined, and now
the verify a circle in reality though the model Pi, instead of verifying Pi as
before. Taking the number "5" and saying "this constant is '5', exactly and
always", is not the sort of answer I thought of. :)
Ciao
Niels |
|
|
| Back to top |
|
|
|
| glen herrmannsfeldt... |
Posted: Fri Oct 16, 2009 10:39 am |
|
|
|
Guest
|
Thomas Richter <thor at (no spam) math.tu-berlin.de> wrote:
< Niels Fr?hling schrieb:
<>> Back to the OP: There, we had the question what we can say about
<>> Huffman compression if we know something about the entropy of the
<>> source. Entropy, again, is a term computed from probabilities. Not
<>> from relative frequencies. Mixing the two is a common error, but
<>> that's not the same thing. Probabilities can be assumed and are part
<>> of the model. Relative frequencies are measured on an input string. If
<>> we only know the entropy, we still do not know anything about the
<>> string. Only how it would typically look like, but not how it does
<>> look like.
Lets get away from the math and look at actual real compression
problems. If I have a file I want to compress, already stored
in disk, I can count the number of each character in that file.
I then don't care about 'typical' but just this one file.
Many compression algorithms allow for stream compression,
where output can be generated before the end of the input
string is known. That is certainly nice, but is not always
required. I was going to look at the man page for pack and compact,
but they don't seem to exist on the systems I use.
<> Hey I have a question which sits in the back of my head, since you
<> start to be "so picky" :^)
<> Do you actually know a single example, where you do know the
<> real-entropy^tm of a source? You make the impression that
<> real-entropy^tm is a deterministic / calculateable term, and not an
<> extrapolation of an observation.
< That's just backwards from what I'd say. Entropy and probabilities are
< nothing you "know by observation", i.e. it is not a measurable quantity.
< Probabilities and entropies are *assumed*. They are part of a model. You
< cannot compute them. You can assume them.
If the requirement is 'compress file X, as stored on disk' then yes
I do know the probabilities. The system is not causal. I know
the probability of a character before I see it, but that is fine.
For stream compression, say writing data to a tape while it is
being generated, then I don't know the frequencies in advance.
I might have one buffer full, but not the whole data stream.
In that case, the algorithm has to rely on the observations of
previous data, that is, it is causal.
Some people need one type of algorithm, others can use either.
-- glen |
|
|
| Back to top |
|
|
|
| Thomas Richter... |
Posted: Fri Oct 16, 2009 11:22 am |
|
|
|
Guest
|
Niels Fröhling schrieb:
Quote: Back to the OP: There, we had the question what we can say about
Huffman compression if we know something about the entropy of the
source. Entropy, again, is a term computed from probabilities. Not
from relative frequencies. Mixing the two is a common error, but
that's not the same thing. Probabilities can be assumed and are part
of the model. Relative frequencies are measured on an input string. If
we only know the entropy, we still do not know anything about the
string. Only how it would typically look like, but not how it does
look like.
Hey I have a question which sits in the back of my head, since you
start to be "so picky" :^)
Do you actually know a single example, where you do know the
real-entropy^tm of a source? You make the impression that
real-entropy^tm is a deterministic / calculateable term, and not an
extrapolation of an observation.
That's just backwards from what I'd say. Entropy and probabilities are
nothing you "know by observation", i.e. it is not a measurable quantity.
Probabilities and entropies are *assumed*. They are part of a model. You
cannot compute them. You can assume them.
Some background: The following "definition" of probability seems plausibe:
Let "A" be an event from a set of events X, and let #A the number of
cases in an experiment where the event A was observed. And let N the
number of experiments. Then:
p(A) := \lim_{N \rightarrow \infty} #A/N
Seems plausible? Certainly. Now proof that this limit exists. Just take
a dice, and let A be the event of the dice showing "6".
--
Result is: You *can't*. There is no mathematical proof that limit
converges, and it won't necessarily converge at all. One cannot show
that or proof that.
At this point, you need to take a step back and think about what you
actually *want*. You want a mathematical model, an axiomatic system,
that describes the properties of probabilities, and from *that* axioms,
derive properties of the system by deduction, then again on firm ground.
You always take for granted that the axioms are right, but you can never
show them to be right.
Take a good(!) math book about probability theory, open the first pages,
and read what you find there: Nothing about "limits", and "relative
frequencies". Instead, you typically find a set of axioms that defines a
system called a "probability space". In some books even completely
without a motivation. "That's it, take it or leave it." This probability
space encodes the properties you expect to find for probabilities, and
you never actually *take* any limit to infinity. You assume that the
system behaves like the axioms, but you never show it - you cannot show
it. In the same sense, you cannot "proof" (mathematically) that Newton's
axioms of classical mechanics are correct. Instead, axioms are the
starting point of a mathematical analysis of the problem: "Suppose our
system would behave like that, what could we say about it?".
Sounds strange if you think about it this way, but that's how
mathematics works.
Quote: I would agree with that theoretically the sources real-entropy^tm is
distinct to observed entropy, but both (in general) are respective a
given model, and because of that the real-entropy^tm is of limited use
(you know your observed entropy is never absolute correct not even in
respect to a specific model, and it's only x% efficient) given the fact
that it is very difficult to get exactly.
Can you show that this limit exists? First of all, there is no limit,
since the files you look at finite (finite hard disk, finite storage,
finite world), so there is no lim -> \infty. So the whole question is
mood. There is nothing like "the entropy of a file" you can measure. You
can measure something like relative frequencies, and from that assume(!)
a probability model (say, simply p(a) = relative frequency of a, though
this is not necessarily the best), and from that compute the entropy.
But there's always one step of "assumption" in here.
Quote: Well actually I do can imagine a source with observation==real, that is
when you say something for example can only and only create "Hamlet",
not a byte more, no variation, nothing different.
I can't. I can never take the limit to have a probability.
Quote: Buuut then real-entropy^tm is identical to the entropy of the coded
stream (let's say LZ, 123k, the compressed file can't give you anything
else than "Hamlet").
You "forget" that you already made an assumption here.
Quote: This of course (for me) can't be true (it means claiming
universal==specific, contradiction), and is of doubtful use, because you
don't want to create a specific model for each finite message. I'm
referring to programs here, you don't want to program a specific model
for a specific single message.
Exactly. Instead, you assume that your message is a "typical message"
generated by a larger, more generic class of models, and you assume(!)
that the relative frequencies in the file derive only little from the
postulated(!) probabilities of your model.
A typical math problem: Allow some "schizophrenia". Somehow, we know
what probabilities are. But let's do as if we forget about all this for
a moment, step back and let's only work with the axioms.
Quote: That's the point of my example, either you have infinite source and you
try to model fragments of it by approximating the source's
real-entropy^tm though piecewise observation (human-brain => "Hamlet",
but also others, infinite others), or you model context-free atomic
universally independent finite sources which makes no sense whatsoever.
Ha! There you go! Of course. But, you never have infinite sources to
begin with. Otherwise, no matter matter what its entropy is, it wouldn't
fit on my hard disk (not in the Shanon model, at least. For the
Kolmogorov model, there are infinite sources with finite
representations, but see above...).
The point is: Even though every model is incomplete, and ill-justified
(from a mathematical p.o.v.) by incomplete induction on the observation,
the resulting *abstract model* that states properties on infinite
sequences works typically (but not always) well enough to compress the
given sequence - or a larger set of sequences.
Quote: In the end I suspect, expect that real-entropy^tm can't be quantified
in absolutes, and the best approximation would come from KC (then we get
model independent entropy, if you don't count KC as a model as well).
Well, that's not "entropy", but a complexity measure, a different beast.
It comes from a different branch of mathematics, and I don't see why the
two should be equal (in general). They have some interesting properties
in common, of course.
Entropy is defined(!) on a probability space. Full stop. Nothing else.
No probability space, no entropy. Entropy is not a property of a finite
string. KC is a property of a pair of a universal Turing machine and a
finite string of symbols from an alphabet. Different input, different
things.
Quote: And simply waiting for the computer with infinite computational
abilities, we just accept our little poor custom-model dependent entropy
as liable approximation, which in most cases because of convenience is
plain order-0 entropy (of a little piece of an infinite string: that are
all the other "files" you want to compress and haven't yet).
Again, assumptions made, you just don't spell them out explicitly:
"Assume our string would be generated by a zero order probability
source, then its entropy would be...". This is "zero-order entropy". Of
course, nobody except mathematicians (brain-damaged people like me) say
things like that, and most people somehow accept this for granted,
probably without even thinking about it. However, it *does* cause some
confusion if you don't make your words very clear, and this unfortunate
gap is then filled by the con-artists we see here over and over again.
Quote: So, I hope I'm not completely out of my mind. And to ask the question
again, maybe lost in my discourse:
Do you actually know a single example, where you do know the
real-entropy^tm of a source?
YES! Those models I make up! And that's all I can do, make up models!
You cannot find them, or proof them. You *make them*.
Entropy is a man-made "mock-up" term, if you like - to put it this
extreme. (Of course, overdoing again to get the point clear!). Given
zero-order i.i.d probability source with p(A) = p(B) = 0.5, *of course*
I KNOW the entropy. What I *do not know* is whether my input is
generated from such a source. I can't. I never do. I assume. I form by
analysis on the basis that it would. If it doesn't, my conclusions are mood.
I assume that you mean the latter question, not the former (i.e. what do
you mean by "know"?). And of course I misunderstood you on purpose.
That's one of my hobbies (drives some of my students nuts, I know.).
Quote: Oh, I don't allow to count artificial constructed sources according an
entropy. My emphasis is more about an equivalent to "know real-Pi^tm",
which was observed, then approximated, then modelled, then absolutely
determined, and now the verify a circle in reality though the model Pi,
instead of verifying Pi as before. Taking the number "5" and saying
"this constant is '5', exactly and always", is not the sort of answer I
thought of.
Sure, I know what you mean. Probably not what you said to begin with,
but clearly, I understand perfectly. And you have my answer. No, I
don't. But that is what a model is: It isn't proven. It is made.
The fun part is: It nevertheless works, because my (or rather, those of
other smart people designing compressors) guesses are often "good enough
to make it work".
So long,
Thomas
PS: Yes, it seems strange indeed how mathematicians think. But again,
step back for a while - abstract models like probability are so in our
minds we no longer see that they are abstract. Here's an even simpler
example:
Ever seen a two?
I mean the number, not the digit. |
|
|
| Back to top |
|
|
|
| biject... |
Posted: Fri Oct 16, 2009 2:32 pm |
|
|
|
Guest
|
On Oct 16, 5:21 am, Thomas Richter <t... at (no spam) math.tu-berlin.de> wrote:
Quote: glen herrmannsfeldt schrieb:
.....
Lets get away from the math and look at actual real compression
problems. If I have a file I want to compress, already stored
in disk, I can count the number of each character in that file.
I then don't care about 'typical' but just this one file.
Sure enough, you do what all other people also do: You built up a
probability model from the observed characters.
Many compression algorithms allow for stream compression,
where output can be generated before the end of the input
string is known. That is certainly nice, but is not always
required. I was going to look at the man page for pack and compact,
but they don't seem to exist on the systems I use.
Such stream compressors are the back-end of many compression standards.
LZ and its variants are stream compressors. The MQ coder in JPEG 2000
and JBIG is a stream compressor.
Nevertheless, they all have a model that *guesses* probabilities.
If you have the whole file of concern there are no
"guesses" you can calculate the probabilities of each
symbol exactly.
However if you compress that file you may miss higher
order dependencies between the characters.
Example if file is long but the characters are only
"ABCD" if you assign two bits to each character that's
the best you can do based only on the individual
probabilities if each occurs the same number of times
in the file. However if occurs and a repeated string
of exactly ABCD then using the relative frequencies
a mistake if compression is your goal.
So when you write a compressor you have to assume
in advance if your only using the individual counts
of each character or higher order terms.
The problem with compression is the more special
cases you want to allow for the longer the final
compressed file will be since if you allow for more
cases the compressed data file has to communicate
that to the decompressor.
If you want the entropy of the data then sadly
you have to base that on the model you choose for
compression. You can fool yourself into thinking
you have a super compressor since you get close
to what you calculate as the entropy. But someone
else can come along and see relationships you mess
and get far better compression than what you think
the entropy is.
Take a given file. Look at the set of all files
made of a permutation of such a file. If you goal
is to write the best compressor based on this class
of files then you can do no better that writing
a compressor that just uses the probabilities of
each symbol. In this case an arithmetic would be
the best period if the counts not part of the file.
Note at points where it looks nice for huffman the
arithmetic exactly matches in this case where counts
not used as part of output.
When you have to include the table data then
the situations changes. It's cheaper to send info
on the tree with huffman that it is with arithmetic.
In this case there are times when the huffman
best and times when arithmetic best. However the
arithmetic to me appears more beautiful Since
for one thing to get an optimal compression you
don't need to use two passes through the file you
can compress on one pass. A proper adaptive
arithmetic is the same an equivalent a static
arithmetic with a table.
Of course you can compress with an adaptive
huffman but sadly they don't match with a static
huffman. The adaptive arithmetic would compress
all files in your special set the same length give
or take a byte. This could be called the entropy
of the files in the set. However if you use an
adaptive vitter huffman you seldom get the same
lenght for permutations of the file. Sometimes
longer sometimes shorter.
Of course I am only talking about bijective
file compressors in the above like arb255.zip
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link" |
|
|
| Back to top |
|
|
|
| Thomas Richter... |
Posted: Fri Oct 16, 2009 3:21 pm |
|
|
|
Guest
|
glen herrmannsfeldt schrieb:
Quote: Back to the OP: There, we had the question what we can say about
Huffman compression if we know something about the entropy of the
source. Entropy, again, is a term computed from probabilities. Not
from relative frequencies. Mixing the two is a common error, but
that's not the same thing. Probabilities can be assumed and are part
of the model. Relative frequencies are measured on an input string. If
we only know the entropy, we still do not know anything about the
string. Only how it would typically look like, but not how it does
look like.
Lets get away from the math and look at actual real compression
problems. If I have a file I want to compress, already stored
in disk, I can count the number of each character in that file.
I then don't care about 'typical' but just this one file.
Sure enough, you do what all other people also do: You built up a
probability model from the observed characters.
Quote: Many compression algorithms allow for stream compression,
where output can be generated before the end of the input
string is known. That is certainly nice, but is not always
required. I was going to look at the man page for pack and compact,
but they don't seem to exist on the systems I use.
Such stream compressors are the back-end of many compression standards.
LZ and its variants are stream compressors. The MQ coder in JPEG 2000
and JBIG is a stream compressor.
Nevertheless, they all have a model that *guesses* probabilities.
Quote: That's just backwards from what I'd say. Entropy and probabilities are
nothing you "know by observation", i.e. it is not a measurable quantity.
Probabilities and entropies are *assumed*. They are part of a model. You
cannot compute them. You can assume them.
If the requirement is 'compress file X, as stored on disk' then yes
I do know the probabilities.
No, you don't. Again, you mix probability with relative frequency. A
probability is a measure on a probability space - you cannot "measure"
such a beast. You can assume - or rather choose - a model whose
probabilities are equal to the relative frequencies. But this is a
*choice*, not a *measurement*. It might not even be the best possible
choice for a model - it is rather a naive model that can be improved
considerably.
Quote: For stream compression, say writing data to a tape while it is
being generated, then I don't know the frequencies in advance.
There you go: "Frequency". Yes, exactly that. Not "Probability"!
Quote: I might have one buffer full, but not the whole data stream.
In that case, the algorithm has to rely on the observations of
previous data, that is, it is causal.
Some people need one type of algorithm, others can use either.
All true, but in neither case you "know" the probability from the file.
So lnog,
Thomas |
|
|
| Back to top |
|
|
|
| Thomas Richter... |
Posted: Fri Oct 16, 2009 8:15 pm |
|
|
|
Guest
|
biject schrieb:
Quote: If you have the whole file of concern there are no
"guesses" you can calculate the probabilities of each
symbol exactly.
NO! Damnit, no! You also confuse probabilities with relative frequencies.
You can make up a model that fits to those frequencies, but these *aren't*
probabilities.
So long,
Thomas |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Sun Nov 29, 2009 4:40 pm
|
|