Main Page | Report this Page
Computers Forum Index  »  Computer Compression  »  Entropy...
Page 1 of 5    Goto page 1, 2, 3, 4, 5  Next

Entropy...

Author Message
Mok-Kong Shen...
Posted: Tue Sep 08, 2009 7:33 pm
Guest
Hi,

I have a very dumb question which I hope is nevertheless not
entirely OT here.

I read in textbooks that the entropy per letter in English is
something like one to one and a half bits per letter. Suppose
I take a certain book and copy out the text from p. 100 - 200,
I'll get 1.0 - 1.5 times the number of letters there as entropy
in units of bits. On the other hand, the same text and hence the
information contained in it is entirely determined by a string
like "ISBN = ......, p. 100 - 200" which is very much shorter and
hence has very little entropy, using the same kind of estimate
of entropy per letter as previously.

How could this apparent paradox be explained away?

Thanks,

M. K. Shen
 
glen herrmannsfeldt...
Posted: Tue Sep 08, 2009 7:33 pm
Guest
Mok-Kong Shen <mok-kong.shen at (no spam) t-online.de> wrote:

< I have a very dumb question which I hope is nevertheless not
< entirely OT here.

< I read in textbooks that the entropy per letter in English is
< something like one to one and a half bits per letter. Suppose
< I take a certain book and copy out the text from p. 100 - 200,
< I'll get 1.0 - 1.5 times the number of letters there as entropy
< in units of bits. On the other hand, the same text and hence the
< information contained in it is entirely determined by a string
< like "ISBN = ......, p. 100 - 200" which is very much shorter and
< hence has very little entropy, using the same kind of estimate
< of entropy per letter as previously.

< How could this apparent paradox be explained away?

I believe this is related to the principle of information hiding.

http://en.wikipedia.org/wiki/Information_hiding

Note that in the second case, most of the important information,
the actual text, is not there. Trying to make measurements on
data that isn't actually there is not a good idea.

Here is another way to think about. Send both strings to someone
on Mars. (that is, a martian.) Consider the data as reconstructed
by the martian, with no access to a library.

-- glen
 
Mok-Kong Shen...
Posted: Tue Sep 08, 2009 9:26 pm
Guest
glen herrmannsfeldt wrote:
Quote:
Mok-Kong Shen <mok-kong.shen at (no spam) t-online.de> wrote:

I read in textbooks that the entropy per letter in English is
something like one to one and a half bits per letter. Suppose
I take a certain book and copy out the text from p. 100 - 200,
I'll get 1.0 - 1.5 times the number of letters there as entropy
in units of bits. On the other hand, the same text and hence the
information contained in it is entirely determined by a string
like "ISBN = ......, p. 100 - 200" which is very much shorter and
hence has very little entropy, using the same kind of estimate
of entropy per letter as previously.

How could this apparent paradox be explained away?

I believe this is related to the principle of information hiding.

http://en.wikipedia.org/wiki/Information_hiding

Note that in the second case, most of the important information,
the actual text, is not there. Trying to make measurements on
data that isn't actually there is not a good idea.

Here is another way to think about. Send both strings to someone
on Mars. (that is, a martian.) Consider the data as reconstructed
by the martian, with no access to a library.

However, what the textbooks say is destined for us, on the
earth, who do have access to libraries. (It takes some time
and trouble to get the book, but that's all. Further lots of
books could be accessed via the internet.) So that would mean
that the claimed 1.0 - 1.5 bits entropy per letter in English
is practically "nonsense" and useless for us, wouldn't it?

Thanks,

M. K. Shen
 
Mark Adler...
Posted: Wed Sep 09, 2009 9:45 am
Guest
On 2009-09-08 10:26:05 -0700, Mok-Kong Shen <mok-kong.shen at (no spam) t-online.de> said:
Quote:
So that would mean
that the claimed 1.0 - 1.5 bits entropy per letter in English
is practically "nonsense" and useless for us, wouldn't it?

No. The entropy is defined with respect to a *self-contained* message
that includes both the compressed data and the decompressor, with all
that's needed on the other end being a general purpose computer to
execute the decompressor and access to no other data. (See Kolmogorov
Complexity in wikipedia.) Then the 1.0 to 1.5 bits per letter
(actually 0.6 to 1.3 bits per letter from experiments with human
predictors) is a good guess of the best you could do for the number of
bits in that message.

If you consider making practical use of the identifier method, e.g. an
ISBN, then you need to go fetch the book from an online server or off
of a mass storage device. Then you're right back to the question of
how to compress that book to reduce the required bandwidth or disk
space. The ISBN won't help you then.

In fact, using the identifier method, we could simply assign a number
to every hard drive, CD, DVD, flash stick, etc. in the world, and with
a relatively small number of bits with additional offsets and lengths
refer to any piece of data in existence. Now *that's* what I would
call useless.

Mark
 
Thomas Richter...
Posted: Wed Sep 09, 2009 10:58 am
Guest
Mok-Kong Shen schrieb:
Quote:
Hi,

I have a very dumb question which I hope is nevertheless not
entirely OT here.

I read in textbooks that the entropy per letter in English is
something like one to one and a half bits per letter. Suppose
I take a certain book and copy out the text from p. 100 - 200,
I'll get 1.0 - 1.5 times the number of letters there as entropy
in units of bits. On the other hand, the same text and hence the
information contained in it is entirely determined by a string
like "ISBN = ......, p. 100 - 200" which is very much shorter and
hence has very little entropy, using the same kind of estimate
of entropy per letter as previously.

How could this apparent paradox be explained away?

The paradox is explained by carefully defining what "entropy" is, and from what you can measure entropy.
The problem you face is that you seem to believe one can compute *the* entropy of a dataset - you cannot.
Entropy is a concept defined on "random sources", an abstract mathematical model like a machine that
emits a number/character/word if you press a button of that machine, and the machine proceeds in some kind
of random way, i.e. it is mathematically described by a random process with or without internal memory,
uniform or not-uniform etc. From *that* you can compute the entropy.

What you actually mean if you claim "the entropy of an english book" is: "Consider a random source that
generates texts whose letter statistics approximates that of an english book (more details to be inserted
into the model), then the entropy of said source would be....".

IOW, it's the model you compute the entropy of, not the book.

To give an example, consider rolling a dice, and writing down each result twice:

1,1,4,4,1,1,2,2,6,6,5,5,2,2,3,3,4,4,....

Now, certainly, the relative frequency of each number here is close to 1/6th for the number of rolls going
to infinity, so one could conclude that *the* entropy of this sequence is -\sum_i 1/6 log_2 1/6, which is the
maximal entropy of a six-letter alphabet. Nevertheless, you can "compress" this sequence by an obvious way by removing
half of the digits and telling the decompressor to double each symbol.

What happened here is that you replaced your initial model (all digits from 1-6 are equally probable) by a better
model (all digits are equally probable, but each digit is repeated twice) and thus get a different result.

So long,
Thomas
 
glen herrmannsfeldt...
Posted: Wed Sep 09, 2009 1:59 pm
Guest
Thomas Richter <thor at (no spam) math.tu-berlin.de> wrote:
(snip)
< Again, the paradox is resolved by noting that you declare something as
< "entropy" that simply isn't that. Entropy is *not* a function of the
< message. Entropy is a function of the *source*.

(snip)
< Kolmogorov complexity is a different beast, and from a different domain.
< It is *not* the entropy, but rather the length of the shortest program
< that reproduces a given message. This concept makes only sense if you
< consider infinitely long messages (or rather, consider messages whose
< length tends to infinity), otherwise it is ill-defined. So you could
< neither measure *the complexity* (in fact, it can be shown that no
< program can do that, so the term remains abstract, quite unlike the
< entropy).

< The Kolmogorov complexity of the ISBN resp. the book, or the two
< speakers are indeed identical as you need to reproduce the book contents
< from the ISBN, and thus the "decompressor" algorithm needs to have the
< book ready somewhere in it; thus, it doesn't matter whether you compress
< the book in a smart way, or rather store its index. The decompressor is
< not allowed to use *any* external resources and must, by itself,
< reproduce the book contents letter by letter. The ISBN doesn't help much
< to do that, though it could. Again, remember that there is no
< "decompressor and data to decompress" in the Kolmogorov world. there is
< *only* the decompressor and nothing else. And its size matters.

I believe that there is no paradox, but the problem isn't simple,
at least for real problems. Should the "decompressor" also include
the description of the computer hardware it is expected to run on?
(For pseudo-code, the description of the pseudo-processor.)

Not so long ago, I was writing PDF generators based on the PDF
description of the PDF format. For archival data, someone will need
a non-compressed form of the format so that they can read the PDF
descriptions of the format. (Not mentioning the ability to read
the actual media that the data is stored on.)

-- glen
 
Mok-Kong Shen...
Posted: Wed Sep 09, 2009 5:16 pm
Guest
Mok-Kong Shen wrote:
[snip]
Quote:
How could this apparent paradox be explained away?

I suppose that my use of a book in the paradox was a major
point that led to critiques in two follow-ups (Adler and
Richter). Allow me to try something different:

In daily discourse, there are persons who speak tersely and to
the point, while others speak in lengthy and round-about ways,
though expressing exactly the same. Now we certainly could
compare the information content (hence entropy) of what two
persons, one of each kind, say, couldn't we? Both utterences
are common English, but of quite different lengths and we
have a paradox as before, I would think.

Thanks,

M. K. Shen
 
Thomas Richter...
Posted: Wed Sep 09, 2009 5:35 pm
Guest
Mok-Kong Shen schrieb:
Quote:
Mok-Kong Shen wrote:
[snip]
How could this apparent paradox be explained away?

I suppose that my use of a book in the paradox was a major
point that led to critiques in two follow-ups (Adler and
Richter). Allow me to try something different:

In daily discourse, there are persons who speak tersely and to
the point, while others speak in lengthy and round-about ways,
though expressing exactly the same. Now we certainly could
compare the information content (hence entropy) of what two
persons, one of each kind, say, couldn't we? Both utterences
are common English, but of quite different lengths and we
have a paradox as before, I would think.

Again, the paradox is resolved by noting that you declare something as
"entropy" that simply isn't that. Entropy is *not* a function of the
message. Entropy is a function of the *source*.

What you could do is assign probability models to the two speakers that
mimic their typical utterance, and *then* measure the entropy of the two
probability sources. Nevertheless, such a model would be highly
complicated, the entropy would be hard to compute, but in principle the
operation is well-defined then. But then, I no longer see a paradox. You
get two numbers.

Kolmogorov complexity is a different beast, and from a different domain.
It is *not* the entropy, but rather the length of the shortest program
that reproduces a given message. This concept makes only sense if you
consider infinitely long messages (or rather, consider messages whose
length tends to infinity), otherwise it is ill-defined. So you could
neither measure *the complexity* (in fact, it can be shown that no
program can do that, so the term remains abstract, quite unlike the
entropy).

The Kolmogorov complexity of the ISBN resp. the book, or the two
speakers are indeed identical as you need to reproduce the book contents
from the ISBN, and thus the "decompressor" algorithm needs to have the
book ready somewhere in it; thus, it doesn't matter whether you compress
the book in a smart way, or rather store its index. The decompressor is
not allowed to use *any* external resources and must, by itself,
reproduce the book contents letter by letter. The ISBN doesn't help much
to do that, though it could. Again, remember that there is no
"decompressor and data to decompress" in the Kolmogorov world. there is
*only* the decompressor and nothing else. And its size matters.


So long,
Thomas
 
Thomas Richter...
Posted: Wed Sep 09, 2009 7:05 pm
Guest
glen herrmannsfeldt schrieb:

Quote:
The Kolmogorov complexity of the ISBN resp. the book, or the two
speakers are indeed identical as you need to reproduce the book contents
from the ISBN, and thus the "decompressor" algorithm needs to have the
book ready somewhere in it; thus, it doesn't matter whether you compress
the book in a smart way, or rather store its index. The decompressor is
not allowed to use *any* external resources and must, by itself,
reproduce the book contents letter by letter. The ISBN doesn't help much
to do that, though it could. Again, remember that there is no
"decompressor and data to decompress" in the Kolmogorov world. there is
*only* the decompressor and nothing else. And its size matters.

I believe that there is no paradox, but the problem isn't simple,
at least for real problems. Should the "decompressor" also include
the description of the computer hardware it is expected to run on?

It may, but it doesn't matter. It does not matter because all it takes
is a Turing-complete machine, and all machines and pseudo- or real codes
are (unless they are brain-dead) Turing complete, and each Turing
machine can emulate any other by a suitable program. And then it doesn't
matter because any description of such a pseudo-code is only of finite
length, but the Kolmogorov-complexity is defined on infinitely long
messages only.

Of course, the whole concept is completely abstract - it is provably
impossible to evaluate the complexity by any algorithm.

So long,
Thomas
 
Mok-Kong Shen...
Posted: Thu Sep 10, 2009 1:06 pm
Guest
Phil Carmody wrote:

Quote:
The part of the paradox tht you seem to be missing is that there
are no known random numbers.

Sorry, I doubt that I captured what you mean. Could you please
elaborate a bit? (Note that the paradox is based on the use
of the "rule of thumb" that I alluded to.)

Thanks,

M. K. Shen
 
Mark Nelson...
Posted: Thu Sep 10, 2009 1:49 pm
Guest
On Sep 8, 10:33 am, Mok-Kong Shen <mok-kong.s... at (no spam) t-online.de> wrote:
Quote:

How could this apparent paradox be explained away?


I know you got a lot of detailed responses to this question, and not
all of them are in agreement with one another.

The most important thing to know about all this is that in our
context, there is no such thing as "absolute entropy."

The entropy of a message is always dependent on the model you are
using.

If the model we are using to compress the content of books is the
Library of Congress, an entire book can be compressed using a 10 digit
ISBN.

While this seems rather remarkable, and it does mean that the book has
only a few bits of entropy, it is also not a very useful model, as it
is can only be use to compress a very limited number of sequences.

Kolmogorov complexity is closely related, and when it is used on a
general purpose computer it may be seen as a way to try to approach
something that we might call "absolute entropy". But complexity theory
doesn't require that you use a general purpose computer to generate
sequences - you can build a computer that has the library of congress
as a database, and you're right back where you started.

So the short answer to your question is: there is no paradox, you have
simply stated the question incorrectly. Textbooks that say English has
an entropy of 1.5 bits per letter are not giving you the whole story.
What they should say is that English text has an entropy of 1.5 bits
per letter when coded using the brain of a native speaker as a model.

Kolmogorov complexity on an arbitrary, general-purpose machine would
satisfy most people's gut feeling for absolute entropy, but at the
same time this might not be all that useful.

- Mark
 
Denis...
Posted: Thu Sep 10, 2009 2:13 pm
Guest
Thomas Richter ha scritto:
Quote:
but the Kolmogorov-complexity is defined on infinitely long
messages only.


Why ? What do you mean ?


I have a long list of references in contraddiction with this claim.
 
Thomas Richter...
Posted: Thu Sep 10, 2009 2:30 pm
Guest
Mok-Kong Shen schrieb:
Quote:
Phil Carmody wrote:

The part of the paradox tht you seem to be missing is that there
are no known random numbers.

Sorry, I doubt that I captured what you mean. Could you please
elaborate a bit? (Note that the paradox is based on the use
of the "rule of thumb" that I alluded to.)

I don't know what Phil means, but I I would probably state something similar: Numbers
or sequences of numbers aren't random. Just the procedures that generated them, their
sources, are.

For example, is the following sequence random:

0,4,9,7,1,1,6,8,5,6,4,5,1,3

This question makes no sense. No specific sequence you can point at is random. Only
the way how the sequence was generated may or may not be random.

So long,
Thomas
 
Thomas Richter...
Posted: Thu Sep 10, 2009 5:25 pm
Guest
Denis schrieb:
Quote:
Thomas Richter ha scritto:
but the Kolmogorov-complexity is defined on infinitely long messages
only.


Why ? What do you mean ?

Because it makes no sense otherwise. I mean exactly that: Kolmogorov complexity
cannot be well-defined on finite messages. Otherwise, implementation details of
the Turing machine running the algorithm to create the message come into play.

Quote:
I have a long list of references in contraddiction with this claim.

Burn them.

So long,
Thomas
 
glen herrmannsfeldt...
Posted: Thu Sep 10, 2009 5:50 pm
Guest
Mok-Kong Shen <mok-kong.shen at (no spam) t-online.de> wrote:
< Thomas Richter wrote:
(snip)

<> For example, is the following sequence random:

<> 0,4,9,7,1,1,6,8,5,6,4,5,1,3

<> This question makes no sense. No specific sequence you can
<> point at is random. Only the way how the sequence was
<> generated may or may not be random.

< I have some difficulty of understanding here. How could one ever
< know the way how the sequence was generated may or may not be
< random without investigating the sequence obtained? Should one
< investigate the construction/mechanism of software/hardware instead?

Is the following sequence of head/tails from coin flips random?


H, H, H, H, H

It doesn't look very random, but randomly occurs about 3% of the time.
Any finite sequence (over a finite alphabet) can be generate
randomly. It is different for an infinite length sequence.

-- glen
 
 
Page 1 of 5    Goto page 1, 2, 3, 4, 5  Next
All times are GMT
The time now is Sun Nov 29, 2009 11:46 am