Main Page | Report this Page
Computers Forum Index  »  Computer Compression  »  Entropy...
Page 3 of 5    Goto page Previous  1, 2, 3, 4, 5  Next

Entropy...

Author Message
Mok-Kong Shen...
Posted: Fri Sep 11, 2009 1:22 pm
Guest
Thomas Richter wrote:

Quote:
If you understand information in the information theoretic sense, the
entropy of a source is the average amount of information you get from
the source per letter. Or to be precise, entropy is the expected value
of information. Information, however, is probably not what you believe
what it is. It is simply the logarithm of the symbol probability. That
is, the average number of yes-no questions you need to guess the
message. In the framework of this "riddle game", information (as in
entropy) is the information as you know, but only then. It is *not* the
information as you may believe it is.

I like to return to what I said previously regarding knowledge
and resources. I mean the information carried by a text string
(physically present in characters or bits) is inherently dependent
on a person's knowledge and resources. As a rather trivial example:
If I get a sentence in a language, say Greek, that I don't know, it
means for me virtually nothing. If, on the other hand, I had learnt
Greek very well, I get the full information out of it. If my
knowledge of Greek is at some intermediate level, I'll understand
part of it, the amount captured being dependent on whether I have
a dictionary (it matters whether the dictionary is big or small)
and perhaps also a book of grammar of Greek at my disposal. It
may finally also depend on how much time I could afford, in the
particular situation I am in, to "decipher" that sentence. So the
information content of a natural language sentence cannot, in my
humble opinion, be evaluated "simply" by applying a "rule of thumb"
(universal) estimate on the length in characters irrespective of
the reader's knowledge and resources. If one disregard this and
somehow get a numerical value and denote it as "entropy", then
I find it hard to see its (practical) usefulness any more than,
say, the length of the sentence.

Thanks,

M. K. Shen
 
Mok-Kong Shen...
Posted: Fri Sep 11, 2009 1:22 pm
Guest
Thomas Richter wrote:
Quote:
Mok-Kong Shen wrote:

Is the following sequence of head/tails from coin flips random?


H, H, H, H, H

It doesn't look very random, but randomly occurs about 3% of the time.
Any finite sequence (over a finite alphabet) can be generate
randomly. It is different for an infinite length sequence.

Question: how do you know that a given source is or is not random?

You don't. You simply *assume* that, and then design a model that
generates sequences that look like the ones you want to compress. And
from that model you compute the entropy.
[snip]


Question: How do I in practice design the model? Is there a standard
"recipe" for that or many different models could be designed (depending
perhaps on one's ingenuity)? If the latter, which one of the many
candidates should one choose? In particular, which model would you
propose for the sequence in the present case and why?

Thanks,

M. K. Shen
 
Thomas Richter...
Posted: Fri Sep 11, 2009 7:28 pm
Guest
Denis wrote:
Quote:
Thomas Richter ha scritto:

Well, we now enter the realm of "existence or not". Basically,
Kolmogorov complexity exists in the same sense dragons exist. As an
abstract concept. You cannot potentially measure it, or design an
algorithm for it, you can only use it as a mathematical abstractum to
prove theorems about it.



Kolmogorov complexity exists in the same sense dragons exist and this is
true for every mathematical concept.

I can effectively measure a Kolmogorov complexity of a finite bit string
in a real existing universal machine . Such measure can be wrong .

No, it is rather useless on finite strings. Why? Ok, here is why:

For any finite message A, there is a Turing machine T that generates A
by a single symbol on the (input) tape. Thus, the complexity of such a
message is the lowest possible there is - on T.

Reason? Simply build the message into the (finite state) Turing machine,
done.

The reason why that does not work for infinitely long messages is that
Turing machines are constrained to have only finite internal state. It
only starts to be interesting if the message size is not constrained.

Quote:
But if I put 2 balls into a box and then I put another ball into the
same box by mathematic concept I can say into the box there are 3 balls.
Are you sure ? Why are you sure there are 3 balls into the box and that
the measure of Kolmogrov complexity are wrong ?

I'm not saying it is wrong. I'm saying it is useless for practical
approaches.

Quote:
So the Kolmogorov complexity exist in the realm like every mathematical
abstract concept exist in the realm.

I haven't said anything else. It is useful to prove theorems (I'm a
mathematician, you know). It still doesn't make the concept useful on
finite messages, and it still doesn't make it a useful approach for
practical problems, in the same sense as dragons are a practical problem
for a biologist. That doesn't make them un-interesting, but just not
interesting for these people.

So long,
Thomas
 
Thomas Richter...
Posted: Fri Sep 11, 2009 7:34 pm
Guest
Mok-Kong Shen wrote:
Quote:
Thomas Richter wrote:

If you understand information in the information theoretic sense, the
entropy of a source is the average amount of information you get from
the source per letter. Or to be precise, entropy is the expected value
of information. Information, however, is probably not what you believe
what it is. It is simply the logarithm of the symbol probability. That
is, the average number of yes-no questions you need to guess the
message. In the framework of this "riddle game", information (as in
entropy) is the information as you know, but only then. It is *not*
the information as you may believe it is.

I like to return to what I said previously regarding knowledge
and resources. I mean the information carried by a text string
(physically present in characters or bits) is inherently dependent
on a person's knowledge and resources.

You are still on the wrong track. This is *not* the entropy we talk
about. For the entropy we talk about, we *do* know the alphabet the
message is made of, and we do know the probabilities and their
dependencies. It is *not* about interpreting the message, only getting
the symbols it is made of. The symbols could be Chinese, and all you get
is a table of all possible symbols in chinese, and their (conditioned,
unconditioned...) probabilities, i.e. a description of the source (a
chinese monkey, if you like).

Quote:
As a rather trivial example:
If I get a sentence in a language, say Greek, that I don't know, it
means for me virtually nothing. If, on the other hand, I had learnt
Greek very well, I get the full information out of it.

Wrong track. The setting is clear for entropy. You *do not know* the
message, you *do* know the probabilities. What you're talking about is
probably some abstract high-level concept of conditioned entropy (which
does exist, but is still something differently defined).

*Forget* your interpretation of "information" as "something I can
understand and assign a meaning to". This is not *meant* by Shannon
entropy. It's a damn pretty simple riddle on symbols from a finite
alphabet.

So long,
Thomas
 
Thomas Richter...
Posted: Fri Sep 11, 2009 7:49 pm
Guest
Mok-Kong Shen wrote:
Quote:
Thomas Richter wrote:
Mok-Kong Shen wrote:

Is the following sequence of head/tails from coin flips random?


H, H, H, H, H

It doesn't look very random, but randomly occurs about 3% of the time.
Any finite sequence (over a finite alphabet) can be generate
randomly. It is different for an infinite length sequence.

Question: how do you know that a given source is or is not random?

You don't. You simply *assume* that, and then design a model that
generates sequences that look like the ones you want to compress. And
from that model you compute the entropy.
[snip]

Question: How do I in practice design the model? Is there a standard
"recipe" for that or many different models could be designed (depending
perhaps on one's ingenuity)?

Now *that* is of course a pretty good question, and there is no simple
answer to that. It requires some ingenuity, experience, and insight into
what the data you want to compress *typically* would look like.

So here are examples:

For LZ-type of compressions, the model is that the message is made out
of collections of symbols (words) that are likely to re-appear in the
message over again. Thus, the LZ model basically assumes that certain
words already present will likely appear again later. This is a good
model for natural language because it is repetitive on the word (and
less on the symbol) level.

For image type of compressions, one popular model is that the image data
in the Fourier space is basically i.i.d., that is an uncorrelated set of
random variables with identical distribution (the JPEG model). More
advanced compression types (JPEG 2000) use the idea that the image data
is more or less the outcome of a Markov random field in the wavelet
domain (this is what the EBCOT type of compression models).

The actual back-end, namely the part of the algorithm that actually
finds the shortest possible representation of the data *after* the model
has been identified is not much different. Both LZ *and* JPEG use (or
may use) a Huffman coding - all the same at this stage of entropy
compression. JPEG 2000 uses arithmetic coding, so does JBIG.

The reason why JPEG doesn't work well for text or LZ doesn't work well
for images (even though GIF or PNG apply it to them) is simply that the
front-end model for the data is different.

And of course, you never know whether that model is the best possible
model; if it were, we could simply forget about any future compression
attempts. Compression research is about finding proper models for data
we want to compress. Not about the entropy coding part, that stuff is
solved *long ago*, by Huffman for example.

Quote:
If the latter, which one of the many
candidates should one choose? In particular, which model would you
propose for the sequence in the present case and why?

I would choose a candidate that performs well on the data I want to
compress. That is, I use test input that is *typical* to the type of
data I would like my compression algorithm to compress, and then design
(or evaluate) potential models for it.

ISBN is not a potentially good model for natural language. It is a good
model to compress all indexed books I have at the time the compressor
was designed, so it does something different. With this example, I
cannot "ISBN"-compress a source I would consider typical for a natural
language text compressor, as for example this posting, so it's not a
good model for natural language. It might be a good model for libraries
that want to keep books in a database.

But, anyhow, the entropy does of course depend on the model, because the
model says what the random source is, and thus there is no "the
entropy", but only "a entropy", namely the one that depends on the model
- where the model says what a symbol is and what the probabilities are.

For LZ, the "symbols" are pairs of letters and indicies (namely letters
that extend already already existing words in the data base, and indices
that address the database in one way or another). In the JPEG example,
the symbols are pairs of amplitudes and runlengths in the 8x8 block
zig-zag scan, etc.

And each time, the entropy is a different one.

So long,
Thomas
 
Thomas Richter...
Posted: Fri Sep 11, 2009 7:56 pm
Guest
Mok-Kong Shen wrote:
Quote:
Thomas Richter wrote:
Mok-Kong Shen wrote:
[snip]
Should one
investigate the construction/mechanism of software/hardware instead?

For Shannon entropy, one should. It depends on the *construction* of
the sequence, not the sequence.

What should one do, if the "consturction" is unknown or practically
unavailable in required details?

It hardly ever is. See my response above. Models are "made up", not
"found" in all practical implementations.

Quote:
As an example in natural language:
If I see a sentence on the internet, it could be written by a human,
but it could as well be the result of some sophisticated automatic
natural language processing system or maybe even the chance product
of the monkey of the British Museum? (BTW, I once saw paintings by
apes that were hard to distinguish from works from artists of certain
schools.)

'course you could. Would also make a pretty good model, but it wouldn't
fit well to all type of messages. Thus, if your input is written by a
monkey, then, well, it might not. Whether that is bad or good is a
matter whether you expect your system to compress monkey-scribble. (-;

What's important is that this is a different model.

Site-remark (and a known-joke to some of you):

The JPEG has an ongoing internal joke about the perfect image
compression algorithm called "Lena-compress". The joke works by knowing
that the "Lena" image is one of the traditional test-images that is
always used for testing image compression algorithms.

"Lena-compress" is the all-time ultimate image compression algorithm:

- If the image is lena, write a one.
- If the image is not lena, write a zero, and append the original.

This compression scheme works (i.e. is universal) and it would work
perfectly on the "traditional test set".

It is still not a good compressor - why?

Because the model "stinks". The model is simply "I expect lena", which
is wrong for the "typical input" (except if the typical input is the
traditional test set, of course). Typical input would be "a larger set
of natural images I want to look at".

Hope that made the point a bit clearer. The question is "what is my
typical data". The punch-line of the joke is "for the JPEG, the typical
data was so often Lena that we could almost turn that into a compression
algorithm". So it's a clever joke. Well, I hope... (-;

So long,
Thomas
 
Thomas Richter...
Posted: Fri Sep 11, 2009 8:01 pm
Guest
Mok-Kong Shen wrote:
Quote:
Thomas Richter wrote:
Mok-Kong Shen wrote:

Excuse me for not yet fully understanding you. Under the correct
definition of the entropy estimate you gave above and assuming that
the range of, say, 1.0 - 1.5 bits per character is ok, wouldn't I
still get the same sort of paradox, in that a short string is shown
to be equivalent (one being obtainable from the other) to a very
long string?

You mean, in the book example? Entropy doesn't measure whether
"information" (in the laymens' sense) is "equivalent". It doesn't care
for this. You still seem to confuse "entropy" (a strictly defined
term) with "the kind of stuff I learn when I would read the book",
which is *not* the type of information Shannon talks about. Shannon
talks only about random sources, and their information content per
symbol. Not about a high-level "content-type" like entropy.

There is, I think, at least one problem here when natural language
is concerned. How correct is the assumption that human utterance
is a random source?

The "correctness" of a model is measured by the entropy when applying
that model to typical input. The better the model, the lower the entropy.

Quote:
You wrote in the other post that one needs to
know the "construction". Would one need eventually details of the
brain or the anatomy of the vocal tract etc.?

For a pretty damn good model, one could. For most day purposes,
something simpler is almost always sufficient, and experience tells that
any improvement beyond such models is usually marginal.

Quote:
If your model is "I have a huge collection of books and I have a
random source that picks one of them at random", then you can define
the entropy of said source. Then it doesn't matter whether you
identify the books by title or ISBN (provided the former is unique).
This is a *different* problem from "I want to read the book and
understand what the author talks about", which entropy cannot answer.
It takes a human, and side-conditions (knowledge about the world, the
period the author was writing the book, your environment and feelings
how you associate the book contents with your personal life etc) to
define that - and that's a much harder problem.

OK, let me do as you said: I randomly pick the book and generate
randomly two numbers which happen to come out as 100 and 200. I get,
as before, two text strings, one short and one long. Do they have
the same entropy?

*Sigh* How should I know. A single example doesn't matter. You pick a
*huge* number of books, and measure the description length on average.
Then, you achieve an approximation of the entropy. The model that gets
the lower entropy is the better, of course.

If the books are known upfront, then knowing the ISBN is sufficient
(i.e. this is then a good model). However, if a new book comes out for
which you don't have an ISBN yet in your database, compression would
fail. However, a classical text compressor like LZ would not.

So long,
Thomas
 
Thomas Richter...
Posted: Fri Sep 11, 2009 8:06 pm
Guest
Mok-Kong Shen wrote:
Quote:
Thomas Richter wrote:
Mok-Kong Shen wrote:

Richter wrote that Kolmogorov complexity caanot be well-defined
on finite messages. That implies in my understanding that this
measure is only of high theoretical interest but unfortunately can
never be applied to practical strings that occur in texts or
discourse. In fact, I remember to have heard that there isn't
todate any "practical" computation of the value of Kolmogorov
complexity of real-life strings.

Even worse. It is not even so that it cannot be defined practically or
measured practically. One can even *prove* that there is no computer
algorithm that could possibly measure it. Thus, it is impossible to
define it in practical terms.

BTW, I now vaguely (not very sure) remember on the other hand to have
once read somewhere that Kolmogorov developed a mathematical concept
of complexity for sequences of finite lengths. Does that affect what
you said above or was it simply because of my faulty memory?

You can define the Kolmogorov complexity on finite sequences *given*
that you have a specific Turing machine. It is the size of the shortest
program *on this machine* that regenerates the message.

The problem is that this definition is always relative to the machine,
*unlike* - and this is the important part - you define it on infinite
strings. *Then* the machine doesn't matter anymore.

The problem with the finite-sized input - and why it makes little sense
there - is simply that for every finite sized message you can define a
Turing machine which generates this message simply by a one-step
program: Put the message into the state-apparatus of the Turing machine.
(Aka: The "Lena compress" version of Kolmogorov complexity. True,
working, but useless).

Independence from the machine on infinite messages:

This is simply because any Turing machine can emulate any other Turing
machine by a finite-sized program, and the contribution of a
finite-sized program to an infinite string doesn't matter.

So long,
Thomas
 
Mok-Kong Shen...
Posted: Fri Sep 11, 2009 9:53 pm
Guest
Thomas Richter wrote:

Quote:
You can define the Kolmogorov complexity on finite sequences *given*
that you have a specific Turing machine. It is the size of the shortest
program *on this machine* that regenerates the message.

The problem is that this definition is always relative to the machine,
*unlike* - and this is the important part - you define it on infinite
strings. *Then* the machine doesn't matter anymore.

The problem with the finite-sized input - and why it makes little sense
there - is simply that for every finite sized message you can define a
Turing machine which generates this message simply by a one-step
program: Put the message into the state-apparatus of the Turing machine.
(Aka: The "Lena compress" version of Kolmogorov complexity. True,
working, but useless).

Independence from the machine on infinite messages:

This is simply because any Turing machine can emulate any other Turing
machine by a finite-sized program, and the contribution of a
finite-sized program to an infinite string doesn't matter.

As layman I rather doubt that letting any kind of machine (real or
theoretical) to work on an infinite string could have a sense.
I mean by definition of "infinite", the work could never get to an
end (no matter what the processing speed is) and thus no result
would ever be available. What's wrong with my way of thinking here?

Thanks,

M. K. Shen
 
Mok-Kong Shen...
Posted: Fri Sep 11, 2009 10:52 pm
Guest
Thomas Richter wrote:

Quote:
You are still on the wrong track. This is *not* the entropy we talk
about. For the entropy we talk about, we *do* know the alphabet the
message is made of, and we do know the probabilities and their
dependencies. It is *not* about interpreting the message, only getting
the symbols it is made of. The symbols could be Chinese, and all you get
is a table of all possible symbols in chinese, and their (conditioned,
unconditioned...) probabilities, i.e. a description of the source (a
chinese monkey, if you like).

I have a layman's question here: In the character strings of
natural languages there are more or less strong correlations,
as the statistical data on digrams and n-grams show. If I
don't err, such correlations are not considered in Shannon's
formula. If this is ture, wouldn't the validity of the result
be a problem?

Thanks,

M. K. Shen
 
Thomas Richter...
Posted: Fri Sep 11, 2009 10:56 pm
Guest
Mok-Kong Shen wrote:
Quote:
Thomas Richter wrote:

You can define the Kolmogorov complexity on finite sequences *given*
that you have a specific Turing machine. It is the size of the
shortest program *on this machine* that regenerates the message.

The problem is that this definition is always relative to the machine,
*unlike* - and this is the important part - you define it on infinite
strings. *Then* the machine doesn't matter anymore.

The problem with the finite-sized input - and why it makes little
sense there - is simply that for every finite sized message you can
define a Turing machine which generates this message simply by a
one-step program: Put the message into the state-apparatus of the
Turing machine.
(Aka: The "Lena compress" version of Kolmogorov complexity. True,
working, but useless).

Independence from the machine on infinite messages:

This is simply because any Turing machine can emulate any other Turing
machine by a finite-sized program, and the contribution of a
finite-sized program to an infinite string doesn't matter.

As layman I rather doubt that letting any kind of machine (real or
theoretical) to work on an infinite string could have a sense.
I mean by definition of "infinite", the work could never get to an
end (no matter what the processing speed is) and thus no result
would ever be available.

You can still get the next output symbol, in the same sense that you can
get the next symbol from the digits of Pi. Pi can be represented as an
infinite decimal fraction, yet it makes sense to speak about Pi in
total. That is, it is at least "potentially infinite" in the sense that
you can always get the next symbol, and it never terminates, and the
length is not finite ad-hoc.

So long,
Thomas
 
Mok-Kong Shen...
Posted: Fri Sep 11, 2009 11:07 pm
Guest
Thomas Richter wrote:
Quote:
Mok-Kong Shen wrote:
Thomas Richter wrote:

You can define the Kolmogorov complexity on finite sequences *given*
that you have a specific Turing machine. It is the size of the
shortest program *on this machine* that regenerates the message.

The problem is that this definition is always relative to the
machine, *unlike* - and this is the important part - you define it on
infinite strings. *Then* the machine doesn't matter anymore.

The problem with the finite-sized input - and why it makes little
sense there - is simply that for every finite sized message you can
define a Turing machine which generates this message simply by a
one-step program: Put the message into the state-apparatus of the
Turing machine.
(Aka: The "Lena compress" version of Kolmogorov complexity. True,
working, but useless).

Independence from the machine on infinite messages:

This is simply because any Turing machine can emulate any other
Turing machine by a finite-sized program, and the contribution of a
finite-sized program to an infinite string doesn't matter.

As layman I rather doubt that letting any kind of machine (real or
theoretical) to work on an infinite string could have a sense.
I mean by definition of "infinite", the work could never get to an
end (no matter what the processing speed is) and thus no result
would ever be available.

You can still get the next output symbol, in the same sense that you can
get the next symbol from the digits of Pi. Pi can be represented as an
infinite decimal fraction, yet it makes sense to speak about Pi in
total. That is, it is at least "potentially infinite" in the sense that
you can always get the next symbol, and it never terminates, and the
length is not finite ad-hoc.

OK, but when do you take the result from the machine? At the
moment you take the result, only a finite number of symbols
have been processed. So you "always" get results from a finite
string, never from an infinite string, although you may choose
to have that finite string to be as long as you like, if you
spend correspondingly long time.

Thanks,

M. K. Shen
 
Thomas Richter...
Posted: Sat Sep 12, 2009 12:22 am
Guest
Mok-Kong Shen wrote:
Quote:
Thomas Richter wrote:

You are still on the wrong track. This is *not* the entropy we talk
about. For the entropy we talk about, we *do* know the alphabet the
message is made of, and we do know the probabilities and their
dependencies. It is *not* about interpreting the message, only getting
the symbols it is made of. The symbols could be Chinese, and all you
get is a table of all possible symbols in chinese, and their
(conditioned, unconditioned...) probabilities, i.e. a description of
the source (a chinese monkey, if you like).

I have a layman's question here: In the character strings of
natural languages there are more or less strong correlations,
as the statistical data on digrams and n-grams show. If I
don't err, such correlations are not considered in Shannon's
formula.

There is no "the formula". There is an n-order entropy definition, and
that definition would of course capture the correlation. Zero-order
entropy, of course, does not. Or, to put it differently, if you have a
source with correlation, increasing the order of the entropy measurement
will decrease the entropy. Or rather, if you have an nth-order model,
you should measure entropy as n-th order entropy, obviously.

So long,
Thomas
 
Thomas Richter...
Posted: Sat Sep 12, 2009 12:26 am
Guest
Mok-Kong Shen wrote:

Quote:
You can still get the next output symbol, in the same sense that you
can get the next symbol from the digits of Pi. Pi can be represented
as an infinite decimal fraction, yet it makes sense to speak about Pi
in total. That is, it is at least "potentially infinite" in the sense
that you can always get the next symbol, and it never terminates, and
the length is not finite ad-hoc.

OK, but when do you take the result from the machine?

Whenever you have enough digits you're satisfied with. Same as for Pi.
At some point, it is precise enough.

Quote:
At the
moment you take the result, only a finite number of symbols
have been processed. So you "always" get results from a finite
string, never from an infinite string, although you may choose
to have that finite string to be as long as you like, if you
spend correspondingly long time.

The difference is the "potentially infinite", or, very formally, the
relative position of the "quantors" in the statement.

Infinite string means: There is a Turing machine T such that for any
message of size N...

and the point is, N is not known and not specified in advance. You can
choose, without modifying the machine. You never get a final result, and
you never get a final result for Pi either. It just goes on.

So long,
Thomas
 
Mok-Kong Shen...
Posted: Sat Sep 12, 2009 1:23 am
Guest
Thomas Richter wrote:
Quote:
Mok-Kong Shen wrote:

I have a layman's question here: In the character strings of
natural languages there are more or less strong correlations,
as the statistical data on digrams and n-grams show. If I
don't err, such correlations are not considered in Shannon's
formula.

There is no "the formula". There is an n-order entropy definition, and
that definition would of course capture the correlation. Zero-order
entropy, of course, does not. Or, to put it differently, if you have a
source with correlation, increasing the order of the entropy measurement
will decrease the entropy. Or rather, if you have an nth-order model,
you should measure entropy as n-th order entropy, obviously.

Thank you for pointing out that to me. In the books I read only
the O-th order formula is given, together with estimate of entropy
for English, although English, like all natural languages,
certainly has rather strong correlations. Maybe the estimate was
actually computed using a higher order formula, I don't know.
But anyway, that's bad of the authors in my humble view.

Thanks,

M. K. Shen
 
 
Page 3 of 5    Goto page Previous  1, 2, 3, 4, 5  Next
All times are GMT
The time now is Wed Dec 02, 2009 11:17 pm