| |
 |
|
|
Science Forum Index » Compression Forum » Use of Tunstall codes for data compression...
Page 1 of 1
|
| Author |
Message |
| John Gilbert... |
Posted: Fri May 16, 2008 4:13 am |
|
|
|
Guest
|
Hi Folks,
Consider the source alphabet SIGMAin = {a, b} where P(a) = 0.75 and
the output alphabet SIGMAout = {0,1}.
If I build a 2bit Tunstall code for this setup I arrive at the
following codebook:
aaa -> 00
aab -> 01
ab -> 10
b -> 11
In Sayood's "Introduction to data compression" book on pg 67 he says
that the design of a code that has a fixed codeword length but a
variable number of symbols per codeword should satisfy the following
conditions:
(1) we should be able to parse a source output sequence into a
sequence of symbols that appear in the codebook
(2) ...
I don't see that Tunstall codes meet this first requirement. For
example, how do I parse the source sequence consisting of "a" or
"aaba"?
Comments or suggestions on what I've misunderstood would be
appreciated :)
Thanks,
John. |
|
|
| Back to top |
|
| biject... |
Posted: Fri May 16, 2008 6:30 am |
|
|
|
Guest
|
On May 16, 8:13 am, John Gilbert <gilbe... at (no spam) gmail.com> wrote:
Quote: Hi Folks,
Consider the source alphabet SIGMAin = {a, b} where P(a) = 0.75 and
the output alphabet SIGMAout = {0,1}.
If I build a 2bit Tunstall code for this setup I arrive at the
following codebook:
aaa -> 00
aab -> 01
ab -> 10
b -> 11
In Sayood's "Introduction to data compression" book on pg 67 he says
that the design of a code that has a fixed codeword length but a
variable number of symbols per codeword should satisfy the following
conditions:
(1) we should be able to parse a source output sequence into a
sequence of symbols that appear in the codebook
(2) ...
I don't see that Tunstall codes meet this first requirement. For
example, how do I parse the source sequence consisting of "a" or
"aaba"?
Comments or suggestions on what I've misunderstood would be
appreciated :)
Thanks,
John.
Actually its not a hard problem look at my bjective LZW code or
numerous discussions on going from bit streams to byte streams.
There is also the problem of going from on odd number of bits
like what is 010 to the tokens you have defined.
Yet the codings you have is all you really need besides a general
understanding of bijective coding methods. Here is the solution to
your simple problem the rest and the parts you did not ask are
all similar. Also not tis one of many bijective ways to do it. Once
you see one you should start to see others.
I can treat you a and b as a simple bijary string doing that and
following my general bijectie methods if the the string is all "b"
tokends add tail of "aaaa...."
anything else add tail of "baaa..."
thus
aaba
becomes
aababaaaa..... for ever
this becomes
aab ab aaa aaa aaa aaa for ever
01 10 00 00 00 00 at this point since I real want to go
to the space of all strings instead of even you needless
limited it to,
this becomes
the string
01
since you have made 01 the token for aab you may
wrongly think there is a conflict. but when I apply
the methods to aab
it goes to
aabbaaaaa.... which becomes
aab b aaa aaa aaa which becomes
01 11 00 00 00 ..
which is
011
life is strange this is not how those with the inability
to think of strings would do. I suggest you look
at amy lzw and arithmetic or huffman compressors
to get a feel for how it works.
The dumb nobijective solution typical of closed thinging
might tell you to be to fileds in front of compressed string
tell you how many pairs of data to use and the number
of original tokend in last set exanple of this piss poor
way yet weak minds should be able to follow
aaba
aab aaa
{2}[1]0100
the 2 tells ot 2 entries
the 1 tells to use only first symbol from last one
Yes its wasteful yes its pss poor way to do it. But
if you can't so bijective file compression where no
space is wasted what choice do you have.
Hope this helps if not hope it at least
stimulated your mind
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link" |
|
|
| Back to top |
|
| biject... |
Posted: Fri May 16, 2008 7:33 am |
|
|
|
Guest
|
On May 16, 8:13 am, John Gilbert <gilbe... at (no spam) gmail.com> wrote:
Quote: Hi Folks,
Consider the source alphabet SIGMAin = {a, b} where P(a) = 0.75 and
the output alphabet SIGMAout = {0,1}.
If I build a 2bit Tunstall code for this setup I arrive at the
following codebook:
aaa -> 00
aab -> 01
ab -> 10
b -> 11
In Sayood's "Introduction to data compression" book on pg 67 he says
that the design of a code that has a fixed codeword length but a
variable number of symbols per codeword should satisfy the following
conditions:
(1) we should be able to parse a source output sequence into a
sequence of symbols that appear in the codebook
(2) ...
I don't see that Tunstall codes meet this first requirement. For
example, how do I parse the source sequence consisting of "a" or
"aaba"?
Comments or suggestions on what I've misunderstood would be
appreciated :)
Thanks,
John.
I'm not sure why you don't go arithmetic but if you can't would it
not be better for the sets your using to go to
aaa > 1
b > 01
ab > 001
aab > 000
since it would make for a shorter file in the cases where a random
at 75% or time and b random at 25% of time
David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link" |
|
|
| Back to top |
|
| Thomas Richter... |
Posted: Fri May 16, 2008 12:25 pm |
|
|
|
Guest
|
John Gilbert schrieb:
Quote: Hi Folks,
Consider the source alphabet SIGMAin = {a, b} where P(a) = 0.75 and
the output alphabet SIGMAout = {0,1}.
If I build a 2bit Tunstall code for this setup I arrive at the
following codebook:
aaa -> 00
aab -> 01
ab -> 10
b -> 11
In Sayood's "Introduction to data compression" book on pg 67 he says
that the design of a code that has a fixed codeword length but a
variable number of symbols per codeword should satisfy the following
conditions:
(1) we should be able to parse a source output sequence into a
sequence of symbols that appear in the codebook
(2) ...
I don't see that Tunstall codes meet this first requirement. For
example, how do I parse the source sequence consisting of "a" or
"aaba"?
That's a general problem that exists for many compression algorithms,
namely, "how do you indicate that the stream ended here". Algorithms are often designed
to work on infinite data streams, i.e. the problem is rarely considered crucial because
several standard solutions exist:
a) add an "EOF" symbol to your alphabet, with a very low probability.
This decreases the coding efficiency a bit, and the code will, of course, be different.
Then "a" is encoded as "a EOF EOF" (or something similar, using a different alphabet).
b) Your format already contains a wrapper that indicates how long the original file
was (this is often practical for other reasons, anyhow). Then, instead of "a", you
would encode "aaa" or "ab", and indicate in the header that the last two or the last
symbol are "bogus" and must be removed.
b) is pretty efficient as it enlarges the output stream only minimally, and potentially
the additional side information is required anyhow, so no overhead is caused by this.
(Practical applications: JPEG uses an "EOF" symbol, JPEG2000 a length-indicator, just
to give practical applications for both methods).
So long,
Thomas |
|
|
| Back to top |
|
| |
|
Page 1 of 1
All times are GMT - 5 Hours
The time now is Thu Jan 08, 2009 3:44 am
|
|