Main Page | Report this Page
 
   
Science Forum Index  »  Compression Forum  »  A request for samples to work with.
Page 1 of 1    
Author Message
Einstein
Posted: Fri Feb 15, 2008 4:47 pm
Guest
I need 4 heavily randomized, entropic files. Max size of 64kb (I am
converting it to 0's and 1's in a text file, so it will be 512kb
then... which is the maximum of the tool)

These files should be unable to be further compressed by any existing
compression utility.

I have just successfully done one test, but I used a central portion
of the Calgary Corpus file since it was to large as a whole. I used
7zip to compress, and obtained admittedly only a few bits of
compression in several different locations after accounting for a
command section, and identification, and only if I quadrupled the
estimated file size to increase the 'number of bits saved'.

This is bad sampling and I admit it. Therefore I am asking for help
generating the admittedly necessary test files.

4 should give me a baseline to see if I am being successful or not.
Sachin Garg
Posted: Fri Feb 15, 2008 5:33 pm
Guest
On Feb 16, 7:47 am, Einstein <michae...@gmail.com> wrote:
Quote:
I need 4 heavily randomized, entropic files. Max size of 64kb (I am
converting it to 0's and 1's in a text file, so it will be 512kb
then... which is the maximum of the tool)

These files should be unable to be further compressed by any existing
compression utility.

I have just successfully done one test, but I used a central portion
of the Calgary Corpus file since it was to large as a whole. I used
7zip to compress, and obtained admittedly only a few bits of
compression in several different locations after accounting for a
command section, and identification, and only if I quadrupled the
estimated file size to increase the 'number of bits saved'.

This is bad sampling and I admit it. Therefore I am asking for help
generating the admittedly necessary test files.

4 should give me a baseline to see if I am being successful or not.

I don't think it is going to be worth the time and effort you plan to
put into it, but its your choice :-)

Try this file:
http://marknelson.us/attachments/million-digit-challenge/AMillionRandomDigits.bin

More information on this here:
http://marknelson.us/2006/06/20/million-digit-challenge

And you can try this file:
http://www.geocities.com/patchnpuki/other/original.gz

More information on this here:
http://www.geocities.com/patchnpuki/other/compression.htm

You can create files of your required size from these. These are both
fairly random files and should not be easily compressible. I think
someone did manage to compress the million-digit file long ago by a
few bytes but I can't recall the details. So, the files are not
perfect but should be good enough to try out your ideas.

Best of luck,
Sachin Garg [India]
www.sachingarg.com | www.c10n.info
Einstein
Posted: Fri Feb 15, 2008 7:27 pm
Guest
This page is not available.

on the second file source.


The million one is to large, my software cant try it. Nice slam on me,
I have since learned a lil more, post more information.

So very kind of you.

I merely asked for sources, not for an attack.

and frankly, any file should be compressible with the simple idiots
tale of a compression that has X built into it, file coincidently is
X, therefore value is 0 for true, compressed to 1 bit, wooo..... Yada
yada.

Anyhow I am honestly trying, I am not in anyones face, so why the in
my face? Hell I published my last efforts, all of the different means
I used, k? Every formula was there, and this is just a legit question.
Guest
Posted: Fri Feb 15, 2008 10:17 pm
On Feb 16, 5:27 am, Einstein <michae...@gmail.com> wrote:

Quote:
The million one is to large, my software cant try it. Nice slam on me,
I have since learned a lil more, post more information.

He did not attack you, he just stated a simple truth. Trying to write
compressors for true random sequences is a waste of time, you can't do
it no matter how hard you try. That is just a fact whether you like
it or not.

Second he gave you a pointer to a sequence of 1,000,000 digits! You
said you need four 64K sequences. If you can't figure out how to
extract four 64K files from one 1Meg file you have more problems ahead
of you than you are ready for. And yes, you can consider that an
attack on your mental abilities. But really, if that is a problem for
you I don't see how you think you can handle something hard like RAD
compressors.

Sorry if I seem harsh, but you were attacking someone who really was
trying to help you. Read my early replies to some of the compression
con-artist if you want to see what harsh really means. Smile
Einstein
Posted: Sat Feb 16, 2008 2:53 am
Guest
sorry I was dismayed my rl name was used on that report, and felt very
attacked.

I just dont want any tainted samples I guess. I am very concerned to
keep this as legit as I can... improper sampling can be taken the
wrong way imo. By me, by my system, and by everyone.
Einstein
Posted: Sat Feb 16, 2008 12:12 pm
Guest
Here is a question:

if I ran every 16 bit string possibility in order, and hooked them up,
and then parsed it at 4 bits and ran it at 4 bits, does this stand for
a random file, or would it seem to ordered? I have been using that for
initial work, but idk, it had an exceptionally high ratio. Effort #2
with 64kb of the center of calgary corpus unzipped then compressed
with 7zip (higher compression ratio that winzip could compress. I
saved like 10 bytes by looks of it, but this could be just a bad
sample.

Thats why I am scared to try to split the big one there, but I will I
guess, into 8 parts like I might if I make sure to gauge all results
together.... btw this is not an 'all random', I am aware thats
impossible. Part of the tool is to see if just a portion can. By looks
of it, about 10% can be compressed, but results vary with how much,
and typically it is ineffectual to even try to attend to the
remainder. I literally just cost efficiently whip out the 10% as I
can... but tests will see, so dont act like I am preaching it Razz
Sachin Garg
Posted: Sat Feb 16, 2008 2:40 pm
Guest
On Feb 16, 10:27 am, Einstein <michae...@gmail.com> wrote:
Quote:
This page is not available.

on the second file source.

Go to the /compression page, then search for text "original.gz" and
click on the link, you should then be able to download the file.
(Geocities doesnt allows direct linking of files, hence the trouble).

Quote:
The million one is to large, my software cant try it.

It should be easy to write a program that can break these files into
size of your choice.

Quote:
Nice slam on me,

That wasn't my intention, my apologies if you took it personally.

Sachin Garg [India]
www.sachingarg.com | www.c10n.info
Guest
Posted: Sun Feb 17, 2008 6:47 am
On Feb 16, 12:53 pm, Einstein <michae...@gmail.com> wrote:
Quote:
sorry I was dismayed my rl name was used on that report, and felt very
attacked.

I must have missed that part, but if you feel you must hide behind a
fake name then you are the one with problems. If you really believe
in your own ideas then you should have the guts to use your name with
them. Notice, I always post with my real name? That is because I
write what I believe in - no lies - no fears.

Quote:
I just dont want any tainted samples I guess. I am very concerned to
keep this as legit as I can... improper sampling can be taken the
wrong way imo. By me, by my system, and by everyone.

How can a sequence of a million digits be tainted? That does not make
sense, or you don't have a clear idea of what random means. In any
sequence of random numbers any subset of that sequence is also
random. As long as the four 64K sequences start points are choosen so
that subsets don't overlap then the sequences are random.

Most people would just use the first 256K numbers broken down to even
subsets - that is all you need.
Einstein
Posted: Sun Feb 17, 2008 11:36 am
Guest
Well the numbers I am getting from small files are to unbelievable
atm... thats why I wanted a medium sized (since thats the max of the
tool atm) 'uncompressible' file.

So far small files have indicated huge chances to compress, but I feel
they lack total depth. Individually they wont compress, except a few,
but if... words escape me.... hmmm, I will use the word Model.... if
the model persists in the ratio's it would indicate 2% compression on
anything, which would not be possible ofc. My goal is much more
modest.

I want to break a file into streams. Each stream should by nature have
some differentiation. I am studying the variances therein atm for
compression modeling. The hope is that the streams can be 'fluctuated
naturally' to which one will allow compression. I expect on average to
save a mere few bits with such an effort, with a chance to repeat
being high but not guaranteed. This would keep up with Shannons law of
Entropy of a binary line, I would just move the line closer to the
exact level...
Jim Leonard
Posted: Mon Feb 18, 2008 7:03 am
Guest
On Feb 16, 4:12 pm, Einstein <michae...@gmail.com> wrote:
Quote:
if I ran every 16 bit string possibility in order, and hooked them up,
and then parsed it at 4 bits and ran it at 4 bits, does this stand for
a random file, or would it seem to ordered?

It would be incredibly ordered. Please use a real random source. A
quick scan of google search results points to http://www.fourmilab.ch/hotbits/
which will give you random bits taken from a decaying radioactive
source (Cęsium-137).

Do grab 64K random bits at a time/in sequence. Do *not* take 64 bits
and repeat them 1K times.

Quote:
of it, about 10% can be compressed, but results vary with how much,
and typically it is ineffectual to even try to attend to the
remainder. I literally just cost efficiently whip out the 10% as I
can... but tests will see, so dont act like I am preaching it Razz

With a true random source, you are wasting your time. Why not look
into other areas of compression for fun? I specialized in high-speed
decompression for the x86 architecture (16-bit); others have tried to
squeeze every last bit out of various corpora. There's a lot of room
for legitimate work if you have a (NON-RANDOM!) niche people haven't
explored yet.
Jim Leonard
Posted: Mon Feb 18, 2008 7:51 am
Guest
On Feb 17, 3:36 pm, Einstein <michae...@gmail.com> wrote:
Quote:
Well the numbers I am getting from small files are to unbelievable
atm... thats why I wanted a medium sized (since thats the max of the
tool atm) 'uncompressible' file.

Run your process on the following small file:

784951623

That's 9 random digits. Here are those digits as a single number in
binary, if you prefer to work only in binary:

101110110010010110100101000111

If you are able to compress the above to something smaller than the 30
bits it takes to represent them, and successfully decompress the
compressed data back into the source, congratulations, you are
mentally ill.

To satisfy the "RAD" people, there was a indeed context in generating
the above, but telling you wouldn't help at all because the
information needed to make sense of the context equals the size of the
output.
Einstein
Posted: Mon Feb 18, 2008 8:44 am
Guest
On Feb 18, 9:51 am, Jim Leonard <MobyGa...@gmail.com> wrote:
Quote:
On Feb 17, 3:36 pm, Einstein <michae...@gmail.com> wrote:

Well the numbers I am getting from small files are to unbelievable
atm... thats why I wanted a medium sized (since thats the max of the
tool atm) 'uncompressible' file.

Run your process on the following small file:

784951623

That's 9 random digits. Here are those digits as a single number in
binary, if you prefer to work only in binary:

101110110010010110100101000111

If you are able to compress the above to something smaller than the 30
bits it takes to represent them, and successfully decompress the
compressed data back into the source, congratulations, you are
mentally ill.

To satisfy the "RAD" people, there was a indeed context in generating
the above, but telling you wouldn't help at all because the
information needed to make sense of the context equals the size of the
output.

111011000001110000101100100111001010110010001100011011000100110011001100
is actual bits for the 9 characters.

1110
1100
0001
1100
0010
1100
1001
1100
1010
1100
1000
1100
0110
1100
0100
1100
1100
1100

Dictionary

1 = 1100
0 = all else is normal


0,1100,1,0,0001,1,0,0010,1,0,1001,1,0,1010,1,0,1000,1,0,0110,1,0,0100,1,1,1

Total size = 50 bits

Original size = 72 bits

ofc it would not be 'that easy', but still having nearly 1/3 of the
space to work with, oh jeez, you really tried.
Einstein
Posted: Mon Feb 18, 2008 8:50 am
Guest
On Feb 18, 10:44 am, Einstein <michae...@gmail.com> wrote:
Quote:
On Feb 18, 9:51 am, Jim Leonard <MobyGa...@gmail.com> wrote:



On Feb 17, 3:36 pm, Einstein <michae...@gmail.com> wrote:

Well the numbers I am getting from small files are to unbelievable
atm... thats why I wanted a medium sized (since thats the max of the
tool atm) 'uncompressible' file.

Run your process on the following small file:

784951623

That's 9 random digits. Here are those digits as a single number in
binary, if you prefer to work only in binary:

101110110010010110100101000111

If you are able to compress the above to something smaller than the 30
bits it takes to represent them, and successfully decompress the
compressed data back into the source, congratulations, you are
mentally ill.

To satisfy the "RAD" people, there was a indeed context in generating
the above, but telling you wouldn't help at all because the
information needed to make sense of the context equals the size of the
output.

111011000001110000101100100111001010110010001100011011000100110011001100
is actual bits for the 9 characters.

1110
1100
0001
1100
0010
1100
1001
1100
1010
1100
1000
1100
0110
1100
0100
1100
1100
1100

Dictionary

1 = 1100
0 = all else is normal

0,1100,1,0,0001,1,0,0010,1,0,1001,1,0,1010,1,0,1000,1,0,0110,1,0,0100,1,1,1

Total size = 50 bits

Original size = 72 bits

ofc it would not be 'that easy', but still having nearly 1/3 of the
space to work with, oh jeez, you really tried.

Dammit, a typo... should start 0,1110
Matt Mahoney
Posted: Thu Feb 21, 2008 2:05 pm
Guest
On Feb 15, 10:33 pm, Sachin Garg <schn...@gmail.com> wrote:
Quote:
On Feb 16, 7:47 am, Einstein <michae...@gmail.com> wrote:



I need 4 heavily randomized, entropic files. Max size of 64kb (I am
converting it to 0's and 1's in a text file, so it will be 512kb
then... which is the maximum of the tool)

These files should be unable to be further compressed by any existing
compression utility.

I have just successfully done one test, but I used a central portion
of the Calgary Corpus file since it was to large as a whole. I used
7zip to compress, and obtained admittedly only a few bits of
compression in several different locations after accounting for a
command section, and identification, and only if I quadrupled the
estimated file size to increase the 'number of bits saved'.

This is bad sampling and I admit it. Therefore I am asking for help
generating the admittedly necessary test files.

4 should give me a baseline to see if I am being successful or not.

I don't think it is going to be worth the time and effort you plan to
put into it, but its your choice :-)

Try this file:
http://marknelson.us/attachments/million-digit-challenge/AMillionRand...

More information on this here:
http://marknelson.us/2006/06/20/million-digit-challenge

And you can try this file:
http://www.geocities.com/patchnpuki/other/original.gz

More information on this here:
http://www.geocities.com/patchnpuki/other/compression.htm

You can create files of your required size from these. These are both
fairly random files and should not be easily compressible. I think
someone did manage to compress the million-digit file long ago by a
few bytes but I can't recall the details. So, the files are not
perfect but should be good enough to try out your ideas.

Best of luck,
Sachin Garg [India]www.sachingarg.com|www.c10n.info

If you write the million random digits as a table with 20000 rows and
50 columns, then add up the columns, then each of the 50 sums will be
an even number. I discovered this a couple of years ago.

The original data was not quite random due to biases in the hardware
generator, so they took the data (20000 punched cards with 50 digits
each) and added adjacent pairs of cards to generate the published
data. See http://www.rand.org/pubs/monograph_reports/MR1418/index.html

If you could guess one original card and you knew the order of the
published cards, then you could recover all the biased data by
subtracting adjacent pairs and compress it slightly. Unfortunately,
the cards were shuffled or mixed up first. The document does not
describe this, but I know they are not in the original order because
if they were, you could alternately add and subtract rows and they
would sum to 0 (mod 10). But they do not. Also, there is a weak
correlation between adjacent cards and also to the second card in
either direction. There should not be a correlation to the second
card unless a card was inserted between them. In any case the
correlation is too weak to compress the file by more than a few bits.

-- Matt Mahoney
Einstein
Posted: Thu Feb 21, 2008 6:32 pm
Guest
On Feb 21, 4:05 pm, Matt Mahoney <matmaho...@yahoo.com> wrote:
Quote:
On Feb 15, 10:33 pm, Sachin Garg <schn...@gmail.com> wrote:



On Feb 16, 7:47 am, Einstein <michae...@gmail.com> wrote:

I need 4 heavily randomized, entropic files. Max size of 64kb (I am
converting it to 0's and 1's in a text file, so it will be 512kb
then... which is the maximum of the tool)

These files should be unable to be further compressed by any existing
compression utility.

I have just successfully done one test, but I used a central portion
of the Calgary Corpus file since it was to large as a whole. I used
7zip to compress, and obtained admittedly only a few bits of
compression in several different locations after accounting for a
command section, and identification, and only if I quadrupled the
estimated file size to increase the 'number of bits saved'.

This is bad sampling and I admit it. Therefore I am asking for help
generating the admittedly necessary test files.

4 should give me a baseline to see if I am being successful or not.

I don't think it is going to be worth the time and effort you plan to
put into it, but its your choice :-)

Try this file:
http://marknelson.us/attachments/million-digit-challenge/AMillionRand...

More information on this here:
http://marknelson.us/2006/06/20/million-digit-challenge

And you can try this file:
http://www.geocities.com/patchnpuki/other/original.gz

More information on this here:
http://www.geocities.com/patchnpuki/other/compression.htm

You can create files of your required size from these. These are both
fairly random files and should not be easily compressible. I think
someone did manage to compress the million-digit file long ago by a
few bytes but I can't recall the details. So, the files are not
perfect but should be good enough to try out your ideas.

Best of luck,
Sachin Garg [India]www.sachingarg.com|www.c10n.info

If you write the million random digits as a table with 20000 rows and
50 columns, then add up the columns, then each of the 50 sums will be
an even number. I discovered this a couple of years ago.

The original data was not quite random due to biases in the hardware
generator, so they took the data (20000 punched cards with 50 digits
each) and added adjacent pairs of cards to generate the published
data. Seehttp://www.rand.org/pubs/monograph_reports/MR1418/index.html

If you could guess one original card and you knew the order of the
published cards, then you could recover all the biased data by
subtracting adjacent pairs and compress it slightly. Unfortunately,
the cards were shuffled or mixed up first. The document does not
describe this, but I know they are not in the original order because
if they were, you could alternately add and subtract rows and they
would sum to 0 (mod 10). But they do not. Also, there is a weak
correlation between adjacent cards and also to the second card in
either direction. There should not be a correlation to the second
card unless a card was inserted between them. In any case the
correlation is too weak to compress the file by more than a few bits.

-- Matt Mahoney

heh thats an interesting factoid Matt mahoney.

I have taken a couple days off due to rl issues, bout to tackle this
mofo again from another angle.
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Fri Aug 29, 2008 4:25 pm