Main Page | Report this Page
 
   
Science Forum Index  »  Compression Forum  »  Magic Compression? (Why won't this work) (Code provided)
Page 1 of 1    
Author Message
Guest
Posted: Mon Feb 18, 2008 1:02 am
I entirely accept the premise that there is no way to compress all
random data, but I'd love some advice on why you can't avoid this
using a time-check and verify system.

I have an "algorithm" (read: simple shell script)
that I think would successfully substantially compress Mark Nelson
AMillionRandomDigits
[1] and successfully decode it.
I'd love to get some feedback. I've included my proof-of-concept code
at the bottom of this post.


It would seem that it is possible to compress the file, and return it
to it's original form, if we can successfully leverage the time-used
to create it as a variable.

On any given machine, we can re-create any file by starting from a
blank file, and then appending.. 00000, then modifying to 00001,
00010, expanding until we have tried every iteration until we have, by
brute-force, found our file.

While this certainly can create our file, the problem is that the
solution solves too much.. While it DOES create our file, it also
creates all other files. That's certainly not going to be very useful.

If we're clever, and willing to accept a very small chance of loss
[2], I contend that we can re-create this file successfully.

Begin by iterating through all possible numbers, out to the number of
digits in the original. Every time we iterate, compare the MD5 (or
SHA, or whatever hash you prefer) to the hash of the original file.
Continue to iterate until you have created every possible file given
the requisite number of digits.

"But wait", you say, "You will get multiple files back, since you are
brute-force cracking the hash"

You're right, but that isn't a problem. When we encode the file in the
first place, we should go through the entire process of creating each
file, and time how long it takes until we create the correct one. We
can save this as a *very rough* time format. (Unix EPOCH is probably
fine). This should be a rough time format to avoid wasting bits..
Seconds is ALMOST CERTAINLY close enough.

When decoding, we again generate every possible file given the number
of bits.. We compare each against the hash of the original, and get a
list of matches.. Then, we can compare the elapsed time of each of the
matches, to the time required to create the original.

This is among the slowest ways to compress a file, but it should
result in quite a savings. The chance of loss is present, but very
small.. On most PCs, it's extraordinarily unlikely that they will have
enough speed to generate multiple collisions in one second. In the
Very-Rare event that it does, and you generate the wrong file, it is
still possible to re-create the original by re-running the decoder,
and telling it to ignore the false result.



Granted, this method is insanely slow, and there is a SLIGHT chance of
corruption, but I'd love to hear more about why it wouldn't
successfully pass Mark's test.


My proof of concept shell script follows. The Encoder and Decoder
should take nearly identical amounts of time to encode/decode the
file.

The Decoder will generate all possible files given the file length,
and then compare the time it took to build them against the time it
took to build the original in the encoder.
It returns the closest match as the requested file. This is almost
always going to be correct, assuming that the PC is in a pristine
state when running each set of code.


kari:pi$ cat encode.sh
#!/bin/bash
if [ ${#} -eq 0 ]; then echo "usage: $0 filename"; exit; fi
echo "" > file
sleep 3;
digits=`cat $1 | wc -c`
digits=`expr $digits + 1`

count=`echo 0`
hash=`cat $1 | md5`

STARTTIME=`date +%s`

while [ `echo $count | wc -c ` -lt $digits ];
do
count=`expr $count + 1`
ELAPSED=`expr $STARTTIME - $(date +%s)`
ELAPSED=`echo $ELAPSED | awk ' { if($1>=0) { print $1} else {print
$1*-1 }}'`

if [ `echo $count | md5` = $hash ]; then echo $count:`echo
$ELAPSED`>> file; fi
if [ "$count" == `cat $1` ]; then echo $digits:$hash $ELAPSED; fi
done
------

kari:pi$ cat decode.sh
#!/bin/bash
if [ ${#} -ne 2 ]; then echo "usage: $0 hash expected_elapsed"; exit;
fi
EXPECTEDELAPSED=`echo $2`
echo "" > file

digits=`echo $1| awk -F: {'print $1'}`
digits=`expr $digits`
hash=`echo $1| awk -F: {'print $2'}`

count=`echo -1`
echo $digits

STARTTIME=`date +%s`
#find all the hash matches
while [ `echo $count | wc -c ` -lt $digits ];
do
count=`expr $count + 1`
# echo -ne $count | wc -c;
ELAPSED=`expr $STARTTIME - $(date +%s)`
ELAPSED=`echo $ELAPSED | awk ' { if($1>=0) { print $1} else {print
$1*-1 }}'`
if [ `echo $count | md5` = $hash ]; then echo $count:`echo
$ELAPSED`>> file; fi
done


#find the closest to the original
LOWESTTIME=`echo 999999999999999`

#find the closest to the original
for i in `cat file`;
do
CURRENT=`echo $i| awk -F: {'print $2'}`
DIFFERENCE=`expr $CURRENT - $EXPECTEDELAPSED`
DIFFERENCEA=`echo $DIFFERENCE | awk ' { if($1>=0) { print $1}
else {print $1*-1 }}'`
if [ "$DIFFERENCEA" -lt "$LOWESTTIME" ]
then
LOWESTTIME=`echo $DIFFERENCEA`
CLOSEST=`echo $i|awk -F: {'print $1'}`
fi
done;

echo $CLOSEST

--------




[1] http://marknelson.us/2006/06/20/million-digit-challenge/
[2] This is a very small chance. You're unlikely to have two hash
collisions in the same second, on most speed PCs.





f436af1486f68768f40319f00e0c1681
Guest
Posted: Mon Feb 18, 2008 1:47 am
I'll freely admit that the time required would make it unfeasable, but
the challenge wasn't to create a solution that works on a human time-
scale, the challenge was to create a solution -at all-.

You're point about recording the iterations is fair, and I had thought
I addressed it in my initial post. You're right, of course that if you
were to count every iteration, you're count number would exceed the
original.. That's why I suggest using the unix EPOC- We're trading a
(very very small) amount of reliability, for fixing the problem of
requiring an infinite number.

Essentially, if we only record how many seconds it took, rather than
how many iterations, we are nearly certain to achieve compression, as
long as we're processing multiple MD5s every second.

That's the reason we need the MD5 hash- We're only using the time to
narrow down the answer between multiple collisions.. Our "saved" time,
will almost always be closest to the correct collision, rather than
some other random series of bits which might hash to the same value.

Think of the Order of Operations this way:
Find all possible matches
Determine which is very very likely to be the original, by comparing
datestamps.


We don't need the full time value for that, only enough precision that
we're unlikely to get two collisions in one iteration.
Guest
Posted: Mon Feb 18, 2008 1:53 am
Again, to be clear, I agree with you that my system won't work, I'm
just trying to understand why Wink
If I properly understand you're response, you seem to be saying that
we'd require tremendously more precision that I think necessary..

For example- Let's say that I am creating 10 MD5 sums every seconds,
and making one timestamp. It seems very unlikely to me more than one
of these 10 would collide..

If I save all possible collisions, and see that there is One success
in my group of 10 (above), then, again since they're unlikely to
collide in the same group, it seems like we've been able to record
1/10th of the data, and reasonably reliably reconstruct it.
Einstein
Posted: Mon Feb 18, 2008 3:26 am
Guest
Actually this is rather smart thinking, but sadly I dont see it
working. Nice thinking outside the box however :)


500 number of bits in this case for the example (Rounded ofc!)

116,744,315,788,278,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000

Thats a huge value, and for only 500 bits, and at the worst case
outcome (Which has a value of 495 bits). The chances of logging math
results until obtaining seems extremely hardcore. Ok so lets say we
can do 100 a second on the best server. We run it for 1 minute (60
seconds)... the best we can save then is 9 bits. Ofc you can keep
running, and keep trying. It all matters how many per second and how
many seconds. Perhaps if you could run a million a second and run it
for a million seconds... and while thats a potential 40 bits of
savings, now we need to identify when it happened. Time then becomes
like binary again...
Thomas Richter
Posted: Mon Feb 18, 2008 7:31 am
Guest
partner50145703@vansoftcorp.com schrieb:
Quote:
I entirely accept the premise that there is no way to compress all
random data, but I'd love some advice on why you can't avoid this
using a time-check and verify system.

I have an "algorithm" (read: simple shell script)
that I think would successfully substantially compress Mark Nelson
AMillionRandomDigits
[1] and successfully decode it.
I'd love to get some feedback. I've included my proof-of-concept code
at the bottom of this post.


It would seem that it is possible to compress the file, and return it
to it's original form, if we can successfully leverage the time-used
to create it as a variable.

On any given machine, we can re-create any file by starting from a
blank file, and then appending.. 00000, then modifying to 00001,
00010, expanding until we have tried every iteration until we have, by
brute-force, found our file.

While this certainly can create our file, the problem is that the
solution solves too much.. While it DOES create our file, it also
creates all other files. That's certainly not going to be very useful.

If we're clever, and willing to accept a very small chance of loss
[2], I contend that we can re-create this file successfully.

Begin by iterating through all possible numbers, out to the number of
digits in the original. Every time we iterate, compare the MD5 (or
SHA, or whatever hash you prefer) to the hash of the original file.
Continue to iterate until you have created every possible file given
the requisite number of digits.

"But wait", you say, "You will get multiple files back, since you are
brute-force cracking the hash"

You're right, but that isn't a problem. When we encode the file in the
first place, we should go through the entire process of creating each
file, and time how long it takes until we create the correct one. We
can save this as a *very rough* time format. (Unix EPOCH is probably
fine). This should be a rough time format to avoid wasting bits..
Seconds is ALMOST CERTAINLY close enough.

When decoding, we again generate every possible file given the number
of bits.. We compare each against the hash of the original, and get a
list of matches.. Then, we can compare the elapsed time of each of the
matches, to the time required to create the original.

This is among the slowest ways to compress a file, but it should
result in quite a savings. The chance of loss is present, but very
small.. On most PCs, it's extraordinarily unlikely that they will have
enough speed to generate multiple collisions in one second. In the
Very-Rare event that it does, and you generate the wrong file, it is
still possible to re-create the original by re-running the decoder,
and telling it to ignore the false result.



Granted, this method is insanely slow, and there is a SLIGHT chance of
corruption, but I'd love to hear more about why it wouldn't
successfully pass Mark's test.

Very simple. Instead of marking the time, you can also think of your
script as counting the number of tries it takes until it creates the
original file back. If the time measurement is precise enough, this
should do it, as it does in your method, and both can be considered
equivalent.

Now, consider, how many tries do we need to encode, i.e. how large the
number has to be? Of course we have to encode as many digits as there
are files generating the same MD5-sum (that's approximately the same
precision required to encode the time).

The solution to your riddle is that even though your MD5-sum is small,
the number of files with the same MD5-sum is *huge*, and the precision
of your time stamp, or the size of the number required to encode the
number of runs is immense. If an MD5-sum contains 128 bits (I don't
know, anyhow), then there are 2^128 different MD5 sums. On the other
hand, if you want to encode a million byte file, then there are (8 bits
per digit) 2^(1000000 * Cool of them, and thus this requires your count
(or your time stamp) to be of the precision of 2^(1000000 * Cool / 2^128 =
2^(1000000 * 8 - 128), which is a *huge* number.(Consider, there are
approximately 10^80 ~= 2^300 particles in this universe!). In fact, it
is only 128 bits shorter than the original file, compensating exactly
for the 128 bits in the MD5-sum, so nothing's gained.

This also explains why your time stamp method is unpractical: Every
algorithm working this way would require *a lot* more time than the age
of the universe.

So long,
Thomas
LR
Posted: Mon Feb 18, 2008 10:29 am
Guest
Quote:
We're only using the time to
narrow down the answer between multiple collisions

There will be 2^(1000000 * 8 - 128) collisions like Thomas Richter said.

Lasse
Thomas Richter
Posted: Mon Feb 18, 2008 11:45 am
Guest
partner50145703@vansoftcorp.com schrieb:
Quote:
Again, to be clear, I agree with you that my system won't work, I'm
just trying to understand why Wink
If I properly understand you're response, you seem to be saying that
we'd require tremendously more precision that I think necessary..

For example- Let's say that I am creating 10 MD5 sums every seconds,
and making one timestamp. It seems very unlikely to me more than one
of these 10 would collide..

Might be, but then again the time you would have to wait until your output
sequence would arrive would be tremendous. Actually, at
most 2^(8*1000000-128)/10 seconds, probably the half on average. And you would have
to encode this time somehow. And this takes a decimal number of (8*1000000-128)*log(2)
places, or a binary number of 8*1000000-128 bits, which is again the number of bits
missing and not encoded in the MD5 sum.

Quote:
If I save all possible collisions, and see that there is One success
in my group of 10 (above), then, again since they're unlikely to
collide in the same group, it seems like we've been able to record
1/10th of the data, and reasonably reliably reconstruct it.

Actually, collisions are *very* likely. I computed how many there will be in total.

So long,
Thomas
Phil Carmody
Posted: Mon Feb 18, 2008 2:34 pm
Guest
partner50145703@vansoftcorp.com writes:
Quote:
I entirely accept the premise that there is no way to compress all
random data, but I'd love some advice on why you can't avoid this
using a time-check and verify system.

I have an "algorithm" (read: simple shell script)
that I think would successfully substantially compress Mark Nelson
AMillionRandomDigits
[1] and successfully decode it.
I'd love to get some feedback. I've included my proof-of-concept code
at the bottom of this post.


It would seem that it is possible to compress the file, and return it
to it's original form, if we can successfully leverage the time-used
to create it as a variable.

On any given machine, we can re-create any file by starting from a
blank file, and then appending.. 00000, then modifying to 00001,
00010, expanding until we have tried every iteration until we have, by
brute-force, found our file.

How do you know if you've found your file unless you also
have a copy of that file to hand. In which case, you've
compressed the file down to precisely its original size,
congratulations.

Quote:
While this certainly can create our file, the problem is that the
solution solves too much.. While it DOES create our file, it also
creates all other files. That's certainly not going to be very useful.

If we're clever, and willing to accept a very small chance of loss
[2], I contend that we can re-create this file successfully.

Begin by iterating through all possible numbers, out to the number of
digits in the original. Every time we iterate, compare the MD5 (or
SHA, or whatever hash you prefer) to the hash of the original file.
Continue to iterate until you have created every possible file given
the requisite number of digits.

"But wait", you say, "You will get multiple files back, since you are
brute-force cracking the hash"

You're right, but that isn't a problem. When we encode the file in the
first place, we should go through the entire process of creating each
file, and time how long it takes until we create the correct one. We
can save this as a *very rough* time format. (Unix EPOCH is probably
fine). This should be a rough time format to avoid wasting bits..
Seconds is ALMOST CERTAINLY close enough.

How many bits of information will this timing value require
on average when brute forcing files of say 10, 100, 1000,
10000 bits?

Quote:
When decoding, we again generate every possible file given the number
of bits.. We compare each against the hash of the original, and get a
list of matches.. Then, we can compare the elapsed time of each of the
matches, to the time required to create the original.

Do you think your argument works equally well with a CRC rather
than a cryptographic hash?

If so, then you know nothing about CRCs. If not, then highlight
the place in your argument where the two deviate.

Phil
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration
Guest
Posted: Mon Feb 18, 2008 4:41 pm
Quote:
How many bits of information will this timing value require
on average when brute forcing files of say 10, 100, 1000,
10000 bits?


This depends on the likelyhood that you want to receive your original
file correctly. As I've said, this entire system is based on the idea
of accepting a small chance of getting the wrong file, which is a
correctable error.
Guest
Posted: Mon Feb 18, 2008 4:47 pm
On Feb 18, 9:29 am, "LR" <sd...@afdd.ds> wrote:
Quote:
We're only using the time to
narrow down the answer between multiple collisions

There will be 2^(1000000 * 8 - 128) collisions like Thomas Richter said.

Lasse

Fair enough. Since there will be such a large number of collisions,
then you are forced to have the specificity to differentiate them,
which would take up the required space, and negate the method.

Thanks for helping me understand, I appreciate it. I thought it was a
neat idea, but you (plural) have shown me why it wouldn't work, and I
respect and thank you for your time.
Industrial One
Posted: Tue Mar 04, 2008 10:47 pm
Guest
On Mar 4, 11:49 pm, "MisterE" <vo...@sometwher.world> wrote:
Quote:
"But wait", you say, "You will get multiple files back, since you are
brute-force cracking the hash"

This concept isn't new.

You will find that the way the hashes overlap exactually cancel out anything
you can gain by triming down on the time value.

If you have a 8 byte file and a 4byte hash, you will find there are 256^4
combinations that make the same hash anyway, so your time needs to be 4 byte
specific anyway, so you can't compress.

"Exactually"
....
MisterE
Posted: Wed Mar 05, 2008 2:49 am
Guest
Quote:

"But wait", you say, "You will get multiple files back, since you are
brute-force cracking the hash"

This concept isn't new.

You will find that the way the hashes overlap exactually cancel out anything
you can gain by triming down on the time value.

If you have a 8 byte file and a 4byte hash, you will find there are 256^4
combinations that make the same hash anyway, so your time needs to be 4 byte
specific anyway, so you can't compress.
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Mon Sep 08, 2008 12:53 am