Main Page | Report this Page
 
   
Science Forum Index  »  Cryptography Forum  »  issues with statistical test suite from http://csrc.nist.gov
Page 3 of 3    Goto page Previous  1, 2, 3
Author Message
Cristiano
Posted: Sat Jan 24, 2004 11:28 am
Guest
Mack wrote:
Quote:
On Thu, 22 Jan 2004 19:30:50 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Wed, 21 Jan 2004 21:44:53 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Tue, 20 Jan 2004 08:59:02 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Mon, 19 Jan 2004 23:18:19 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Mon, 19 Jan 2004 20:28:40 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Sat, 17 Jan 2004 17:09:26 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Luke Kenneth Casson Leighton wrote:

the people at csrc.nist.gov inform me that they used
blum-blum-shub as the "baseline" for the lempel-ziv test
(i haven't asked them about the other tests) and that
they then took EMPIRICALLY OBSERVED values for the mean
and standard deviation of the information that generates
the p-values.

They have also used sha-1 based generator to get the mean
and
the variance. They updated those values with:
mean = 69588.20190000
variance = 73.23726011
which are good enough.

if they did the same on one or two other tests, it's
possible that they either didn't take a large enough
pseudo-random sample from which to derive the empirical
mean and s.d., or that there is a problem with the
pseudo-random generator that they used.

either way, a skew of the p-values is, as you say,
introduced.

No skewed p-values introduced.

This is easily testable. The p-values are not uniformly
distributed. There is skew. I have posted examples in a
seperate message.

I seen the message. We can try with an example.
Suppose you have n=1e6.
Calculate the p-values for few W's:
W p-value
69588 0,49058890723625
69589 0,537151142026133
69590 0,58320929236385
69591 0,628151659956359
69592 0,67141122813222

you can clearly see that the p-value 0.51 (for example)
cannot exists because W = 69588.361 cannot exists (W is an
integer number).

The p-values are *not* skewed, they don't exist. You just
need
to properly use the test.

If they don't exist then there is skew in the output.

No.
You can see that they are *not* skewed by calculating the
skewness:
it is about 0 (very good).
An other way is to see the graphical display of the sorted
p-values: you'll see that they are about evenly distributed
(the
only problem
is that they "jump" over the bins).

A skewed distribution is, for example, a bell shaped curve
which looks like a chi-squared one with df>3; in other words
the lack
of p-values is in a tail.


Perhaps a better description is bias. However you define it the
distribution is not even.

Yes, but not skewed as you and the troll insist to say.


The p-values of the FFT were skewed. Recheck my post with
the data. On the 1e6 x 100 test of Lempel-Ziv the data were
also skewed.

In which post? There is no post with: "Skewness= ...".
I checked the LZ test also for the skewness and I found no skewed
p-values. I don't know in which other way I can say that.


post: m7hn005agur5djmlkmbmcqmaelha74u38m@4ax.com

It is an e-mail! O_o

no, newsgroup post id.

I'm not able to find it. Have you the link for google?


Quote:
I didn't specifically include skewness values because the
program doesn't automatically provide them. But from the
resultant data the skewness is obvious in several instances.

Working with the uniform binned data.
The expected mean is 5.5.
The SD is
100 = 2.8868
1000 = 2.8737

In the 1e6x1000 LZ case it is least obvious.
The total number of values on the left is 522.
The Skewness is about -.0515. We could argue about
exactly how significant (ses=.07746) this is but since it is
consistent across multiple tests it is relevant. The slight
skewness is only a side effect. The real problem was
expecting the p-values to be uniform when they are not.

In the 1e6x100 LZ only one value exceeds the expected mean
on the right while three do on the left.
Skewness= -.5689 (ses=.2449) obviously skewed.

Just to give an example, in the statistic process control, usually a
distribution is said good if |skewness| <= .5 (there is also the
kurtosis, but in our conversation it is irrelevant).
This is to say that .57 is not so big. If you have seen that value
only one time, it is not a problem.

skewness should be less than 2*ses which will vary by sample size.
ses=sqrt(6/n).

I'm not a mathematician, so I don't know if this rule really applies. Could
you elaborate a bit, please?


Quote:
The problem is that it isn't an isolated incident although I believe
it is the largest such value. No skewness values in the opposite
direction were encountered at that sample size.

Direction?


Quote:
The FFT are also obviously skewed for 1e5x1000 and 1e4x1000.
1e5x100=.4701 (ses=.2449)
1e5x1000=.5163 (ses=.07746)
1e4x1000=.4598 (ses=.07746)

The rank test is also skewed very slightly (insignificantly?) for
1e4x1000.
1e4x1000=.0460

The FFT is definitely skewed.

I said several times that FFT test must be used around 1e6 bits; 1e5
bits is not around 1e6 bits!
Try to check 1e6 or 2e6 bits, you should see a smaller skewness.


The 1e6 values were acceptable. You stated that my original
claim that I had found skew was false.

I've never said that you are a liar, I always said that you're not using the
test in the proper way, so your results are inconsistents (I'm aware that my
English could generate some misunderstanding).


Quote:
In fact the original poster
stated that this was only a problem below 1e6.

No, the original troller stated: "stay well clear of using this test", while
he should stay well clear of using his brain.
Anyway, I'm not interested in this topic. My only interest is to do a useful
conversation about testing the generators.


Quote:
Also the FFT p-values are not skewed (usually I get skewness=0.1,
0.2).


Are you using the sample mean or expected mean? For the 1e5 FFT I
never got a skewness below .4. For 1000 samples .2 would definitely
be a significant skew (2*ses=.15492).

Sure, you use that test in a bad way; n must be around 1e6 bit, do
you remember?
Anyway, your question seems strange. You must use the sample mean,
not the expected one.

That is incorrect when you are examining a sample presumed to be
from a specific distribution. That would measure skew with respect
to the sample itself, not with respect to the expected distribution.

1e4 x 1000
--------------------------------------------------------------------------
----
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 P-VALUE PROPORTION
STATISTICAL TEST
--------------------------------------------------------------------------
----
0 0 0 0 0 0 0 0 0 1000 0.000000 * 1.0000
Lempel-Ziv
igamc underflow error occurs for Lempel-Ziv

As an example use the LZ test with 1e4 where all samples went to
bucket ten since the case statement doesn't handle this case. If you
use the sample mean then it has a skew of zero. ie. the mean is ten
and all samples go to bucket ten. This is obviously not what we
want the test to show. We are looking for a measure of how well
this conforms to the expected distribution, in this case the mean
should be 5.5 and the sample is badly skewed.

This is the first time I hear about skewness used for the goodness of fit!
To see "how well this conforms to the expected distribution" you must not
use the skewness, you must use a proper test (KS test, chi-square or my SL
test, if you like). Why don't use also the kurtosis, the median and so
forth? This way you have everything but the stuff you need.


Quote:
I think you have already agreed that the KS test of the p-values
for these tests is not correct. Specifically it isn't the correct
test.

I totally agree with your last sentence: the KS test *must* not be
used with LZ.
But all the tests in the suite are good to check a prng (if they
are properly used).


No argument here.

The problem is that the test suite produces a "finalAnalysisReport"
that indicates failures where there are none.
This is entirely because it uses an incorrect test
when producing this report.

That problem seems common for several tests (including DH). For this
reason I take each single test and then I use them in a better way.

Diehard has only given occasional bad result ie. isolated p-values,
with good data. The major problem I have with diehard is that it
isn't sensitive enough with processed data from physical random
number generators.

Do you think it should be?


Quote:
Diehard doesn't give KS results except where
it is appropriate.

So does NIST test. But exactly, what do you mean?


Quote:
Unfortunately the output is pretty hard to read.
I usually open it with a text editor and search for results of
.000, .00, and .0.

And when you find them what do you do?


Quote:
I am also having to create my own test suite because nothing
else meets my current needs. sts seems like a good package but
it has its limitations.

Yes, all the tests have limitations. I think if one uses a test in a proper
way the test can be useful anyway. The "proper way" could be also to discard
a test! I done that with some test in dh.

Cristiano
Cristiano
Posted: Sat Jan 24, 2004 1:47 pm
Guest
Mack wrote:
Quote:
post: m7hn005agur5djmlkmbmcqmaelha74u38m@4ax.com

It is an e-mail! O_o

no, newsgroup post id.

Ok, I found it.

Cristiano
Mack
Posted: Sat Jan 24, 2004 10:22 pm
Guest
On Sat, 24 Jan 2004 16:28:27 GMT, "Cristiano"
<cristiano.pi@NSquipo.it> wrote:

Quote:
Mack wrote:
On Thu, 22 Jan 2004 19:30:50 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Wed, 21 Jan 2004 21:44:53 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Tue, 20 Jan 2004 08:59:02 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Mon, 19 Jan 2004 23:18:19 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Mon, 19 Jan 2004 20:28:40 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Sat, 17 Jan 2004 17:09:26 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Luke Kenneth Casson Leighton wrote:

the people at csrc.nist.gov inform me that they used
blum-blum-shub as the "baseline" for the lempel-ziv test
(i haven't asked them about the other tests) and that
they then took EMPIRICALLY OBSERVED values for the mean
and standard deviation of the information that generates
the p-values.

They have also used sha-1 based generator to get the mean
and
the variance. They updated those values with:
mean = 69588.20190000
variance = 73.23726011
which are good enough.

if they did the same on one or two other tests, it's
possible that they either didn't take a large enough
pseudo-random sample from which to derive the empirical
mean and s.d., or that there is a problem with the
pseudo-random generator that they used.

either way, a skew of the p-values is, as you say,
introduced.

No skewed p-values introduced.

This is easily testable. The p-values are not uniformly
distributed. There is skew. I have posted examples in a
seperate message.

I seen the message. We can try with an example.
Suppose you have n=1e6.
Calculate the p-values for few W's:
W p-value
69588 0,49058890723625
69589 0,537151142026133
69590 0,58320929236385
69591 0,628151659956359
69592 0,67141122813222

you can clearly see that the p-value 0.51 (for example)
cannot exists because W = 69588.361 cannot exists (W is an
integer number).

The p-values are *not* skewed, they don't exist. You just
need
to properly use the test.

If they don't exist then there is skew in the output.

No.
You can see that they are *not* skewed by calculating the
skewness:
it is about 0 (very good).
An other way is to see the graphical display of the sorted
p-values: you'll see that they are about evenly distributed
(the
only problem
is that they "jump" over the bins).

A skewed distribution is, for example, a bell shaped curve
which looks like a chi-squared one with df>3; in other words
the lack
of p-values is in a tail.


Perhaps a better description is bias. However you define it the
distribution is not even.

Yes, but not skewed as you and the troll insist to say.


The p-values of the FFT were skewed. Recheck my post with
the data. On the 1e6 x 100 test of Lempel-Ziv the data were
also skewed.

In which post? There is no post with: "Skewness= ...".
I checked the LZ test also for the skewness and I found no skewed
p-values. I don't know in which other way I can say that.


post: m7hn005agur5djmlkmbmcqmaelha74u38m@4ax.com

It is an e-mail! O_o

no, newsgroup post id.

I'm not able to find it. Have you the link for google?

http://groups.google.com/groups?q=m7hn005agur5djmlkmbmcqmaelha74u38m%404ax.com&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=gpku0091c13avua0vst24v1gu09ppf4fne%404ax.com&rnum=1

Quote:


I didn't specifically include skewness values because the
program doesn't automatically provide them. But from the
resultant data the skewness is obvious in several instances.

Working with the uniform binned data.
The expected mean is 5.5.
The SD is
100 = 2.8868
1000 = 2.8737

In the 1e6x1000 LZ case it is least obvious.
The total number of values on the left is 522.
The Skewness is about -.0515. We could argue about
exactly how significant (ses=.07746) this is but since it is
consistent across multiple tests it is relevant. The slight
skewness is only a side effect. The real problem was
expecting the p-values to be uniform when they are not.

In the 1e6x100 LZ only one value exceeds the expected mean
on the right while three do on the left.
Skewness= -.5689 (ses=.2449) obviously skewed.

Just to give an example, in the statistic process control, usually a
distribution is said good if |skewness| <= .5 (there is also the
kurtosis, but in our conversation it is irrelevant).
This is to say that .57 is not so big. If you have seen that value
only one time, it is not a problem.

skewness should be less than 2*ses which will vary by sample size.
ses=sqrt(6/n).

I'm not a mathematician, so I don't know if this rule really applies. Could
you elaborate a bit, please?


ses is the standard error of skewness. It is similar to standard
deviation. Although taking the values as having the same
meaning is probably a bad idea.

Quote:

The problem is that it isn't an isolated incident although I believe
it is the largest such value. No skewness values in the opposite
direction were encountered at that sample size.

Direction?

left = negative
right = positive
In this case the skew was to the left.

Quote:


The FFT are also obviously skewed for 1e5x1000 and 1e4x1000.
1e5x100=.4701 (ses=.2449)
1e5x1000=.5163 (ses=.07746)
1e4x1000=.4598 (ses=.07746)

The rank test is also skewed very slightly (insignificantly?) for
1e4x1000.
1e4x1000=.0460

The FFT is definitely skewed.

I said several times that FFT test must be used around 1e6 bits; 1e5
bits is not around 1e6 bits!
Try to check 1e6 or 2e6 bits, you should see a smaller skewness.


The 1e6 values were acceptable. You stated that my original
claim that I had found skew was false.

I've never said that you are a liar, I always said that you're not using the
test in the proper way, so your results are inconsistents (I'm aware that my
English could generate some misunderstanding).

I was using the test as instructed by the manual. The fact that the
test is not valid for a range of values indicates a problem
with the test for general use.

Quote:


In fact the original poster
stated that this was only a problem below 1e6.

No, the original troller stated: "stay well clear of using this test", while
he should stay well clear of using his brain.
Anyway, I'm not interested in this topic. My only interest is to do a useful
conversation about testing the generators.


Also the FFT p-values are not skewed (usually I get skewness=0.1,
0.2).


Are you using the sample mean or expected mean? For the 1e5 FFT I
never got a skewness below .4. For 1000 samples .2 would definitely
be a significant skew (2*ses=.15492).

Sure, you use that test in a bad way; n must be around 1e6 bit, do
you remember?
Anyway, your question seems strange. You must use the sample mean,
not the expected one.

That is incorrect when you are examining a sample presumed to be
from a specific distribution. That would measure skew with respect
to the sample itself, not with respect to the expected distribution.

1e4 x 1000
--------------------------------------------------------------------------
----
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 P-VALUE PROPORTION
STATISTICAL TEST
--------------------------------------------------------------------------
----
0 0 0 0 0 0 0 0 0 1000 0.000000 * 1.0000
Lempel-Ziv
igamc underflow error occurs for Lempel-Ziv

As an example use the LZ test with 1e4 where all samples went to
bucket ten since the case statement doesn't handle this case. If you
use the sample mean then it has a skew of zero. ie. the mean is ten
and all samples go to bucket ten. This is obviously not what we
want the test to show. We are looking for a measure of how well
this conforms to the expected distribution, in this case the mean
should be 5.5 and the sample is badly skewed.

This is the first time I hear about skewness used for the goodness of fit!
To see "how well this conforms to the expected distribution" you must not
use the skewness, you must use a proper test (KS test, chi-square or my SL
test, if you like). Why don't use also the kurtosis, the median and so
forth? This way you have everything but the stuff you need.


Skewness is a parameter of "goodness of fit" as is kurtosis and
median. Generally they are used for goodness of fit to a normal
curve but they can be used for other distributions as well.
Skewness measures symmetry about a point. Kurtosis could be
used but the expected value would not be zero as for the normal
curve when applied to a uniform distribution, although this can be
easily calculated.

We have already agreed that KS is not the right test here. Chi-square
or SL are more appropriate.

Quote:

I think you have already agreed that the KS test of the p-values
for these tests is not correct. Specifically it isn't the correct
test.

I totally agree with your last sentence: the KS test *must* not be
used with LZ.
But all the tests in the suite are good to check a prng (if they
are properly used).


No argument here.

The problem is that the test suite produces a "finalAnalysisReport"
that indicates failures where there are none.
This is entirely because it uses an incorrect test
when producing this report.

That problem seems common for several tests (including DH). For this
reason I take each single test and then I use them in a better way.

Diehard has only given occasional bad result ie. isolated p-values,
with good data. The major problem I have with diehard is that it
isn't sensitive enough with processed data from physical random
number generators.

Do you think it should be?

Since it was not designed with this purpose in mind I wouldn't
expect it to be.

Quote:


Diehard doesn't give KS results except where
it is appropriate.

So does NIST test. But exactly, what do you mean?

The finalAnalysisReport returns KS test values
where these values are not appropriate.

Quote:


Unfortunately the output is pretty hard to read.
I usually open it with a text editor and search for results of
.000, .00, and .0.

And when you find them what do you do?

Repeat that specific test with more data to determine if it is
isolated or consistent. The newer version of diehard returns
a final KS value but also states that it is more of a general
guide than absolute result.

Quote:


I am also having to create my own test suite because nothing
else meets my current needs. sts seems like a good package but
it has its limitations.

Yes, all the tests have limitations. I think if one uses a test in a proper
way the test can be useful anyway. The "proper way" could be also to discard
a test! I done that with some test in dh.

Cristiano


I have never found it necessary to discard a DH test. They may not
detect a problem where there isn't one but they have never given
a strong result of a problem where one didn't exist.

I am still a bit suspicious of the FFT and LZ tests since they do not
yet have a firm mathematical foundation. They seem like good tests
but they are still empirical. Of course we should be suspicious of
any single test only by using a number of tests can we be certain that
we aren't getting false positives or negatives.

Leslie 'Mack' McBride
remove text between _ marks to respond via e-mail
Cristiano
Posted: Sun Jan 25, 2004 8:23 am
Guest
Mack wrote:
Quote:
On Sat, 24 Jan 2004 16:28:27 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Thu, 22 Jan 2004 19:30:50 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

skewness should be less than 2*ses which will vary by sample size.
ses=sqrt(6/n).

I'm not a mathematician, so I don't know if this rule really
applies. Could you elaborate a bit, please?


ses is the standard error of skewness. It is similar to standard
deviation. Although taking the values as having the same
meaning is probably a bad idea.

I think so.


Quote:
Also the FFT p-values are not skewed (usually I get skewness=0.1,
0.2).


Are you using the sample mean or expected mean? For the 1e5 FFT I
never got a skewness below .4. For 1000 samples .2 would
definitely
be a significant skew (2*ses=.15492).

Sure, you use that test in a bad way; n must be around 1e6 bit, do
you remember?
Anyway, your question seems strange. You must use the sample mean,
not the expected one.

That is incorrect when you are examining a sample presumed to be
from a specific distribution. That would measure skew with respect
to the sample itself, not with respect to the expected distribution.

1e4 x 1000
------------------------------------------------------------------------
--
----
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 P-VALUE PROPORTION
STATISTICAL TEST
------------------------------------------------------------------------
--
----
0 0 0 0 0 0 0 0 0 1000 0.000000 * 1.0000
Lempel-Ziv
igamc underflow error occurs for Lempel-Ziv

As an example use the LZ test with 1e4 where all samples went to
bucket ten since the case statement doesn't handle this case. If
you
use the sample mean then it has a skew of zero. ie. the mean is ten
and all samples go to bucket ten. This is obviously not what we
want the test to show. We are looking for a measure of how well
this conforms to the expected distribution, in this case the mean
should be 5.5 and the sample is badly skewed.

This is the first time I hear about skewness used for the goodness
of fit!
To see "how well this conforms to the expected distribution" you
must not
use the skewness, you must use a proper test (KS test, chi-square or
my SL
test, if you like). Why don't use also the kurtosis, the median and
so
forth? This way you have everything but the stuff you need.


Skewness is a parameter of "goodness of fit" as is kurtosis and
median. Generally they are used for goodness of fit to a normal
curve but they can be used for other distributions as well.

I disagree.
You can use those parameters to see how much *they* are different from the
expected ones, but you shouldn't use them to see how much a set of samples
differs from the expected distribution.
I mean that if you see skewness=.65 you are not able to say how much your
distribution differs from the expected one. Obviously you could say that
your distribution is not perfectly good, but how much?
On the contrary, with a proper test for the goodness of fit you are able to
calculate a p-value to say how much you distribution differs from the
expected one.

For example, I use the FFT to transform 32 kbit and then I calculate the
mean and the skewness of the transformed values.
If I calculate those parameters for a good prng and for a bad one (like
lcg), I seen that they are very different.
Unfortunately I can't know if a generator under test is good or bad if it
gives values in the range skewness_bad ... skewness_good because I don't
have a significance level.
To calculate a significance level I could calculate KS test of the
transformed values, but I don't know how to do it.


Quote:
Skewness measures symmetry about a point. Kurtosis could be
used but the expected value would not be zero as for the normal
curve when applied to a uniform distribution, although this can be
easily calculated.

Sure, it is 6/5 * (n^2+1) / (n^2-1) for a discrete uniform distribution.
And when you got, for eample, 7/6 what do you say? It is good? It is bad?
And how much?

Here I have two doubts:
1) Surely we get some information from those parameters, but can the
information gotten be used in testing a rng (in an efficient way)?
2) You say: "Skewness measures symmetry about a point". I don't know how you
calculate "your" skewness. Do you calculate it using the absolute moments or
the central moments in some "strange" way?


Quote:
We have already agreed that KS is not the right test here. Chi-square
or SL are more appropriate.

Yes, or perhaps no (see next paragraph).


Quote:
Diehard doesn't give KS results except where
it is appropriate.

So does NIST test. But exactly, what do you mean?

The finalAnalysisReport returns KS test values
where these values are not appropriate.

Who say that? Have you done a new discovery?
If you calculate the KS test for *only* one sequence, then the KS is good
enough.
But if you calculate the KS of the KS's gotten from 100 or 1000 sequences,
then the overall p-value is useless because the 100 or 1000 p-values are too
binned.


Quote:
Unfortunately the output is pretty hard to read.
I usually open it with a text editor and search for results of
.000, .00, and .0.

And when you find them what do you do?

Repeat that specific test with more data to determine if it is
isolated or consistent.

With more data? Each test needs a fixed number of 32-bit numbers (some test
requires slight variations on the number of input numbers).

Anyway, when a generator is definitely good or bad?


Quote:
The newer version of diehard returns
a final KS value but also states that it is more of a general
guide than absolute result.

That final KS p-values seems really useless calculated that way, because it
is the p-values of heterogeneous p-values.
I found much more useful to calculate the overall p-value for each test done
100 times; if I have 16 tests, I get 16 overall p-values and then you can
see where's the problem.


Quote:
I am also having to create my own test suite because nothing
else meets my current needs. sts seems like a good package but
it has its limitations.

Yes, all the tests have limitations. I think if one uses a test in a
proper
way the test can be useful anyway. The "proper way" could be also to
discard
a test! I done that with some test in dh.

I have never found it necessary to discard a DH test. They may not
detect a problem where there isn't one but they have never given
a strong result of a problem where one didn't exist.

I don't know the status of the newer version of dh, but that test has had
many problems (for example you could see my post on september 2003 about the
bad distribution of the overlap sum test).


Quote:
I am still a bit suspicious of the FFT and LZ tests since they do not
yet have a firm mathematical foundation. They seem like good tests
but they are still empirical. Of course we should be suspicious of
any single test only by using a number of tests can we be certain that
we aren't getting false positives or negatives.

I'm used to testing the test to avoid some surprise.

Cristiano
Mack
Posted: Mon Jan 26, 2004 2:38 am
Guest
On Sun, 25 Jan 2004 13:23:29 GMT, "Cristiano"
<cristiano.pi@NSquipo.it> wrote:

Quote:
Mack wrote:
On Sat, 24 Jan 2004 16:28:27 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Thu, 22 Jan 2004 19:30:50 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

skewness should be less than 2*ses which will vary by sample size.
ses=sqrt(6/n).

I'm not a mathematician, so I don't know if this rule really
applies. Could you elaborate a bit, please?


ses is the standard error of skewness. It is similar to standard
deviation. Although taking the values as having the same
meaning is probably a bad idea.

I think so.


Also the FFT p-values are not skewed (usually I get skewness=0.1,
0.2).


Are you using the sample mean or expected mean? For the 1e5 FFT I
never got a skewness below .4. For 1000 samples .2 would
definitely
be a significant skew (2*ses=.15492).

Sure, you use that test in a bad way; n must be around 1e6 bit, do
you remember?
Anyway, your question seems strange. You must use the sample mean,
not the expected one.

That is incorrect when you are examining a sample presumed to be
from a specific distribution. That would measure skew with respect
to the sample itself, not with respect to the expected distribution.

1e4 x 1000
------------------------------------------------------------------------
--
----
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 P-VALUE PROPORTION
STATISTICAL TEST
------------------------------------------------------------------------
--
----
0 0 0 0 0 0 0 0 0 1000 0.000000 * 1.0000
Lempel-Ziv
igamc underflow error occurs for Lempel-Ziv

As an example use the LZ test with 1e4 where all samples went to
bucket ten since the case statement doesn't handle this case. If
you
use the sample mean then it has a skew of zero. ie. the mean is ten
and all samples go to bucket ten. This is obviously not what we
want the test to show. We are looking for a measure of how well
this conforms to the expected distribution, in this case the mean
should be 5.5 and the sample is badly skewed.

This is the first time I hear about skewness used for the goodness
of fit!
To see "how well this conforms to the expected distribution" you
must not
use the skewness, you must use a proper test (KS test, chi-square or
my SL
test, if you like). Why don't use also the kurtosis, the median and
so
forth? This way you have everything but the stuff you need.


Skewness is a parameter of "goodness of fit" as is kurtosis and
median. Generally they are used for goodness of fit to a normal
curve but they can be used for other distributions as well.

I disagree.
You can use those parameters to see how much *they* are different from the
expected ones, but you shouldn't use them to see how much a set of samples
differs from the expected distribution.
I mean that if you see skewness=.65 you are not able to say how much your
distribution differs from the expected one. Obviously you could say that
your distribution is not perfectly good, but how much?
On the contrary, with a proper test for the goodness of fit you are able to
calculate a p-value to say how much you distribution differs from the
expected one.

In this specific case the p-values from a KS test were showing a value
less than .001, quite a bit different from uniform. The skewness is a
measure of how it isn't uniform. In a normal distribution it would be
a measure of how it differs from the normal.

Quote:

For example, I use the FFT to transform 32 kbit and then I calculate the
mean and the skewness of the transformed values.
If I calculate those parameters for a good prng and for a bad one (like
lcg), I seen that they are very different.
Unfortunately I can't know if a generator under test is good or bad if it
gives values in the range skewness_bad ... skewness_good because I don't
have a significance level.
To calculate a significance level I could calculate KS test of the
transformed values, but I don't know how to do it.

The traditional skewness level is 2*ses. Anything outside of this is
definitely bad if it is recurring. Anything inside of 1/2*ses is
probably ok.

Quote:


Skewness measures symmetry about a point. Kurtosis could be
used but the expected value would not be zero as for the normal
curve when applied to a uniform distribution, although this can be
easily calculated.

Sure, it is 6/5 * (n^2+1) / (n^2-1) for a discrete uniform distribution.
And when you got, for eample, 7/6 what do you say? It is good? It is bad?
And how much?

Here I have two doubts:
1) Surely we get some information from those parameters, but can the
information gotten be used in testing a rng (in an efficient way)?
2) You say: "Skewness measures symmetry about a point". I don't know how you
calculate "your" skewness. Do you calculate it using the absolute moments or
the central moments in some "strange" way?

skewness=1/n*(sum from 1 to n((Yi-expected mean)^3))/expected
deviation^3

specifically:

Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the
same to the left and right of the center point

from: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

and for Kurtosis:

Kurtosis is a measure of whether the data are peaked or flat relative
to a normal distribution. That is, data sets with high kurtosis tend
to have a distinct peak near the mean, decline rather rapidly, and
have heavy tails. Data sets with low kurtosis tend to have a flat top
near the mean rather than a sharp peak. A uniform distribution would
be the extreme case.

Quote:


We have already agreed that KS is not the right test here. Chi-square
or SL are more appropriate.

Yes, or perhaps no (see next paragraph).


Diehard doesn't give KS results except where
it is appropriate.

So does NIST test. But exactly, what do you mean?

The finalAnalysisReport returns KS test values
where these values are not appropriate.

Who say that? Have you done a new discovery?
If you calculate the KS test for *only* one sequence, then the KS is good
enough.

Why would you ever use a KS test on one p-value?

Quote:
But if you calculate the KS of the KS's gotten from 100 or 1000 sequences,
then the overall p-value is useless because the 100 or 1000 p-values are too
binned.


Unfortunately the output is pretty hard to read.
I usually open it with a text editor and search for results of
.000, .00, and .0.

And when you find them what do you do?

Repeat that specific test with more data to determine if it is
isolated or consistent.

With more data? Each test needs a fixed number of 32-bit numbers (some test
requires slight variations on the number of input numbers).

Anyway, when a generator is definitely good or bad?


A generator is bad when it fails a test that has a good mathematical
foundation in a spectacular manner Or alternatively you can say that
a generator is bad when fails some test (mathematical or empirical)
that other "good" generators pass. Of course a single number that is
..0000 or .9999 is not an indication of failure it must do so
consistently, since a value like that happens randomly 1 time in
10000.

A generator is good when it meets the criteria for which it will be
used. A bad generator such as a simple congruential generator may
be a "good" generator if we are using it in a non-demanding
application.

Quote:

The newer version of diehard returns
a final KS value but also states that it is more of a general
guide than absolute result.

That final KS p-values seems really useless calculated that way, because it
is the p-values of heterogeneous p-values.
I found much more useful to calculate the overall p-value for each test done
100 times; if I have 16 tests, I get 16 overall p-values and then you can
see where's the problem.


I am also having to create my own test suite because nothing
else meets my current needs. sts seems like a good package but
it has its limitations.

Yes, all the tests have limitations. I think if one uses a test in a
proper
way the test can be useful anyway. The "proper way" could be also to
discard
a test! I done that with some test in dh.

I have never found it necessary to discard a DH test. They may not
detect a problem where there isn't one but they have never given
a strong result of a problem where one didn't exist.

I don't know the status of the newer version of dh, but that test has had
many problems (for example you could see my post on september 2003 about the
bad distribution of the overlap sum test).


The newer version is still being listed as 0.2 beta. However that
error has been corrected. The big three tests Gorilla, GCD, and
Birthday Spacings test all seem to be functioning adequately.

Quote:

I am still a bit suspicious of the FFT and LZ tests since they do not
yet have a firm mathematical foundation. They seem like good tests
but they are still empirical. Of course we should be suspicious of
any single test only by using a number of tests can we be certain that
we aren't getting false positives or negatives.

I'm used to testing the test to avoid some surprise.

Cristiano


Leslie 'Mack' McBride
remove text between _ marks to respond via e-mail
Cristiano
Posted: Mon Jan 26, 2004 2:58 pm
Guest
Mack wrote:
Quote:
On Sun, 25 Jan 2004 13:23:29 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Sat, 24 Jan 2004 16:28:27 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Thu, 22 Jan 2004 19:30:50 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Skewness measures symmetry about a point. Kurtosis could be
used but the expected value would not be zero as for the normal
curve when applied to a uniform distribution, although this can be
easily calculated.

Sure, it is 6/5 * (n^2+1) / (n^2-1) for a discrete uniform
distribution. And when you got, for eample, 7/6 what do you say? It
is good? It is bad? And how much?

Here I have two doubts:
1) Surely we get some information from those parameters, but can the
information gotten be used in testing a rng (in an efficient way)?
2) You say: "Skewness measures symmetry about a point". I don't know
how you calculate "your" skewness. Do you calculate it using the
absolute moments or the central moments in some "strange" way?

skewness=1/n*(sum from 1 to n((Yi-expected mean)^3))/expected
deviation^3

specifically:

Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the
same to the left and right of the center point

from: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

They wrote that Y is the mean, not the expected mean and that you should use
the standard deviation, not the expected deviation.
As I already said, you use a useless method in a bad way.


Quote:
Diehard doesn't give KS results except where
it is appropriate.

So does NIST test. But exactly, what do you mean?

The finalAnalysisReport returns KS test values
where these values are not appropriate.

Who say that? Have you done a new discovery?
If you calculate the KS test for *only* one sequence, then the KS is
good enough.

Why would you ever use a KS test on one p-value?

Uh!? I know that my English is bad, but not so bad! Do you understand what I
say?
You said that the KS test used for the final report is not appropriate.
I said that if you run once the FFT test, the related KS test is far good.

One can use whatever he wants, but the KS test used for the
finalAnalysisReport is appropriate.

Quote:
But if you calculate the KS of the KS's gotten from 100 or 1000
sequences, then the overall p-value is useless because the 100 or
1000 p-values are too binned.


Unfortunately the output is pretty hard to read.
I usually open it with a text editor and search for results of
.000, .00, and .0.

And when you find them what do you do?

Repeat that specific test with more data to determine if it is
isolated or consistent.

With more data? Each test needs a fixed number of 32-bit numbers
(some test requires slight variations on the number of input
numbers).

Anyway, when a generator is definitely good or bad?


A generator is bad when it fails a test that has a good mathematical
foundation in a spectacular manner

Could you list some test that has a good mathematical foundation?


Quote:
Or alternatively you can say that
a generator is bad when fails some test (mathematical or empirical)
that other "good" generators pass. Of course a single number that is
.0000 or .9999 is not an indication of failure it must do so
consistently, since a value like that happens randomly 1 time in
10000.

You still haven't said when a generator is good or bad.


Quote:
A generator is good when it meets the criteria for which it will be
used.

If the criteria are showed by a test, then this is an incredibily big error!
What do you mean, exactly?


Quote:
A bad generator such as a simple congruential generator may
be a "good" generator if we are using it in a non-demanding
application.

It seems to me that you don't have an objective method to say: "this is
good" or "this is bad".
I think it is because of the way you use skewness mixed with KS, chi-square
and ayes.


Quote:
I am also having to create my own test suite because nothing
else meets my current needs. sts seems like a good package but
it has its limitations.

Yes, all the tests have limitations. I think if one uses a test in
a proper
way the test can be useful anyway. The "proper way" could be also
to discard
a test! I done that with some test in dh.

I have never found it necessary to discard a DH test. They may not
detect a problem where there isn't one but they have never given
a strong result of a problem where one didn't exist.

I don't know the status of the newer version of dh, but that test
has had many problems (for example you could see my post on
september 2003 about the bad distribution of the overlap sum test).


The newer version is still being listed as 0.2 beta. However that
error has been corrected. The big three tests Gorilla, GCD, and
Birthday Spacings test all seem to be functioning adequately.

Have you double checked that?

Cristiano
Mack
Posted: Wed Jan 28, 2004 6:11 am
Guest
On Mon, 26 Jan 2004 19:58:56 GMT, "Cristiano"
<cristiano.pi@NSquipo.it> wrote:

Quote:
Mack wrote:
On Sun, 25 Jan 2004 13:23:29 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Sat, 24 Jan 2004 16:28:27 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Thu, 22 Jan 2004 19:30:50 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Skewness measures symmetry about a point. Kurtosis could be
used but the expected value would not be zero as for the normal
curve when applied to a uniform distribution, although this can be
easily calculated.

Sure, it is 6/5 * (n^2+1) / (n^2-1) for a discrete uniform
distribution. And when you got, for eample, 7/6 what do you say? It
is good? It is bad? And how much?

Here I have two doubts:
1) Surely we get some information from those parameters, but can the
information gotten be used in testing a rng (in an efficient way)?
2) You say: "Skewness measures symmetry about a point". I don't know
how you calculate "your" skewness. Do you calculate it using the
absolute moments or the central moments in some "strange" way?

skewness=1/n*(sum from 1 to n((Yi-expected mean)^3))/expected
deviation^3

specifically:

Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the
same to the left and right of the center point

from: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

They wrote that Y is the mean, not the expected mean and that you should use
the standard deviation, not the expected deviation.

In any statistical calculation you can test either against the sample
or against the expected. In the case of that specific program there
is no way to enter an expected outcome. They do not say sample mean
or sample standard deviation either.

Quote:
As I already said, you use a useless method in a bad way.

If you don't like my methodology don't use it.
But it measures exactly what I was trying to measure
in an effective way. I did not use this test to determine
that the sample was not uniform. I used this test to try
and determine how it was not uniform.

The same statistic type of statistic can be calculated
by using the mean deviation from the expected. However
it is more sensitive to the center and less sensitive to the
tails.

t=(mean-expected mean)/variance of the mean
v^2=expected standard deviation/N

in the 1e6x100 LZ case this is
t=(3.63-5.5)/2.8868*sqrt(1/100))=-.6478/.1=-6.48
The correct number of degrees of freedom is 100.

Again this uses a simplified version that is oriented toward the
expected variance of the mean instead of the estimated
variance of the mean. In this case I was simply to lazy to
calculate it but the result will be similar.

Again we can clearly see that the distribution is not
symetrical about the expected mean. The means are
not equal.

in the 1e6x1000 case this is
t=(5.423-5.5)/2.8737*sqrt(1/1000))=-.847
which is not significant but the prior case makes
us wonder if the measurement is correct.

I will leave the calculation of the FFT values to
someone who cares since we have already agreed that
the FFT does not perform as expected on values below
1e6.

Quote:


Diehard doesn't give KS results except where
it is appropriate.

So does NIST test. But exactly, what do you mean?

The finalAnalysisReport returns KS test values
where these values are not appropriate.

Who say that? Have you done a new discovery?
If you calculate the KS test for *only* one sequence, then the KS is
good enough.

Why would you ever use a KS test on one p-value?

Uh!? I know that my English is bad, but not so bad! Do you understand what I
say?
You said that the KS test used for the final report is not appropriate.
I said that if you run once the FFT test, the related KS test is far good.


That would be running a KS test on one p-value. A KS test is not
appropriate on one p-value.

Quote:
One can use whatever he wants, but the KS test used for the
finalAnalysisReport is appropriate.


quoting from post: ppCPb.126628$VW.5097709@news3.tin.it

I said:
Quote:
I think you have already agreed that the KS test of the p-values
for these tests is not correct. Specifically it isn't the correct
test.

and you said:
Quote:
I totally agree with your last sentence: the KS test *must* not be used with
LZ.
But all the tests in the suite are good to check a prng (if they are
properly used).

therefore the finalAnalysisReport returns KS test values
where these values are not appropriate.

now back to your last post:
Quote:
But if you calculate the KS of the KS's gotten from 100 or 1000
sequences, then the overall p-value is useless because the 100 or
1000 p-values are too binned.


Unfortunately the output is pretty hard to read.
I usually open it with a text editor and search for results of
.000, .00, and .0.

And when you find them what do you do?

Repeat that specific test with more data to determine if it is
isolated or consistent.

With more data? Each test needs a fixed number of 32-bit numbers
(some test requires slight variations on the number of input
numbers).

Anyway, when a generator is definitely good or bad?


A generator is bad when it fails a test that has a good mathematical
foundation in a spectacular manner

Could you list some test that has a good mathematical foundation?

The Rank test, the Count the Ones test, Runs test, Craps test,
, Poker test, Distribution test, Overlapping Pairs/Triples/Quadruples
test (the sparse form uses empirical data, only the raw form has a
completely mathematical foundation). The birthday spacings test
claims to have good mathematical foundation but some parameters
of the test are empirical. I haven't read the detailed discussion of
this test so I can't confirm the foundation.

Correct me if I am wrong on any of these or missed any that
don't use empirically derived data to test against.

Quote:


Or alternatively you can say that
a generator is bad when fails some test (mathematical or empirical)
that other "good" generators pass. Of course a single number that is
.0000 or .9999 is not an indication of failure it must do so
consistently, since a value like that happens randomly 1 time in
10000.

You still haven't said when a generator is good or bad.


That is a pretty good definition of bad. There isn't a really
satisfactory definition of good except that it isn't bad.
An ideal generator produces all sets of possible output
with equal probability and no correlation.

All PRNGs are of course correlated in some way. Given
the initial state of a PRNG you can predict the next value
with 100% probability.

Quote:

A generator is good when it meets the criteria for which it will be
used.

If the criteria are showed by a test, then this is an incredibily big error!
What do you mean, exactly?


The most common criteria are lack of correlation and even
distribution. They are easy to test in theory but hard in practice.
All PRNGs are by definition correlated in some way (ie. they have
state).

Congruential generators are correlated in one way while
lagged fibonacci are correlated in different ways. Shift register
generators are also correlated in certain ways. All have provable
periods and even distribution.

Other generators such as AWC and MWC have radically different
types of correlation but are slightly biased. The bias is specific
to the generator but is 2 parts or more over the period. For an
extremely long period this is negligible.

KISS is an example of a combination generator that passes
almost any test you throw at it.

In the cryptographic arena an additional requirement is
the inability to determine the state from the output.
BBS and ARC4 are examples. BBS is based on the idea
that factoring is difficult. ARC4 uses shuffling and has been
shown to have some weaknesses in its output. I haven't kept
up with the status but it is still being used. BBS is probably
a lot closer to 'ideal' but is very slow.

Quote:

A bad generator such as a simple congruential generator may
be a "good" generator if we are using it in a non-demanding
application.

It seems to me that you don't have an objective method to say: "this is
good" or "this is bad".
I think it is because of the way you use skewness mixed with KS, chi-square
and ayes.


I have several objective ways of saying "this is bad". For any
specific PRNG there is no single way of saying this is good.
Ideally we want something computationally infeasible to predict
but no one has ever proven that is possible. See the whole group
of P vs. NP threads for that argument.

Quote:

I am also having to create my own test suite because nothing
else meets my current needs. sts seems like a good package but
it has its limitations.

Yes, all the tests have limitations. I think if one uses a test in
a proper
way the test can be useful anyway. The "proper way" could be also
to discard
a test! I done that with some test in dh.

I have never found it necessary to discard a DH test. They may not
detect a problem where there isn't one but they have never given
a strong result of a problem where one didn't exist.

I don't know the status of the newer version of dh, but that test
has had many problems (for example you could see my post on
september 2003 about the bad distribution of the overlap sum test).


The newer version is still being listed as 0.2 beta. However that
error has been corrected. The big three tests Gorilla, GCD, and
Birthday Spacings test all seem to be functioning adequately.

Have you double checked that?

As well as I can. My "presumed good" generators pass the tests.

The GCD and gorilla tests rely on empirically
derived data. The birthday spacings has a more mathematical
basis. I haven't actually read the detailed discussion of the
test so I can't comment extensively on its development.

From: http://www.jstatsoft.org/v07/i03/tuftests.pdf

Note that we do not need the true distributions to develop a test
of randomness. All we need to do is compare the sample
distribution from a particular RNG with the standard provided by a
number of presumably good RNG's, 'presumably good' meaning that
they produce results so close to a single one that the single one may
used as a standard. ...

Since some of the best minds in mathematics can't come up with
a better definition of a good RNG then I certainly don't expect to do
so. But I don't necessarily trust the empirically derived data.

Quote:

Cristiano


Leslie 'Mack' McBride
remove text between _ marks to respond via e-mail
Mack
Posted: Thu Jan 29, 2004 5:38 am
Guest
On Wed, 28 Jan 2004 11:11:33 GMT, Mack
<macckone@a_nospamjunk123_ol.com> wrote:

Quote:
On Mon, 26 Jan 2004 19:58:56 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Sun, 25 Jan 2004 13:23:29 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

Mack wrote:
On Sat, 24 Jan 2004 16:28:27 GMT, "Cristiano"
cristiano.pi@NSquipo.it> wrote:

[snip]

I don't know the status of the newer version of dh, but that test
has had many problems (for example you could see my post on
september 2003 about the bad distribution of the overlap sum test).


The newer version is still being listed as 0.2 beta. However that
error has been corrected. The big three tests Gorilla, GCD, and
Birthday Spacings test all seem to be functioning adequately.

Have you double checked that?


I checked on the web site the currently available beta version does
not have the fix for the OSUM test. There is supposedly a version
where it works but it isn't on the web site. The older diehard
package has the same test that works properly and a newsgroup
message with the fortran source for the correction was posted. This
was supposedly recoded sometime between October and now. I stand
corrected on that test, it does sometimes cause a false failure for a
good generator. I haven't had a "good" generator fail below the .5 %
mark but it still looks odd.

Quote:
As well as I can. My "presumed good" generators pass the tests.

The GCD and gorilla tests rely on empirically
derived data. The birthday spacings has a more mathematical
basis. I haven't actually read the detailed discussion of the
test so I can't comment extensively on its development.

From: http://www.jstatsoft.org/v07/i03/tuftests.pdf

Note that we do not need the true distributions to develop a test
of randomness. All we need to do is compare the sample
distribution from a particular RNG with the standard provided by a
number of presumably good RNG's, 'presumably good' meaning that
they produce results so close to a single one that the single one may
used as a standard. ...

Since some of the best minds in mathematics can't come up with
a better definition of a good RNG then I certainly don't expect to do
so. But I don't necessarily trust the empirically derived data.


Cristiano


Leslie 'Mack' McBride
remove text between _ marks to respond via e-mail

Leslie 'Mack' McBride
remove text between _ marks to respond via e-mail
 
Page 3 of 3    Goto page Previous  1, 2, 3   All times are GMT - 5 Hours
The time now is Wed Oct 08, 2008 2:57 am