| |
 |
|
|
Science Forum Index » Math - Numerical Analysis Forum » Summary statistics of a long data stream
Page 1 of 1
|
| Author |
Message |
| Stephen |
Posted: Sun Mar 25, 2007 9:23 pm |
|
|
|
Guest
|
I have an application where I have to monitor a stream of single-
precision reals from a file. Other than the processing of this stream
to calculate some values (which are not important here) I need to
calculate summary statistics for the stream of data itself - min, max,
mean, median, as well as upper and lower quartile. There are several
constraints on the problem. First, I can only read the data once; that
is I have to read the data from beginning to end and process as the
data is read, then discard each value. Second, the amount of data is
quite large, and generally larger than the available memory, so I
can't save for later processing. Finally, I don't know the size of the
data stream beforehand - I just know when it finishes. Data streams
can range from rather small (a few hundred values) to quite large (on
the order of 10^9 values). I don't know the statistical form of the
ensemble of values beforehand, but nearby values tend to show some
local correlation (i.e. not necessarily IID).
Calculating the min and max are trivial, but for truly large data
streams, it's easy to get roundoff issues coming into the calculation
of the mean and quartiles. I've handled this by just occasionally
sampling the data stream, and this seems to work ok, but I'd like to
know if anyone knows of a more sophisticated approach, especially for
reliable median and quartile calculation.
Any suggestions would be appreciated.
Stephen |
|
|
| Back to top |
|
| |
|
Page 1 of 1
All times are GMT - 5 Hours
The time now is Thu Dec 04, 2008 9:25 pm
|
|