| |
 |
|
|
Science Forum Index » Statistics - Math Forum » Can I consider a dataset moving average as...
Page 1 of 1
|
| Author |
Message |
| Gilberto Reis Filho... |
Posted: Tue May 06, 2008 5:19 am |
|
|
|
Guest
|
Hello, maybe some of you experts can help me out. It is a very simple question and to my surprise my research so far could not answer it conclusively.
I have a finite dataset of around 1300 observations. I am sure that this data cannot be considered as normally distributed.
If I take a moving average of n=30 for this data set (from observation 1 to 30, 2 to 31, 3 to 33 and so forth) can I consider the distribution of the averages as approximately normal? My understanding of the central limit theorem says yes, but I would like the confirmation of someone knowledgeable since I am not a statician.
Thanks for your replies |
|
|
| Back to top |
|
| Russell... |
Posted: Tue May 06, 2008 6:28 am |
|
|
|
Guest
|
On May 6, 11:19 am, Gilberto Reis Filho <bon... at (no spam) gmail.com> wrote:
Quote: Hello, maybe some of you experts can help me out. It is a very simple question and to my surprise my research so far could not answer it conclusively.
I have a finite dataset of around 1300 observations. I am sure that this data cannot be considered as normally distributed.
If I take a moving average of n=30 for this data set (from observation 1 to 30, 2 to 31, 3 to 33 and so forth) can I consider the distribution of the averages as approximately normal? My understanding of the central limit theorem says yes, but I would like the confirmation of someone knowledgeable since I am not a statician.
Thanks for your replies
I'll give this a shot, although there are people here
more expert on the subject than I.
The CLT requires some limitations on the distribution
from which your 1300 observations are drawn (IIRC that
it has finite mean and variance), but assuming that is
true, you still have to consider a couple of other
things. The first is are the original observations
independent of each other? The CLT also requires that.
You can test that by calculating the autocorrelation of
your data.
The second is that the values from a running mean will
not be independent of each other. If the original
observations satisfy the requirements of the CLT, you
could average 1 to 30, 31 to 60, etc. and the resulting
set of values should be reasonably close to normally
distributed. But my guess is that the values from a
running mean would actually be grouped more tightly
than a normal distribution, or maybe it would be a
normal distribution with a smaller variance than one
might expect based just on the variance of the original
data and the number of values. If I've misled you
maybe someone will correct me, or add some additional
information to help you.
Cheers,
Russell |
|
|
| Back to top |
|
| z... |
Posted: Tue May 06, 2008 7:00 am |
|
|
|
Guest
|
On May 6, 11:19 am, Gilberto Reis Filho <bon... at (no spam) gmail.com> wrote:
Quote: Hello, maybe some of you experts can help me out. It is a very simple question and to my surprise my research so far could not answer it conclusively.
I have a finite dataset of around 1300 observations. I am sure that this data cannot be considered as normally distributed.
If I take a moving average of n=30 for this data set (from observation 1 to 30, 2 to 31, 3 to 33 and so forth) can I consider the distribution of the averages as approximately normal? My understanding of the central limit theorem says yes, but I would like the confirmation of someone knowledgeable since I am not a statician.
Thanks for your replies
are the observations random? CLt assumes you draw random samples; if
the order is random taht would fit. but if they're a time series or
something, not so much. |
|
|
| Back to top |
|
| Russell... |
Posted: Tue May 06, 2008 7:33 am |
|
|
|
Guest
|
On May 6, 11:19 am, Gilberto Reis Filho <bon... at (no spam) gmail.com> wrote:
Quote: Hello, maybe some of you experts can help me out. It is a very simple question and to my surprise my research so far could not answer it conclusively.
I have a finite dataset of around 1300 observations. I am sure that this data cannot be considered as normally distributed.
If I take a moving average of n=30 for this data set (from observation 1 to 30, 2 to 31, 3 to 33 and so forth) can I consider the distribution of the averages as approximately normal? My understanding of the central limit theorem says yes, but I would like the confirmation of someone knowledgeable since I am not a statician.
Thanks for your replies
Another thing, why do you care about getting
something that is normally distributed? In my
experience running means are usually used to
filter out high frequency "noise" (realizing
that one man's noise may be another's signal).
Cheers,
Russell |
|
|
| Back to top |
|
| Gilberto Reis Filho... |
Posted: Tue May 06, 2008 8:12 am |
|
|
|
Guest
|
Quote: On May 6, 11:19 am, Gilberto Reis Filho
bon... at (no spam) gmail.com> wrote:
Hello, maybe some of you experts can help me out.
It is a very simple question and to my surprise my
research so far could not answer it conclusively.
I have a finite dataset of around 1300
observations. I am sure that this data cannot be
considered as normally distributed.
If I take a moving average of n=30 for this data
set (from observation 1 to 30, 2 to 31, 3 to 33 and
so forth) can I consider the distribution of the
averages as approximately normal? My understanding of
the central limit theorem says yes, but I would like
the confirmation of someone knowledgeable since I am
not a statician.
Thanks for your replies
Another thing, why do you care about getting
something that is normally distributed? In my
experience running means are usually used to
filter out high frequency "noise" (realizing
that one man's noise may be another's signal).
Cheers,
Russell
Thank you all for the replies so far. I'll try to elaborate a little more and answer all your questions.
The dataset refers to blood glucose level observations of mine. They definively are related to a time series and might not be really independant. On average there are around 3 observations per day but that varies, sometimes more sometimes less.
I want to get something close to a normal distribution to calculate probabilities, let's say answer what is the probability of p<90 or p>180 with some confidence etc. etc.
Another approach I intend to take after sorting this out is computing the means for 1 week, 2 weeks, 3 weeks and 4 weeks and draw conclusions from them, like: 'there is a x% probability that the weekly mean in below 180'. The reason for this is that it can be better understand by other people.
This 'week based analysis' is proving very difficult to me because in this case n varies: there are weeks when I have only 6 observations and weeks when I have 26 observations. I believe that in this case I cannot use the CLT and I am a bit clueless as how to proceed.
The dataset and some analysis I made are available here if anyone is interested (refer to sheets 'Glicemia' and 'Médias'):
http://spreadsheets.google.com/pub?key=paFaAniurNBdgSK9AYzgAMw
Thanks again for all the help.
cheers. |
|
|
| Back to top |
|
| Russell... |
Posted: Tue May 06, 2008 8:43 am |
|
|
|
Guest
|
On May 6, 2:12 pm, Gilberto Reis Filho <bon... at (no spam) gmail.com> wrote:
Quote: On May 6, 11:19 am, Gilberto Reis Filho
bon... at (no spam) gmail.com> wrote:
Hello, maybe some of you experts can help me out.
It is a very simple question and to my surprise my
research so far could not answer it conclusively.
I have a finite dataset of around 1300
observations. I am sure that this data cannot be
considered as normally distributed.
If I take a moving average of n=30 for this data
set (from observation 1 to 30, 2 to 31, 3 to 33 and
so forth) can I consider the distribution of the
averages as approximately normal? My understanding of
the central limit theorem says yes, but I would like
the confirmation of someone knowledgeable since I am
not a statician.
Thanks for your replies
Another thing, why do you care about getting
something that is normally distributed? In my
experience running means are usually used to
filter out high frequency "noise" (realizing
that one man's noise may be another's signal).
Cheers,
Russell
Thank you all for the replies so far. I'll try to elaborate a little more and answer all your questions.
The dataset refers to blood glucose level observations of mine. They definively are related to a time series and might not be really independant. On average there are around 3 observations per day but that varies, sometimes more sometimes less.
I want to get something close to a normal distribution to calculate probabilities, let's say answer what is the probability of p<90 or p>180 with some confidence etc. etc.
Another approach I intend to take after sorting this out is computing the means for 1 week, 2 weeks, 3 weeks and 4 weeks and draw conclusions from them, like: 'there is a x% probability that the weekly mean in below 180'. The reason for this is that it can be better understand by other people.
This 'week based analysis' is proving very difficult to me because in this case n varies: there are weeks when I have only 6 observations and weeks when I have 26 observations. I believe that in this case I cannot use the CLT and I am a bit clueless as how to proceed.
The dataset and some analysis I made are available here if anyone is interested (refer to sheets 'Glicemia' and 'Médias'):
http://spreadsheets.google.com/pub?key=paFaAniurNBdgSK9AYzgAMw
Thanks again for all the help.
cheers.- Hide quoted text -
- Show quoted text -
OK, one approach might be to fit a distribution that
one might expect such data to follow to the data and
calculate the probabilities from that. Another
approach could be to transform the data to an
approximately normal distribution. Google Box-Cox
transform for one approach to doing the latter.
Also sometimes just taking the square root or log of
the data will make it approximately normal. Some
people (who I must admit are more highly trained
than I) argue against transforming the data, but it
is done (rightly or wrongly) in practice fairly
frequently.
Cheers,
Russell |
|
|
| Back to top |
|
| Gilberto Reis Filho... |
Posted: Tue May 06, 2008 9:22 am |
|
|
|
Guest
|
Quote: OK, one approach might be to fit a distribution that
one might expect such data to follow to the data and
calculate the probabilities from that. Another
approach could be to transform the data to an
approximately normal distribution. Google Box-Cox
transform for one approach to doing the latter.
Also sometimes just taking the square root or log of
the data will make it approximately normal. Some
people (who I must admit are more highly trained
than I) argue against transforming the data, but it
is done (rightly or wrongly) in practice fairly
frequently.
Cheers,
Russell
Let me see if i follow you.
Fitting the data might not be so easy because I do not know if it follows any known distribution. Besides I am not sure I have the knowledge to do that...
Transforming may be an option. It seems easier. You mean take the original data (that we know is not normally distributed) and calculate the log or square root, what should give me something approximately normal, right? Then I assume I just need to reverse the transformation to get readable data again. Sounds interesting. Is there any way to compute what would be the expected error in the estimate in this case?
thanks. |
|
|
| Back to top |
|
| Gilberto Reis Filho... |
Posted: Tue May 06, 2008 10:09 am |
|
|
|
Guest
|
Quote: OK, one approach might be to fit a distribution
that
one might expect such data to follow to the data
and
calculate the probabilities from that. Another
approach could be to transform the data to an
approximately normal distribution. Google Box-Cox
transform for one approach to doing the latter.
Also sometimes just taking the square root or log
of
the data will make it approximately normal. Some
people (who I must admit are more highly trained
than I) argue against transforming the data, but
it
is done (rightly or wrongly) in practice fairly
frequently.
Cheers,
Russell
Let me see if i follow you.
Fitting the data might not be so easy because I do
not know if it follows any known distribution.
Besides I am not sure I have the knowledge to do
that...
Transforming may be an option. It seems easier. You
mean take the original data (that we know is not
normally distributed) and calculate the log or square
root, what should give me something approximately
normal, right? Then I assume I just need to reverse
the transformation to get readable data again. Sounds
interesting. Is there any way to compute what would
be the expected error in the estimate in this case?
thanks.
I found a so called Fischer transform that from my understanding is supposed to transform data that is not normal to normal. Is anyone aware if this is a good route to 'normalize' a dataset? |
|
|
| Back to top |
|
| Russell... |
Posted: Tue May 06, 2008 10:10 am |
|
|
|
Guest
|
On May 6, 3:22 pm, Gilberto Reis Filho <bon... at (no spam) gmail.com> wrote:
Quote: OK, one approach might be to fit a distribution that
one might expect such data to follow to the data and
calculate the probabilities from that. Another
approach could be to transform the data to an
approximately normal distribution. Google Box-Cox
transform for one approach to doing the latter.
Also sometimes just taking the square root or log of
the data will make it approximately normal. Some
people (who I must admit are more highly trained
than I) argue against transforming the data, but it
is done (rightly or wrongly) in practice fairly
frequently.
Cheers,
Russell
Let me see if i follow you.
Fitting the data might not be so easy because I do not know if it follows any known distribution. Besides I am not sure I have the knowledge to do that...
True, it takes a bit more knowledge and the right
software to do that.
Quote:
Transforming may be an option. It seems easier. You mean take the original data (that we know is not normally distributed) and calculate the log or square root, what should give me something approximately normal, right? Then I assume I just need to reverse the transformation to get readable data again. Sounds interesting. Is there any way to compute what would be the expected error in the estimate in this case?
thanks.
It depends on the data, but sometimes something simple
like a square root, or square, or log of each data
value will produce a distribution that is close
enough to normal for practical purposes. You have to
try a few things and see how they work. You treat the
resulting transformed distribution just like you would
if the data were normal in terms of calculating
probabilities, etc. For instance, if taking the log
gives you a distribution that looks normal, you can
use the usual probability tables to calculate the
probability of exceeding log(X) (e.g. in terms of the
transformed variable). And you're correct, you have
to do the inverse transform when you want to talk in
terms of the original quantity.
Cheers,
Russell |
|
|
| Back to top |
|
| ... |
Posted: Tue May 06, 2008 11:07 am |
|
|
|
Guest
|
On Tue, 06 May 2008 11:19:48 EDT, Gilberto Reis Filho
<boneco at (no spam) gmail.com> wrote:
Quote: Hello, maybe some of you experts can help me out. It is a very simple question and to my surprise my research so far could not answer it conclusively.
I have a finite dataset of around 1300 observations. I am sure that this data cannot be considered as normally distributed.
If I take a moving average of n=30 for this data set (from observation 1 to 30, 2 to 31, 3 to 33 and so forth) can I consider the distribution of the averages as approximately normal? My understanding of the central limit theorem says yes, but I would like the confirmation of someone knowledgeable since I am not a statician.
Thanks for your replies
Probably, although if the original data is sufficiently far from
normal the approximation won't be a good one.
Two caveats:
(1) This relies on the original observations not being too serially
correlated.
(2) The moving averages may be approximately normal, but they
certainly won't be independent of one another.
-Dick Startz |
|
|
| Back to top |
|
| Art Kendall... |
Posted: Tue May 06, 2008 2:27 pm |
|
|
|
Guest
|
This is strictly a lay opinion in terms of the content. I suggest you
Google glucose control restricting your search to the .gov domain.
Do you have additional information on whether the readings are
preprandial or postprandial (before/after meals)? You definitely do not
want to mix those unless you are only worried about going hypoglycemic
(low).
You might do something similar to SPC statistical process control charts
with the exception that the limits for "out of control" are usually
prespecified by your physician. In the _US_, something like 70 for a
low limit for both pre and postprandial. Preprandial is often called
high if it is over 120 and postprandial is called high if it is over
140. Other places use a different scale of measurement but the display
ideas should apply.
In process control usually the limits are set in terms of the SD or SE,
but that should be unnecessary with blood glucose, although the
magnitude of swings seems to be an important concept.
Some glucometers (e.g., from Onetouch) record readings and have PC
software set up to visualize the results. You might check whether their
webpage still has examples of their displays so you can do something
similar.
Some practitioners consider HA1C readings the way to get a sort of long
term view of how much one is hyperglycemic. I MAY BE WRONG but I don't
think it measures the variability.
Art
Gilberto Reis Filho wrote:
Quote: On May 6, 11:19 am, Gilberto Reis Filho
bon... at (no spam) gmail.com> wrote:
Hello, maybe some of you experts can help me out.
It is a very simple question and to my surprise my
research so far could not answer it conclusively.
I have a finite dataset of around 1300
observations. I am sure that this data cannot be
considered as normally distributed.
If I take a moving average of n=30 for this data
set (from observation 1 to 30, 2 to 31, 3 to 33 and
so forth) can I consider the distribution of the
averages as approximately normal? My understanding of
the central limit theorem says yes, but I would like
the confirmation of someone knowledgeable since I am
not a statician.
Thanks for your replies
Another thing, why do you care about getting
something that is normally distributed? In my
experience running means are usually used to
filter out high frequency "noise" (realizing
that one man's noise may be another's signal).
Cheers,
Russell
Thank you all for the replies so far. I'll try to elaborate a little more and answer all your questions.
The dataset refers to blood glucose level observations of mine. They definively are related to a time series and might not be really independant. On average there are around 3 observations per day but that varies, sometimes more sometimes less.
I want to get something close to a normal distribution to calculate probabilities, let's say answer what is the probability of p<90 or p>180 with some confidence etc. etc.
Another approach I intend to take after sorting this out is computing the means for 1 week, 2 weeks, 3 weeks and 4 weeks and draw conclusions from them, like: 'there is a x% probability that the weekly mean in below 180'. The reason for this is that it can be better understand by other people.
This 'week based analysis' is proving very difficult to me because in this case n varies: there are weeks when I have only 6 observations and weeks when I have 26 observations. I believe that in this case I cannot use the CLT and I am a bit clueless as how to proceed.
The dataset and some analysis I made are available here if anyone is interested (refer to sheets 'Glicemia' and 'Médias'):
http://spreadsheets.google.com/pub?key=paFaAniurNBdgSK9AYzgAMw
Thanks again for all the help.
cheers. |
|
|
| Back to top |
|
| |
|
Page 1 of 1
All times are GMT - 5 Hours
The time now is Tue May 13, 2008 8:15 pm
|
|