Main Page | Report this Page
 
   
Science Forum Index  »  Space - Consult Forum  »  Data Corrrelation Query
Page 1 of 1    
Author Message
Raj
Posted: Wed Jan 10, 2007 7:08 am
Guest
Hi,
I am post graduate student doing a project in the area of soft
computing.Th problem i am trying to solve is that of predicting
operaional risk. I have a small dataset of 25 data points that includes

five input variables - System downtime, Number of employees, Data
Quality, Number of transactions, Number of losses, and one output
variable - the loss amount. Because i have a very small dataset, i went

through the following process to generate additional data points:
step 1: Select one variable at a time and fit various distributions
over it.
step 2: Based upon the goodness of fit tests, select the best
distribution for each variable seperately.
step 3: Generate random numbers for each variable over the selected
distribution seperately.
step 4: Tabulate the values.
My question is how do we ensure the correleation among the variables
that was there in the original sample data over the randomely selected
data as well. Because the random numbers were generated seperately for
each variable i could not find any correlation among the variables that

was present in the original sample data.
please do give me some suggestions, i will be waiting for them.

regards,
Raj kiran
Richard Ulrich
Posted: Thu Jan 11, 2007 12:05 am
Guest
On 10 Jan 2007 03:08:04 -0800, "Raj" <nrajkiran@gmail.com> wrote:

Quote:
Hi,
I am post graduate student doing a project in the area of soft
computing.Th problem i am trying to solve is that of predicting
operaional risk. I have a small dataset of 25 data points that includes

five input variables - System downtime, Number of employees, Data
Quality, Number of transactions, Number of losses, and one output
variable - the loss amount. Because i have a very small dataset, i went

through the following process to generate additional data points:

[snip, process of faking up new data]

Please say again,
*Why* are you doing this?

When I have heard of something like this before, it has been
a mistake. But I am willing to keep my mind open.

You are not going to gain any 'information' or improve any
statistical testing by pretending to increase the degrees of freedom.

Is there some published rationale for doing this?

--
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
Raj
Posted: Thu Jan 11, 2007 8:19 am
Guest
Dear Sir,
I thank you very much for the reply, let me rephrase the question. I
would like to know if there is/are any methods which can generate
additional data from the original data and still maintain the
correlation that was present among the different variables of the data.
I need more data so that the techniques i use in the prediction can be
trained better.
regards,
Raj kiran

Richard Ulrich wrote:
Quote:
On 10 Jan 2007 03:08:04 -0800, "Raj" <nrajkiran@gmail.com> wrote:

Hi,
I am post graduate student doing a project in the area of soft
computing.Th problem i am trying to solve is that of predicting
operaional risk. I have a small dataset of 25 data points that includes

five input variables - System downtime, Number of employees, Data
Quality, Number of transactions, Number of losses, and one output
variable - the loss amount. Because i have a very small dataset, i went

through the following process to generate additional data points:

[snip, process of faking up new data]

Please say again,
*Why* are you doing this?

When I have heard of something like this before, it has been
a mistake. But I am willing to keep my mind open.

You are not going to gain any 'information' or improve any
statistical testing by pretending to increase the degrees of freedom.

Is there some published rationale for doing this?

--
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
Greg Heath
Posted: Thu Jan 11, 2007 9:29 am
Guest
Raj wrote:
Quote:
Hi,
I am post graduate student doing a project in the area of soft
computing.Th problem i am trying to solve is that of predicting
operaional risk. I have a small dataset of 25 data points that includes

five input variables - System downtime, Number of employees, Data
Quality, Number of transactions, Number of losses, and one output
variable - the loss amount. Because i have a very small dataset, i went

through the following process to generate additional data points:
step 1: Select one variable at a time and fit various distributions
over it.
step 2: Based upon the goodness of fit tests, select the best
distribution for each variable seperately.
step 3: Generate random numbers for each variable over the selected
distribution seperately.
step 4: Tabulate the values.
My question is how do we ensure the correleation among the variables
that was there in the original sample data over the randomely selected
data as well. Because the random numbers were generated seperately for
each variable i could not find any correlation among the variables that

was present in the original sample data.
please do give me some suggestions, i will be waiting for them.

WHY ARE YOU WASTING EVERY ONES TIME?

Your question was answered on 12 dec 2006 in sci.stat.math
AND earlier today in comp.ai.neural-nets!

Don't be so inconsiderate as to submit the same post,
separately, to different newgroups.

Greg
John Uebersax
Posted: Fri Jan 12, 2007 4:51 pm
Guest
Richard Ulrich wrote:

Quote:
When I have heard of something like this before, it has been
a mistake. But I am willing to keep my mind open.

I could be wrong, but it seems like he just wants to generate a large
number of data points to test/tweak a prediction model. Then, I
guess, the idea is to re-apply the developed model back to the original
25 observations. Something like that. Doesn't sound all that
far-fetched.

John Uebersax PhD
Old Mac User
Posted: Mon Jan 15, 2007 8:25 pm
Guest
Raj... This is not a good idea. You cannot generate valid data from
actual data. You'll gain nothing by this process and if you ever have
to explain what you have done, the audience will laugh. OMU
Raj wrote:
Quote:
Dear Sir,
I thank you very much for the reply, let me rephrase the question. I
would like to know if there is/are any methods which can generate
additional data from the original data and still maintain the
correlation that was present among the different variables of the data.
I need more data so that the techniques i use in the prediction can be
trained better.
regards,
Raj kiran

Richard Ulrich wrote:
On 10 Jan 2007 03:08:04 -0800, "Raj" <nrajkiran@gmail.com> wrote:

Hi,
I am post graduate student doing a project in the area of soft
computing.Th problem i am trying to solve is that of predicting
operaional risk. I have a small dataset of 25 data points that includes

five input variables - System downtime, Number of employees, Data
Quality, Number of transactions, Number of losses, and one output
variable - the loss amount. Because i have a very small dataset, i went

through the following process to generate additional data points:

[snip, process of faking up new data]

Please say again,
*Why* are you doing this?

When I have heard of something like this before, it has been
a mistake. But I am willing to keep my mind open.

You are not going to gain any 'information' or improve any
statistical testing by pretending to increase the degrees of freedom.

Is there some published rationale for doing this?

--
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
John Uebersax
Posted: Tue Jan 16, 2007 3:34 am
Guest
Again, here's a possible reason why people might legitimately generate
data as Raj describes: suppose you have several different prediction
algorithms (regression, nearest-neighbor, etc.), and you want, for
future purposes, to see which method is best in your intended range of
applications. And suppose you have some pilot data.

It's not unusual to generate random data with a large N to test/compare
algorithms. Well, if one already has pilot data, then why not use that
information (means, covariances) so that the random data has realistic
a realistic covariance structure?

Now, I can't say for sure this is what Raj wants to do--but it is an
explanation compatible with his post, and I think the principle of
"giving someone the benefit of the doubt" is a generally good one.

John Uebersax PhD


Old Mac User wrote:
Quote:
Raj... This is not a good idea. You cannot generate valid data from
actual data. You'll gain nothing by this process and if you ever have
to explain what you have done, the audience will laugh. OMU
Raj wrote:
Dear Sir,
I thank you very much for the reply, let me rephrase the question. I
would like to know if there is/are any methods which can generate
additional data from the original data and still maintain the
correlation that was present among the different variables of the data.
I need more data so that the techniques i use in the prediction can be
trained better.
regards,
Raj kiran

Richard Ulrich wrote:
On 10 Jan 2007 03:08:04 -0800, "Raj" <nrajkiran@gmail.com> wrote:

Hi,
I am post graduate student doing a project in the area of soft
computing.Th problem i am trying to solve is that of predicting
operaional risk. I have a small dataset of 25 data points that includes

five input variables - System downtime, Number of employees, Data
Quality, Number of transactions, Number of losses, and one output
variable - the loss amount. Because i have a very small dataset, i went

through the following process to generate additional data points:

[snip, process of faking up new data]

Please say again,
*Why* are you doing this?

When I have heard of something like this before, it has been
a mistake. But I am willing to keep my mind open.

You are not going to gain any 'information' or improve any
statistical testing by pretending to increase the degrees of freedom.

Is there some published rationale for doing this?

--
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
Old Mac User
Posted: Tue Jan 16, 2007 10:09 am
Guest
John U. :

In 1982 I "inherited" a group of five people uding a corporate
reorganization. These people were another "statistics group" from a
different corporate division, located in a neaby building. They built
models, made predictions, etc. for a critical businesses area. None
were degreed statisticians. After they settled in their new quarters
down the hall from me I started looking at what they were doing. In
presentations to me they showed large amounts of data and claimed to be
swamped in managing all that data. I found this strange, because the
type of experimental data they were using was expensive, difficult to
acquire, and not all that hard to manage. They also spoke in strange
tongues. One thing they mentioned again and again was "building models
from derived data". Oooops!!! What's "derived data"? I had to wait
until their supervisor was out of town (he reported to me) to squeeze a
confession of them. So just what is derived data? I'll bet you can
guess. It was just what the original poster to this thread is
proposing. That is, use existing "real" data to build models... make
predictions to "fill in the gaps"... declare those predictions to be
"derived" data... mingle the derived data with the real data... build
new models and make predictions for clients who are paying the bills
for this work. Yes, that's exactly what they were doing. The funny
thing is that a small empire had been built on this scheme. This is
dishonest, and should not have happened. On taking this to my
supervisor we agreed to have the mastermind of this racket (the one who
reported directly to me) and one of his cohorts fired and to dismantle
that group and scatter the remains around where they could do no more
harm.

That's not the only time I found a scam going on with data and
statistical methods. Today, with statistical software and eyecandy
PowerPoints, statistical scams abound. If you'll pay attention you can
see them at Google Groups everyday.

This old engineer was taught early on "don't trust anything". Inspect,
ask questions, lay your hands on it, verify, but trust nothing. Sorry,
John, but I don't "think the principle of "giving someone the benefit
of the doubt" is a generally good one." Sometimes, when the stakes are
low and a good reputation has been established... yes, of course. But
not when meaningful work (or work that is intended to be meaningful) is
being funded and being paid for. It's stuff like this (this = what the
OP intends to to) that gives science a bad name.

This "don't trust anything" policy... which is common among experienced
engineers... saves money, time, and lives. I'm seldom surprised when I
poke around in "statistical wonders" that have been created by amateurs
only to find rot and mold. See some of Afonso's schemes and rants to
gain further appreciation for this. OMU




John Uebersax wrote:
Quote:
Again, here's a possible reason why people might legitimately generate
data as Raj describes: suppose you have several different prediction
algorithms (regression, nearest-neighbor, etc.), and you want, for
future purposes, to see which method is best in your intended range of
applications. And suppose you have some pilot data.

It's not unusual to generate random data with a large N to test/compare
algorithms. Well, if one already has pilot data, then why not use that
information (means, covariances) so that the random data has realistic
a realistic covariance structure?

Now, I can't say for sure this is what Raj wants to do--but it is an
explanation compatible with his post, and I think the principle of
"giving someone the benefit of the doubt" is a generally good one.

John Uebersax PhD


Old Mac User wrote:
Raj... This is not a good idea. You cannot generate valid data from
actual data. You'll gain nothing by this process and if you ever have
to explain what you have done, the audience will laugh. OMU
Raj wrote:
Dear Sir,
I thank you very much for the reply, let me rephrase the question. I
would like to know if there is/are any methods which can generate
additional data from the original data and still maintain the
correlation that was present among the different variables of the data.
I need more data so that the techniques i use in the prediction can be
trained better.
regards,
Raj kiran

Richard Ulrich wrote:
On 10 Jan 2007 03:08:04 -0800, "Raj" <nrajkiran@gmail.com> wrote:

Hi,
I am post graduate student doing a project in the area of soft
computing.Th problem i am trying to solve is that of predicting
operaional risk. I have a small dataset of 25 data points that includes

five input variables - System downtime, Number of employees, Data
Quality, Number of transactions, Number of losses, and one output
variable - the loss amount. Because i have a very small dataset, i went

through the following process to generate additional data points:

[snip, process of faking up new data]

Please say again,
*Why* are you doing this?

When I have heard of something like this before, it has been
a mistake. But I am willing to keep my mind open.

You are not going to gain any 'information' or improve any
statistical testing by pretending to increase the degrees of freedom.

Is there some published rationale for doing this?

--
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
Richard Ulrich
Posted: Wed Jan 17, 2007 12:09 am
Guest
On 15 Jan 2007 23:34:11 -0800, "John Uebersax" <jsuebersax@gmail.com>
wrote:

Quote:
Again, here's a possible reason why people might legitimately generate
data as Raj describes: suppose you have several different prediction
algorithms (regression, nearest-neighbor, etc.), and you want, for
future purposes, to see which method is best in your intended range of
applications. And suppose you have some pilot data.

It's not unusual to generate random data with a large N to test/compare
algorithms. Well, if one already has pilot data, then why not use that
information (means, covariances) so that the random data has realistic
a realistic covariance structure?

In a later post,
OMU goes further what I posted earlier, about not seeing a
justification for matching the data "randomly". I appreciate his
example, too, about the hazard.

Here is a further comment on randomization: You should do that
with *idealized* data, which is extreme or not-extreme in
possible ways; it does not match your sample. How does the
method work with log-normal data? normal data? uniform? etc.
If you match the sample *closely*, while expanding the N, you
are subject to all the distortion of having a small d.f., while
your test statistics lie to you about p-values.

Here is a further comment about fiddling with the data on hand:
There is such a thing as "perturbation" analysis, which (say) jiggles
each number in the data, by fractions, to see if the conclusions
are robust. The examples that I have seen did not seem to
generate new information for the analyst who could read diagnostic
information, but could demonstrate leverage, and so one, for the
lay-reader. That was a long time ago, and I suspect that perturbation
analysis (PA) has been replaced by boot-strapping, in the places where
PA may have helped to estimate variances.


[snip, rest]

--
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Thu Nov 20, 2008 11:11 pm