Main Page | Report this Page
 
   
Science Forum Index  »  Statistics - Math Forum  »  Validating hypothesis with data set (problematic...
Page 1 of 1    
Author Message
Edward Jensen...
Posted: Tue Jul 08, 2008 5:21 pm
Guest
Hi,

I have read a bit from the archives of the newgroup about simplifying a
regression model by stepwise eliminating explanatory variables. The reckon
that the common opinion is that this is a problematic question that should
be avoided as much as possible.

I still don't quite understand the problem. The following quote is from
http://www.tufts.edu/~gdallal/simplify.htm
"With rare exception, a hypothesis cannot be validated in the dataset that
generated it"

What does this mean? Is the hypothesis that e.g. only three of eight
explanatory variables are needed in the regression model, but that cannot be
validated with the same data set that is used to build the regression model?
If that is the case, why not? Isn't that the same thing we do, when we test
for the significance of a single or multiple coefficients? Is it the same
problem present in the concept that a null-hypothesis cannot be accepted but
only rejedted?

The reason I am reading about this is that a have a large data set (240
measurements with 300 varaibles), and I want to find the smallest subset
that gives the best classification into two groups. I have tried
implementing a heuristic search (genetic algorithm) for selecting subsets of
these variables, which are fed into a support vector machine for
classification. The program is able to reduce the dimensionality for 4 with
98.75% procent correct classification in a 10-fold cross-validation.
Interestingly, the variables selected are the same as was assumed to be
important compared to a physical model of the observed system. Is this
fundamentally a bad approach? I guess it would fall into the class of "data
mining" or exploratory data analysis. I would like to try the same with
logistic regression instead of SVM, but can I trust the results due to the
problems with stepwise model simplification?

I would very much appreciate any comments on this matter, since I don't
quite understand the problem.

Thanks in advance.
RichUlrich...
Posted: Thu Jul 10, 2008 4:58 pm
Guest
Greg gave some pragmatic advice about how to try modeling
with too many variables. I will try to address some of the
other questions. His answers may have implied some of
these comments....

On Wed, 9 Jul 2008 00:21:18 +0200, "Edward Jensen"
<edward at (no spam) jensen.invalid> wrote:

Quote:
Hi,

I have read a bit from the archives of the newgroup about simplifying a
regression model by stepwise eliminating explanatory variables. The reckon
that the common opinion is that this is a problematic question that should
be avoided as much as possible.

I still don't quite understand the problem. The following quote is from
http://www.tufts.edu/~gdallal/simplify.htm
"With rare exception, a hypothesis cannot be validated in the dataset that
generated it"

A VERY large sample might allow some hypthesis to be generated
from a tiny sub-sample; with validation to follow. A HUGE
effect, with consistency and logic on its side, might be
considered self-validating. Those are the rare exceptions
that I see, right off.

Quote:

What does this mean? Is the hypothesis that e.g. only three of eight
explanatory variables are needed in the regression model, but that cannot be
validated with the same data set that is used to build the regression model?

Later on, you mention that you have 300 variables, with a smaller
number of them assumed to have direct physical relations to the
outcome. In my universe of clinical research, we often have something
similar -- 400 items collected. Those items do not comprise 400
hypotheses, though someone might one to explore them that way.

The 400 items might be reduced to 20 rating-scales, and those scales
or a few items are further reduced to sets. There are one or two
"variables" that represent the primary hypotheses that justify the
study. There might be 5 of the 20 scales that measure important,
secondary hypotheses that may have been suggested by the
literature or prior research. There are 15 scales whose analysis is
considered "franklly exploratory" -- Ideas may be suggested and
supported, but nothing can be "validated" in terms of their primary
relation to the outcome, or perhaps in explaining each other.

The main hypotheses are tested with (say) 5% overall power,
in order to validate them. The secondary hypotheses might receive
less stringent testing; there are a lot of circumstances that may
vary. None of the "exploratory hypotheses" will be considered
to be validated -- That probably holds, even if there is some item
that has *enormous* predictive power, compared to everything
else. (IF it is so good, why did no one notice it before?)


Quote:
If that is the case, why not? Isn't that the same thing we do, when we test
for the significance of a single or multiple coefficients? Is it the same
problem present in the concept that a null-hypothesis cannot be accepted but
only rejedted?

You can correct your testing accurately for having 2 or 5, or maybe
15 tests. You are in much worse shape when you try to correct for
100 -- There is the pragmatic difficulty of achieving p-levels less
than .00001, etc., with a moderate sized sample. Just as important,
p-levels are increasingly mistaken as they become smaller,
whenever the assumptions are not fully met (even slightly).
- For human samples, "independence" is never met, which matters
more severely when the "significance" is achieved through the
agency of large Ns combined with tiny differences in means or
with tiny correlations.


Quote:

The reason I am reading about this is that a have a large data set (240
measurements with 300 varaibles), and I want to find the smallest subset
that gives the best classification into two groups. I have tried
implementing a heuristic search (genetic algorithm) for selecting subsets of
these variables, which are fed into a support vector machine for
classification. The program is able to reduce the dimensionality for 4 with
98.75% procent correct classification in a 10-fold cross-validation.

You are misclassifying 4 cases out of 240. Very good, if there are
120 in each of two classes. Very mediocre, for the "best 4 of 300"
variables, if the two outcomes divide into 220+20 and that is the
step-wise result. Using a "10-fold cross-validation" would make it
a little more substantial, but that would be hard to judge. (You
might try a Monte-Carlo approach -- assign the same counts for the
split to cases at random, and see what success results; repeat a
few hundred times.)

The statement that "4 errors" is a result of 10-fold cross-validation
leaves me some room for wondering -- Was this search performed
just *once*, or were there several trials, resulting evetually in this
final, best-fit with 4 errors? -- If you are repeating the whole
10-fold cross-validation, then *that* repetition introduced
multiple-testing into the design, at another level.


Quote:
Interestingly, the variables selected are the same as was assumed to be
important compared to a physical model of the observed system. Is this
fundamentally a bad approach? I guess it would fall into the class of "data
mining" or exploratory data analysis. I would like to try the same with
logistic regression instead of SVM, but can I trust the results due to the
problems with stepwise model simplification?

Model 1: Only the primary variables.
Model 2: Primary variables plus others with physical roles.
Model 3: Only the *rest*.
Model 4: Perhaps, (3) with (1). Then, consider: How much
is added? Are the relations logical, and are predictions in the
same direction as before? If the predictions are not in the same
direction, is there logic to *that*?

The only protection against the errors introduced by small-effects
chosen by step-wise selection is cross-validation, the more
extensive, the better. There is one article in the literature that
suggests that for large numbers of predictions, one should start
with only *tiny* sub-samples for finding potential predictors.

Quote:

I would very much appreciate any comments on this matter, since I don't
quite understand the problem.

Thanks in advance.

--

Rich Ulrich
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Tue Oct 07, 2008 3:56 pm