Main Page | Report this Page
Science Forum Index  »  Statistics - Math Forum  »  ols regressions...
Page 1 of 1    

ols regressions...

Author Message
joseph Frank...
Posted: Wed Oct 21, 2009 10:38 am
Guest
I have a model: Y-=a+b1X1+beX2+b3X3 all the parameters are not significant.Another colleague told me to do the following: Try to remove outliers the top and bottom 2.5% or 5% of the sample for the X1 variable for instance, or X2 OR the dependent variable. Does it make any sense to remove the outliers of the X variables?
 
Paul...
Posted: Wed Oct 21, 2009 10:55 am
Guest
On Oct 21, 4:38 pm, joseph Frank <josephFrank1... at (no spam) hotmail.com> wrote:
[quote]I have a model: Y-=a+b1X1+beX2+b3X3 all the parameters are not significant.Another colleague told me to do the following: Try to remove outliers the top and bottom 2.5% or 5% of the sample for the X1 variable for instance, or X2 OR the dependent variable. Does it make any sense to remove the outliers of the X variables?
[/quote]
Is the Analysis of Variance F statistics (MSR/MSE) significant? If
so, I would first check for multicollinearity (look at the variance
inflation factors).

Eliminating outliers is warranted at times, but do you have outliers?
The top/bottom 2.5% of the observations are not necessarily outliers;
someone has to be up/down there. (If we define the top/bottom 2.5% of
any sample as outliers, and delete them, then recursively we trim the
sample down to an empty set -- which does speed up subsequent
computations.)
/Paul
 
joseph Frank...
Posted: Wed Oct 21, 2009 10:54 pm
Guest
I have no multicollinearity problem as the correlation matric shows low correlation between the variables. I do agree that putting a threshold of 2.5% and 5% would eliminate non-outliers. But what troubled me about his comment is that to eliminate the X outliers. Is it normal to look at the outliers of the independent variable?
 
Bruce Weaver...
Posted: Thu Oct 22, 2009 2:57 am
Guest
On Oct 22, 4:54 am, joseph Frank <josephFrank1... at (no spam) hotmail.com> wrote:
[quote]I have no multicollinearity problem as the correlation matric shows low correlation between the variables. I do agree that putting a threshold of 2.5% and 5% would eliminate non-outliers. But what troubled me about his comment is that to eliminate the X outliers. Is it normal to look at the outliers of the independent variable?
[/quote]
Here is a sample data set that Jerry Dallal posted a few years ago
during a discussion of multicollinearity.

X1 X2 X3 Y
18 88 106 13
72 45 117 43
36 63 99 50
75 26 101 77
22 83 105 23
99 71 170 68
69 53 122 6
6 49 55 51
86 99 185 37
85 64 149 10
87 7 94 32
93 32 125 69
44 88 132 4
34 34 68 13
84 28 112 18

Look at the correlation matrix for X1-X3. The values of r range from
-.307 to .657--values will probably not raise any red flags. Then
regress Y on X1, X2 and X3. Notice that one of the 3 variables is
excluded from the equation, because tolerance = 0.

On the flip-side, bivariate correlations can be quite high without
signaling any problematic multicollinearity (e.g., when polynomial
terms are included in the model).

So, you can have complete linear dependence despite the absence of any
alarming bivariate correlations, and you can have no problematic
multicollinearity when one or more of the correlations does look
high. Therefore, using the correlation matrix to diagnose
multicollinearity is not a good idea.

--
Bruce Weaver
bweaver at (no spam) lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/Home
"When all else fails, RTFM."
 
thebluecliffrecord...
Posted: Thu Oct 22, 2009 3:28 am
Guest
Dear All,

Excellent comments on outliers and collinearity by Mr. Frank, Mr.
Kendall, and Mr. Weaver.

[quote]...have a model: Y-=a+b1X1+beX2+b3X3 all the parameters are not >>significant.Another colleague told me to do the following: Try to >>remove outliers the top and bottom 2.5% or 5% of the sample for the X1 >>variable for instance, or X2 OR the dependent variable. Does it make >>any sense to remove the outliers of the X variables?"
[/quote]
First, the "definition" (or analysis) of outliers need to be analyzeed
in terms of
1) predictors (i.e, outliers in terms of X),
2) responses (i.e, outliers in terms of Y), "AND"
3) the "relationship" between X and Y.

I haven't seen a sigle analysis suggested by your collegue (i.e.,
remove 2.5%) and this suggestion ignores some the three cases of
outliers.

Second, (in short...) do several PCA (principal component analysis),
that is 1) PCA on X, and 2) PCA on X "AND" Y combined. Draw scatter
plots of PC scores to identify outliers. In multivariate quality
control, two control charts are utilized, 1) control limits on PC
(the portion of data accounted by PC), and 2) control limits the
portion of data unaccounted by PC. In addition, PCA is the
recommended way of identifying "collinerity".

Even before applying PCA, draw a matrix plot of X & Y and check.

Hope this helps.

Sincerely,

Sangdon Lee, Ph.D.,
GM Tech. Center
 
Paul...
Posted: Thu Oct 22, 2009 5:40 am
Guest
On Oct 22, 4:54 am, joseph Frank <josephFrank1... at (no spam) hotmail.com> wrote:
[quote]I have no multicollinearity problem as the correlation matric shows low correlation between the variables.
[/quote]
This has been dealt with by other posters, so no need to pile on.

[quote]I do agree that putting a threshold of 2.5% and 5% would eliminate non-outliers. But what troubled me about his comment is that to eliminate the X outliers. Is it normal to look at the outliers of the independent variable?
[/quote]
If they are in fact outliers, yes. Art mentioned that a high
proportion of alleged outliers end up being data entry errors, and I
concur. I read a comment once from an industry consultant that 60% or
so of the time spent on a typical statistical modeling project is
spent cleaning the data. If some of the X values are munged, no good
will come of keeping them in the sample as-is, so you'll want either
to fix them or to zap them.

There's also the possibility that your sample actually mixes
observations from more than one population, in which case a single
model may not fit the entire sample. For example, if I'm trying to
predict the frequency with which people eat out at low- to mid-price
restaurants, using disposable income as a predictor, no good will come
of having Bill and Melinda Gates in the sample, irrespective of
whether their observation is recorded correctly. So I need to winnow
my sample down to people with incomes in a certain range, and restrict
my model to applying to that domain.

For completeness, I'll also note that an outlier (whether it's a
recording error or an observation that really belongs to a different
population) need not be in the sample tails of either X or Y.
Consider a sample of income levels (X) and life insurance policies (Y)
(omitting the Gates this time). Typically people with higher income
will carry higher insurance, but I might have an observation in the
sample with a relatively low income (but not in the bottom 5%) and a
relatively high insurance coverage (but not in the top 5%). It's not
an outlier on either variable individually, but the combination is an
oddity.

I tell my students that if they purge an "outlier" that is neither a
recording error nor an observation that happens to contain an
unusually large amount of noise, then they are discarding
information. The trick is to make sure the information is not
pertinent (I'm modeling dining behavior for middle class people, so I
don't really care how often Bill Gates eats at Denny's), rather than
simply not convenient (my model fits so much better if I discard the
observations it doesn't fit).

/Paul
 
Art Kendall...
Posted: Thu Oct 22, 2009 6:52 am
Guest
Low bivariate correlations among the X variables does not necessarily
mean that the squared multiple correlations of each X from the set of
other X variables are not high.
as Paul said look at the VIF.

I am usually reluctant to remove suspected outliers unless the values
are beyond a plausible range for the construct being measured.
In my experience about 80% of alleged outliers are data entry errors.
*Double check your data.*

Do you have a sufficient number of cases?
What do the zero order correlations of each X with Y look like? Are any
significant? Is the source of your concern that the bivariate
regressions have good fit, but that the multiple regression has low Bs?

What do the squared multiple correlations of each X from the other Xs
look like?

*Try visualizing your data. *
Look at boxplots of all variables. If you can go back to the source
again for the data point that show up as suspicious.
Look at each 2d and each 3d scatterplot.
In order to create a 4 dimensional scatterplot, leave 2 Xs alone and use
colors for quintiles of another X.
Then create the 3D scatterplot and rotate the image to look at it from
different angles, fit straight lines, then fit loess lines. Are any
points isolated? do you see any bends?
What do the residual plots look like?

Do the visualizations give the impression that interactions of some X
variables with themselves (X**2, X**3, etc.) or interactions among Xs
(X1*X2, X1*X3, etc.) are meaningful and improve the fit?

IFF none of the previous helps, then use the boxplots to identify a
*few* of the extreme point that are well separated from the others.
Create flag variables for edited points. Run a re regression as before
and then see if entering the flag variables on a step improves the fit
in a meaningful and significant way. Then do a run trimming those
points in to the value of the next believable point and run your
regressions. Then go back and treat the cases with flags as missing.
If you get meaningful differences, you have a long write up to do
explaining why treating those points as outliers changed things. If
the additional runs do not work, just write up the original run and
mention the checks done.

Art Kendall
Social Research Consultants


joseph Frank wrote:
[quote]I have no multicollinearity problem as the correlation matric shows low correlation between the variables. I do agree that putting a threshold of 2.5% and 5% would eliminate non-outliers. But what troubled me about his comment is that to eliminate the X outliers. Is it normal to look at the outliers of the independent variable?
[/quote]
 
joseph Frank...
Posted: Fri Oct 23, 2009 10:37 pm
Guest
Thanks all for your contributions. Actually, it made me realize that a lot of work should be done behind the scenes before reporting any regression.
1- Can anyone recommend a good textbook that explains in details the checklist of things that needs to be done before reporting a regression.
2-Also can you recommend an easy to read textbook that discusses the usage of principal component analysis to find multicolinearity.
 
 
Page 1 of 1    
All times are GMT - 5 Hours
The time now is Mon Dec 14, 2009 10:11 pm