Main Page | Report this Page
Science Forum Index  »  Space - Consult Forum  »  PLS modeling question...
Page 1 of 1    

PLS modeling question...

Author Message
Edward Jensen...
Posted: Wed Sep 30, 2009 12:55 pm
Guest
Hi all.

I've been experimenting a bit with PLS modeling to better understand some
troubles I'm having with real life data.

The problem I've having is that while using my model to predict the traning
data results in a quite acceptable fit, the prediction of a validation
sample (25% of the data set randomly chosen) generates nearly uncorrelated
responses not even close to the real values in the sample. What does this
usually indicate?

I've tried a different number of component (to avoid overfitting) and also
leave one out cross validation. The cross validation procedure results in a
better fit than using more than a one-component-subset for validation.

Right now I'm leaning toward the case that the explanatory variables does
not contain enough information to predict the response. I'm using 317
samples, 1 response variable and 1500 predictors and an 11 component model
which accounts for > 95% of the variance in both predictors and response.
What puzzles me, however, is that if this is the case, then why does my
traning sample yield such a good fit and my validation not?

Okay, so I tried a quick simulation. A predictor matrix of 254 x 1501 random
normal distributed values with mean 0 and sd 20 and a random normal response
variable with mean 0 and sd 30. The variances explained of a PLS model with
up to 4 components:
1 comps 2 comps 3 comps 4 comps
X 0.5424 1.014 1.486 1.958
yTrain 87.4126 97.962 99.686 99.951

The prediction of the training data yield a nearly perfect fit as in the
previous case, but the prediction of 63 variables of the same distributions
as the traning set yields nearly total random responses also as in the
previous case. I'm puzzled.

First of all: How can such a good fit be generated from random data?
Secondly: Why doesn't it generalizes to a new but equally distributed data
set?

How would one diagnose this phenomenon?

All comments are very much appreciated.

Thanks in advance,
Edward

Cross-posted to sci.stat.math with follow-up to sci.stat.consult
 
Paige Miller...
Posted: Thu Oct 01, 2009 8:49 am
Guest
On Sep 30, 2:55 pm, "Edward Jensen" <edw... at (no spam) jensen.invalid> wrote:

[quote:e1ac7ef0bc]What puzzles me, however, is that if this is the case, then why does my
traning sample yield such a good fit and my validation not?
[/quote:e1ac7ef0bc]
Possibilities:

1. You have a major outlier(s) in your training sample which has
skewed the fit. Remove it, re-fit, see if things get better.
2. Your 11 dimensions, which account for >95% of the variance in both
the X and Y is actually overfitted. Your statement "I'm leaning toward
the case that the explanatory variables does not contain enough
information to predict the response" makes little sense if you can
explain >95% of the variance of the Y variables.

[quote:e1ac7ef0bc]Okay, so I tried a quick simulation. A predictor matrix of 254 x 1501 random
normal distributed values with mean 0 and sd 20 and a random normal response
variable with mean 0 and sd 30. The variances explained of a PLS model with
up to 4 components:
        1 comps  2 comps  3 comps  4 comps
X        0.5424    1.014    1.486    1.958
yTrain  87.4126   97.962   99.686   99.951

The prediction of the training data yield a nearly perfect fit as in the
previous case, but the prediction of 63 variables of the same distributions
as the traning set yields nearly total random responses also as in the
previous case. I'm puzzled.

First of all: How can such a good fit be generated from random data?
Secondly: Why doesn't it generalizes to a new but equally distributed data
set?
[/quote:e1ac7ef0bc]
While I don't really understand this particular random example (what
do you mean prediction of 63 variables?) ... Nevertheless, when you
have 254 rows and 1501 random columns, it is quite likely that one (or
a combination of two or more) of those random columns will do a good
job of predicting the Y variable. Not a good example.

Better example: the 1501 random columns which make up your X matrix
are highly correlated with one another ... so you don't really have
1501 independent random columns. This is the situation where PLS
shines.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
 
Paige Miller...
Posted: Fri Oct 02, 2009 4:50 am
Guest
On Oct 1, 8:52 pm, Rich Ulrich <rich.ulr... at (no spam) comcast.net> wrote:

[quote:4c4bf7d2ae]I haven't used it, but PLS has a good reputation for dealing
with "too many variables";
[/quote:4c4bf7d2ae]
I would make an amendment to your statement: PLS has a good reputation
for dealing with too many highly correlated variables.

I agree that this many iid normal variables is not a good test for
PLS, and I don't expect any other procedure to perform well either.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
 
Rich Ulrich...
Posted: Fri Oct 02, 2009 2:10 pm
Guest
On Fri, 2 Oct 2009 07:50:24 -0700 (PDT), Paige Miller
<paige.miller at (no spam) kodak.com> wrote:

[quote:d70948ea94]On Oct 1, 8:52 pm, Rich Ulrich <rich.ulr... at (no spam) comcast.net> wrote:

I haven't used it, but PLS has a good reputation for dealing
with "too many variables";

I would make an amendment to your statement: PLS has a good reputation
for dealing with too many highly correlated variables.
[/quote:d70948ea94]
Good amendment.

[quote:d70948ea94]
I agree that this many iid normal variables is not a good test for
PLS, and I don't expect any other procedure to perform well either.
[/quote:d70948ea94]
- so he could get a more realistic simulation of how well PLS
should perform for his case if he simulates with correlations
present. Or (as I suggested), simply randomize the Outcome vector.

--
Rich Ulrich
 
 
Page 1 of 1    
All times are GMT - 5 Hours
The time now is Wed Dec 09, 2009 5:39 pm