| |
 |
|
|
Science Forum Index » Space - Consult Forum » Principal Component Analysis...
Page 1 of 2 Goto page 1, 2 Next
|
| Author |
Message |
| David... |
Posted: Thu May 08, 2008 6:21 am |
|
|
|
Guest
|
Dear list,
is it possible to use PCA on categorical data?
I have a group of 30 continuous and categorical data and would like to
select a subset for modeling a response variable. I read that PCA
would help me doing this data reduction, but all the examples I have
seen involve continuous data.
Thanks for your help,
D. |
|
|
| Back to top |
|
| Ray Koopman... |
Posted: Mon May 12, 2008 10:10 pm |
|
|
|
Guest
|
On May 8, 9:21 am, David <david_art... at (no spam) hotmail.com> wrote:
Quote: Dear list,
is it possible to use PCA on categorical data?
I have a group of 30 continuous and categorical data and would like to
select a subset for modeling a response variable. I read that PCA
would help me doing this data reduction, but all the examples I have
seen involve continuous data.
Thanks for your help,
D.
Yes, there is such a thing as PCA for categorical (and mixed) data,
but I doubt that it would be much help to you. Principal component
regression, which is what you're talking about, is dicey enough with
continuous data, and making some of the variables categorical would
complicate matters substantially. |
|
|
| Back to top |
|
| Paige Miller... |
Posted: Wed May 14, 2008 2:25 am |
|
|
|
Guest
|
On May 8, 12:21 pm, David <david_art... at (no spam) hotmail.com> wrote:
Quote: Dear list,
is it possible to use PCA on categorical data?
I have a group of 30 continuous and categorical data and would like to
select a subset for modeling a response variable. I read that PCA
would help me doing this data reduction, but all the examples I have
seen involve continuous data.
The data reduction done by Principal Components is not the data
reduction you refer to as "select a subset for modeling a response
variable".
Principal Components optimizes a specific objective function;
selecting a subset of your variables for modeling a response variable
has an implicit, and different, objective function to be optimized. So
using a procedure that optimizes one objective function (PCA) to
achieve optimization of a different objective function just doesn't
make sense. (Yes, I know sometimes people use PCA that way, but that's
simply not a good thing to do in most situations.)
If you have 30 continuous and categorical variables, and you want to
predict Y, I suggest you look into Partial Least Squares (PLS)
regression, which is much more suited to the case where you have many
correlated predictor variables. PLS can be easily modified to handle
categorical variables. It provides a model where all of your 30
predictor variables are used in prediction. You can find variables
with small loadings or small regression coefficients and eliminate
them from your model if you so choose; there is a robust debate among
statisticians and chemometricians whether or not this is a good things
to do.
--
Paige Miller
paige\dot\miller \at\ kodak\dot\com |
|
|
| Back to top |
|
| ... |
Posted: Wed May 14, 2008 4:14 am |
|
|
|
Guest
|
Paige Miller <paige.miller at (no spam) kodak.com> wrote:
Quote: If you have 30 continuous and categorical variables, and you want to
predict Y, I suggest you look into Partial Least Squares (PLS)
regression, which is much more suited to the case where you have many
correlated predictor variables. PLS can be easily modified to handle
categorical variables. It provides a model where all of your 30
predictor variables are used in prediction. You can find variables
with small loadings or small regression coefficients and eliminate
them from your model if you so choose; there is a robust debate among
statisticians and chemometricians whether or not this is a good things
to do.
Hi Paige,
I've seen you recommend PLS on a number of occasions, so I imagine you
might know what's happening. Some years ago, I had used a pre-release
version of Wynne Chin's PLSGraph, but I'm not sure he ever completed the
project. Do you know anything about it? Do you use or recommend any other
software for PLS?
Thanks,
Mike Babyak |
|
|
| Back to top |
|
| Paige Miller... |
Posted: Wed May 14, 2008 5:50 am |
|
|
|
Guest
|
On May 14, 10:14 am, nau... at (no spam) nil.com wrote:
Quote: Hi Paige,
I've seen you recommend PLS on a number of occasions, so I imagine you
might know what's happening. Some years ago, I had used a pre-release
version of Wynne Chin's PLSGraph, but I'm not sure he ever completed the
project. Do you know anything about it? Do you use or recommend any other
software for PLS?
Thanks,
Mike Babyak
Mike,
I make no recommendations on which software to use, however I know the
following packages do perform PLS: SAS, MATLAB (with add-on
toolboxes), Unscrambler (published by Camo), and SIMCA-P (published by
Umetrics). There are probably other software packages that perform
PLS.
I do however make a negative recommendation, as in: DON'T USE THIS
PACKAGE FOR PLS. The package I am referring to is JMP (which is very
good for many statistical analyses, but unusable for PLS). My reasons
for this negative recommendation were written here (and apparently
still apply to the newest version of JMP, which is version 7):
https://listserv.umd.edu/cgi-bin/wa?A2=ind0506&L=ICS-L&P=R1101&I=-3
--
Paige Miller
paige\dot\miller \at\ kodak\dot\com |
|
|
| Back to top |
|
| Art Kendall... |
Posted: Wed May 14, 2008 8:40 am |
|
|
|
Guest
|
For data reduction, see the CATPCA - Categorical PCA - procedure in SPSS
it deals with mixed continuous and categorical variables.
You may also be interested in CATREG - Categorical Regression.
Art Kendall
Social Research Consultants
David wrote:
Quote: Dear list,
is it possible to use PCA on categorical data?
I have a group of 30 continuous and categorical data and would like to
select a subset for modeling a response variable. I read that PCA
would help me doing this data reduction, but all the examples I have
seen involve continuous data.
Thanks for your help,
D. |
|
|
| Back to top |
|
| John Uebersax... |
Posted: Wed May 14, 2008 8:32 pm |
|
|
|
Guest
|
Hi David,
Some suggestions:
1. If your categorical variables are ordered-categorical, then you
can calcualte:
a. Pearson correlations between each pair of continuous variables.
b. Polyserial correlations between continuous and ordered-
categorical variables
c. Polychoric correlations between ordered-categorical variables
then place these in a single matrix and analyze that matrix by PCA. A
program like LISREL/Prelis will do all this for you more-or-less
automatically.
2. Although I agree with what others have posted, personally I prefer
the approach you originally suggested: to approach data-reduction and
the modeling of your response variable as two separate steps.
3. Since you just want to select a subset of non-redundant variables,
you have other options besides PCA. For example, you can use
hierarchical cluster analyis on the correlation matrix. That will
divide your variables into clusters. Then you can pick 'exemplars'
from each cluster and use those in your data model. This gives you
more flexibility, because you can use other measures of similarity/
redundancy among your variables besides correlation coefficients. For
example, if your categorical variables are non-ordered (i.e., purely
nominal variables), you can calculate the canonical correlation
between each pair of them. Then you can cluster analyze the matrix of
canonical correlation coefficients to divide the variables into
separate groups, and then select exemplars from each group.
Possibly you can include the canonical correlations in the overall
matrix as described in point 1 above -- I'm not sure, becuase they
might tend to run lower overall than Pearson correlations.
Hope this helps.
John Uebersax PhD
On May 8, 6:21 pm, David <david_art... at (no spam) hotmail.com> wrote:
Quote: Dear list,
is it possible to use PCA on categorical data?
I have a group of 30 continuous and categorical data and would like to
select a subset for modeling a response variable. I read that PCA
would help me doing this data reduction, but all the examples I have
seen involve continuous data.
Thanks for your help,
D. |
|
|
| Back to top |
|
| David... |
Posted: Thu May 15, 2008 1:36 am |
|
|
|
Guest
|
On May 15, 8:32 am, John Uebersax <jsueber... at (no spam) gmail.com> wrote:
Quote: Hi David,
Some suggestions:
1. If your categorical variables are ordered-categorical, then you
can calcualte:
a. Pearson correlations between each pair of continuous variables.
b. Polyserial correlations between continuous and ordered-
categorical variables
c. Polychoric correlations between ordered-categorical variables
then place these in a single matrix and analyze that matrix by PCA. A
program like LISREL/Prelis will do all this for you more-or-less
automatically.
2. Although I agree with what others have posted, personally I prefer
the approach you originally suggested: to approach data-reduction and
the modeling of your response variable as two separate steps.
3. Since you just want to select a subset of non-redundant variables,
you have other options besides PCA. For example, you can use
hierarchical cluster analyis on the correlation matrix. That will
divide your variables into clusters. Then you can pick 'exemplars'
from each cluster and use those in your data model. This gives you
more flexibility, because you can use other measures of similarity/
redundancy among your variables besides correlation coefficients. For
example, if your categorical variables are non-ordered (i.e., purely
nominal variables), you can calculate the canonical correlation
between each pair of them. Then you can cluster analyze the matrix of
canonical correlation coefficients to divide the variables into
separate groups, and then select exemplars from each group.
Possibly you can include the canonical correlations in the overall
matrix as described in point 1 above -- I'm not sure, becuase they
might tend to run lower overall than Pearson correlations.
Hope this helps.
John Uebersax PhD
On May 8, 6:21 pm, David <david_art... at (no spam) hotmail.com> wrote:
Dear list,
is it possible to use PCA on categorical data?
I have a group of 30 continuous and categorical data and would like to
select a subset for modeling a response variable. I read that PCA
would help me doing this data reduction, but all the examples I have
seen involve continuous data.
Thanks for your help,
D.- Hide quoted text -
- Show quoted text -
Thankyou all for your input. Here are some comments:
- Art, you suggest some PCA methods, but my initial worry about using
PCA is losing iterpreatability
- Paige, you suggest PLS, but is PLS not doing effectively what PCA
does or Principal Component Regression? I have just read through it
quickly, and had a look at Faraway`s "Practical Regression and ANOVA
using R" and it says "On the other hand, PLS is virtually useless for
explanation purposes". So how can I trace back my regressors after
doing PLS?
- John, you suggest calculating a correlation matrix for all pairwise
comparison of my variables and then performing hierarchical clustering
to select a representative of each of the groups. That sounds very
interesting. So if I have 30 variables, should I end up with a 30x30
correlation matrix that could be fed to a clustering algorithm? My
categorical variables are generally non-ordered, like "family history"
yes-no. What kind of correlation measurement could I use for non-
ordered categorical variables?
Thanks for your useful comments
D. |
|
|
| Back to top |
|
| Paige Miller... |
Posted: Thu May 15, 2008 2:17 am |
|
|
|
Guest
|
On May 15, 7:36 am, David <david_art... at (no spam) hotmail.com> wrote:
Quote: - Paige, you suggest PLS, but is PLS not doing effectively what PCA
does or Principal Component Regression? I have just read through it
quickly, and had a look at Faraway`s "Practical Regression and ANOVA
using R" and it says "On the other hand, PLS is virtually useless for
explanation purposes". So how can I trace back my regressors after
doing PLS?
When you have 30 input variables, and they are (highly) correlated,
the problem isn't the method -- in this case PLS -- the problem is
that you cannot in any way separate the distinct and independent
effects of each of the individual input variables. Logically, this
cannot be done. Thus, any model based upon 30 correlated predictors,
regardless of the estimation method, does not provide distinct and
independent effect estimates of the 30 predictors that can be used for
explanatory purposes. PLS does not do this, and any other method that
you choose will not do this either, because as I said, it is logically
impossible.
In this case, the model (regardless of the estimation method) may or
may not be a good predictive model. There are zillions of studies
published where PLS performs well predictively. There are also many
many examples where the PLS vectors are interpretable -- in other
words, you don't have a distinct and independent estimate of the
individual effects, but you do have a good interpretation of the
linear combination of effects that PLS provides. This is very powerful
when it works, and so to dismiss PLS and similar approaches as having
no explanatory power is completely incorrect. The explanatory power
comes from the linear combinations of independent variables that PLS
provides.
Next, PLS is not PCA regression. Let me repeat that. PLS is not PCA
regression. One more time. PLS is not PCA regression.
The PCA vectors are chosen without regard to the dependent variable.
They may or may not be correlated with the dependent variable. If they
are not correlated with the dependent variables, then this step of the
analysis is essentially worthless when you want to predict a dependent
variable(s). PLS, on the other had, will find vectors (linear
combinations of the independent variables) that are correlated with
the response variable(s), if such vectors exist. That's what the PLS
algorithm does. So, by default, PLS is a superior regression method to
PCA Regression in the case where you want to model a dependent
variable(s).
Finally, with regards to comments made by other commenters, PLS
provides a data reduction method that is directly linked to your final
objective, that is to model some dependent variable. All of the other
commenters, discussing PCA, CATPCA, LISREL, and selecting a subset of
non-redundant variables, all amount to pre-processing the data without
regards to the dependent variable, and thus you wind up risking doing
something in your pre-processing that eliminates information that is
helpful to your final predictive model. Why take that risk? PLS
minimizes that particular risk. PLS finds a lower dimensional
representation of your independent variables (akin to the idea of
preprocessing) that is correlated with your dependent variables, if
such a lower dimensional representation exists.
--
Paige Miller
paige\dot\miller \at\ kodak\dot\com |
|
|
| Back to top |
|
| ... |
Posted: Thu May 15, 2008 3:46 am |
|
|
|
Guest
|
Quote: Paige Miller wrote:
When you have 30 input variables, and they are (highly) correlated,
the problem isn't the method -- in this case PLS -- the problem is
that you cannot in any way separate the distinct and independent
effects of each of the individual input variables. Logically, this
cannot be done.
Paige wrote a very nice summarization on the difficulty (or
impossiblity) of interpreting the regression coefficients in which
predictors are correlated and I am fully in agreement with his
recommendation of PLS.
I just want to emphasize one more time. If variables are highly
correlated, the distinct and independent effects of individual
predictors can NOT be estimated NO MATTER WHAT (even with zillions of
data, any sophisticated non-linear methods, etc).
Logically impossible and mathematically impossible as well. This is
related to collinearity or ill-conditioning.
I found the chapters 12 and 13 of “Data Analysis and Regression”
written by F. Mosteller and J. W. Tukey (Woes of regression
coefficients) very enlightening for the interpretation of regression
coefficients.
Interpretation assumes cause-and-effect relationship. In my humble
opinion, statistics has not been successful in understanding the cause-
and-effect relationship, let alone, dynamic behavior. It has been
repeated so many times: Correlation is not causation. Well, this
is not the responsibility of statistics but the problem of
OBSERVATIONAL data. What I mean is that if data are from controlled
experiments in which the predictors are independent, samples are
balanced and randomly allocated, the interpretation and the effects
of individual predictors are very straight forward. However, majority
of data are observational in life. So the difficulty of
interpretation.
By the way, in a rough summary.
- Multiple linear regression (also ALL the methods in the GENERALIZED
LINEAR REGRESSION)accounts for the maximum variance of Y.
- PCA (and thus PCR) accounts for the maximum variance of X.
- PLS accounts for the maximum variance of X AND Y.
I found it helpful to just think about the intrinsic structure among
predictors before any methods are applied: whether some predictors are
causally related and thus combined or deleted.
Hope this helps.
Sangdon Lee, Ph.D.
GM Tech Center. |
|
|
| Back to top |
|
| Paige Miller... |
Posted: Thu May 15, 2008 4:20 am |
|
|
|
Guest
|
On May 15, 8:59 am, "Gaj Vidmar" <gaj.vid... at (no spam) mf.uni-lj.si> wrote:
Quote: - But there is also a point that with "too many variables" [with regard to
the number of cases], in order to avoid capitalisation on chance (overfit,
lack of generalisability or however-you-call-it) while avoiding learning
stuff too advanced for "only semi-smart people" (regularised discriminant
analysis, shrinkage a la Prof. Harrell etc.), you can get valid results with
simple methods precisely and only by ignoring the outcome when reducing the
dimensionality first (with PCA, clustering [followed by selection of
"representatives" or producing a score on each "varible group"], CATPCA
[after wise discretisation of numeric variables], even FA, or some other
way), doing, of course, everything cum grano salis (i.e., with subject
matter knowledge, "feeling" for data etc.).
This is an extremely long sentence and quite honestly, it is so
convoluted in its structure, that I am afraid I may not understand
what you are saying.
However ... PLS has protection against overfitting. It is called
crossvalidation. Crossvalidation isn't perfect, but it does prevent
certain abuses of the data. The mere fact that PLS may use large
number of variables as X and much fewer observations than variables
does not imply that overfitting has happened. In the original poster's
case, if he had 30 variables, and they are (highly) correlated, there
may not be 30 independent things happening in his X matrix. In fact,
PLS usually chooses a few dimensions to use for prediction, which most
definitely is not overfitting. As I said earlier, it has proven to be
a useful tool in many published examples.
The part of the sentence that reads "you can get valid results with
simple methods precisely and only by ignoring the outcome when
reducing the dimensionality first" disturbs me greatly, if I am
understanding it properly. I will say again, pre-processing your data
without regards to the dependent variable is dangerous and may discard
information that can be used to predict. I strongly urge people to use
pre-processing techniques that do not ignore the dependent variable.
--
Paige Miller
paige\dot\miller \at\ kodak\dot\com |
|
|
| Back to top |
|
| Gaj Vidmar... |
Posted: Thu May 15, 2008 7:59 am |
|
|
|
Guest
|
Let me start with the last issue, i.e., similarity/distance measure for
mixed data.
There's enough research on that (especially recently); I'll just mention
Gower's index (General Coefficient of Similarity, if I recall correctly) and
its extensions (to include ordinal data etc.).
If you just have numerical and binary data, Pearson correlation will not be
such a terible idea (please note that I'm writing from a strictly applied,
"help-the-client--considering-all-the-tradeoffs" perspective).
---
To add to another poster championing PLS, Minitab also does it.
---
On a general note, to summarise things simplistically:
- There is a point in the abovementioned poster's concern with loosing the
information/variables that are most relevant to predicting the outcome if
reducing the dimensionality of the predictors without regard to the outcome
(the whole critique of Principal Components Regression by people immensly
more qualified than me, including a whole book, is also related to that).
- But there is also a point that with "too many variables" [with regard to
the number of cases], in order to avoid capitalisation on chance (overfit,
lack of generalisability or however-you-call-it) while avoiding learning
stuff too advanced for "only semi-smart people" (regularised discriminant
analysis, shrinkage a la Prof. Harrell etc.), you can get valid results with
simple methods precisely and only by ignoring the outcome when reducing the
dimensionality first (with PCA, clustering [followed by selection of
"representatives" or producing a score on each "varible group"], CATPCA
[after wise discretisation of numeric variables], even FA, or some other
way), doing, of course, everything cum grano salis (i.e., with subject
matter knowledge, "feeling" for data etc.).
---
Yeah, as they say, it all depends.
Best regards,
Gaj Vidmar, PhD
Institute for Rehabilitytion, Republic of Slovenia
& Univ. of Ljubljana, Fac. of Medicine, Inst. of Biomedical Informatics
"David" <david_arteta at (no spam) hotmail.com> wrote in message
news:d6e6fb0f-6a55-4fae-9d44-fe4c4c166f0a at (no spam) a70g2000hsh.googlegroups.com...
On May 15, 8:32 am, John Uebersax <jsueber... at (no spam) gmail.com> wrote:
Quote: Hi David,
Some suggestions:
1. If your categorical variables are ordered-categorical, then you
can calcualte:
a. Pearson correlations between each pair of continuous variables.
b. Polyserial correlations between continuous and ordered-
categorical variables
c. Polychoric correlations between ordered-categorical variables
then place these in a single matrix and analyze that matrix by PCA. A
program like LISREL/Prelis will do all this for you more-or-less
automatically.
2. Although I agree with what others have posted, personally I prefer
the approach you originally suggested: to approach data-reduction and
the modeling of your response variable as two separate steps.
3. Since you just want to select a subset of non-redundant variables,
you have other options besides PCA. For example, you can use
hierarchical cluster analyis on the correlation matrix. That will
divide your variables into clusters. Then you can pick 'exemplars'
from each cluster and use those in your data model. This gives you
more flexibility, because you can use other measures of similarity/
redundancy among your variables besides correlation coefficients. For
example, if your categorical variables are non-ordered (i.e., purely
nominal variables), you can calculate the canonical correlation
between each pair of them. Then you can cluster analyze the matrix of
canonical correlation coefficients to divide the variables into
separate groups, and then select exemplars from each group.
Possibly you can include the canonical correlations in the overall
matrix as described in point 1 above -- I'm not sure, becuase they
might tend to run lower overall than Pearson correlations.
Hope this helps.
John Uebersax PhD
On May 8, 6:21 pm, David <david_art... at (no spam) hotmail.com> wrote:
Dear list,
is it possible to use PCA on categorical data?
I have a group of 30 continuous and categorical data and would like to
select a subset for modeling a response variable. I read that PCA
would help me doing this data reduction, but all the examples I have
seen involve continuous data.
Thanks for your help,
D.- Hide quoted text -
- Show quoted text -
Thankyou all for your input. Here are some comments:
- Art, you suggest some PCA methods, but my initial worry about using
PCA is losing iterpreatability
- Paige, you suggest PLS, but is PLS not doing effectively what PCA
does or Principal Component Regression? I have just read through it
quickly, and had a look at Faraway`s "Practical Regression and ANOVA
using R" and it says "On the other hand, PLS is virtually useless for
explanation purposes". So how can I trace back my regressors after
doing PLS?
- John, you suggest calculating a correlation matrix for all pairwise
comparison of my variables and then performing hierarchical clustering
to select a representative of each of the groups. That sounds very
interesting. So if I have 30 variables, should I end up with a 30x30
correlation matrix that could be fed to a clustering algorithm? My
categorical variables are generally non-ordered, like "family history"
yes-no. What kind of correlation measurement could I use for non-
ordered categorical variables?
Thanks for your useful comments
D. |
|
|
| Back to top |
|
| Gaj Vidmar... |
Posted: Thu May 15, 2008 10:17 am |
|
|
|
Guest
|
To try something less "convoluted in structure" (although I cannot write a
sentence without parentheses :)
- read 'simple methods' as 'apart from PLS and even more advanced methods'
and the misunderstanding will be gone.
Let's say I counted multiple regression and clustering and PCA among the
simple methods because they can be at least roughly understood and
[self-]tought without matrix algebra and the like, to/by all kinds of social
and life sciences students and/or researchers etc.
Of course even an amateur like me knows that there's no overfit 'as such' in
PLS and that there's cross-validation -- I referred to 'apart from PLS and
even more advanced methods' as stressed above.
And "is dangerous and may discard information that can be used to predict"
does not contradict 'can give valid results'. Needn't, but can, and with
subject matter knowledge, care, feeling, proper EDA, refraining from
'cheating' (in the sense of, e.g., of picking 'the best' predictors from the
huge pool, 'preferably' by some machine-learning method, and then not
reporting about the dropped ones), it more than likely will help one produce
something publishable, a reasonably non-crap thesis, a not completely
useless BI report or the like, rather than get one confused, commit blunders
and the like.
As said, it all depends on what are the data, what is the aim, what does the
analyst know and how much time/motivation he/she has to learn, what are the
time constraints, software preferences, budget, etc. etc. etc.
Best regards,
Gaj Vidmar
(just trying -- apparently unsuccessfully -- to convey to the poster some
down-to-earth thoughts about real-life data analysis)
"Paige Miller" <paige.miller at (no spam) kodak.com> wrote in message
news:08bc3119-8b81-421a-8e1d-3cb67faa5ed2 at (no spam) d77g2000hsb.googlegroups.com...
Quote: On May 15, 8:59 am, "Gaj Vidmar" <gaj.vid... at (no spam) mf.uni-lj.si> wrote:
- But there is also a point that with "too many variables" [with regard
to
the number of cases], in order to avoid capitalisation on chance
(overfit,
lack of generalisability or however-you-call-it) while avoiding learning
stuff too advanced for "only semi-smart people" (regularised discriminant
analysis, shrinkage a la Prof. Harrell etc.), you can get valid results
with
simple methods precisely and only by ignoring the outcome when reducing
the
dimensionality first (with PCA, clustering [followed by selection of
"representatives" or producing a score on each "varible group"], CATPCA
[after wise discretisation of numeric variables], even FA, or some other
way), doing, of course, everything cum grano salis (i.e., with subject
matter knowledge, "feeling" for data etc.).
This is an extremely long sentence and quite honestly, it is so
convoluted in its structure, that I am afraid I may not understand
what you are saying.
However ... PLS has protection against overfitting. It is called
crossvalidation. Crossvalidation isn't perfect, but it does prevent
certain abuses of the data. The mere fact that PLS may use large
number of variables as X and much fewer observations than variables
does not imply that overfitting has happened. In the original poster's
case, if he had 30 variables, and they are (highly) correlated, there
may not be 30 independent things happening in his X matrix. In fact,
PLS usually chooses a few dimensions to use for prediction, which most
definitely is not overfitting. As I said earlier, it has proven to be
a useful tool in many published examples.
The part of the sentence that reads "you can get valid results with
simple methods precisely and only by ignoring the outcome when
reducing the dimensionality first" disturbs me greatly, if I am
understanding it properly. I will say again, pre-processing your data
without regards to the dependent variable is dangerous and may discard
information that can be used to predict. I strongly urge people to use
pre-processing techniques that do not ignore the dependent variable.
--
Paige Miller
paige\dot\miller \at\ kodak\dot\com |
|
|
| Back to top |
|
| Art Kendall... |
Posted: Thu May 15, 2008 12:14 pm |
|
|
|
Guest
|
I would not suggest using the PCA under Factor unless you categorical
variables are dichotomous.
CATPCA is in the Categories add-on to PCA. It is distinct from the
other kinds of factor analysis in that it handles a mix of categorical
and continuous variables. If the categorical variables are dichotomous
CATPCA is identical to PCA. It can also tell you how variables work
at different levels of measurement. For example, it will compare using
categorical variables as pure nominal, ordered, and not too discrepant
from interval. (Many extent or Likert items turn out not too discrepant
from interval level.)
Almost any University has SPSS.
The full documentation including algorithms is on the CD that comes
with it and I believe it also on the spss site.
see
http://www.spss.com/categories/data_analysis.htm
for a brief description of the procedures available in Categories.
If you cannot navigate to the documentation that is online email me and
I'll try to navigate to it and send the links.
If it turn out not to online (which I doubt) and your email program
allows you to receive .pdf file, email me and I'll send them
Art Kendall
Social Research Consultants
David wrote:
<snip>
Quote:
- Art, you suggest some PCA methods, but my initial worry about using
PCA is losing iterpreatability
|
|
|
| Back to top |
|
| David... |
Posted: Thu May 15, 2008 11:21 pm |
|
|
|
Guest
|
Hi all again, thanks for your input, it is greatly appreciated. You
may be guessing by now my statistics level. I am trying to get some
directions as to where to look for from here.
Following your discussion, for what I am understanding, it seems to be
a general consensus that feature selection needs to be addressed
together with the dependent variable, so that information that might
be useful in predicting Y is not dropped. And PLS does this, unlike
PCA and "Correlation among predictors". But on the other hand, I
wonder why finding groups of correlated predictors and then choosing a
member of each group to relate it to the response variable would lead
to loosing information? If I had 5 groups of correlated variables
among my 30 predictors, I could use the 5 representatives in the
prediction with no fear of loosing information, couldn´t I?
One thing that strikes me is that you are all talking about "(highly)
correlated" predictors. Is this generally the case for any dataset?
Paige, you say that "When you have 30 input variables, and they are
(highly) correlated,
the problem isn't the method -- in this case PLS -- the problem is
that you cannot in any way separate the distinct and independent
effects of each of the individual input variables."
Do you always get highly correlated variables? I am working on
clinical data, having different "biological" and "social" variables
(such as presence of symptoms, smoker, concentration of
metabolites,...) to try to predict a clinical outcome (response to
treatment, bad prognosis,...). I can understand that if all 30
variables are correlated, no method will allow me to do a good
selection of variables. If this is always the case, then why bother
trying to model data?
Thanks Art for the info on PCA. We use SPSS and will look into the
information. I am also trying R.
Regards,
David |
|
|
| Back to top |
|
| |
Page 1 of 2 Goto page 1, 2 Next
All times are GMT - 5 Hours
The time now is Tue Jul 08, 2008 9:38 pm
|
|