Main Page | Report this Page
 
   
Science Forum Index  »  Space - Consult Forum  »  about PCA and variability??...
Page 1 of 1    
Author Message
onyourmark...
Posted: Tue Jun 24, 2008 8:47 pm
Guest
Hi, I have read about PCA. Always there is reference to the fact that
the first principle component is the component with the greatest
variation. Also there is mention of the fact that we seek to rotate
the axes in the direction of maximum variability. And again there is
mention of the fact that much of the variability of the data can be
accounted for by a smaller set of variables.

I am missing something in these statements. I cannot see clearly the
importance of the variability of a variable. I think that this is
being considered in the context of predicting the value of another (a
dependent/response) variable.

I trying to think about it like this. Suppose there is no variability
in a given variable say X. i.e. it is constant. I guess we would say
that such a variable would have no usefulness in predicting the value
of a response variable. I mean if X is always 5 then knowing what X is
will give us no chance to have any insight into the value of our
response variable (say Y).

But the converse of this idea, that if X is extremely variable -has a
lot of variability, then ..... what? I cannot go to the next step
here.

Any help would be appreciated!
Thank you.
cprice...
Posted: Wed Jun 25, 2008 5:19 am
Guest
From your original set of many X variables, PCA lets you create a new
set of variables. This new set will have just as many variables as
your original set, but the point here is that, with only a few of
them, you could still capture almost all of the variance of the
original X variables.


As an exaggerated example, your outcome might be that the 1st new
variable might capture 98% of the variance of all of the original
variables. This means that this one new variable does a good job of
representing the entire set of original variables. That is why there
is so much focus on wanting your new variable to have a large
variance.

As another example, if instead your outcome was that the 1st new
variable only accounted for 4% of the variance of the original
variables, we could say it does not do a good job of representing the
set of original variables.


Now, for a new PC variable that does capture a large percentage of the
original variance, does this mean this variable is good at predicting
some other response variable? The usual answer here is no, the new
variables are only chosen to explain the variance of some set of
variables, and do not take into account how any of the varibles relate
to some response variable. In PCA, there are no response or predictor
variables.

However, I am aware of another opinion, which I like. If you are using
some set of predictor variables to model some response variable, then
right off the bat, the reason you are doing this in the first place is
because you think this set of predictor variables is reasonable to try
and predict your response variable. If you then find one or two new PC
variables that can represent your original X variables, then it is
hardly any more of a stretch to use them for prediction, than it
already was to use your original X variables.


This could be a moot point though, because other methods do exist that
will create new variables that do explicitly take into account the
covariance between a response variable and some other set of
variables, such as partial least squares. I think there have been some
good posts on this topic in this group already.


Also, at the risk of bringing up more confusion, if your original
variables are in different units, and have greatly varying variances
for each of them, you will want to do your PCA from the correlation
matrix of your original X variables, as opposed to the covariance
matrix of your original X variables. This should be as simple as
checking off some option on whatever software you are using. The
reasoning is that PC's from a covariance matrix will be greatly skewed
towards the original variables with large variances.

-CP



Quote:
I trying to think about it like this. Suppose there is no variability
in a given variable say X. i.e. it is constant. I guess we would say
that such a variable would have no usefulness in predicting the value
of a response variable. I mean if X is always 5 then knowing what X is
will give us no chance to have any insight into the value of our
response variable (say Y).

But the converse of this idea, that if X is extremely variable -has a
lot of variability, then ..... what? I cannot go to the next step
here.

Any help would be appreciated!
Thank you.
Art Kendall...
Posted: Wed Jun 25, 2008 7:15 am
Guest
A common use of the different kinds of factor analysis such as PCA is to
represent a set of X variables in terms of fewer artificial variables or
factors.

The factors are arranged in order of decreasing amount of the total X
variance accounted for. The first artificial variable from the factor
analysis "accounts for" more of the variability of the of the X set.
Each successive such variable accounts for less and less of the variance
of the set until the cutoff where the interpreter decides that the
amount of the total X variability the factor accounts for is trivial.

How factors from X relate to another variable Y or set of Ys is a
completely different question.

What needs to be distinguished is
"accounting for variance WITHIN A SET of variables" from
"accounting for variance BETWEEN SETS of variables".

Art Kendall
Social Research Consultants

onyourmark wrote:
Quote:
Hi, I have read about PCA. Always there is reference to the fact that
the first principle component is the component with the greatest
variation. Also there is mention of the fact that we seek to rotate
the axes in the direction of maximum variability. And again there is
mention of the fact that much of the variability of the data can be
accounted for by a smaller set of variables.

I am missing something in these statements. I cannot see clearly the
importance of the variability of a variable. I think that this is
being considered in the context of predicting the value of another (a
dependent/response) variable.

I trying to think about it like this. Suppose there is no variability
in a given variable say X. i.e. it is constant. I guess we would say
that such a variable would have no usefulness in predicting the value
of a response variable. I mean if X is always 5 then knowing what X is
will give us no chance to have any insight into the value of our
response variable (say Y).

But the converse of this idea, that if X is extremely variable -has a
lot of variability, then ..... what? I cannot go to the next step
here.

Any help would be appreciated!
Thank you.
Paige Miller...
Posted: Wed Jun 25, 2008 7:45 am
Guest
On Jun 25, 11:19 am, cprice <cpr... at (no spam) gmail.com> wrote:

Quote:
Also, at the risk of bringing up more confusion, if your original
variables are in different units, and have greatly varying variances
for each of them, you will want to do your PCA from the correlation
matrix of your original X variables, as opposed to the covariance
matrix of your original X variables.

I have to disagree with a portion of this. If the variables are in
different units, then I agree it makes sense to use the correlation
matrix. If the variables have greatly varying variances but are in the
same units, I don't believe that a correlation is called for. In fact,
in spectroscopy, you can see greatly varying variances, but the units
are the same and most spectroscopy applications do not use correlation
matrices. Conversely, if you have different units (say X1 is pH on a
scale of 0 to 14 and X2 is RPM of a motor on a scale of say 2000 to
10000), you could have both variables have a variance of 2 in their
different scales, and you would want to use a correlation matrix ...
you can't say that a variance of 2 units of RPM is equivalent to a
variance of 2 units in pH which is what you would be saying by using a
covariance matrix.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
onyourmark...
Posted: Wed Jun 25, 2008 9:19 pm
Guest
On Jun 26, 12:19 am, cprice <cpr... at (no spam) gmail.com> wrote:
Quote:
From your original set of many X variables,PCAlets you create a new
set of variables. This new set will have just as many variables as
your original set, but the point here is that, with only a few of
them, you could still capture almost all of the variance of the
original X variables.

As an exaggerated example, your outcome might be that the 1st new
variable might capture 98% of the variance of all of the original
variables. This means that this one new variable does a good job of
representing the entire set of original variables. That is why there
is so much focus on wanting your new variable to have a large
variance.

As another example, if instead your outcome was that the 1st new
variable only accounted for 4% of the variance of the original
variables, we could say it does not do a good job of representing the
set of original variables.

Now, for a new PC variable that does capture a large percentage of the
original variance, does this mean this variable is good at predicting
some other response variable? The usual answer here is no, the new
variables are only chosen to explain the variance of some set of
variables, and do not take into account how any of the varibles relate
to some response variable. InPCA, there are no response or predictor
variables.

However, I am aware of another opinion, which I like. If you are using
some set of predictor variables to model some response variable, then
right off the bat, the reason you are doing this in the first place is
because you think this set of predictor variables is reasonable to try
and predict your response variable. If you then find one or two new PC
variables that can represent your original X variables, then it is
hardly any more of a stretch to use them for prediction, than it
already was to use your original X variables.

This could be a moot point though, because other methods do exist that
will create new variables that do explicitly take into account the
covariance between a response variable and some other set of
variables, such as partial least squares. I think there have been some
good posts on this topic in this group already.

Also, at the risk of bringing up more confusion, if your original
variables are in different units, and have greatly varying variances
for each of them, you will want to do yourPCAfrom the correlation
matrix of your original X variables, as opposed to the covariance
matrix of your original X variables. This should be as simple as
checking off some option on whatever software you are using. The
reasoning is that PC's from a covariance matrix will be greatly skewed
towards the original variables with large variances.

-CP

I trying to think about it like this. Suppose there is novariability
in a given variable say X. i.e. it is constant. I guess we would say
that such a variable would have no usefulness in predicting the value
of a response variable. I mean if X is always 5 then knowing what X is
will give us no chance to have any insight into the value of our
response variable (say Y).

But the converse of this idea, that if X is extremely variable -has a
lot ofvariability, then ..... what?  I cannot go to the next step
here.

Any help would be appreciated!
Thank you.



Hi and thanks to all you have responded to my query. My question is, I
suppose, say in regard to the above post, why are we interested in
whether one of the original variables or one of the new derived
variables might capture 98% of the variance of all of the original
variables or not. I mean, I guess this is a very basic question, but
why are we interested in the variation of the variables in the first
place.
I recall from regression that we talk about the explained versus
unexplained variation. This is, as I recall, what percent of the total
variation of the response variable is explained by the explanatory
variable(s). So this is in the context of predicting the outcome of
the response variable and the higher the explained variation the
better the model (less error in the prediction).
So I am assuming that in PCA, the reason we are interested in how much
of the overall variation is covered by a variable is something to do
with how well it will work for predicting some response variable
outcome. But this does not seem to be right because we are not talking
about how much of the variation in the explained variable (if there is
an explained variable) is explained by our variable or variables
(either the original variables or the derived components). Rather we
are just talking about how much of the overall variation in our entire
set of variables is accounted for by one variable or one component or
a set of variables or a set of components. But I don't understand why
we are interested in this overall variation in the first place?
THANKS
Paige Miller...
Posted: Thu Jun 26, 2008 2:14 am
Guest
On Jun 26, 3:19 am, onyourmark <william... at (no spam) gmail.com> wrote:

Quote:
I recall from regression that we talk about the explained versus
unexplained variation. This is, as I recall, what percent of the total
variation of the response variable is explained by the explanatory
variable(s). So this is in the context of predicting the outcome of
the response variable and the higher the explained variation the
better the model (less error in the prediction).
So I am assuming that in PCA, the reason we are interested in how much
of the overall variation is covered by a variable is something to do
with how well it will work for predicting some response variable
outcome. But this does not seem to be right because we are not talking
about how much of the variation in the explained variable (if there is
an explained variable) is explained by our variable or variables
(either the original variables or the derived components). Rather we
are just talking about how much of the overall variation in our entire
set of variables is accounted for by one variable or one component or
a set of variables or a set of components. But I don't understand why
we are interested in this overall variation in the first place?

Image you have 3 variables, and all the data lies on a plane in three
dimensional space (except for very small amount of random noise which
might push the data off the plane). This means that a 2 dimensional
representation of your data explains almost all of the variability,
and the orientation of that plane relative to the original three axes
is of interest. The analogy to n original variables is
straightforward. We are interested in variability because that's where
things happen in our data. The dimensions that show 0.01% of the total
variability are not interesting because virtually nothing is happening
there.

If we have n original variables, and we find that k<n PCA dimensions
explains almost all of the variability, then we can explain (and
perhaps understand) what is going on in our data better. We can
represent our data in fewer dimensions, which might give us better
pictures or insights into our data.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
...
Posted: Thu Jun 26, 2008 1:47 pm
Guest
onyourmark <william108 at (no spam) gmail.com> wrote:
Quote:

Hi and thanks to all you have responded to my query. My question is, I
suppose, say in regard to the above post, why are we interested in
whether one of the original variables or one of the new derived
variables might capture 98% of the variance of all of the original
variables or not. I mean, I guess this is a very basic question, but
why are we interested in the variation of the variables in the first
place.

Maybe you aren't interested in that. In which case, you probably shouldn't
do a PCA analysis. It's a tool for a job. If you have no need for that
job, you have no need for that tool.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
onyourmark...
Posted: Thu Jun 26, 2008 8:13 pm
Guest
On Jun 27, 3:47 am, xhos... at (no spam) gmail.com wrote:
Quote:
onyourmark <william... at (no spam) gmail.com> wrote:

Hi and thanks to all you have responded to my query. My question is, I
suppose, say in regard to the above post, why are we interested in
whether one of the original variables or one of the new derived
variables might capture 98% of the variance of all of the original
variables or not. I mean, I guess this is a very basic question, but
why are we interested in the variation of the variables in the first
place.

Maybe you aren't interested in that.  In which case, you probably shouldn't
do aPCAanalysis.  It's a tool for a job.  If you have no need for that
job, you have no need for that tool.

Xho

--
--------------------http://NewsReader.Com/--------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Hi and thanks to all again. I am interested in PCA. May I ask, (sorry
for being obtuse), when you say "that is where all the action occurs"
you are saying that all the variability is in those two dimensions and
that only 0.01 percent of all the variability lies in the z axis, but
aren't you concerned with predicting a fourth variable (a response
variable, say Y)? Otherwise I don't understand what you mean by that
is where the action occurs. I mean, I understand that most of the
variation occurs in those two dimensions but why is the variation
important?
I can see that, for example, if data is constant with respect to a
certain variable, say X1, so that for every case/individual X1 has the
same value, say X1=5, across all observations, then X1 will be useless
in predicting Y (or as it is sometimes said "variation in Y") because
X1 is 5 no matter what value Y is (if you tell me that for this
individual/observation/case X1 is 5, that is not going to help me to
predict Y at all). And by extension if X1 is not constant but has
almost no variation then it will be almost useless in predicting the
variation in Y.
So is this why we are interested in the variation of the variables?
Because they are input variables? Or is there some other reason?
Thanks again.
sigbert...
Posted: Thu Jun 26, 2008 9:51 pm
Guest
Hi,

Quote:
Hi and thanks to all again. I am interested in PCA. May I ask, (sorry
for being obtuse), when you say "that is where all the action occurs"
you are saying that all the variability is in those two dimensions and
that only 0.01 percent of all the variability lies in the z axis, but
aren't you concerned with predicting a fourth variable (a response
variable, say Y)? Otherwise I don't understand what you mean by that
is where the action occurs. I mean, I understand that most of the
variation occurs in those two dimensions but why is the variation
important?

You have to separate two things: structure in the data which you
interested in and variation of your data. It is clear that no
variation in a variable or component means no interesting structure.
However, interesting structure in your data does not necessarily mean
a high variation in a variable or component. But in a lot of data it
turned out to be exactly like this: interesting structure = high
variation in a component.

Hope that helps Sigbert
Paige Miller...
Posted: Fri Jun 27, 2008 2:28 am
Guest
On Jun 27, 2:13 am, onyourmark <william... at (no spam) gmail.com> wrote:

Quote:
Hi and thanks to all again. I am interested in PCA. May I ask, (sorry
for being obtuse), when you say "that is where all the action occurs"
you are saying that all the variability is in those two dimensions and
that only 0.01 percent of all the variability lies in the z axis, but
aren't you concerned with predicting a fourth variable (a response
variable, say Y)? Otherwise I don't understand what you mean by that
is where the action occurs. I mean, I understand that most of the
variation occurs in those two dimensions but why is the variation
important?
I can see that, for example, if data is constant with respect to a
certain variable, say X1, so that for every case/individual X1 has the
same value, say X1=5, across all observations, then X1 will be useless
in predicting Y (or as it is sometimes said "variation in Y") because
X1 is 5 no matter what value Y is (if you tell me that for this
individual/observation/case X1 is 5, that is not going to help me to
predict Y at all). And by extension if X1 is not constant but has
almost no variation then it will be almost useless in predicting the
variation in Y.
So is this why we are interested in the variation of the variables?
Because they are input variables? Or is there some other reason?
Thanks again.

There is no Y variable in PCA. There is no concept of predicting a
dependent variable in PCA.

Let me ask you a question. Consider a data example where you have only
one variable; let's get creative and call that variable X.

Do you ever compute the variance of X?

If so, why? Because it might tell us something useful about X? Or do
you do that because the statistics textbooks tell you to do that?

Do you ever compute the mean of X? If so, why? Because it might tell
us something useful about X? Or do you do that because the statistics
textbooks tell you to do that?

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
Art Kendall...
Posted: Mon Jun 30, 2008 7:52 am
Guest
The various forms of factor analysis can be used for "data reduction",
sometimes for finding a few latent constructs underlying more numerous
very specific measures. In many fields the interpretability (construing
meaning) is important in deciding on the number of factors to retain.
The number of predictors (independent variables) makes a big difference
in how many cases (data rows) are needed to do correlations,
regressions, etc.

This is over simplified. Suppose your x variables are a set of test
questions on spelling, a set on addition and subtraction, etc.The
various forms of factor analysis are often used to double check items
that "go together". So that instead of having 20 variables each of
which measure spelling of particular words, you use a summarization of
them to represent general spelling achievement. You might then use the
measure of general spelling achievement as a predictor of job success.
On the one hand you are finding variables that "are pretty much
measuring the same thing" on the other hand you are interested in
finding out whether what a construct based on what is common to that set
related to a separate construct.

Another example of grouping sets of more particular measures of a
construct in order to create a stronger summative measure of that
construct is in attitude measurement. M. Lorr et al, took a set of
questions thought to measure liberalism-conservatism. They found out
that 3 factors could represent the common variance of several dozens of
specific questions.
Then in relating liberalism-conservatism to voting or candidate
preference etc, they did not have dozens of predictors, they could use
3 which represented the three underlying factors of general
liberalism-conservatism, egalitarianism, and favoring sexual freedom.
Much subsequent research has found that these 3 factors have different
relations to different kinds of social issues.

If a few factors can meaningfully summarize the variance of a larger set
of measures, a researcher can do her/his theorizing, modeling, and
analysis based on those more abstract constructs. Those constructs can
then be used to relate to other constructs.

Art Kendall
Social Research Consultants

onyourmark wrote:
Quote:
On Jun 27, 3:47 am, xhos... at (no spam) gmail.com wrote:
onyourmark <william... at (no spam) gmail.com> wrote:

Hi and thanks to all you have responded to my query. My question is, I
suppose, say in regard to the above post, why are we interested in
whether one of the original variables or one of the new derived
variables might capture 98% of the variance of all of the original
variables or not. I mean, I guess this is a very basic question, but
why are we interested in the variation of the variables in the first
place.
Maybe you aren't interested in that. In which case, you probably shouldn't
do aPCAanalysis. It's a tool for a job. If you have no need for that
job, you have no need for that tool.

Xho

--
--------------------http://NewsReader.Com/--------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Hi and thanks to all again. I am interested in PCA. May I ask, (sorry
for being obtuse), when you say "that is where all the action occurs"
you are saying that all the variability is in those two dimensions and
that only 0.01 percent of all the variability lies in the z axis, but
aren't you concerned with predicting a fourth variable (a response
variable, say Y)? Otherwise I don't understand what you mean by that
is where the action occurs. I mean, I understand that most of the
variation occurs in those two dimensions but why is the variation
important?
I can see that, for example, if data is constant with respect to a
certain variable, say X1, so that for every case/individual X1 has the
same value, say X1=5, across all observations, then X1 will be useless
in predicting Y (or as it is sometimes said "variation in Y") because
X1 is 5 no matter what value Y is (if you tell me that for this
individual/observation/case X1 is 5, that is not going to help me to
predict Y at all). And by extension if X1 is not constant but has
almost no variation then it will be almost useless in predicting the
variation in Y.
So is this why we are interested in the variation of the variables?
Because they are input variables? Or is there some other reason?
Thanks again.
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Mon Dec 01, 2008 9:27 am