Main Page | Report this Page
 
   
Science Forum Index  »  Statistics - Education Forum  »  Correlation matrix with missing data
Page 1 of 1    
Author Message
Guest
Posted: Thu Dec 07, 2006 12:11 pm
Hello. I have a large number of completed surveys with a number of
satisfaction questions ranging from 1-10. I wish to find out how each
question is related to each other question so that I can group similar
questions together. To do this I compute Pearson's correlation
coefficients between the questions and use the absolute values. This
works fine, but there is a small problem with missing data... Here's an
example: Let's say there are 5 questions (q1-q5) and 6 respondents
(a-f). a-c are males and d-f females. Males get to answer question 4
and females question 5, so these two questions are never answered
together. The satisfaction levels for each question are:

q1 q2 q3 q4 q5
a 4 8 3 8 -
b 7 6 6 6 -
c 3 4 3 5 -
d 2 5 1 - 5
e 6 3 6 - 2
f 8 6 7 - 6

Note that this is all the data I have to work with, I don't have any
knowledge about the actual questions. Using pairwise deletion of
missing data I get the following correlation matrix:

1.000 0.097 0.976 0.052 0.052
0.097 1.000 -0.081 0.982 0.996
0.976 -0.081 1.000 -0.189 -0.125
0.052 0.982 -0.189 1.000 -
0.052 0.996 -0.125 - 1.000

Suggesting a strong relationship between questions 1 and 3, questions 2
and 4 and questions 2 and 5. Let's call the relationship between
questions qa and qb r(qa,qb). What I am wondering is what can be said
about r(q4,q5)? It seems likely that these two questions are similar in
nature since they both correlate so well with question 2, so I would
probably want them grouped together. Would it make sense to simply
estimate r(q4,q5) by r(q4,q2)*r(q2,q5) = 0.982*0.996 = 0.978? or is
there a better way?

Thanks,

Daniel
Ray Koopman
Posted: Fri Dec 08, 2006 3:55 am
Guest
Correlation is a property of paired values, but q4 and q5 are unpaired,
both in principle and in practice. I don't know what r(q4,q5) means.

daniel_nordlund_1982@hotmail.com wrote:
Quote:
Hello. I have a large number of completed surveys with a number of
satisfaction questions ranging from 1-10. I wish to find out how each
question is related to each other question so that I can group similar
questions together. To do this I compute Pearson's correlation
coefficients between the questions and use the absolute values. This
works fine, but there is a small problem with missing data... Here's an
example: Let's say there are 5 questions (q1-q5) and 6 respondents
(a-f). a-c are males and d-f females. Males get to answer question 4
and females question 5, so these two questions are never answered
together. The satisfaction levels for each question are:

q1 q2 q3 q4 q5
a 4 8 3 8 -
b 7 6 6 6 -
c 3 4 3 5 -
d 2 5 1 - 5
e 6 3 6 - 2
f 8 6 7 - 6

Note that this is all the data I have to work with, I don't have any
knowledge about the actual questions. Using pairwise deletion of
missing data I get the following correlation matrix:

1.000 0.097 0.976 0.052 0.052
0.097 1.000 -0.081 0.982 0.996
0.976 -0.081 1.000 -0.189 -0.125
0.052 0.982 -0.189 1.000 -
0.052 0.996 -0.125 - 1.000

Suggesting a strong relationship between questions 1 and 3, questions 2
and 4 and questions 2 and 5. Let's call the relationship between
questions qa and qb r(qa,qb). What I am wondering is what can be said
about r(q4,q5)? It seems likely that these two questions are similar in
nature since they both correlate so well with question 2, so I would
probably want them grouped together. Would it make sense to simply
estimate r(q4,q5) by r(q4,q2)*r(q2,q5) = 0.982*0.996 = 0.978? or is
there a better way?

Thanks,

Daniel
Richard Ulrich
Posted: Sat Dec 09, 2006 1:12 am
Guest
On 7 Dec 2006 08:11:51 -0800, daniel_nordlund_1982@hotmail.com wrote:

Quote:
Hello. I have a large number of completed surveys with a number of
satisfaction questions ranging from 1-10. I wish to find out how each
question is related to each other question so that I can group similar
questions together. To do this I compute Pearson's correlation
coefficients between the questions and use the absolute values. This
works fine, but there is a small problem with missing data... Here's an
example: Let's say there are 5 questions (q1-q5) and 6 respondents
(a-f). a-c are males and d-f females. Males get to answer question 4
and females question 5, so these two questions are never answered
together. The satisfaction levels for each question are:

q1 q2 q3 q4 q5
a 4 8 3 8 -
b 7 6 6 6 -
c 3 4 3 5 -
d 2 5 1 - 5
e 6 3 6 - 2
f 8 6 7 - 6

Note that this is all the data I have to work with, I don't have any
knowledge about the actual questions. Using pairwise deletion of
missing data I get the following correlation matrix:

1.000 0.097 0.976 0.052 0.052
0.097 1.000 -0.081 0.982 0.996
0.976 -0.081 1.000 -0.189 -0.125
0.052 0.982 -0.189 1.000 -
0.052 0.996 -0.125 - 1.000

Suggesting a strong relationship between questions 1 and 3, questions 2
and 4 and questions 2 and 5. Let's call the relationship between
questions qa and qb r(qa,qb). What I am wondering is what can be said
about r(q4,q5)? It seems likely that these two questions are similar in
nature since they both correlate so well with question 2, so I would
probably want them grouped together. Would it make sense to simply
estimate r(q4,q5) by r(q4,q2)*r(q2,q5) = 0.982*0.996 = 0.978? or is
there a better way?


It looks to me as if q4 and q5 were intended as sex-ed versions
of the same question, which conceivably could be rephased to
something neutral like, "satisfaction with spouse".

I would ask the person who owns the data if it was expected that
that the two sets of answers would simply be merged as if it had
been one question. To further justify merging them, you could
ask whether there is internal evidence that the two means are
different, given the other data -- For instance, since the q2
correlation is so high:
Do the two sets of data (M, F) show the same regression line,
in both slope and intercept?

--
Rich Ulrich, wpilib@pitt.edu
http://www.pitt.edu/~wpilib/index.html
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Wed Dec 03, 2008 5:20 pm