| |
 |
|
|
Science Forum Index » Statistics - Math Forum » Correct measure of dispersion...
Page 1 of 1
|
| Author |
Message |
| mishery... |
Posted: Wed May 07, 2008 4:16 am |
|
|
|
Guest
|
I have some linguistic data for which I need to get a measure of
dispersion. The data relates to various aspects of the meanings of 517
words.
Each word has a different number of data points, the average is 12 and
the range 5-21.
The data points themselves are counts ranging from 1-130 (51 separate
values). Two thirds of the data points in the whole set = 1.
For each word I want a measure of the dispersion of its data. Can I
use a standard skew measure for these data? Or would variance-to-mean
ratio be better?
Thank you
- Mike |
|
|
| Back to top |
|
| Ray Koopman... |
Posted: Wed May 07, 2008 8:23 am |
|
|
|
Guest
|
On May 7, 7:16 am, mishery <mf... at (no spam) csl.psychol.cam.ac.uk> wrote:
Quote: I have some linguistic data for which I need to get a measure of
dispersion. The data relates to various aspects of the meanings of 517
words.
Each word has a different number of data points, the average is 12 and
the range 5-21.
The data points themselves are counts ranging from 1-130 (51 separate
values). Two thirds of the data points in the whole set = 1.
For each word I want a measure of the dispersion of its data. Can I
use a standard skew measure for these data? Or would variance-to-mean
ratio be better?
Thank you
- Mike
Simplson's reciprocal index of dispersion for unordered categorical
data is (sum n)^2/(sum n^2), where the n's are the counts in the
various categories. It gives a number between 1 and the number of
nonzero counts. |
|
|
| Back to top |
|
| mishery... |
Posted: Thu May 08, 2008 12:21 am |
|
|
|
Guest
|
Thank you!
But I am not 100% sure that this will be right, I don't think I
described my data accurately.
The data are aspects of meanings of words given by subjects. Subjects
were asked to give features of words, so for the word "eagle" they
might give "flies", "has wings" and so on. The data for each word is
the total number of words for which this feature was given. If this
number = 1, then it was only given for that particular word. So for
"eagle" you might have...
Eagle
has_wings 25
has_beak 10
lives_in_mountains 1
symbol_of_US 1
....
....
That is, the feature "has_wings" was given for 24 other words whereas
"lives_in_mountains" was only given for this particular word.
Obviously some words will have more information given, more features
than others and there can be no zero values.
What we are interested in is the degree to which different words
elicited features that are highly shared or distinctive. So we need
some kind of measure of the dispersion of the count data feature
numbers. What is the balance shared or distinctive are the features of
each word?
Is Simpson's reciprocal index still the right measure?
Thank you |
|
|
| Back to top |
|
| mishery... |
Posted: Thu May 08, 2008 12:32 am |
|
|
|
Guest
|
What I am worried about is if I have two words with ten features and
in one case they are all 1 (all unique features of that word) and in
the other they are all 10 (all relatively highly shared), this will
give the same value using simpsons reciprocal index won't it? |
|
|
| Back to top |
|
| Ray Koopman... |
Posted: Thu May 08, 2008 1:17 pm |
|
|
|
Guest
|
On May 8, 3:21 am, mishery <mf... at (no spam) csl.psychol.cam.ac.uk> wrote:
Quote: Thank you!
But I am not 100% sure that this will be right, I don't think I
described my data accurately.
The data are aspects of meanings of words given by subjects. Subjects
were asked to give features of words, so for the word "eagle" they
might give "flies", "has wings" and so on. The data for each word is
the total number of words for which this feature was given. If this
number = 1, then it was only given for that particular word. So for
"eagle" you might have...
Eagle
has_wings 25
has_beak 10
lives_in_mountains 1
symbol_of_US 1
...
...
That is, the feature "has_wings" was given for 24 other words whereas
"lives_in_mountains" was only given for this particular word.
Obviously some words will have more information given, more features
than others and there can be no zero values.
What we are interested in is the degree to which different words
elicited features that are highly shared or distinctive. So we need
some kind of measure of the dispersion of the count data feature
numbers. What is the balance shared or distinctive are the features of
each word?
Is Simpson's reciprocal index still the right measure?
Thank you
You're right, you don't want Simpson's reciprocal index. But how
about
the mean of the reciprocals of the counts? If "has_wings" was
elicited
by 25 words, one of which was "eagle", then you might say that
"eagle"
owns 1/25 of "has_wings"; similarly, it owns 1/10 of "has_beak" and
all of "lives_in_the_mountains" and "symbol_of_the US". Averaging
those shares gives .535, which would be a measure of distinctiveness
rather than dispersion.
I think that's a reasonable first approximation, but it omits the
notion of the centrality of the feature to the meaning of the
stimulus
object. If you can manage to get centrality measures on a ratio
scale,
then a better measure of distinctiveness would be (sum c/n)/(sum c),
where the c's are the centrality measures and the n's are the counts. |
|
|
| Back to top |
|
| Ray Koopman... |
Posted: Thu May 08, 2008 8:58 pm |
|
|
|
Guest
|
On May 8, 4:17 pm, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
Quote: On May 8, 3:21 am, mishery <mf... at (no spam) csl.psychol.cam.ac.uk> wrote:
Thank you!
But I am not 100% sure that this will be right, I don't think I
described my data accurately.
The data are aspects of meanings of words given by subjects.
Subjects were asked to give features of words, so for the word
"eagle" they might give "flies", "has wings" and so on. The data
for each word is the total number of words for which this feature
was given. If this number = 1, then it was only given for that
particular word. So for "eagle" you might have...
Eagle
has_wings 25
has_beak 10
lives_in_mountains 1
symbol_of_US 1
...
...
That is, the feature "has_wings" was given for 24 other words
whereas "lives_in_mountains" was only given for this particular
word.
Obviously some words will have more information given, more
features than others and there can be no zero values.
What we are interested in is the degree to which different words
elicited features that are highly shared or distinctive. So we
need some kind of measure of the dispersion of the count data
feature numbers. What is the balance shared or distinctive are
the features of each word?
Is Simpson's reciprocal index still the right measure?
Thank you
You're right, you don't want Simpson's reciprocal index. But how
about the mean of the reciprocals of the counts? If "has_wings"
was elicited by 25 words, one of which was "eagle", then you might
say that "eagle" owns 1/25 of "has_wings"; similarly, it owns 1/10
of "has_beak" and all of "lives_in_the_mountains" and "symbol_of_
the_US". Averaging those shares gives .535, which would be a
measure of distinctiveness rather than dispersion.
I think that's a reasonable first approximation, but it omits
the notion of the centrality of the feature to the meaning of the
stimulus object. If you can manage to get centrality measures on
a ratio scale, then a better measure of distinctiveness would be
(sum c/n)/(sum c), where the c's are the centrality measures and
the n's are the counts.
It occurred to me that the proportion of subjects who responded
"has_wings" to "eagle" would be a measure of the centrality of
"has_wings" for "eagle", etc. Note that this makes no assumptions
regarding how many responses the subject gave to eagle. (A fancier
version might incorporate the position of the response in the
sequence of responses, with earlier responses presumably being
more central.) |
|
|
| Back to top |
|
| Ray Koopman... |
Posted: Fri May 09, 2008 7:49 am |
|
|
|
Guest
|
On May 8, 3:21 am, mishery <mf... at (no spam) csl.psychol.cam.ac.uk> wrote:
Quote: Thank you!
But I am not 100% sure that this will be right, I don't think I
described my data accurately.
The data are aspects of meanings of words given by subjects.
Subjects were asked to give features of words, so for the word
"eagle" they might give "flies", "has wings" and so on. The data
for each word is the total number of words for which this feature
was given. If this number = 1, then it was only given for that
particular word. So for "eagle" you might have...
Eagle
has_wings 25
has_beak 10
lives_in_mountains 1
symbol_of_US 1
...
...
That is, the feature "has_wings" was given for 24 other words
whereas "lives_in_mountains" was only given for this particular
word.
Obviously some words will have more information given, more
features than others and there can be no zero values.
What we are interested in is the degree to which different words
elicited features that are highly shared or distinctive. So we
need some kind of measure of the dispersion of the count data
feature numbers. What is the balance shared or distinctive are
the features of each word?
Is Simpson's reciprocal index still the right measure?
Thank you
My two posts yesterday addressed the specific question you asked.
Here is a different approach to what I perceive to be the general
problem.
Consider a matrix F in which each row i corresponds to a different
stimulus word, each column k corresponds to a different response
feature, and each cell f_ik contains the number of times that word i
elicited feature k. F will (probably) be mostly zeros.
Let f_i = sum_k f_ik, and let p_ik = f_ik / f_i. Then each row of P
contains the frequency distribution of features for word i, and d_i
= 1/(sum_k p_ik^2) = Simpson's reciprocal index for the dispersion
of the features of word i.
Let q_ik = sqrt(p_ik). The R = QQ', where ' denotes a matrix
transpose, is the matrix of pairwise correlations of the words in
terms of their feature distributions. (The correlations r_ij are
related to the Hellinger distances among the words.)
Let C = R^1. Then 1/c_ii = 1 - the squared multiple correlation of
word i with all the other words = a measure of the distinctiveness
of word i.
You can also component-analyze R, to look for patterns among the
correlations. |
|
|
| Back to top |
|
| Ray Koopman... |
Posted: Mon May 12, 2008 6:33 am |
|
|
|
Guest
|
On May 12, 5:27 am, mishery <mf... at (no spam) csl.psychol.cam.ac.uk> wrote:
Quote: Consider a matrix F in which each row i corresponds to a different
stimulus word, each column k corresponds to a different response
feature, and each cell f_ik contains the number of times that word i
elicited feature k. F will (probably) be mostly zeros.
Let f_i = sum_k f_ik, and let p_ik = f_ik / f_i. Then each row of P
contains the frequency distribution of features for word i, and d_i
= 1/(sum_k p_ik^2) = Simpson's reciprocal index for the dispersion
of the features of word i.
Let q_ik = sqrt(p_ik). The R = QQ', where ' denotes a matrix
transpose, is the matrix of pairwise correlations of the words in
terms of their feature distributions. (The correlations r_ij are
related to the Hellinger distances among the words.)
I was with you up to here. The bit below confused me.
Let C = R^1.
Sorry, that was a typo. It should have been C = R^-1,
the matrix inverse of R.
Quote:
Not sure about this term. C = the correlational matrix R?
Then 1/c_ii = 1 - the squared multiple correlation of
word i with all the other words = a measure of the distinctiveness
of word i.
This seems like a good measure but I can't unpack
1/c_ii = 1 - the squared multiple correlation of word i with all the other words
1/c_ii = the reciprocal of the i'th diagonal
of the matrix inverse of R.
Quote:
The inverse of cells of the correlation matrix R?
Thanks |
|
|
| Back to top |
|
| mishery... |
Posted: Tue May 13, 2008 5:13 am |
|
|
|
Guest
|
Quote:
Let q_ik = sqrt(p_ik). The R = QQ', where ' denotes a matrix
transpose, is the matrix of pairwise correlations of the words in
terms of their feature distributions. (The correlations r_ij are
related to the Hellinger distances among the words.)
Actually I am a bit confused by this. I must have missed something
obvious.
R = the matrix of the q_ik values multiplied by its transpose?
I am not sure my matrix algebra is up to this. I can do a regression
using the matrix calculations but it is a bit robotic, E.g. I don't
really understand why inversing the X'X matrix in the regression
calculation has the effect it has. |
|
|
| Back to top |
|
| Ray Koopman... |
Posted: Tue May 13, 2008 6:06 am |
|
|
|
Guest
|
On May 13, 8:13 am, mishery <mf... at (no spam) csl.psychol.cam.ac.uk> wrote:
Quote: Let q_ik = sqrt(p_ik). The R = QQ', where ' denotes a matrix
transpose, is the matrix of pairwise correlations of the words in
terms of their feature distributions. (The correlations r_ij are
related to the Hellinger distances among the words.)
Actually I am a bit confused by this. I must have missed something
obvious.
R = the matrix of the q_ik values multiplied by its transpose?
Yes, r_ij = sum_k q_ik*q_jk.
Quote:
I am not sure my matrix algebra is up to this. I can do a regression
using the matrix calculations but it is a bit robotic, E.g. I don't
really understand why inversing the X'X matrix in the regression
calculation has the effect it has.
I agree, the interpretation of 1/c_ii as 1 - the squared multiple
correlation of word i with all the other words is not at all obvious. |
|
|
| Back to top |
|
| |
|
Page 1 of 1
All times are GMT - 5 Hours
The time now is Sun Sep 07, 2008 2:52 pm
|
|