 |
|
| Science Forum Index » Space - Consult Forum » Simple probability... |
|
Page 1 of 2 Goto page 1, 2 Next |
|
| Author |
Message |
| w.ccarleton... |
Posted: Mon Oct 12, 2009 6:52 am |
|
|
|
Guest
|
Hi All,
I have a simple (I'm sure it's simple for those comfortable with
probability, I mean) probability question that I'm having trouble
with. I have some frequency information about a sample of houses with
certain architectural features in them (features A and B). If 52% of
the houses in the sample have feature A, and 80% of the houses have
feature B, what is the probability that a house has both features? I
really don't want to go back to the original data and count these for
myself so I'm satisfied with a probability, but I can't figure out how
to combine these (although I recall doing just as poorly on questions
like this in highschool!). Thanks in advance,
Chris |
|
|
| Back to top |
|
|
|
| Ray Koopman... |
Posted: Mon Oct 12, 2009 9:15 am |
|
|
|
Guest
|
On Oct 12, 9:52 am, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
[quote:c5e05333cf]Hi All,
I have a simple (I'm sure it's simple for those comfortable with
probability, I mean) probability question that I'm having trouble
with. I have some frequency information about a sample of houses with
certain architectural features in them (features A and B). If 52% of
the houses in the sample have feature A, and 80% of the houses have
feature B, what is the probability that a house has both features? I
really don't want to go back to the original data and count these for
myself so I'm satisfied with a probability, but I can't figure out how
to combine these (although I recall doing just as poorly on questions
like this in highschool!). Thanks in advance,
Chris
[/quote:c5e05333cf]
Somewhere between 32% and 52% of the houses have both A and B. Any
value within those limits is possible. Try plugging some numbers
into the 2 x 2 table. The only constraints are that they must not
be negative and must give the marginal totals that you specified.
B ~B all B ~B all
A 32 20 52 A 52 0 52
~A 48 0 48 ~A 28 20 48
all 80 20 100 all 80 20 100 |
|
|
| Back to top |
|
|
|
| John Uebersax... |
Posted: Tue Oct 13, 2009 7:44 am |
|
|
|
Guest
|
Hi Chris,
A first response might of course be, "why not just go back and count
the houses with both features" -- because, as Ray points out,
otherwise there is a wide range of possible values.
But suppose the list is too long, or one no longer has the original
data. What can be done then?
One possibility is to begin with an educated guess about how
associated the two features are. For example, can one answer this
question: given that a house has feature A, what is the probability
that it also has feature B?
This is a reasonable question, and it could be answered in either in
terms of (1) a point estimate, (2) an upper and lower limit, and (3) a
probability distribution. Alternative (3) takes us into applied
Bayesian statistics, which is probably too complex a subject to pursue
here, so let's look at (1) and (2).
Let:
P(A) = probability a house has feature A (we know this)
P(B) = probability a house has feature B (we know this)
P(B|A) = probability that a house has feature B given that it has
feature A (we guess at this)
P(A,B) = probability that a house has both features A and B (to infer)
From basic probability theory we know:
P(A,B) = P(A) P(B|A) [1]
Thus, knowing P(A), and guessing at P(B|A), one arrives at an estimate
of P(A,B).
So, for example, we know that P(A) is .52. Suppose one estimates that
90% of houses with feature A also have feature B. Then:
P(A,B) = .52 * .90 = .468.
The original guess at P(B|A) might be subjective, or it might be based
on other data.
For a range of estimates one simply supplies lower- and upper-bound
estimates of P(B|A) in equation [1]. This produces upper- and lower-
bound estimate of P(A,B).
One may alternatively guess at P(A|B), the probability a house has
feature A given that it has feature B, and use this equation:
P(A,B) = P(B) P(A|B) [2]
An further option is to separately guess at both P(B|A) and P(A|B),
place these in equation [1] and equation [2], respectively, and take
as your estimate of P(A,B) the average of the two results.
Finally it should be mentioned that in the special case of statistical
independence -- where the presence of one feature does not affect the
probability of the other feature’s presence, then the formula is just:
P(A,B) = P(A) P(B). [3]
Hope this helps.
John Uebersax PhD
http://www.john-uebersax.com
On Oct 12, 9:52 am, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
[quote:afd70e367a]Hi All,
I have a simple (I'm sure it's simple for those comfortable with
probability, I mean) probability question that I'm having trouble
with. I have some frequency information about a sample of houses with
certain architectural features in them (features A and B). If 52% of
the houses in the sample have feature A, and 80% of the houses have
feature B, what is the probability that a house has both features? I
really don't want to go back to the original data and count these for
myself so I'm satisfied with a probability, but I can't figure out how
to combine these (although I recall doing just as poorly on questions
like this in highschool!). Thanks in advance,
Chris[/quote:afd70e367a] |
|
|
| Back to top |
|
|
|
| w.ccarleton... |
Posted: Wed Oct 14, 2009 3:34 am |
|
|
|
Guest
|
On Oct 13, 1:44 pm, John Uebersax <jsueber... at (no spam) gmail.com> wrote:
[quote:4adf54739f]Hi Chris,
A first response might of course be, "why not just go back and count
the houses with both features" -- because, as Ray points out,
otherwise there is a wide range of possible values.
But suppose the list is too long, or one no longer has the original
data. What can be done then?
One possibility is to begin with an educated guess about how
associated the two features are. For example, can one answer this
question: given that a house has feature A, what is the probability
that it also has feature B?
This is a reasonable question, and it could be answered in either in
terms of (1) a point estimate, (2) an upper and lower limit, and (3) a
probability distribution. Alternative (3) takes us into applied
Bayesian statistics, which is probably too complex a subject to pursue
here, so let's look at (1) and (2).
Let:
P(A) = probability a house has feature A (we know this)
P(B) = probability a house has feature B (we know this)
P(B|A) = probability that a house has feature B given that it has
feature A (we guess at this)
P(A,B) = probability that a house has both features A and B (to infer)
From basic probability theory we know:
P(A,B) = P(A) P(B|A) [1]
Thus, knowing P(A), and guessing at P(B|A), one arrives at an estimate
of P(A,B).
So, for example, we know that P(A) is .52. Suppose one estimates that
90% of houses with feature A also have feature B. Then:
P(A,B) = .52 * .90 = .468.
The original guess at P(B|A) might be subjective, or it might be based
on other data.
For a range of estimates one simply supplies lower- and upper-bound
estimates of P(B|A) in equation [1]. This produces upper- and lower-
bound estimate of P(A,B).
One may alternatively guess at P(A|B), the probability a house has
feature A given that it has feature B, and use this equation:
P(A,B) = P(B) P(A|B) [2]
An further option is to separately guess at both P(B|A) and P(A|B),
place these in equation [1] and equation [2], respectively, and take
as your estimate of P(A,B) the average of the two results.
Finally it should be mentioned that in the special case of statistical
independence -- where the presence of one feature does not affect the
probability of the other feature’s presence, then the formula is just:
P(A,B) = P(A) P(B). [3]
Hope this helps.
John Uebersax PhDhttp://www.john-uebersax.com
On Oct 12, 9:52 am, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
Hi All,
I have a simple (I'm sure it's simple for those comfortable with
probability, I mean) probability question that I'm having trouble
with. I have some frequency information about a sample of houses with
certain architectural features in them (features A and B). If 52% of
the houses in the sample have feature A, and 80% of the houses have
feature B, what is the probability that a house has both features? I
really don't want to go back to the original data and count these for
myself so I'm satisfied with a probability, but I can't figure out how
to combine these (although I recall doing just as poorly on questions
like this in highschool!). Thanks in advance,
Chris
[/quote:4adf54739f]
Thanks to both John and Ray for responding, I greatly appreciate your
time. The purpose behind this exercise is to combine two highly
correlated (highly correlated for archaeological data anyhow)
variables into a single variable for a Principal Components Analysis.
I know that the two features are correlated (R^2 > .7). Does my
knowledge about the correlation allow me to derive a single value, as
opposed to a range, to replace the two separate variables in the PCA?
I'm not entirely certain that I could go back to the original data in
any reasonable amount of time - trying to submit before December.
Thanks again,
Chris |
|
|
| Back to top |
|
|
|
| Ray Koopman... |
Posted: Wed Oct 14, 2009 6:55 am |
|
|
|
Guest
|
On Oct 14, 6:34 am, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
[quote:2296a039c1]On Oct 13, 1:44 pm, John Uebersax <jsueber... at (no spam) gmail.com> wrote:
Hi Chris,
A first response might of course be, "why not just go back and count
the houses with both features" -- because, as Ray points out,
otherwise there is a wide range of possible values.
But suppose the list is too long, or one no longer has the original
data. What can be done then?
One possibility is to begin with an educated guess about how
associated the two features are. For example, can one answer this
question: given that a house has feature A, what is the probability
that it also has feature B?
This is a reasonable question, and it could be answered in either in
terms of (1) a point estimate, (2) an upper and lower limit, and (3) a
probability distribution. Alternative (3) takes us into applied
Bayesian statistics, which is probably too complex a subject to pursue
here, so let's look at (1) and (2).
Let:
P(A) = probability a house has feature A (we know this)
P(B) = probability a house has feature B (we know this)
P(B|A) = probability that a house has feature B given that it has
feature A (we guess at this)
P(A,B) = probability that a house has both features A and B (to infer)
From basic probability theory we know:
P(A,B) = P(A) P(B|A) [1]
Thus, knowing P(A), and guessing at P(B|A), one arrives at an estimate
of P(A,B).
So, for example, we know that P(A) is .52. Suppose one estimates that
90% of houses with feature A also have feature B. Then:
P(A,B) = .52 * .90 = .468.
The original guess at P(B|A) might be subjective, or it might be based
on other data.
For a range of estimates one simply supplies lower- and upper-bound
estimates of P(B|A) in equation [1]. This produces upper- and lower-
bound estimate of P(A,B).
One may alternatively guess at P(A|B), the probability a house has
feature A given that it has feature B, and use this equation:
P(A,B) = P(B) P(A|B) [2]
An further option is to separately guess at both P(B|A) and P(A|B),
place these in equation [1] and equation [2], respectively, and take
as your estimate of P(A,B) the average of the two results.
Finally it should be mentioned that in the special case of statistical
independence -- where the presence of one feature does not affect the
probability of the other feature’s presence, then the formula is just:
P(A,B) = P(A) P(B). [3]
Hope this helps.
John Uebersax PhD
http://www.john-uebersax.com
On Oct 12, 9:52 am, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
Hi All,
I have a simple (I'm sure it's simple for those comfortable with
probability, I mean) probability question that I'm having trouble
with. I have some frequency information about a sample of houses with
certain architectural features in them (features A and B). If 52% of
the houses in the sample have feature A, and 80% of the houses have
feature B, what is the probability that a house has both features? I
really don't want to go back to the original data and count these for
myself so I'm satisfied with a probability, but I can't figure out how
to combine these (although I recall doing just as poorly on questions
like this in highschool!). Thanks in advance,
Chris
Thanks to both John and Ray for responding, I greatly appreciate your
time. The purpose behind this exercise is to combine two highly
correlated (highly correlated for archaeological data anyhow)
variables into a single variable for a Principal Components Analysis.
I know that the two features are correlated (R^2 > .7). Does my
knowledge about the correlation allow me to derive a single value, as
opposed to a range, to replace the two separate variables in the PCA?
I'm not entirely certain that I could go back to the original data in
any reasonable amount of time - trying to submit before December.
Thanks again,
Chris
[/quote:2296a039c1]
For 2 x 2 tables, r = (pAB - pA*pB)/Sqrt[pA(1-pA)pB(1-pB)],
so pAB = pA*pB + r*Sqrt[pA(1-pA)pB(1-pB)].
For your data that gives pAB = .416 + .20*r,
and if r^2 = .7 then pAB = .58 |
|
|
| Back to top |
|
|
|
| Ray Koopman... |
Posted: Wed Oct 14, 2009 2:23 pm |
|
|
|
Guest
|
On Oct 14, 9:55 am, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
[quote:a5a950d9c3]For 2 x 2 tables, r = (pAB - pA*pB)/Sqrt[pA(1-pA)pB(1-pB)],
so pAB = pA*pB + r*Sqrt[pA(1-pA)pB(1-pB)].
For your data that gives pAB = .416 + .20*r,
and if r^2 = .7 then pAB = .58
.... which is impossible, as I should have recognized. For your data, .[/quote:a5a950d9c3]
32 <= pAB <= .52, exactly. For 2 x 2 tables in general, the maximum
possible pAB is the smaller of pA or pB, and the corresponding r is
the maximum possible r. For your data that's .5204, or r^2 = .271.
Where did you get the .7 from? |
|
|
| Back to top |
|
|
|
| Rich Ulrich... |
Posted: Wed Oct 14, 2009 2:28 pm |
|
|
|
Guest
|
On Wed, 14 Oct 2009 06:34:35 -0700 (PDT), "w.ccarleton"
<w.ccarleton at (no spam) gmail.com> wrote:
[snip, delete previous]
[quote:8755daeccc]
Thanks to both John and Ray for responding, I greatly appreciate your
time. The purpose behind this exercise is to combine two highly
correlated (highly correlated for archaeological data anyhow)
variables into a single variable for a Principal Components Analysis.
I know that the two features are correlated (R^2 > .7). Does my
knowledge about the correlation allow me to derive a single value, as
opposed to a range, to replace the two separate variables in the PCA?
I'm not entirely certain that I could go back to the original data in
any reasonable amount of time - trying to submit before December.
Thanks again,
[/quote:8755daeccc]
They are correlated features, and one of the two variables
has unequal proportions. Therefore, the 4 cells of the 2x2
table, AxB, have different Ns. You might ignore the *content*
of the measures, and procede to use those facts in order to
create a 3-level variable that probably will make some sense.
Take the 50-50 category as A and the 80-20 as B. Then,
simply: Take the smaller two-way category (B_2) as one
new category with 20% of the data, and score that "1";
split B_1 into scores "2"and "3", with "3" representing
the largest N of the table, about 45%, leaving 35% in
the middle. As Likert showed, 75 years ago, the exact
spacing of such intervals is not likely to matter much.
What matters, also, is that the categories should make
some sense. I think I could usually figure an interpretation
of what I got in that style, but it might not always be as
natural - given the *meaning* of the measures - as
something based on real-life interpretation. Does
"something", any of the four cells, represent an
"ideal form", some sort of epitome, of a conventional
paradigm? -- In that case, you would sort the cells
in the order of decreasing match-to-the-form, for the
three or four scores.
--
Rich Ulrich |
|
|
| Back to top |
|
|
|
| w.ccarleton... |
Posted: Wed Oct 14, 2009 6:48 pm |
|
|
|
Guest
|
On Oct 14, 8:23 pm, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
[quote:8ec5f04355]On Oct 14, 9:55 am, Ray Koopman <koop... at (no spam) sfu.ca> wrote:> For 2 x 2 tables, r = (pAB - pA*pB)/Sqrt[pA(1-pA)pB(1-pB)],
so pAB = pA*pB + r*Sqrt[pA(1-pA)pB(1-pB)].
For your data that gives pAB = .416 + .20*r,
and if r^2 = .7 then pAB = .58
... which is impossible, as I should have recognized. For your data, .
32 <= pAB <= .52, exactly. For 2 x 2 tables in general, the maximum
possible pAB is the smaller of pA or pB, and the corresponding r is
the maximum possible r. For your data that's .5204, or r^2 = .271.
Where did you get the .7 from?
[/quote:8ec5f04355]
The correlation is derived from comparing the percentage of houses
with feature A to the percentage of houses with feature B over 9
architectural levels at the site. So, I have two vectors, each of
length 9, that contain the % of houses with each feature in each
level. Apparently, when I apply a simple linear model to the two, an
R^2 value (using R and lm()) of ~0.7 is determined. I was hoping that
I could use the knowledge of the correlation between the two features
(which must occur in some houses together given the % values for each
in each level) to estimate the probability that both features occur
together. That value would be calculated for each of 9 levels and the
new vector would be used in the Principal Components instead of the
two separate (but highly correlated) vectors of A and B. I suppose,
technically, the correlation is just telling me that when the % of
houses with A increases, so does the % of houses with B and nothing
directly about whether the two co-occur in individual houses (only in
the sample overall), but when the %s are greater than .5 then they
must overlap. Any ideas? |
|
|
| Back to top |
|
|
|
| Ray Koopman... |
Posted: Wed Oct 14, 2009 9:53 pm |
|
|
|
Guest
|
On Oct 14, 9:48 pm, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
[quote:d840aaf4c5]On Oct 14, 8:23 pm, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
On Oct 14, 9:55 am, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
For 2 x 2 tables, r = (pAB - pA*pB)/Sqrt[pA(1-pA)pB(1-pB)],
so pAB = pA*pB + r*Sqrt[pA(1-pA)pB(1-pB)].
For your data that gives pAB = .416 + .20*r,
and if r^2 = .7 then pAB = .58
... which is impossible, as I should have recognized. For your data,
.32 <= pAB <= .52, exactly. For 2 x 2 tables in general, the maximum
possible pAB is the smaller of pA or pB, and the corresponding r is
the maximum possible r. For your data that's .5204, or r^2 = .271.
Where did you get the .7 from?
The correlation is derived from comparing the percentage of houses
with feature A to the percentage of houses with feature B over 9
architectural levels at the site. So, I have two vectors, each of
length 9, that contain the % of houses with each feature in each
level. Apparently, when I apply a simple linear model to the two, an
R^2 value (using R and lm()) of ~0.7 is determined. I was hoping that
I could use the knowledge of the correlation between the two features
(which must occur in some houses together given the % values for each
in each level) to estimate the probability that both features occur
together. That value would be calculated for each of 9 levels and the
new vector would be used in the Principal Components instead of the
two separate (but highly correlated) vectors of A and B. I suppose,
technically, the correlation is just telling me that when the % of
houses with A increases, so does the % of houses with B and nothing
directly about whether the two co-occur in individual houses (only in
the sample overall), but when the %s are greater than .5 then they
must overlap. Any ideas?
[/quote:d840aaf4c5]
R^2 < 0 ?? However, be that as it may, I think you've realized that
there are two conceptually distinct correlations involved here. For
the sake of discussion, suppose that for each house you knew whether
it has or lacks feature A and whether it has or lacks feature B. Then
there are three different correlations that you could compute:
1. You could get r(A,B) within each level separately,
and then average those values somehow.
2. You could (as you did) correlate the average A at each level with
with the average B at each level, correlating over the 9 levels.
3. You could get r(A,B) over all the houses, ignoring level.
The problem (which is sometimes called "Simpson's Paradox") is that
the first and second correlations are logically unrelated to one
another. There is simply no way that knowing one tells you anything
about the other. The third correlation can be expressed as a function
of the first two.
Loosely, with many corners rounded severely:
1 = average correlation,
2 = correlation of averages,
3 = 1 + 2.
If your goal is to make a blanket statement "x% of the houses have
both A and B", without regard to level, then you need the data that
would go into 3. If all you have is the data for 2 then you're stuck.
One way in which problems such as this are often handled is to assume
that the within-level correlations are zero, that all the relation
between A and B is due to differences between levels. Then an estimate
of the number of houses in level i that have both A and B would be
ei = ai*bi/ni, where
ai = the number of houses in level i that have feature A,
bi = the number of houses in level i that have feature B,
ni = the number of houses in level i.
Summing ei over i would give an estimate of the total number of houses
with both A and B, and dividing that by the total number of houses
would give an estimate of the proportion with both A and B. |
|
|
| Back to top |
|
|
|
| w.ccarleton... |
Posted: Thu Oct 15, 2009 7:21 am |
|
|
|
Guest
|
On Oct 15, 3:53 am, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
[quote:494dee7f41]On Oct 14, 9:48 pm, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
R^2 < 0 ?? However, be that as it may, I think you've realized that
[/quote:494dee7f41]
The R^2 isn't negative. I was using the tilde '~' to express that it's
not exactly 0.7 (it's actually 0.7186...). Sorry for the confusion.
[quote:494dee7f41]One way in which problems such as this are often handled is to assume
that the within-level correlations are zero, that all the relation
between A and B is due to differences between levels. Then an estimate
of the number of houses in level i that have both A and B would be
ei = ai*bi/ni, where
ai = the number of houses in level i that have feature A,
bi = the number of houses in level i that have feature B,
ni = the number of houses in level i.
Summing ei over i would give an estimate of the total number of houses
with both A and B, and dividing that by the total number of houses
would give an estimate of the proportion with both A and B.
[/quote:494dee7f41]
I'm definitely starting to get the impression that there isn't going
to be a mathematically, or logically, sound way of combining these two
variables without going back to the original data (which may not turn
out to be feasible). I'm using published data that does not include
counts of features and buildings. The author I've taken the
architectural data from presented the information in % per level. I am
interested in changing systems over time so this is not really a
problem for most of the analysis. It became a problem while I was
trying to implement a PCA in order to narrow the range of variables.
In total, as I'm studying many systems other than architecture, there
are 38 archaeological variables and four paleoclimatological variables
in my study. If I'm understanding you correctly, I cannot use %s in
place of the 'number of houses in level i that have feature A' for
'ai' or the other variables. If I did use the % instead of the counts,
which I don't presently have access to, then your suggestion becomes
the same as %A*%B. In order for that to be valid I must assume that
the correlation between A and B in each building level is 0, as you've
said.
Here's the next important question: even if I had access to the counts
of buildings in a single level with the features, how could a
correlation (in the sense of a linear regression) be derived from two
numbers? Would not the 'correlation' be expressed then as a % of
buildings with both A and B since a linear regression between two data
points is going to be invalid? So, finally, the assumption that the
correlation between A and B for a given level is 0 is sound because
there's no reason to assume a priori that one exists, and a regression
wouldn't be revealing anyhow. If I can assume independence in each
level for A and B (even if they often co-occur) then I should be able
to use the formula you've suggested, or just multiply the %s, rather
than be concerned about conditional probabilities. Given an individual
level you would only know that the variables co-occur, and you have to
view the occurrence of both variables over time at the site to spot a
valid correlation. Does that then suggest that within a single level A
and B can be considered co-occurring, but independent, from which it
follows that I can ignore conditional probability? Of course, this
last assumption is with the caveat that I don't have some
archaeological reason to assume dependence.
This discussion has really helped and I greatly appreciate your time,
Chris |
|
|
| Back to top |
|
|
|
| Ray Koopman... |
Posted: Sun Oct 18, 2009 8:50 pm |
|
|
|
Guest
|
On Oct 15, 10:21 am, "w.ccarleton" <w.ccarleton at (no spam) gmail.com> wrote:
[quote]On Oct 15, 3:53 am, Ray Koopman <koopman at (no spam) sfu.ca> wrote:
R^2 < 0 ?? However, be that as it may, I think you've realized that
The R^2 isn't negative. I was using the tilde '~' to express that it's
not exactly 0.7 (it's actually 0.7186...). Sorry for the confusion.
One way in which problems such as this are often handled is to assume
that the within-level correlations are zero, that all the relation
between A and B is due to differences between levels. Then an estimate
of the number of houses in level i that have both A and B would be
ei = ai*bi/ni, where
ai = the number of houses in level i that have feature A,
bi = the number of houses in level i that have feature B,
ni = the number of houses in level i.
Summing ei over i would give an estimate of the total number of
houses with both A and B, and dividing that by the total number of
houses would give an estimate of the proportion with both A and B.
I'm definitely starting to get the impression that there isn't going
to be a mathematically, or logically, sound way of combining these
two variables without going back to the original data (which may not
turn out to be feasible). I'm using published data that does not
include counts of features and buildings. The author I've taken the
architectural data from presented the information in % per level. I
am interested in changing systems over time so this is not really a
problem for most of the analysis. It became a problem while I was
trying to implement a PCA in order to narrow the range of variables.
In total, as I'm studying many systems other than architecture, there
are 38 archaeological variables and four paleoclimatological variables
in my study. If I'm understanding you correctly, I cannot use %s in
place of the 'number of houses in level i that have feature A' for
'ai' or the other variables. If I did use the % instead of the counts,
which I don't presently have access to, then your suggestion becomes
the same as %A*%B. In order for that to be valid I must assume that
the correlation between A and B in each building level is 0, as you've
said.
[/quote]
I've been misreading conceptually, as well as perceptually (tilde vs
minus). So tell me if this is the context: You have a data matrix
with 38 columns (variables) and 9 rows (cases = levels). You want
to do a PCA, but there are far too many variables, so you're trying
to reduce the number of variables by combining some of them a priori.
This thread is about two particular variables that you want to
combine. Each variable is the proportion of houses at each level
that have a particular feature. One variable refers to feature A,
the other variable refers to feature B. You do not know how many
houses the proportions are based on -- all you have are the 9 pairs
of proportions: (pAi,pBi), i = 1..9. You asked how to estimate the
proportion of houses at each level that have both A and B, which
is your proposed combination variable.
If you are willing to assume that A and B are independent within
each level then the product pAi*pBi would be an estimate of the
proportion that have both A and B. The only problem would be the
tenability of the independence assumption.
But if you are willing to change the question a little then there is
another answer that is just as simple and requires no assumptions.
Do you really want to know what proportion _have_ both A and B?
Wouldn't it be equally informative to know instead what proportion
_lack_ both A and B; or, equivalently, what proportion have at least
one of A or B? Unless one of those it options is clearly better than
the other, I would suggest averaging the two approaches, which leads
to simply pAi+pBi as the combination measure. This would be equivalent
to giving each house a "feature count" score, and then using the
average feature count at each level in the PCA. It can also be easily
extended to situations where there are more than two features, and
(imho) is more in the "linear combination" spirit of PCA than the
proportion having or lacking all the features in the set would be.
[quote]
Here's the next important question: even if I had access to the
counts of buildings in a single level with the features, how could
a correlation (in the sense of a linear regression) be derived from
two numbers? Would not the 'correlation' be expressed then as a % of
buildings with both A and B since a linear regression between two data
points is going to be invalid?
[/quote]
There's nothing wrong with a linear regression when the predictor has
only two values. One definition of the regression of variable Y on
variable X is that it is E(Y|X), the conditional mean of Y given X.
If X is discrete, with only two values, then the function consists of
only two points and can always be fit by a straight line, regardless
of whether Y is discrete or continuous.
[quote]So, finally, the assumption that the
correlation between A and B for a given level is 0 is sound because
there's no reason to assume a priori that one exists, and a regression
wouldn't be revealing anyhow. If I can assume independence in each
level for A and B (even if they often co-occur) then I should be able
to use the formula you've suggested, or just multiply the %s, rather
than be concerned about conditional probabilities. Given an individual
level you would only know that the variables co-occur, and you have to
view the occurrence of both variables over time at the site to spot a
valid correlation. Does that then suggest that within a single level A
and B can be considered co-occurring, but independent, from which it
follows that I can ignore conditional probability? Of course, this
last assumption is with the caveat that I don't have some
archaeological reason to assume dependence.
This discussion has really helped and I greatly appreciate your time,
Chris[/quote] |
|
|
| Back to top |
|
|
|
| Rich Ulrich... |
Posted: Mon Oct 19, 2009 4:43 pm |
|
|
|
Guest
|
On Sun, 18 Oct 2009 23:50:12 -0700 (PDT), Ray Koopman <koopman at (no spam) sfu.ca>
wrote:
[quote]On Oct 15, 10:21 am, "w.ccarleton" <w.ccarleton at (no spam) gmail.com> wrote:
On Oct 15, 3:53 am, Ray Koopman <koopman at (no spam) sfu.ca> wrote:
[/quote]
[snip, preceding]
[quote]
But if you are willing to change the question a little then there is
another answer that is just as simple and requires no assumptions.
Do you really want to know what proportion _have_ both A and B?
Wouldn't it be equally informative to know instead what proportion
_lack_ both A and B; or, equivalently, what proportion have at least
one of A or B? Unless one of those it options is clearly better than
the other, I would suggest averaging the two approaches, which leads
to simply pAi+pBi as the combination measure. This would be equivalent
to giving each house a "feature count" score, and then using the
average feature count at each level in the PCA. It can also be easily
extended to situations where there are more than two features, and
(imho) is more in the "linear combination" spirit of PCA than the
proportion having or lacking all the features in the set would be.
[/quote]
That's closer to the spirit of my own post on the 14th -- coming
up with a score.
I suggested a couple of other versions of score, which may be
interesting since there are unequal proportions.
[snip, rest]
--
Rich Ulrich |
|
|
| Back to top |
|
|
|
| w.ccarleton... |
Posted: Tue Oct 20, 2009 4:34 am |
|
|
|
Guest
|
On Oct 19, 6:43 pm, Rich Ulrich <rich.ulr... at (no spam) comcast.net> wrote:
[quote]On Sun, 18 Oct 2009 23:50:12 -0700 (PDT), Ray Koopman <koop... at (no spam) sfu.ca
wrote:
On Oct 15, 10:21 am, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
On Oct 15, 3:53 am, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
[snip, preceding]
But if you are willing to change the question a little then there is
another answer that is just as simple and requires no assumptions.
Do you really want to know what proportion _have_ both A and B?
Wouldn't it be equally informative to know instead what proportion
_lack_ both A and B; or, equivalently, what proportion have at least
one of A or B? Unless one of those it options is clearly better than
the other, I would suggest averaging the two approaches, which leads
to simply pAi+pBi as the combination measure. This would be equivalent
to giving each house a "feature count" score, and then using the
average feature count at each level in the PCA. It can also be easily
extended to situations where there are more than two features, and
(imho) is more in the "linear combination" spirit of PCA than the
proportion having or lacking all the features in the set would be.
That's closer to the spirit of my own post on the 14th -- coming
up with a score.
I suggested a couple of other versions of score, which may be
interesting since there are unequal proportions.
[snip, rest]
--
Rich Ulrich
[/quote]
Thanks again Rich and Ray,
When you say pAi + pBi is equivalent to a 'feature count score' I'm
not exactly following. I had some similar trouble understanding the
scoring system that Rich suggested. Could one or both of you try to
explain that to me again? If I understand it, then I would be able to
add the percentage of houses that contain A to the percentage of
houses that contain B at each level and use that as the new variable.
My confusion is coming from the issue of having then greater than 100%
of houses with both features - I guess I'm asking: why is it valid to
add percentages?
Chris |
|
|
| Back to top |
|
|
|
| w.ccarleton... |
Posted: Tue Oct 20, 2009 12:00 pm |
|
|
|
Guest
|
On Oct 20, 5:37 pm, Rich Ulrich <rich.ulr... at (no spam) comcast.net> wrote:
[quote]On Tue, 20 Oct 2009 07:34:29 -0700 (PDT), "w.ccarleton"
w.ccarle... at (no spam) gmail.com> wrote:
On Oct 19, 6:43 pm, Rich Ulrich <rich.ulr... at (no spam) comcast.net> wrote:
On Sun, 18 Oct 2009 23:50:12 -0700 (PDT), Ray Koopman <koop... at (no spam) sfu.ca
wrote:
On Oct 15, 10:21 am, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
On Oct 15, 3:53 am, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
[snip, preceding]
But if you are willing to change the question a little then there is
another answer that is just as simple and requires no assumptions.
Do you really want to know what proportion _have_ both A and B?
Wouldn't it be equally informative to know instead what proportion
_lack_ both A and B; or, equivalently, what proportion have at least
one of A or B? Unless one of those it options is clearly better than
the other, I would suggest averaging the two approaches, which leads
to simply pAi+pBi as the combination measure. This would be equivalent
to giving each house a "feature count" score, and then using the
average feature count at each level in the PCA. It can also be easily
extended to situations where there are more than two features, and
(imho) is more in the "linear combination" spirit of PCA than the
proportion having or lacking all the features in the set would be.
That's closer to the spirit of my own post on the 14th -- coming
up with a score.
I suggested a couple of other versions of score, which may be
interesting since there are unequal proportions.
[snip, rest]
--
Rich Ulrich
Thanks again Rich and Ray,
When you say pAi + pBi is equivalent to a 'feature count score' I'm
not exactly following. I had some similar trouble understanding the
scoring system that Rich suggested. Could one or both of you try to
explain that to me again? If I understand it, then I would be able to
add the percentage of houses that contain A to the percentage of
houses that contain B at each level and use that as the new variable.
My confusion is coming from the issue of having then greater than 100%
of houses with both features - I guess I'm asking: why is it valid to
add percentages?
Well, in the sense that you merely want an indicator variable
with a continuous scale, adding percentages gives you one.
If you want a meaningful number, you may (say) assume
independence, and estimate the "percentage with both"
or "percentage with neither", or whatever cell you deem
interesting -- so that it will be easiest to talk about.
I originally thought that you had the flexibility of scoring
each house, and that led me astray.
--
Rich Ulrich
[/quote]
Oh I see... okay so it says nothing meaningful in and of itself, but
it gives me a single variable so that I can satisfy the PCA
assumptions and carry on with my life. Do either of you happen to have
a reference that I could read in which this technique may have been
used? I think that I can rationally defend it, but my committee will
likely want to see that it has been used successfully elsewhere.
Thanks very much for your time (both Rich and Ray).
Chris |
|
|
| Back to top |
|
|
|
| Rich Ulrich... |
Posted: Tue Oct 20, 2009 3:37 pm |
|
|
|
Guest
|
On Tue, 20 Oct 2009 07:34:29 -0700 (PDT), "w.ccarleton"
<w.ccarleton at (no spam) gmail.com> wrote:
[quote]On Oct 19, 6:43 pm, Rich Ulrich <rich.ulr... at (no spam) comcast.net> wrote:
On Sun, 18 Oct 2009 23:50:12 -0700 (PDT), Ray Koopman <koop... at (no spam) sfu.ca
wrote:
On Oct 15, 10:21 am, "w.ccarleton" <w.ccarle... at (no spam) gmail.com> wrote:
On Oct 15, 3:53 am, Ray Koopman <koop... at (no spam) sfu.ca> wrote:
[snip, preceding]
But if you are willing to change the question a little then there is
another answer that is just as simple and requires no assumptions.
Do you really want to know what proportion _have_ both A and B?
Wouldn't it be equally informative to know instead what proportion
_lack_ both A and B; or, equivalently, what proportion have at least
one of A or B? Unless one of those it options is clearly better than
the other, I would suggest averaging the two approaches, which leads
to simply pAi+pBi as the combination measure. This would be equivalent
to giving each house a "feature count" score, and then using the
average feature count at each level in the PCA. It can also be easily
extended to situations where there are more than two features, and
(imho) is more in the "linear combination" spirit of PCA than the
proportion having or lacking all the features in the set would be.
That's closer to the spirit of my own post on the 14th -- coming
up with a score.
I suggested a couple of other versions of score, which may be
interesting since there are unequal proportions.
[snip, rest]
--
Rich Ulrich
Thanks again Rich and Ray,
When you say pAi + pBi is equivalent to a 'feature count score' I'm
not exactly following. I had some similar trouble understanding the
scoring system that Rich suggested. Could one or both of you try to
explain that to me again? If I understand it, then I would be able to
add the percentage of houses that contain A to the percentage of
houses that contain B at each level and use that as the new variable.
My confusion is coming from the issue of having then greater than 100%
of houses with both features - I guess I'm asking: why is it valid to
add percentages?
[/quote]
Well, in the sense that you merely want an indicator variable
with a continuous scale, adding percentages gives you one.
If you want a meaningful number, you may (say) assume
independence, and estimate the "percentage with both"
or "percentage with neither", or whatever cell you deem
interesting -- so that it will be easiest to talk about.
I originally thought that you had the flexibility of scoring
each house, and that led me astray.
--
Rich Ulrich |
|
|
| Back to top |
|
|
|
|
|
All times are GMT - 5 Hours
The time now is Sat Nov 28, 2009 2:57 am
|
|