Main Page | Report this Page
 
   
Science Forum Index  »  Statistics - Math Forum  »  Overall R^2 and subgroup R^2
Page 1 of 1    
Author Message
Pekka Jarvela
Posted: Wed Jan 10, 2007 10:30 am
Guest
I have dataset consisting of three different classes, {A, B, C}, and a
linear model for the dataset with R^2 = 0.72. Plotting measured values
(y) against predicted values (yhat) for different classes separately
and fitting a line to each gets:

A => R^2 = 0.398
B => R^2 = 0.466
C => R^2 = 0.401

QUESTION: Can I conclude that the model which is able to explain 72 %
of variation in the whole dataset {A, B, C}, is able to explain only
39.8 % of variation in class A, 46.6 % in class B and 40.1 % in class
C?

Cheers.

-PJ
Pekka Jarvela
Posted: Wed Jan 10, 2007 6:02 pm
Guest
Ray Koopman kirjoitti:

Quote:

I guess I read it differently, as indicating that predicted values
were generated by fitting the model to the whole dataset, and that
the within-class R^2s were simply the squares of the bivariate
correlations of those predicted values with the observed values.

Yes, this is what I meant. Just to make it sure:

Measured values in three different classes A, B and C

y_A = {y1_a, y2_a, ..., yn_a}
y_B = {y1_b, y2_b, ..., yn_b}
y_C = {y1_c, y2_c, ..., yn_c}

I fitted the model y = d0 + d1*x1 + d2*x2 for the whole data set
{A,B,C} and got R^2 = 0.72. Then I divided yhats to corresponding
classes

yhat_A = {yhat1_a, yhat2_a, ..., yhatn_a}
yhat_B = {yhat1_b, yhat2_b, ..., yhatn_b}
yhat_C = {yhat1_c, yhat2_c, ..., yhatn_c}

Then I plotted y_A vs yhat_A, y_B vs yhat_B and y_C vs yhat_C and
calculated corr(y_A,yhat_A) = 39.8 %, corr(y_B,yhat_B) = 46.6 % and
corr(y_C,yhat_C) = 40.1 % which are also R^2 values for lines joining
measured vs predicted value points in class A, class B and class C
respectively.

- - -

Still, I guess I might have created a class variable which indicates to
which class each case belongs (1 <-> A, 2 <-> B, 3 <-> C) and put it as
a "fixed factor" in General Linear Model (GLM) Univariate analysis.
Then class depency would have its own effect on the whole model as
intercept would get different values for different classes.

-PJ
Ray Koopman
Posted: Wed Jan 10, 2007 7:15 pm
Guest
Pekka Jarvela wrote:
Quote:
Ray Koopman kirjoitti:


I guess I read it differently, as indicating that predicted values
were generated by fitting the model to the whole dataset, and that
the within-class R^2s were simply the squares of the bivariate
correlations of those predicted values with the observed values.

Yes, this is what I meant. Just to make it sure:

Measured values in three different classes A, B and C

y_A = {y1_a, y2_a, ..., yn_a}
y_B = {y1_b, y2_b, ..., yn_b}
y_C = {y1_c, y2_c, ..., yn_c}

I fitted the model y = d0 + d1*x1 + d2*x2 for the whole data set
{A,B,C} and got R^2 = 0.72. Then I divided yhats to corresponding
classes

yhat_A = {yhat1_a, yhat2_a, ..., yhatn_a}
yhat_B = {yhat1_b, yhat2_b, ..., yhatn_b}
yhat_C = {yhat1_c, yhat2_c, ..., yhatn_c}

Then I plotted y_A vs yhat_A, y_B vs yhat_B and y_C vs yhat_C and
calculated corr(y_A,yhat_A) = 39.8 %, corr(y_B,yhat_B) = 46.6 % and
corr(y_C,yhat_C) = 40.1 % which are also R^2 values for lines joining
measured vs predicted value points in class A, class B and class C
respectively.

- - -

Still, I guess I might have created a class variable which indicates to
which class each case belongs (1 <-> A, 2 <-> B, 3 <-> C) and put it as
a "fixed factor" in General Linear Model (GLM) Univariate analysis.
Then class depency would have its own effect on the whole model as
intercept would get different values for different classes.

-PJ

Coding group membership as (A = 1, B = 2, C = 3) would be appropriate
only if you expect mean[y_A]-mean[y_B] to equal mean[y_B]-mean[y_C].
In general you will need two group-membership variables, not just
one. There are many ways to code such 'dummy' variables; one would be
(A = {0,0}, B = {1,0}, C = {0,1}). Then d0 would be the intercept for
group A, and the regression coefficients for the two dummies would be
the differences between d0 and the intercepts for groups B and C,
respectively.
David Jones
Posted: Thu Jan 11, 2007 8:13 am
Guest
Ray Koopman wrote:
Quote:
David Jones wrote:
Ray Koopman wrote:
Pekka Jarvela wrote:
I have dataset consisting of three different classes, {A, B, C},
and a linear model for the dataset with R^2 = 0.72. Plotting
measured values (y) against predicted values (yhat) for different
classes separately and fitting a line to each gets:

A => R^2 = 0.398
B => R^2 = 0.466
C => R^2 = 0.401

QUESTION: Can I conclude that the model which is able to explain
72 % of variation in the whole dataset {A, B, C}, is able to
explain only
39.8 % of variation in class A, 46.6 % in class B and 40.1 % in
class C?

Cheers.

-PJ

Yes. R^2 for the whole dataset is larger than the within-class
R^2s
because the whole dataset contains between-class variability that
the model can explain but that by definition is absent from the
within- class data.

Or no. The OP needs to be careful about "the model". The
description
indicates that the model is being refitted within each class, which
is not (or may not be) quite the same as fitting the model to all
the
data and then working out R^2 for each class separately for this
model ... which would give different (certainly no better) R^2
values than the separately-fitted sub-models.

David Jones

I guess I read it differently, as indicating that predicted values
were generated by fitting the model to the whole dataset, and that
the within-class R^2s were simply the squares of the bivariate
correlations of those predicted values with the observed values.

I was also reading between the lines, interpreting the question as
"why is the overall R^2 bigger than the within-class R^2s?".

I work in a field where we use "R^2" in a non-regression context and
where it is usually defined as 1 minus the ratio of the sums of
squares of errors for (i) the final modelled/predicted values and
(ii)a naive predictor consisting of the mean. Thus there is no
correlation involved and it only agrees with a regesssion-based
definition if the "predicted values" are either constructed directly
by a multivariate regression on the data set on which the evaluation
is being made or constructed indirectly from an intermediate
univariate regression using a raw predictor as the dependent variable.
In our context we don't want to include in the final predictor the
sort of bias-correction and rescaling effects resulting from such an
intermediate effect, since it is the raw predictor that we want to
test, not the "corrected" one. Note that specification via the squared
errors maintains the interpretation of R^2 as the "proportion of
variance explained", rather than "squared correlation".

In the present (OP's) context, there is the question of why the
squared-correlation R^2 are of interest for the sub-groups since there
is an implicit "correction", for each sub-group, of the predictions
from the full model. An R^2 based on squared errors, as above, might
be more meaningfull. A comparison of the two R^2 is essentially an
indication of whether it is worth including group-specific level and
multiplier effects if these aren't already in the model.
 
Page 1 of 1       All times are GMT - 5 Hours
The time now is Sun Nov 23, 2008 7:29 am