He doesn’t trust the fit . . . r=.999

I received the following question from an education researcher:

I was wondering if I could ask you a question about an HLM model I’m working on. The basic design is that we have 5 years of 8th grade student achievement data (standardized test scores, this is the dependent variable), 4th grade test scores, demographics (e.g., gender and ethnicity) and status wrt special ed or ELL, etc.. In addition, we have some school- or second-level information such as school averages of the student information, type of school (grade configuration), enrollment and so. In total there are thousands of students and many schools over the 5 years of information.

The model we’re using is quite parsimonious, using only 7 student-level effects and 4 school-level effects. What’s puzzling us is that the correlation between predicted and actual is unrealistically high…r=0.999. We’re using the HPMIXED procedure in SAS but that shouldn’t matter. By dropping variables, obviously we can get the correlation to go down, but we shouldn’t have to do this in my view. It looks like we’re overfitting things but I don’t see how. Is it important that the coefficient of variation for the dependent variable is about 10? To me, this seems quite low and, coupled with a pretty narrow range of possible values (between 1 and 4.5), maybe we are overfitting?

Anyway, I hope this question is clear. We’re uncomfortable with an unrealistically good fit and are wondering how to fix it.

My reply: I’m not quite sure how this correlation is being predicted, but I wonder if what’s happening is that it’s using the estimates of the student effects–the unexplained student-level variation–in the predicted values. One way to get a sense of this would be to use your model to predict each year’s data from the previous year. I agree that r=.999 is pretty weird. I think you have to get a better sense of what these fitted values really mean.

1 thought on “He doesn’t trust the fit . . . r=.999

  1. I think this is an example of a problem that has bothered me before, too. I'm tempted to call it a bug in R, but it's not really a bug and it doesn't just come up in R…really it is a statistical convention that I think is bad.

    If one fits a model without an intercept term, R (and some other packages) does not compute R-squared as the ratio of explained variance to unexplained variance; instead, it computes the ratio of explained variance to mean(y^2), where y is the vector of data. This is nearly never what I want, and I don't know why other people want it, but that's the way it is.

    So, instead of looking at the R-squared value reported by R, I have to calculate it myself.

Comments are closed.