Correlation, prediction, variation, etc.

Hamdan Azhar writes:

I [Azhar] write with a question about language in the context of statistics. Consider the three statements below.

a) Y is significantly associated (correlated) with X;

b) knowledge of X allows us to account for __% of the variance in Y;

c) Y can be predicted to a significant extent given knowledge of X.

To what extent are these statements equivalent? Much of the (non-statistical) scientific literature doesn’t seem to distinguish between these notions. Is this just about semantics — or are there meaningful differences here, particularly between b and c?

Consider a framework where X constitutes a predictor space of p variables (x1,…,xp). We wish to generate a linear combination of these variables to yield a score that optimally correlates with Y. Can we substitute the word “predicts” for “optimally correlates with” in this context?

One can argue that “correlating” or “accounting for variance” suggests that we are trying to maximize goodness-of-fit (i.e. R-squared) for this particular dataset. On the other hand, “prediction” implies that we engage in some form of cross-validation where we seek to minimize some measure of prediction error. Is this reading too much into the language? Is it alright to substitute “prediction” for “accounting for variance”? Or are these distinct concepts that we should be careful not to conflate?

My reply: If interpreted generally enough, these statements are equivalent. “Correlation” refers to a linear relation, whereas “association” is more general. Similarly, you can get information without accounting for “variance,” but if you replace the term by “variation” then this might work. I don’t think you get anything useful out of worrying about these different expressions in general. I’d recommend focusing on specific examples.

1 thought on “Correlation, prediction, variation, etc.

  1. One specific example: I seem to remember some concern on this blog about Malcolm Gladwell's phrasing when he claims John Gottman can "predict" which couples will divorce, when he's really "finding predictors that optimally correlate with" divorce but not actually making predictions as such.

Comments are closed.