“The method of multiple correlation”

Someone writes:

I was reading Harold Gulliksen’s /Theory of Mental Tests/ (1950), and on p. 327-329 it describes a process for solving a set of equations of the form

y = b1x1 + b2x2 + … + bnxn

so as to minimize the least square error. Sounds like regression. But this procedure claims to account for the correlation between all the x variables. He calls it “the method of multiple correlation”.

Why don’t we use this procedure all the time, instead of standard regression, which assumes independence of the independent variables?

My reply: I haven’t ever heard of this before. But it sounds to me just like multiple regression (which does not assume independence of the x-variables). This confusion of terminology is one reason why I don’t like to use the term “independent variables.” I prefer to call them “predictors.”

13 thoughts on ““The method of multiple correlation”

  1. I agree, it sounds exactly like regression. I've read somewhere that what we now call multiple regression was called multiple correlation in the (distant) past. I wonder if it's because there was a focus on using regression for things like intelligence, and so people used standardized coefficients, rather than unstandardized coefficients.

  2. I think that correlation was much more used in the past due to Wright's method of path coefficients, were all variables are standardized and the diagram of variables is described with path coefficients for directed relationships and correlations for undirected relationships. For example consider x_0 = x_1 + x_2 + X_3 then var(x_0) = var(x_1) + var(x_2) + var(x_3) + 2cov(x_1,x_2) + 2cov(x_1,x_3) + 2cov(x_2,x_3). Dividing this last eq. by var(x_0) and setting cov(x_i,x_j)=sd(x_i)sd(x_j)r_ij (sd stands for standard deviation and r for correlation coefficient), we get 1 = p^2_1 + p^2_2 + p^2_3 + 2p_1p_2r_12 + 2p_1p_3r_13 + 2p_2p_3r_23, where p_i is a path coefficient from x_i to x_0. Now say that x_3 is the error (residual) and if we set r_13 and r_23 to zero (usual assumption that independent variables are not correlated with error), we get multiple regression, where x_0 is dependent variable, x_1 and x_2 are independent variables, and x_3 is an error term. Now the above decomposition is:1 = p^2_1 + p^2_2 + p^2_3 + 2p_1p_2r_12.The R^2 (the multiple correlation coefficient) is then equal to 1 – p^2_3.

  3. Agreed on the "independent variables" thing, particularly because students tend to abbreviate it "IV," which they then confuse with "instrumental variable(s)" a bit later. I always say "covariates," or (more awkwardly) "right-hand side variables."

    Likewise, I hate the term "dependent variable." "Response variable" or "left-hand side variable" seem better.

    Then again, there are a bunch of terms social scientists tend to use for statistical concepts and phenomena that get on my nerves. So begins my long, slow slide into curmudgeondom.

  4. There are _three_ different uses of the word "independent" that can come up in regression:

    "independent variables" as an alternate name for the explanatory variables, which go by many other names as well.
    linear independence
    statistical independence

    It can be VERY confusing. I have blogged about this in the past…. it's something that people should be cautioned about IMO :)

  5. I use 'independent variables' because I learned it in 7th grade, and it is a general purpose name.

    When I am doing control work, I think of them as inputs, and when I am doing variational work, I think of them as parameters.

    The semantics are ugly, and don't really add much, because we are concerned with the relation of one to the other, not what they themselves are [unless we are trying to build something, then we need to figure out what things we have access to].

  6. The book I mentioned earlier was 'Applied Econometrics, A Time Series Approach' by Kerry Patterson (Macmillan Press, 2000). On page 8 (which shows how far I got through the book) it is talking about some work and debates by/with Keynes, and says "…is known as 'regression analysis' often referred to in Keynes' [sic] time as 'multiple correlation analysis' when its use was in its infancy'.

  7. "Outcome" is good (kinda like "response").

    "Predictor" has implications that I'm not always comfortable with, mainly because some people (read: students) then expect you actually to *do* prediction, which we don't always do. (I shy away from "explanatory variables" a bit, for a similar reason: sometimes it's just association, not explanation). "Covariates" just seems more general.

  8. I favor "outcome" and "covariates". "Outcome" avoids the sometimes troublesome dependent/independent implication and remains descriptive without reference to the particular arrangement of the equation. I've seen the left/right side of the equation reversed by students from right-to-left language backgrounds, which can create some confusion.

    I find "covariate" to be the most descriptive term, especially in a linear regression setting and in discussing omitted variable bias in that context.

Comments are closed.