Another salvo in the ongoing battle over standardizing regression coefficients

Sander Greenland doesn’t like the automatic rescaling of regression coefficients (for example, my pet idea of scaling continuous inputs to have a standard deviation of 0.5, to put them on a scale comparable to binary predictors) because he prefers interpretable units (years, meters, kilograms, whatever). Also he points out that data-based rescaling (such as I recommend) creates problems in comparing models fit to different datasets.

OK, fine. I see his points. But let’s go out into the real world, where people load data into the computer and fit models straight out of the box. (That’s “out of the box,” not “outside the box.”)

Here’s something I saw recently, coefficients (and standard errors) from a fitted regression model:

coefficient for “per-capita GDP”: -.079 (.170)
coefficient for “secondary school enrollment”: -.001 (.006)

Now you tell me that these have easy interpretations. Sure. I’d rather have seen these standardized. Then I’d be better able to interpret the results. Nobody’s stopping you from doing a more careful rescaling, a la Greenland, but that’s not the default we’re starting from.

11 thoughts on “Another salvo in the ongoing battle over standardizing regression coefficients

  1. Here's one paper: <a>The fallacy of employing standardized regression coefficients and correlations as measures of effect, by Greenland, Schlessman, and Criqui (1986). My own paper on the topic is here.

  2. I'm surprised this is controversial. The only difference in the use of standardised and raw coefficients is in their interpretation. There will be cases when one or the other is better: are you interested in a change in one standard deviation, or in "per capita GDP"? That is a decision for the scientist, not the statistician.

    I'm running a QTL analysis at the moment, and I'm not going to standardise, because it is more useful to know the effect of substituting one allele for another: estimating the effect of substituting half an allele will just increase the confusion.

    On the other hand, I did transform the response variables to have a unit variance, because that way I can compare across traits more easily. I might transform back when presenting the results, though. I haven't decided yet – it depends on what is easier for the reader to understand. Why isn't this sort of pragmatism the only way?

  3. One of the advantages of standardizing coefficients is the ability to compare the effects across _predictors_. One of the disadvantages is losing the actual scale, so comparisons across _datasets_ are impaired. I wonder if a solution to this dilemma is to plot coefficients at the standardized scale, but label the axis at the original scale. Like so: example

    I tried adding axis labels on the right, but I think it gets too busy: example 2

  4. The problems in Andrew's GDP example is only buried, not answered by SD-standardization. Those problems are (1) the regressor GDP and the outcome (note: not given!) need to be rescaled so that their units are contextually relevant (that means no rescaling of indicators), and (2) those units need to be given in every presentation (except for indicators, since their only sensible unit is 1-0=1). To put it mildly, presenting decontextualized unitless estimates for coefficients involving quantitative regressors or outcomes is an awful habit, and the example is just another example of why it's a bad one…. Instead of trying to avoid context, embrace it. If GDP is rescaled to represent (say) 1000s of year-2000 dollars, and the outcome is neonatal deaths per 1000 live births, the GDP coefficient becomes the change in neonatal deaths per 1000 for every thousand-dollar increase in GDP. If the contextual rendering seems too hard to understand, you either did it wrong or you don't understand the topic well enough to be trusted with doing or interpreting statistical analyses on the topic, let alone trusted to fit regression models about it. But the usual statistical solution seems to be divide by the standard deviation so we don't have to understand the context and can move on to the next publication instead, secure in the false belief we dealt with the problems….Next up is the claim that SD-standardization is helpful for comparing coefficients across regressors. No it isn't, and for parallel reasons. It just glosses over the inherent noncomparability of qualitatively different quantities, which requires (horrors!) yet more contextual information to cope with well. Which is "more important" to determining gas mileage: car weight or engine displacement? That question is not well answered by SD-standardization, neither on an individual nor population level. But I'm sure someone will argue that it is (hopefully beyond just saying it is), so I'll take a break and await that….

    In the meantime, see also Greenland, S., Maclure, M., Schlesselman, J. J., Poole, C. and Morgenstern, H. (1991). Standardized regression coefficients: A further critique and a review of alternatives. Epidemiology, 2, 387-392.
    P.S. I consider myself a pragmatist. In science at least, pragmatism doesn't have to mean bowing to convention when the latter is short on common sense.

  5. Sander,

    Thanks for the comments. In reply:

    1. I never rescale indicators. I rescale numerical predictors so they can be interpreted approximately on the same scale as indicators.

    2. I rescale by dividing by two standard devations, not one.

    3. I'm not bowing to convention; I'm trying to set up a new convention. The existing default, like it or not, is to stick in regression predictors straight out of the box with no rescaling at all.

    4. Again, it's all about the defaults. I'm happy if people want to interpret each coefficient on its original scale, but that's not what I see. Again, go to the example in the blog entry above. I don't think many of us have any intuition about numbers of "secondary school enrollment," but I think we can understand changes from low to high.

    Just to reiterate: the competition here is not rescaling vs. your careful modeling; the competition is between rescaling and nothing.

  6. For what it's worth, I'm closer to Sander's camp on this one, although I've used (and recommended) sd-based rescaling myself on occasion. But it's way better — less error-prone, more interpretable by me, more interpretable by others — to use a rescaling that moves coeffficients into ranges where they can be at least roughly compared against each other (as with sd-based rescaling) while leaving them interpretable. If "number of wolf visits to an area per year" is giving a coefficient that is way out of line with the others, I can use "number of wolf visits per day", or "hundreds of wolf visits per year"…or "extra wolf visits per 100 days," if I want to remove a baseline.

    Andrew, allow me to point out that this could be automated just like sd-based rescaling, at least to get you to the right order of magnitude, which is usually enough. You don't have to give up on an automated solution.

  7. Andrew said the real world default is to stick in regressors out of the box. I guess that depends on which world we live in, and what we consider real.

    I'm not saying the world of epidemiology is real, but I often see people in that world rescaling regressors to meaningful increments on the original scales:

    the cardiovascular disease relative risk for a 3 cup/day increase in tea consumption (not 1 cup/day)

    the mortality rate difference for a 10-year increase in age (not 1 year)

    the ischemic stroke rate ratio for a 20 mg/dL increase in serum glucose (not 1 mg/dL)

    etc.

    But standard deviations as units of measurement? I am so glad I wandered into a research field where multiplying by sd(x), dividing by sd(y), or both is frowned upon. I guess that means I have to frown twice as hard upon dividing by 2sd(x)!

    Here's an example, from a meta-analysis of studies comparing the mean volume of the left hippocampus between PTSD patients and controls:

    Study A:
    mean difference = -113 mm^3
    sd(y) = 187 mm^3
    conventional effect size = -0.60
    Andrew's effect size = -0.30

    Study B:
    mean difference = -300 mm^3
    sd(y) = 494 mm^3
    conventional effect size = -0.61
    Andrew's effect size = -0.30

    Study C:
    mean difference = -300 mm^3
    sd(y)= 252 mm^3
    conventional effect size = -1.19
    Andrew's effect size = -0.60

    I would just have a very hard time using "real" to describe a world in which the estimated "effect size" is approximately equal in studies A and B and twice as big in study C. To me, the estimated effects look identical in B and C and less than half as big in A.

    If I were a PTSD patient (actually, I was one, for a while, a long time ago), and the left hemisphere of my hippocampus had been shrunk by 300 mm^3, and somebody told me that effect would be only half as big if I were in study B instead of study C, we'd be talking about who really needs to get whose head examined!

  8. Charlie,

    I think that people may be more sensitive to scaling in epidemiology than in social science. Most of the regressions I've seen, even by top researchers, tend to take predictors in a raw scale (e.g., GDP in dollars) for which it's hard to visualize a change of 1 unit. I'm glad to hear that things are better in epidemiology.

    In my Statistics in Medicine paper (linked to above), I give recent examples form one leading journals in medicine and one in economics where variables were not scaled in any way that makes the predictors particularly interpretable.

    Regarding scaling by one or two standard deviations: the point is that the 2sd scaling puts things on a scale comparable to 0/1 binary predictors (which, if p is anywhere near 0.5, have sd's of about 1/2). If you're used to binary predictors such as male/female (as we are in the social sciences), I find it helpful to scale the continuous predictors in a comparable way.

    An alternative (suggested by Ted Dunning here) is to code binary predictors as -1/1 and then divide continuous predictors by 1 sd. This is arguably a better choice; I chose 0/1 and 2sd because this seemed more consistent with what I'm used to seeing in applied research.

    Finally, I agree that standardized coefficients cannot be directly compared between studies when the range of the predictors varies from study to study. I do mention this point in <a>our paper (see the second paragraph of Section 1.1) but it is certainly a point worth repeating.

    In general, I don't think a rescaled analysis is always the right way to end an analysis but it can be helpful when starting, as an automatic way to scale things. I'm often working with data where some variables are on a 1-5 scale, others are on a 1-7 scale, etc., and it's good to be able to do that quick rescaling to get things started. Certainly in your meta-analysis example, if you were going to do rescaling, it seems that you'd want a common rescaling for all three studies.

Comments are closed.