Richard Berk’s book on regression analysis

I just finished reading Dick Berk’s book, “Regression analysis: a constructive critique” (2004). It was a pleasure to read, and I’m glad to be able to refer to it in our forthcoming book. Berk’s book has a conversational format and talks about the various assumptions required for statistical and causal inference from regression models. I was disappointed that the book used fake data–Berk discussed a lot of interesting examples but then didn’t follow up with the details. For example, Section 2.1.1 brought up the Donohue and Levitt (2001) example of abortion and crime, and I was looking forward to Berk’s more detailed analysis of the problem–but he never returned to the example later in the book. I would have learned more about Berk’s perspective on regression and causal inference if he were to apply it in detail to some real-data examples. (Perhaps in the second edition?)

I also had some miscellaneous comments:

p. xiv: Berk writes, “We could all sit at our desks and perform hypthetical experiments in our heads all day, and science would not advance one iota.” This is true of most of us, I’m sure . . . but Einstein advanced science by his hypothetical experiments on relativity theory. So it is possible!

p.19 has a nice quote: “If the goal is to do more than describe the data on hand, information must be introduced that cannot be contained in the data themselves.”

Figure 3.3: This is described as a “bad fit,” but it’s just a case of high residual variance. The model fits fine, but sigma is large. Perhaps a distinction worth making. (More generally, I like that Berk focuses on the mean–the deterministic part of the regression model–rather than the errors. Most statistics texts seem to make the mistake of talking on and on about the distribution of the errors and the variance function, but it’s the deterministic part that’s generally most important.)

I also like that in Section 3.6, Berk presents transformations as a way to get linearity (not normality or equal-variance, which are typically much less important). Again, an important practical point that is lost in many more mathematical books.

Figure 4.5: I don’t really understand this picture: it’s a plot of (hypothetical?) data of students’ grade point averages vs. SAT scores, and the discussion says that “larger positive errors around the line tend to have larger SAT scores.” But I don’t understand what is meant by “errors” here or why the regression line is not drawn to go through the data.

p.68: Berk writes, “The null hypothesis is either true or false.” Actually, I’d go further: in any problem I’ve ever worked on, the null hypothesis is false. Mathematically, the null hyp in regression is that some beta equals 0, and in social and environmental science, the true beta (as would be seen by gathering data from a very large population) is never exactly zero. I do think that hyp testing can be useful–for example, it can tell you that you can be very sure that beta>0 or that beta<0, and whether the data are sufficient to estimate beta precisely--but we know ahead of time that beta != 0. Sections 5.2.2 and 10.5.1: There's an extensive discussion here of "response schedules" but I don't quite understand what's being said here. A full example with data would help. On p. 92, there's a discussion of estimating the effects of prison sentences, and Berk seems to be saying that this can't be done because you can't simultaneously manipulate the length of sentence and the age at which a prisoner is released. But I don't see why this is a problem: say, for example, that you have a bunch of 20-year-old prisoners, and though a randomized intervention, some are released at age 25 and some at age 40. You can look at a bivariate outcome: crimes committed ages 25-40, and crimes committed ages 40+. The treatment will have a clear effect on the first outcome (of course, there are cost-benefit issues as to whether the treatment is worth it), but you can certainly compare the crimes age 40+ for the two groups. Chapter 5 has lots of discouraging examples. It would be good to see some success stories. (Parochially, I can point to this and this as particularly clean examples of causal inference from observational data, but lots more is out there.) I also think the discussion of models would be strengthened by some disucssion of interactions (in the causal setting, that would correspond to treatments that are more effective for some groups than others). This is also an active research area (see here).

p.99: In the discussion of matching, it would unify things to point out that matching followed by regression can be more effective than either alone (this was Don Rubin’s Ph.D. thesis, published in article form as Rubin, 1973).

In Chapter 8 there’s some discussion of stepwise regression etc. It would be also helpful to discuss other methods of combining predictors, for example adding them up to create “scores.” Also, when mentioning categorical predictors, that’s a good place to put a pointer to multilevel models.

I agree with Berk’s skepticism in Chapter 9 about traditional “regression diagnostics.” In my experience, outliers and nonnormality are not the key concerns, and what’s more important is to get a sense of what the model is doing and how it is being fit to the data.

In Chapter 10, he refers to multilevel modeling as “relatively recent.” Actually, it’s been around since the early 1950s in animal breeding and since the early 1970s in education. These are two fields where one encounters many small replicated datasets.

I also think Berk is too skeptical about multilevel models. I think he needs to apply equal skepticism to the altermative, which is to include categorical predictors and fit by least squares. This least-squares alternative has bad statistical properties, makes it difficult to fit varying slopes and include group-level predictors, becomes even messier when fitting logistic regressions to sparse data, and is in fact a special case of multilevel modeling where the group-level variance is set to infinity. So, yes, I agree that multilevel modeling does not solve the problem of causality (see here), it can be pretty useful for model fitting.

I have similar comments for Berk’s discussion of meta-analysis. Numerical and graphical combination of information can be helpful, and multilevel meta-analysis is a way of doing this and separating out the different sources of variation as they appear in the data.

Finally, there’s a quote on page 204 disparaging the method (which I like) of hypothesizing a model, and then when it is rejected by the data, of replacing or improving the model. I think the iteration of modeling/fitting/checking/re-modeling is extremely useful (here, I’m influenced by the writings of Jaynes and Box, as well as my own experiences). The quote says something about how if you “stick your neck out” to assume a model, then your head will be cut off. But I don’t think that’s quite right. I’ll make an assumption, knowing that it’s false, and ready to replace or refine it as indicated by the data.

4 thoughts on “Richard Berk’s book on regression analysis

  1. Finally, there's a quote on page 204 disparaging the method (which I like) of hypothesizing a model, and then when it is rejected by the data, of replacing or improving the model.

    I have to agree. I'm in credit modeling, myself, and you have to make some starting assumptions about what the basic model looks like or you'll get buried by spurious correlations in the process of mining the data. :^)

  2. hi professor gelman,

    thanks for a very thought-provoking blog posting. your post prompts me to ask you something i've been wondering about ever since i began learning about NON-regression-based approaches to causal inference: namely, why do virtually all statistically-oriented political scientists think that regression-based/MLE methods are giving them the correct answers in observational settings? after all, we have long known (since at least the Rubin/Cochran papers of 1970s) that regression is often (and quite possible *generally*) unreliable in observational settings.

    Do we have a single example of a non-trivial observational dataset wherein we can show that regression analysis produces the result that would have been obtained in a randomized experiment? We have lots of examples that show regression fails this test (here i'm thinking of dehejia/wahba/lalonde, etc.) where is the definitive empirical success story? there should be many success stories, given the universality of the methodology– but i don't know of a single one.

    in your blog, you write:

    "(Parochially, I can point to this [link to gelman paper] and this [link to gelman paper] as particularly clean examples of causal inference from observational data, but lots more is out there.)

    i do not doubt that your linked papers (which i have not read) are excellent and rigorous examples of applied regression analysis. but my question is, how is it that these papers validate a regression-based approach to causal inference? what do you know of the "correct answer" in these cases, aside from your regression-based estimates?

    best regards,

    alexis diamond

  3. Thanks for the interesting and thorough review.

    One of my teachers suggested that you split the data into thirds. One third to suggest the model, one third to fit the coefficients of the model, and a third to test.

    Typos?
    hypthetical?
    disucssion?
    altermative?

  4. i just started reading this book, and quite like it. so pleased to see that you had reviewed it.

    cosma shalizi also recommends it. (not a proper cosma review, but an "algae" recommendation.)
    http://cscs.umich.edu/~crshalizi/weblog/algae-200

    i agree that i'd like to see more detailed analysis of the datasets which berk describes. but i think the fake data/hypothetical examples can have value in illustrating concepts, without becoming bogged down in the messiness of real data, which may obscure his point. that is my guess anyway for the use of fake data.

Comments are closed.