Intractable is as intractable does: Bayesian progress during the Dark Ages

I just reviewed the second edition of Jeff Gill’s book for Amazon (5 stars). It’s a fun book: the tone, a sort of theoretically-minded empiricism that is hard for me to characterize exactly but strikes me as a style of writing, and of thinking, that will resonate with the social science readership. It’s great to see this sort of book that really puts a lot of thought into the “why” as well as the “how” (which is what we statisticians tend to focus on).

I gave comments on an earlier draft, and I’ll put those comments below, but first I wanted to rant a bit about Bayesian methods before the Gibbs sampler came into play.

Some (reconstructed) history

Given that Gill does talk about history, I would’ve liked to have seen a bit more discussion of the applied Bayesian work in the “dark ages” between Laplace/Gauss in the early 1800s and the use of the Gibbs sampler and related algorithms in the late 1980s. In particular, Henderson et al. used these methods in animal breeding (and, for that matter, Fisher himself thought Bayesian methods were fine when they were used in actual multilevel settings where the “prior distribution” corresponded to an actual, observable distribution of entities (rather than a mere subjective statement of uncertainty)); Lindley and Smith; Dempster, Rubin, and their collaborators (who did sophisticated pre-Gibbs-sampler work, published in JASA and elsewhere, applying Bayesian methods to educational data); and I’m sure others. Also, in parallel, the theoretical work by Box, Tiao, Stein, Efron, Morris, and others on shrinkage estimation and robustness. These statisticians and scientists worked their butt off getting applied Bayesian methods to work _before_ the new computational methods were around and, in doing so, motivated the development of said methods and actually developed some of these methods themselves. Writing that these methods, “while superior in theoretical foundation, led to mathematical forms that were intractable,” is a bit unfair. Intractable is as intractable does, and the methods of Box, Rubin, Morris, etc etc. worked. The Gibbs sampler etc. took the methods to the next level (more people could use the methods with less training, and the experts could fit more sophisticated methods), but Bayesian statistics was more than a theoretical construct back in 1987 or whenever.

Miscellaneous comments

OK, now here are various comments I made on the pre-publication draft. Changes have been made so some of the page numbers may be wrong, and maybe various things got fixed. Still, these comments might provoke further thoughts if you buy or teach out of the book.

Chapter 1: on page 6, it might be worth pointing out that in actual psych experiments, people’s probability functions are typically subadditive (P(A) + P(not-A) < 1) or superadditive. Probability theory is normative but not descriptive. On p.30, I'd prefer the more unified notation p() rather than f(), pi(), L() etc. In the Bayesian world, all these are simply probabilities. Section 1.8: You might want to refer readers to Appendix C of Bayesian Data Analysis, where we have a bunch of Bugs code and R code for Gibbs and Metropolis. The appendix is self-contained (and on the web, actually) so it might be a good place for people to get a sense of the way to code these things up in R. On page 45, I'd just say "Bayesian intervals" rather than "Bayesian credible intervals". Who needs one more sloppy term (in this case, the vague "credible")? p.52: Is this algebra really needed here? It looks like such a mess! Section 2.4: we can also do Bayesian learning by finding out that a model does not fit the data. This doesn't discussed much! p.63: I'm a Bayesian and I think parameters are fixed by nature. But I don't know them, so I model them using random variables. Chapter 3: I like that they give the t model a proud place. I'd also suggest the robit model, which is the generalization of probit with latent t distribution. p.80: When Sigma is unknown, you can do better than the Wishart. Try the scaled-inverse-Wishart (see Gelman and Hill for details). Chapter 4: I don't have much to say here except that some plots of raw data and regression lines would help. page 124: I'll say this only one more time: the big table here is unreadable. Just put the data on the web and give a link! p.124, bottom: There are big problems with the so-called diffuse inverse-gamma prior; see here: [2006] Prior distributions for variance parameters in hierarchical models. {\em Bayesian Analysis} {\bf 1}, 515--533. (Andrew Gelman) Also see p.385 for another example of these priors that should be changed, I think. p.127: This is ugly; you should use vector-matrix notation here. Chapter 5: This is fine but would be much better if there were an explicit connection to hierarchical models, where the parameters of the "prior distribution" are fitted and estimated from data. Also of interest would be the examples in Chapter 1 of Bayesian Data Analysis where we come up with prior distributions (for football scores and record-linkage matches) using data analysis rather than vague "elicitation". Section 5.5: I know that elicitation looks cool but I don't buy it, and I don't think the examples cited here are very realistic. I think this is part of the older, theoretical tradition of Bayesian analysis, and I recommend removing this section. Chapters 6 and 7: I'm very disappointed that there isn't a discussion or example here of the approach of plotting the data and plotting replicated data under the model, and checking to see how they differ. The philosophical discussions and mathematics in the chapter are fine, but for social science, I think that serious model checking is important, and it's crucial in presenting Bayesian methods to make the connection between the modeling and the actual fit to data. Otherwise you're just holding your nose and diving in deep into the model without an appropriate method for evaluating skeptical claims. Section 7.2.3: Take a look at the decision theory chapter of Bayesian Data Analysis for a much more applied take on this (which, I argue, is far more relevant to social science). Again, do what you want, but take a look and think hard about this before just presenting the conventional take on it. Section 7.6: This belongs in a computation chapter--it's out of place here in chapter 7. Chapter 8. This is fine, but I'd use theta rather than X as your random variable. That's more consistent with what you're doing in Bayesian inference. Also, we're generally getting posterior simulations and intervals, not posterior means, so (8.1) is a bit misleading. Figure 8.6: This is fine, but as a statistician I'd like to see the same creativity in plotting data as you use in plotting these distributions! p.310: way too much historical detail here about a particular technical method! Why should the social scientist (or anyone, really) care that "Zangwill's textbook provides a critical proof concerning conditions for monotonic convergence" etc.? One or two references would be enough! p.386: A parameter estimate of "96323.24" ? Perhaps something needs to be rescaled!! p.393: This would help if you plotted two chains on top of each other. p.451, etc: These would be much more clear if you plot 2 or 3 chains together. Conclusion: This is a thougtful and thought-provoking book, focusing more on priors, motivation, model evaluation, and computation, and less on the nuts-and-bolts of constructing and fitting models. As such, it fits in very well with existing books such as ours that focus more on the models.

3 thoughts on “Intractable is as intractable does: Bayesian progress during the Dark Ages

  1. Thanks Andy for the insightful comments and nice words. A couple of respectful responses.

    1. I did some research into the "dark ages" and it is really hard to find much material. In fact, what one can find is hard to make applicable. It's hard to argue, though, with the assertion that these folks had a hard time keeping the flame alive. Of course we could add Fisher to that list since fiducial inference is really a special case of Bayesian inference (this may make him roll over in his grave, pipe in hand).

    2. The list of issues above was from your reading of an earlier manuscript edition. Great comments. I implemented most of them and disagreed with only a few.

  2. Jeff,

    I guess what I'm sayin is, during the Dark Ages, the monks didn't just keep the flame alive, they advanced the ball as well (to mix metaphors). One reason why the Gibbs sampler revolution happened so fast is that people were already fitting hierarchical models, just with difficulty. For example, my 1990 Jasa paper with King fit a hierarchical mixture model using informative prior distributions. We used the Gibbs sampler (or, as we called it, the data augmentation algorithm of Tanner and Wong), but I already knew this was the model we wanted to fit, thanks to the work of Lindley, Dempster, Rubin, etc etc, showing the effectiveness of hierarchical models with small numbers of observations per group. (We actually had only one observation per group, which would seem to make the model impossible to fit, but we estimated variance parameters from an external analysis.)

Comments are closed.