Bayesian statistics then and now

The following is a discussion of articles by Brad Efron and Rob Kass, to appear in the journal Statistical Science. I don’t really have permission to upload their articles, but I think (hope?) this discussion will be of general interest and will motivate some of you to read the others’ articles when they come out. (And thanks to Jimmy and others for pointing out typos in my original version!)

It is always a pleasure to hear Brad Efron’s thoughts on the next century of statistics, especially considering the huge influence he’s had on the field’s present state and future directions, both in model-based and nonparametric inference.

Three meta-principles of statistics

Before going on, I’d like to state three meta-principles of statistics which I think are relevant to the current discussion.

First, the information principle, which is that the key to a good statistical method is not its underlying philosophy or mathematical reasoning, but rather what information the method allows us to use. Good methods make use of more information. This can come in different ways: in my own experience (following the lead of Efron and Morris, 1973, among others), hierarchical Bayes allows us to combine different data sources and weight them appropriately using partial pooling. Other statisticians find parametric Bayes too restrictive: in practice, parametric modeling typically comes down to conventional models such as the normal and gamma distributions, and the resulting inference does not take advantage of distributional information beyond the first two moments of the data. Such problems motivate more elaborate models, which raise new concerns about overfitting, and so on.

As in many areas of mathematics, theory and practice leapfrog each other: as Efron notes, empirical Bayes methods have made great practical advances but “have yet to form into a coherent theory.” In the past few decades, however, with the work of Lindley and Smith (1972) and many others, empirical Bayes has been folded into hierarchical Bayes, which is part of a coherent theory that includes inference, model checking, and data collection (at least in my own view, as represented in chapters 6 and 7 of Gelman et al, 2003). Other times, theoretical and even computational advances lead to practical breakthroughs, as Efron illustrates in his discussion of the progress made in genetic analysis following the Benjamini and Hochberg paper on false discovery rates.

My second meta-principle of statistics is the methodological attribution problem, which is that the many useful contributions of a good statistical consultant, or collaborator, will often be attributed to the statistician’s methods or philosophy rather than to the artful efforts of the statistician himself or herself. Don Rubin has told me that scientists are fundamentally Bayesian (even if they don’t realize it), in that they interpret uncertainty intervals Bayesianly. Brad Efron has talked vividly about how his scientific collaborators find permutation tests and p-values to be the most convincing form of evidence. Judea Pearl assures me that graphical models describe how people really think about causality. And so on. I’m sure that all these accomplished researchers, and many more, are describing their experiences accurately. Rubin wielding a posterior distribution is a powerful thing, as is Efron with a permutation test or Pearl with a graphical model, and I believe that (a) all three can be helping people solve real scientific problems, and (b) it is natural for their collaborators to attribute some of these researchers’ creativity to their methods.

The result is that each of us tends to come away from a collaboration or consulting experience with the warm feeling that our methods really work, and that they represent how scientists really think. In stating this, I’m not trying to espouse some sort of empty pluralism–the claim that, for example, we’d be doing just as well if we were all using fuzzy sets, or correspondence analysis, or some other obscure statistical method. There’s certainly a reason that methodological advances are made, and this reason is typically that existing methods have their failings. Nonetheless, I think we all have to be careful about attributing too much from our collaborators’ and clients’ satisfaction with our methods.

My third meta-principle is that different applications demand different philosophies. This principle comes up for me in Efron’s discussion of hypothesis testing and the so-called false discovery rate, which I label as “so-called” for the following reason. In Efron’s formulation (which follows the classical multiple comparisons literature), a “false discovery” is a zero effect that is identified as nonzero, whereas, in my own work, I never study zero effects. The effects I study are sometimes small but it would be silly, for example, to suppose that the difference in voting patterns of men and women (after controlling for some other variables) could be exactly zero. My problems with the “false discovery” formulation are partly a matter of taste, I’m sure, but I believe they also arise from the difference between problems in genetics (in which some genes really have essentially zero effects on some traits, so that the classical hypothesis-testing model is plausible) and in social science and environmental health (where essentially everything is connected to everything else, and effect sizes follow a continuous distribution rather than a mix of large effects and near-exact zeroes).

To me, the false discovery rate is the latest flavor-of-the-month attempt to make the Bayesian omelette without breaking the eggs. As such, it can work fine if the implicit prior is ok, it can be a great method, but I really don’t like it as an underlying principle, as it’s all formally based on a hypothesis testing framework that, to me, is more trouble than it’s worth. In thinking about multiple comparisons in my own research, I prefer to discuss errors of Type S and Type M rather than Type 1 and Type 2 (Gelman and Tuerlinckx, 2000, Gelman and Weakliem, 2009, Gelman, Hill, and Yajima, 2009). My point here, though, is simply that any given statistical concept will make more sense in some settings than others.

For another example of how different areas of application merit different sorts of statistical thinking, consider Rob Kass’s remark: “I tell my students in neurobiology that in claiming statistical significance I get nervous unless the p-value is much smaller than .01.” In political science, we’re typically not aiming for that level of uncertainty. (Just to get a sense of the scale of things, there have been barely 100 national elections in all of U.S. history, and political scientists studying the modern era typically start in 1946.)

Progress in parametric Bayesian inference

I also think that Efron is doing parametric Bayesian inference a disservice by focusing on a fun little baseball example that he and Morris worked on 35 years ago. If he would look at what’s being done now, he’d see all the good statistical practice that, in his section 10, he naively (I think) attributes to “frequentism.” Figure 1 illustrates with a grid of maps of public opinion by state, estimated from national survey data. Fitting this model took a lot of effort which was made possible by working within a hierarchical regression framework–“a good set of work rules,” to use Efron’s expression. Similar models have been used recently to study opinion trends in other areas such as gay rights in which policy is made at the state level, and so we want to understand opinions by state as well (Lax and Phillips, 2009).

I also completely disagree with Efron’s claim that frequentism (whatever that is) is “fundamentally conservative.” One thing that “frequentism” absolutely encourages is for people to use horrible, noisy estimates out of a fear of “bias.” More generally, as discussed by Gelman and Jakulin (2007), Bayesian inference is conservative in that it goes with what is already known, unless the new data force a change. In contrast, unbiased estimates and other unregularized classical procedures are noisy and get jerked around by whatever data happen to come by–not really a conservative thing at all. To make this argument more formal, consider the multiple comparisons problem. Classical unbiased comparisons are noisy and must be adjusted to avoid overinterpretation; in constrast, hierarchical Bayes estimates of comparisons are conservative (when two parameters are pulled toward a common mean, their difference is pulled toward zero) and less likely to appear to be statistically significant (Gelman and Tuerlinckx, 2000).

Another way to understand this is to consider the “machine learning” problem of estimating the probability of an event on which we have very little direct data. The most conservative stance is to assign a probability of ½; the next-conservative approach might be to use some highly smoothed estimate based on averaging a large amount of data; and the unbiased estimate based on the local data is hardly conservative at all! Figure 1 illustrates our conservative estimate of public opinion on school vouchers. We prefer this to a noisy, implausible map of unbiased estimators.

Of course, frequentism is a big tent and can be interpreted to include all sorts of estimates, up to and including whatever Bayesian thing I happen to be doing this week–to make any estimate “frequentist,” one just needs to do whatever combination of theory and simulation is necessary to get a sense of my method’s performance under repeated sampling. So maybe Efron and I are in agreement in practice, that any method is worth considering if it works, but it might take some work to see if something really does indeed work.

Comments on Kass’s comments

Before writing this discussion, I also had the opportunity to read Rob Kass’s comments on Efron’s article.

I pretty much agree with Kass’s points, except for his claim that most of Bayes is essentially maximum likelihood estimation. Multilevel modeling is only approximately maximum likelihood if you follow Efron and Morris’s empirical Bayesian formulation in which you average over intermediate parameters and maximize over hyperparameters, as I gather Kass has in mind. But then this makes “maximum likelihood” a matter of judgment: what exactly is a hyperparameter? Things get tricky with mixture models and the like. I guess what I’m saying is that maximum likelihood, like many classical methods, works pretty well in practice only because practitioners interpret the methods flexibly and don’t do the really stupid versions (such as joint maximization of parameters and hyperparameters) that are allowed by the theory.

Regarding the difficulties of combining evidence across species (in Kass’s discussion of the DuMouchel and Harris paper), one point here is that this works best when the parameters have a real-world meaning. This is a point that became clear to me in my work in toxicology (Gelman, Bois, and Jiang, 1996): when you have a model whose parameters have numerical interpretations (“mean,” “scale,” “curvature,” and so forth), it can be hard to get useful priors for them, but when the parameters have substantive interpretations (“blood flow,” “equilibrium concentration,” etc.), then this opens the door for real prior information. And, in a hierarchical context, “real prior information” doesn’t have to mean a specific, pre-assigned prior; rather, it can refer to a model in which the parameters have a group-level distribution. The more real-worldy the parameters are, the more likely this group-level distribution can be modeled accurately. And the smaller the group-level error, the more partial pooling you’ll get and the more effective your Bayesian inference is. To me, this is the real connection between scientific modeling and the mechanics of Bayesian smoothing, and Kass alludes to some of this in the final paragraph of his comment.

Hal Stern once said that the big divide in statistics is not between Bayesians and non-Bayesians but rather between modelers and non-modelers. And, indeed, in many of my Bayesian applications, the big benefit has come from the likelihood. But sometimes that is because we are careful in deciding what part of the model is “the likelihood.” Nowadays, this is starting to have real practical consequences even in Bayesian inference, with methods such as DIC, Bayes factors, and posterior predictive checks, all of whose definitions depend crucially on how the model is partitioned into likelihood, prior, and hyperprior distributions.

On one hand, I’m impressed by modern machine-learning methods that process huge datasets and I agree with Kass’s concluding remarks that emphasize how important it can be that the statistical methods be connected with minimal assumptions; on the other hand, I appreciate Kass’s concluding point that statistical methods are most powerful when they are connected to the particular substantive question being studied. I agree that statistical theory is far from settled, and I agree with Kass that developments in Bayesian modeling are a promising way to move forward.

References

Efron, B., and Morris, C. (1971). Limiting the risk of Bayes and empirical Bayes estimates–Part I: the Bayes case. Journal of the American Statistical Association 66, 807-815.

Gelman, A., Bois, F. Y., and Jiang, J. (1996). Physiological pharmacokinetic analysis using population modeling and informative prior distributions. Journal of the American Statistical Association 91, 1400-1412.

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis, second edition. London: Chapman and Hall.

Gelman, A., Hill, J., and Yajima, M. (2009). Why we (usually) don’t have to worry about multiple comparisons. Technical report, Department of Statistics, Columbia University.

Gelman, A., and Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics 15, 373-390.

Gelman, A., and Weakliem, D. (2009). Of beauty, sex, and power: statistical challenges in estimating small effects. American Scientist 97, 310-316.

Gelman, A., and Jakulin, A. (2007). Bayes: radical, liberal, or conservative? Statistica Sinica 17, 422-426.

Lax, J. R. and Phillips, J. H. (2009). Gay rights in the states: public opinion and policy responsiveness. American Political Science Review 103.

Lindley, D. V., and Smith, A. F. M. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society B 34, 1-41.

8 thoughts on “Bayesian statistics then and now

  1. Andrew, you wrote: "In Efron's formulation (which follows the classical multiple comparisons literature), a "false discovery" is a zero effect that is identified as nonzero, whereas, in my own work, I never study nonzero effects."

    Don't you mean to say that you never study zero effects?

  2. There are some very nice points here.

    While I think it is a fallacy of sorts, I think people find frequentist views conservative because when evidence is weak the null hypothesis is always accepted. Thus, while one more data point can jerk around the estimate it's not likely to jerk around (or reject) the null, so the status quo remains.

    I like the philosophy of frequentism. But it can be a bad decision making tool. But then decision making is always subjective, no?

  3. "And the smaller the group-level error, the more partial pooling you'll get and the more effective your Bayesian inference is. To me, this is the real connection between scientific modeling and the mechanics of Bayesian smoothing"

    does this not also

    "depend crucially on how the model is partitioned into likelihood, prior, and hyperprior distributions."

    [Pr(u)*Pr(ui|u)]*{P(xi|ui)}

    versus

    [Pr(u)]*{Pr(ui|u)*P(xi|ui)}

    prior in [] brackets likelihood in {}

    aka [epistemic] versus {aleatory}

    Keith

  4. Efron's take on "proper use of indirect evidence … learning from experience of others" in the link given by freddy is revealing (or a bit disappointing) about whether the learning should be in the likelihood or in the specification of the prior.

    Efron's regression example (Fig 2) is simply all likelihood – under the usual linear regression assumption there is the same intercept and slope parameter (with exactly common values)in each individual observation likelihood and a direct combination by likelihood multiplication. (To me this is disappointing but maybe thats because he considers this acount Fisherian)

    Historically, Daniel Bernoulli did this (modeled astronomical observations as draws from a probability model with the same parameter with a common value in his 1778 paper entitled The most probable choice between several discrepant observations and the formation
    therefrom of the most likely induction. For the same parameter but with possibly different values, in 1839 Bienayme modeled the parameter themselves as random draws from a probability model that itself had same parameter with a common value for each draw(suggested earlier by Poisson). This was later shown to be implied by the assumption of exchangeability.

    But its all in the likelihood – the same parameter repeatedly in different units of analysis in the likelihood specification provides a direct combination – no need for a prior.
    (Other than the Kenyesian prior of the data model being correct with probability one)

    The prior does bring something extra-data in and seems the less wrong vehicle for dealing with truly indirect evidence ….

    Maybe empirical Bayes / James-Stien is just the combination of exchangeability (if you can't tell the difference treat it as random) and all models being wrong but assuming not too badly wrong ones (a so so implicit prior) will greatly improve your repeated performance.

    Keith

Comments are closed.