Are we not Bayesians?

David reports,

Boris presented the TSCS paper at midwest and was being accused by Neal Beck for not being a real Bayesian. Beck was making the claim that “we’re not Bayesians” because we’re using uninformative priors. He’s seems to be under the assumption that bayesians only use informative priors. Boris should have just directed him to your book and told him to read chapters 1 and 2! I know you’ve spoken to Beck before, but have you ever had such an exchange with him on this topic? He kept making the claim that if you use diffuse priors, all you’re doing is MLE. It may be true that for many simple anaylses that bayesian inference and MLE can produce similar results, but Bayesian inference can easily be extended to more complex problems (something that MLE may have a harder time doing).

What is Bayesian inference?

My reply: Bayesian inference is characterized by the use of the posterior distribution–the distribution of unknowns, conditional on knowns. Bayesian inference can be done with different sorts of models. In general, more complex models are better (see here, also with some interesting discussion), but a simpler model is less effort to set up and can be used as a starting point in a wide range of examples.

Diffuse prior distributions

Diffuse prior distributions are a type of simplification. Other simplifications we commonly use are conventional data models such as the normal distribution, and conventional transformations such as logit or probit. Bayesian inference with these models is still Bayesian. In the model-checking stage of Bayesian data analysis (see Chapter 6 of our book), you can check the fit of the model and think about how to improve it.

More technically, an improper prior distribution can be considered as “noninformative” if it is a stable limit of proper prior distributions (see Sections 2.2-2.4 of this paper).

Hmmm . . . let me try to put this more aphoristically. Bayesian inference with the right model is better than Bayesian inference with a wrong model. “Improper” models (that is, models without a joint probability distribution for all knowns and unknowns in the model) cannot be right. But Bayesian inference with a wrong model is still Bayesian.

Update (19 Apr 05): Neal says he was misquoted. He also says he’ll reply soon.

12 thoughts on “Are we not Bayesians?

  1. I'm glad that I'm only taking introductory econometrics. If I felt that I needed to understand it, that post might have been downright frightening. My professor did refer to Bayesian vs. Frequentist (?) something-or-other a day or two ago, but I was fortunate enough to miss the gist of it entirely. Anyway, the blog makes for interesting reading, even if it's generally far over my poor head.

  2. Andrew, your paper on priors seems appropos.

    You write, "Noninformative prior distributions are intended to allow Bayesian inference for parameters about which not much is known beyond the data included in the analysis at hand. Various justifications and interpretations of noninformative priors have been proposed over the years, including invariance

    (Jeffreys, 1961), maximum entropy (Jaynes, 1983), and agreement with classical estimators (Box

    and Tiao, 1973, Meng and Zaslavsky, 2002). In this paper, we follow the approach of Bernardo

    (1979) and consider so-called noninformative priors as “reference models” to be used as a standard

    of comparison or starting point in place of the proper, informative prior distributions that would be

    appropriate for a full Bayesian analysis (see also Kass and Wasserman, 1996)."

  3. Hmm… I really need to brush up my stats. Unusual blog. Very interesting. I think I might learn something here. Keep up the good work.

  4. AG (12.10.04): A lot has been written in statistics about "parsimony"–that is, the desire to explain phenomena using fewer parameters–but I've never seen any good general justification for parsimony. (I don't count "Occam's Razor," or "Ockham's Razor," or whatever, as a justification. You gotta do better than digging up a 700-year-old quote.)

    DF: General theories tend to be more useful than specific ones. More parameters means more accurate but less generalizable.

    I think expected value theory is a better theory than expected utility theory in an important way. EV lets me take any ordered pair of gambles (G1, G2) and say whether G1>G2 OR G2>G1 OR G1=G2. EU does not constrain the relation on G1 and G2 very much.

    EV is a bold theory. You give it a pair of gambles and it predicts what a decision maker will do. EU is loosey-goosey.

    When you go from EV to EU, you add one "parameter" the shape of the utility function. When you go from EU to cumulative prospect theory, you add four more parameters.

    http://psych.fullerton.edu/mbirnbaum/calculators/

    Which is a better theory – EV w/0 parameters or CPT with 5 parameters? That's a toughie.

    AG: Maybe it's because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that's even better.

    DF: It's not because you are a social scientist, it's because you're an applied statistician. The sorts of problems you encounter are ones that have been recognized, formalized and quantified. There is often a great deal of data available and a model that enables you to combine lots of data is better than a model that doesn't let you do this.

    But fewer parameters means more useful in the sense of:

    a. easier to apply

    b. bolder predictions

    AG: But I don't kid myself that they're better than more complicated efforts!

    DF: Glad to hear it! Don't kid yourself that they are worse!

  5. Deb,

    A model with many parameters is generalizable if it is structured hierarchically. To speak generally, consider three models for data from a psychological experiment:

    1. simple model with no person-level effects

    2. non-hierarchical model with person-level effects

    3. hierarchical model with person-level effects that have their own distribution.

    Models 2 and 3 have more parameters than model 1. The appeal of models 2 and 3 is that they can be more realistic. Model 2 can't be directly generalized to new persons, but model 3 can. In the examples I've worked on, I've used models such as model 3, and these are in fact more generalizable than models of types 1 or 2. But the hierarchical part of the model is key.

    These are the kind of models that Radford is talking about, I think.

    An example from my own work is in our 1996 paper, "Physiological pharmacokinetic analysis using population modeling and informative prior distributions" (also covered in Section 20.3 of our book).

  6. Deb,

    It depends on the problem, I'm sure. For modeling the sex of babies, I prefer the one-parameter model, Pr(girl)=p, to the zero-parameter model, Pr(girl)=0.5. The one-parameter model plus some data give an estimate of p=0.488 or so.

  7. I don't get it. I think your point is that the model p(girl)=.488 is more accurate than the model p(girl)=.5. But I have no idea what this fact has to do with our debate about the optimal number of parameters in a theory.

    I know you're skeptical of Occam, but what about Popper? The utility of a function is positively correlated with its falsifiability. As falsifiability approaches zero, utility approaches zero.

    The more parameters in a model, the more difficult it is to falsify. It's easy to test EV. It's harder to test EU. It's almost impossible to test cumulative prospect theory.

    The St. Petersburg paradox is an example of an empirical finding that falsifies EV. The Allais and Ellsberg paradoxes falsify EU. The fact that there is not yet a paradox that falsifies CPT might be due to the fact that it's very tough to come up with a counterexample to a theory that has five free parameters.

  8. Deb,

    You wrote, "Do you agree that a model with zero parameters is better than a model with one parameter?" I don't necessarily agree with this. I think the 1-parameter Pr(girl)=p model is better than the 0-parameter Pr(girl)=0.5 model.

    This was not a comment on prospect theory. Prospect theory is cool but my impression is that it's been falsified lots of times. But I don't think prospect theory is taken so seriously as a cognitive model. It's more of a convenient framweork that captures a variety of attitudes that people have about probability and uncertainty.

    To get to your other question–and back to my own areas of expertise–in the examples I've worked on, I wouldn't say there is an "optimal number of parameters in a theory." I'd almost always like to have more parameters and be more realistic, and the limiting factor is with my ability to actually fit such a model. In practice, I will definitely use models that I know are simplifications but that's typically because my model-fitting methods are not sophisticated enough to capture the uncertainty involved in estimating large numbers of parameters.

    Every once in a while, the simplest model fits fine and I'm happy to stop there, but it rarely happens this way in my experience.

  9. Deb – There's a great quote by Peter Grunwald in his introductory chapter to "Advances in Minimum Description Length" (2005, p.16; MIT Press) that talks about parsimony.

    It is often claimed that Occam's razor is false — we often try to model real-world situations that are arbitrarily complex, so why should we favor simple models? In the words of Webb [1996], "What good are simple models of a complex world?" The short answer is: even if the true data-generating machinery is very complex, it may be a good strategy to prefer simple models for small sample sizes. Thus, MDL (and the corresponding form of Occam's razor) is a strategy for inferring models from data ("choose simple models at small sample sizes"), not a statement

    about how the world works ("simple models are more likely to be true") — indeed, a strategy cannot be true or false; it is "clever" or "stupid." And the strategy of preferring simpler models is clever even if the data-­generating process is highly complex

    I think that all this comes down to the question of what we are trying to achieve with statistics. If the goal is only to descibe data accurately, then parsimony is irrelevant. If the goal is to describe accurately and concisely, or predict future events in the presence of noise, then parsimony becomes our guard against over-fitting. The earliest formal result that I know of demonstrating this is Akaike (1973), but there have been several variants since then. From an information theoretic point of view, people like Grunwald, Rissanen and Wallace have shown that parsimony is important in the compression of data, while folks like Dawid have talked a lot about predictions of future events (though I'm not as familiar with Dawid's work as I should be, so I might be misinterpreting).

  10. AG: Prospect theory is cool but my impression is that it's been falsified lots of times.

    DF: I'm not sure what you mean by "cool." It's very trendy and chic. It has not been falsified. Kahneman won the Nobel because people believe it is an ACCURATE descriptive model. Rabin won the John Bates Clark Medal and the Macarthur Genius award for seeming to show mathematically what K&T showed empirically – EU is a bad descriptive model and prospect theory is a good one.

    Michael Birnbaum claims to have falsified prospect theory, but the VAST majority of behavioral economists believe PT is descriptively accurate.

    [BTW: The Kahneman, Slovic & Tversky book you cite in your 98 Am. Stat. paper was published in 82, not 84 as you state. Also, KST82 has NOTHING to do with your pre-replication of Rabin's calibration theorem. KST82 deals with availability, representativeness, gamblers fallacy, etc. Kahneman & Tversky, 2000 (Choices, values, frames) deals with prospect theory, loss aversion, etc.]

    AG: But I don't think prospect theory is taken so seriously as a cognitive model. It's more of a convenient framweork that captures a variety of attitudes that people have about probability and uncertainty.

    DF: You are right that PT is not taken seriously as a cognitive model. But I'm not sure it captures the empirical regularities as well as it could – there's an S-shaped value function and an S-shaped probability function. But PT formalizes the two S-shaped functions in different ways. For those of us who are into parsimony, this is yet another red flag that there's something not quite kosher about PT.

Comments are closed.