Recently in Bayesian Statistics Category

Constructing informative priors

| 2 Comments

Christiaan de Leeuw writes:

I write to you with a question about the construction of informative priors in Bayesian analysis. Since most Bayesians at the statistics department here are more of the 'Objective' Bayes persuasion, I wanted some outside opinions as well.

Guilherme Rocha writes:

Matt Stephenson writes:

Bill Browne sends in this interesting job possibility. Closing date for applications is 30 Oct 2009, so if you're interested, let him know right away!

Song Qian writes:

I am very pleased to see your comment on not analyzing data without context. Would you please elaborate the reasons on your blog? I have been teaching an intro data analysis class to our professional masters students since 2005. One thing I have emphasized is the understanding of the underlying scientific problem before conducting any data analysis. This point is not always well-taken. Thanks.

My response: From a Bayesian point of view, it's pretty clear: no context = no prior information. It's really more than that, though, since the context structures the model itself, not just the numerical information that you use to regularize parameter estimates. For the climate change example, Bill Jefferys provides a good discussion here on what you can get from substantive knowledge.

Gustaf Granath writes:

I am an ecologist. I have been struggling with a problem for some time now and even asked some statisticians about this. It would be interesting for me (and maybe other people reading your blog) to hear your opinion. So far, I have not received a satisfying answer from anyone.

I am doing a meta-analysis (in ecology with normal dist. data) using two different apporaches. My first approach is a frequentist mixed-model, assuming independence of each sample. The second approach is a hierarchical Bayesian model, modelling the dependence structure in the data set (e.g multiple outcomes from each study). I want to investigate if my covariates are important, and since I have many candidate covariates, I need to do some kind of model selection. My questions is then: is there a model selection tool that can be applied on both approaches??

Keith points me to this article by Gretchen Chapman and Jingjing Liu:

Previous research has demonstrated that Bayesian reasoning performance is improved if uncertainty information is presented as natural frequencies rather than single-event probabilities. A questionnaire study of 342 college students replicated this effect but also found that the performance-boosting benefits of the natural frequency presentation occurred primarily for participants who scored high in numeracy. This finding suggests that even comprehension and manipulation of natural frequencies requires a certain threshold of numeracy abilities, and that the beneficial effects of natural frequency presentation may not be as general as previously believed.

Sounds interesting. Unfortunately the article has no killer graph to make the point. In psychology, the killer graph often takes the form of a plot with two lines that cross, thus demonstrating the interaction of interaction of interest. Maybe Chapman and Liu could do this for their next article.

P.S. I gotta say, it would be pretty cool to be named "Jingjing." Sort of a Boutros Boutros or Mike Michaelson thing going on here.

Bob writes:

I've been meaning to follow up two comments you made in passing about priors:

1. You said you didn't like Dirichlet priors for multinomials because they didn't model covariance. What alternative do you suggest?

2. When I told you I was using the prior from the hierarchical binomial survival example from [page 128 of] your BDA book, you said you didn't like that prior any more. Why and what would you suggest as an alternative?

The book model reparameterized Beta(a,b) in terms of mean a/(a+b) which got a uniform prior, and scale a+b with a Pareto(1.5) prior [p(a+b) proportional to (a+b)**-2.5].

It works fairly well in practice, though it does lead to a fair number of large scale (a+b) samples.

I used your prior for baseball batting average estimation; the post includes the raw data (2006 AL position players) in tsv form, BUGS code, and the R calling harness.

I also use your prior for a hierarchical model of diagnostic test accuracy in epidemiology (or other data coding tasks).

I have longer versions of that paper with more analysis, simulations, data, alternative item-response type models, and pointers to all the code and data.

The basic epidemiology model keeps getting rediscovered. I'm still the only one who's drunk
enough of your Kool-Aid to go the full Bayesian hierarchical model route.

My reply:

Harker Rhodes writes:

I'm looking for a chapter in someone's textbook titled "Bayesian analysis of case-control studies". Or a chapter with any title covering that subject.....

Here's his longer story:

Someone who knows that I hate the so-called Fisher exact test asks:

I was hoping you could point to a Bayesian counterpart or improvement to "Fisher's exact test" - for 2 x 2 categorical, contigency tables with possibly very small numbers (too small to do a chi-square.) I see that you had a blog post on it before (1) but there are several issues i'm unclear about:

(i) What would a full applied Bayesian analysis look like of this type of problem, in general? I have seen one beta-binomial like analysis but never in practical/applied examples. Any practical examples you may have for this, e.g. papers or code examples you've used in teaching, would be great.

(ii) What if we add the twist that the data from the two populations for our 2 x 2 test is paired? e.g. we have several male and several female patients, and the two conditions are drug / no drug. But, each male and female are paired as they are twins (which breaks the independence of the samples obviously.) How is this modeled from a Bayesian perspective?

(iii) Less important: when in practice is it ok to use Fisher's exact test if you're open to Bayesian analysis? 'Never' is a reasonable answer but i'd like to understand practical reasons why you think this. Finally, if all of our data counts are greater then 10, do you think its legitimate to use a chi-square?

My reply:

(i) The basic analysis is pretty simple, it goes like this:

y1 ~ Binomial (n1, p1)

y2 ~ Binomial (n2, p2)

We need a prior distribution on (p1,p2), and we usually assume that n1,n2 provide no information about p1,p2. (This latter point depends on the design of the study, but I'm keeping it simple here.) What's a good prior distribution depends on the problem, but in many cases, a simple uniform distribution on (p1,p2) will be fine. Whatever your prior is, you then throw in the likelihood and you get posterior inference for (p1,p2). Draw 1000 simulations and then use these to get inference for p1-p2. That's it. With moderate or large sample sizes, this is basically equivalent to the standard t-test.

If you have many tables, you can set up a hierarchical model for the p's. We have an example near the end of chapter 5 of Bayesian Data Analysis.

(ii) With paired data, you can fit a logistic regression. Call the data y_ij, where i=1 or 2 and j is the index for the pairing. Then you can model Pr(y_ij=1) = invlogit (a_i + b_j), with a hierarchical model for the b_j's, something like b_j ~ N (mu_b, sigma_b^2), with weakly informative or flat prior distributions on mu_b, sigma_b.

(iii) The only case I could even imagine using the so-called Fisher exact test is if the data were collected so that the row and column margins were both pre-specified. The only example I can think of with this design is Fisher's tea-tasting experiment. In all cases I've seen, at most one margin is preset by design. Also, I'd never do a chi-squared test in this setting. See chapter 2 of ARM for an example where I thought a chi-squared test was OK.

Here. That was fun. I never knew what the dude looks like before. Now I know that the has lighter skin and darker hair that I do. Or maybe it's just the lighting. The conversation was fun; I hope to have another chance to do this.

Last year I did a bloggingheads with Will Wilkinson.

Awhile ago I was invited by Keying Ye to contribute to a book of essays, Frontier of Statistical Decision Making and Bayesian Analysis, in honor of the great Jim Berger. Here's my chapter, which begins:

Jim Berger has made important contributions in many areas of Bayesian statistics, most notably on the topics of statistical decision theory and prior distributions. It is the latter subject which I shall discuss here. I will focus on they applied work of my collaborators and myself, not out of any claim for its special importance but because these are the examples with which I am most familiar. A discussion of the role of the prior distribution in several applied examples will perhaps be more interesting than the alternative of surveying the gradual progress of Bayesian inference in political science (or any other specific applied field).

I will go through four examples that illustrate different sorts of prior distributions as well as my own progress--in parallel with the rest of the statistical research community--in developing tools for including prior information in statistical analyses . . .

Hamdan Yousuf writes:

I was reading your Kanazawa letter to the editor and I was interested in your discussion of multiple comparisons. This might be an elementary issue but I don't quite understand when the issue of multiple comparisons arises, in general. To give an example from research I have been involved in, assume I am trying to fit a linear regression on a response variable (PR: placebo responsivity score, continuous, experimentally measured) and am assessing 100-200 potential predictors (mostly scores on psychological scales.) The predictors are highly multicollinear such that it is difficult to build a "model" using more than 1 of them, and the matter simplifies to picking the single predictor that optimally explains variance in my response variable. Note that my number of observations (subjects) is small, about 40.

Is this considered a situation with multiple comparisons? That is, I am simultaneously looking at p-values for correlation between my response and each potential predictor. In practice, a handful of the variables yield very good p-values (.001-.005), and these variables make sense scientifically. However, should I be using a correction for MCs, say Bonferonni, with p=.05/200=.0005, in which case nothing is significant. Or am I misinterpreting the idea of multiple comparisons to begin with?

My reply: No, I don't think you should be using classical multiple comparisons methods in your problem. See here and here for further discussion. For your example, maybe it would make sense to combine a bunch of your potential predictors into a single combined scale. I'm guessing that your real question is not, "Are any of these 200 potential predictors correlated with the outcome in the population," but rather "How good are these predictors?" I think you'd be better off with a multilevel model in which you handle the uncertainty using partial pooling.

Interactions and Bayesian Anova

| No Comments

Gregor Gorjanc writes:

In the "weakly informative priors" article, we propose a Cauchy (0, 2.5) default prior distribution for logistic regression coefficients, motivating it from applied concerns and also as a regularizer.

Recently, Gregor Gorjanc pointed me to an article by Jairo Fuquen, John Cook, and Luis Pericchi, also recommending Cauchy prior distributions but this time using a robustness argument. Their article is a bit more mathematical than ours, and with a different focus, more concerned with improvements in specific applications than in the construction of a generic default prior distribution. But we have similar messages, and in that sense our papers are complementary.

On many occasions it's handy to have a list of conjugate prior distributions. Several books have it, but if you're typing away on a beach somewhere, let me provide some links:

John Cook's summary of univariate conjugate prior relationships:

conjugate.png

John links to another two good sources: Wikipedia and to Daniel Fink's "A Compendium of Conjugate Priors".

John Cook also has a clickable diagram of distribution relationships, a subset of a much larger one by Leemis and McQueston (click to enlarge):

univ16.png"

(Material found via LingPipe's introduction to Bayesian statistics, thanks Bob.)

Troels Ring writes:

You know undoubtedly this site and the idea behind, presented also in the book by Spiegelhalter et al 2004. A recent reason for wondering about this is a paper in American journal of Kidney Disease 2009; 53: 208-217 claiming that protein restriction kills people with a hazard ratio 1.15 to 3.20 so to "believe" this, if I understand it, the prior would have to have weights above 1.9 which is strange since the anticipated effect would be beneficial. I have found few references to this method (a paper by Greenland mentions it shortly) and I'd like to hear your view of it.

My reply: I actually hadn't heard of this research before. It looks like it could be useful. I have no time to think more about this now, but my quick thought is that there's something a bit wacky about making decisions based on the endpoint of a 95% interval. It doesn't seem so Bayesian to me. On the other hand, I do this myself to some extent often enough, and in ARM we have a whole chapter on power calculations, so I don't quite follow a consistent line on this myself.

Ring adds:

Bill Harris writes:

I'm not a professional statistician, but I do use statistics in my work, and I'm increasingly attracted to Bayesian approaches.

Several colleagues have asked me to describe the difference between Bayesian analysis and classical statistics. I think I've not yet succeeded well, and so I was about to start a blog entry to clear that up. Then I decided to look around.

Mary Towner sends along this article by herself and Barney Luttbeg that discusses the Trivers-Willard hypothesis and its applications to humans.

I think that Towner and Luttbeg agree with David Weakliem and myself on the substance, but I disagree with them on the question of what models to fit. It's not so much a Bayesian or non-Bayesian question--we use both approaches in our article--but rather a question of whether to treat parameters as continuous or discrete. In their example on page 100, you consider models in which the probability of boy births is 0.50 and 0.53. I think it would make more sense to consider theta to be a continuous parameter with distribution centered on the historical value of 0.515. Neither of those hypothesized values seem vary plausible to me. On the substance, though, I think we're all on the same page.

P.S. I was curious.

Bayesian jobs at SAS

| No Comments

Fang Chen writes:

I work at SAS on Bayesian software development . . . SAS has just reopened some positions and we are in the process of finding and attracting talented individuals who might be interested in making a career out of developing Bayesian software. What we essentially look for are people who are relatively well versed in Bayesian statistics, have had extensive hands-on experience in MCMC/Bayesian modeling (preferably using one of the low-level languages like C), and are interested in making relevant software.

If you know someone, a graduating student maybe, who fit this description, do you mind passing along the information? The job description/application can be found here, searching for job number 09001613.

I received this question in the mail:

Your Biometrics article, Multiple imputation for model checking: completed-data plots with missing and latent data, suggests diagnostics when the missing values of a dataset are filled in by multiple imputation. But suppose we have two equivalent files--File A with variable y left-censored at known threshold and File B with y fully observed. We draw multiple imputations of censored y in File A. (1) Can we validate our imputation model by setting y in File B as left-censored according to the inclusion indicator from A, performing multiple imputation of these "censored" data, and comparing imputed to observed values? (2) In particular, what diagnostic measure(s) would tell us whether the imputed and observed values fit closely enough to validate our imputation model?

My reply: I'm a little confused: if you already have File B, what do you need File A for? Do the two files have different data, or are you just using this to validate your imputation model? If the latter, then, yes, you can see whether the observations in File B are consistent with the predictive distributions obtained from your multiple imputations on File A. You wouldn't expect the imputations to be perfect, but you'd like the imputed 50% intervals to have approximate 50% coverage, you'd like the average values of the true data to equal the predictions from the imputations, on average, and conditional on any information in the observed data in File A. (But the imputations don't have to--and, in general, shouldn't--be correct on average, conditional on the hidden true values.)

You may also be interested in my 2004 article, Exploratory data analysis for complex models, which actually an example on death-penalty sentencing, with censored data.

Hybrid Monte Carlo

| 4 Comments

Richard Morey writes:

On your blog a while back, you asked why more people aren't using Hybrid (Hamiltonian) Monte Carlo. I have tried it, and found that it works quite well for many applications, but not so well for others (parameters with bounded space, and parameters with whose log-posterior has exponential functions in them, specifically). When I started using it, there wasn't much out there about it, precisely because it hasn't caught on. Well, to help remedy that a bit, I've created a CRAN package to do hybrid Monte Carlo sampling (HybridMC), and I thought this may be of interest to your readers. The back end is written in C, so it is quite fast. I've had good luck with it so far.

Cool. We should take a look at this.

Just in case you thought this blog was all fluffy political stuff . . . Kaisey Mandel writes:

Kevin Kelly on Ockham

| 19 Comments

Cosma Shalizi writes:

Kevin Kelly has an interesting take on Ockham's razor, which is basically that it helps you converge to the truth faster than methods which add unnecessary complexities let you do. I think his clearest paper about it is this, though sadly it looks like he removed the cartoons he had in the draft versions.

I took a look. Here's the abstract:

Explaining the connection, if any, between simplicity and truth is among the deepest problems facing the philosophy of science, statistics, and machine learning. Say that an efficient truth-finding method minimizes worst-case costs en route to converging to the true answer to a theory choice problem. Let the costs considered include the number of times a false answer is selected, the number of times opinion is reversed, and the times at which the reversals occur. It is demonstrated that (1) always choosing the simplest theory compatible with experience and (2) hanging onto it while it remains simplest is both necessary and sufficient for efficiency.

This is fine, but I don't see it applying in the sorts of problems I work on, in which "converging on the true answer" requires increasingly complicated models as more data arrive. To put it another way, I don't work on "theory choice problems," and I'm invariably selecting "false answers."

P.S. I'm not saying this to mock Kelly's paper; I can imagine this can be useful in some settings, just maybe not in problems such as mine where I would like my models to be more, not less, inclusive.

Nils Hjort, Chris Holmes, Peter Muller, and Stephen Walker have come out with a new book on Bayesian Nonparametrics. It's great stuff, makes me realize how ignorant I am of this important area of statistics. Here are the chapters:

0. An invitation to Bayesian nonparametrics (Hjort, Holmes, Muller, and Walker)

1. Bayesian nonparametric methods: motivation and ideas (Walker)

2. The Dirichlet process, related priors and posterior asymptotics (Subhashis Ghosal)

3. Models beyond the Dirichlet process (Antonio Lijoi and Igor Prunster)

4. Further models and applications (Hjort)

5. Hierarchical Bayesian nonparametric models with applications (Yee Whye Teh and Michael I. Jordan)

6. Computational issues arising in Bayesian nonparametric hierarchical models (Jim Griffin and Chris Holmes)

7. Nonparametric Bayes applications to biostatistics (David Dunson)

8. More nonparametric Bayesian models for biostatistics (Muller and Fernando Quintana)

I have a bunch of comments, mostly addressed at some offhand remarks about Bayesian analysis made in chapters 0 and 1. But first I'll talk a little bit about what's in the book.

David Spiegelhalter and Ken Rice wrote this excellent short article on Bayesian statistics. I think it's far superior to the Wikipedia articles on Bayes, most of which focus too much on discrete models for my taste.

Alan Bergland writes:

I am a graduate student studying evolutionary biology at Brown University. I am writing you with what I think is a simple question, but I cannot seem to find an answer I feel comfortable with.

I am trying to test a planned contrast using posterior distributions from a mixed model (the mixed model is calculated in lme4, and the simulations in arm). The model is fairly complicated, but at the end of the day, there are two fixed effect treatments with two levels each that I am interested in. Lets call these fixed effects "treatment A" (with levels A and a) and "treatment B" (with levels B and b). I am interested in the interaction between treatment A and treatment B, but have a specific hypothesis about the form of that interaction I would like to test. Specifically, I would like to test if ab is less than Ab & aB=AB.

As you and Jennifer Hill suggest in your Multilevel/Hierarchical models book (p. 20), I could test if ab

Once I can calculate the probability that Ab=AB, would it be reasonable to calculate the probability that (ab is less than Ab & aB=AB) as Pr(ab is less than Ab)*Pr(aB=AB)?

My reply:

1. Don't use the arm's sim() function for lmer() objects. The current version is wrong; we're fixing it now, and the replacement should be available in about a month.

2. I don't recommend testing if aB=AB. At least in the sorts of problems I work on, no two comparisons are exactly equal. I think it makes more sense to estimate the relevant comparison, get the confidence interval, and make a graph. You could also do things like calculate the posterior probability (based on simulations) that ab < AB & |aB - AB|

Ryan Richt writes:

I wondered if you have a quick moment to dig up an old post of your own that I cannot find by searching. I read an entry where you discussed if there really was a difference between a prior of 1/2 meaning that we have no knowledge of a coin flip, or meaning we are exactly certain that it's generative distribution is 1/2.

I'm only 24 and just got my masters last year, but I now have my own summer interns (who of course I encourage to read ET Jaynes and see the bayesian light) and one of them basically asked that question today.

My reply: The two original blog entries are here and here. Here's my published article. And here's a link discussing actual wrestlers and boxers. (Apparently the wrestler would win.)

The talks from the mini-conference are up on the website. The speakers:

Martin Lindquist (Dept of Statistics, Columbia)
Ed Vul (Dept of Brain and Cognitive Sciences, MIT)
Nikolas Krigeskorte (Laboratory of Brain and Cognition, NIH)
Tor Wager (Dept of Psychology, Columbia)
Andrew Gelman (Dept of Statistics, Columbia)
Daphna Shohamy (Dept of Psychology, Columbia)
Cosma Shalizi (Dept of Statistics, CMU)
Pat Shrout (Dept of Psychology, NYU)

The powerpoints are up, and also videos of our presentations. If you listen carefully, you can hear the raucous laughter in the background. . . .

It's been a dramatic month: A month ago, a coalition of some of the leading teams qualifies for the $1 million grand prize for improving the accuracy of the movie-recommending model by more than 10%. But, they would close the competition 30 days afterward, in case someone else is able to improve upon the result. This happened less than a day before the deadline, by The enormous Ensemble, composed of 23 previously separate teams and individuals. Of course, most of the progress towards the victory was through the models making use of new significant patterns in the data, such as that of time.

The development of an ensemble from many separate teams was another accomplishment, and the GPT's inclusion rules provide some insight into the process: "shares" of the winnings were distributed based on how much was a contribution able to improve the result in terms of percentage points. Simon Owens describes what it was like to participate in The Ensemble.

Bayesian statistics always works with ensembles: the posterior is a weighted average of all models, the weight being based on the fit of each model times the prior quality of the model. There are some additional Bayesian elements that could be a part of future competitions, such as Bayesian scoring functions.

In the past I was asked to contrast Occam's razor with the Epicurean principle. Occam's razor is the Bayesian prior, or the the yang principle: simpler models have greater a priori weight (because we tend to economize that what is useful). Occam's razor goes back to Aristotle, who wrote "For the more limited, if adequate, is always preferable," and "For if the consequences are the same, it is always better to assume the more limited antecedent" in his Physics. We mathematically express it as the prior.

Epicurean principle is the yin, or mathematically expressed as the integral over the model space. Ensembles go back to Epicurus' letter to Herodotus: "When, therefore, we investigate the causes of [...] phenomena, [...] we must take into account the variety of ways in which analogous occurrences happen within our experience." Thus, Bayesian statistics combines the yin and the yang, balancing the pursuit of simplicity with the limitations of uncertainty.

[7/31/09: Added a link to Simon Owens' interview with The Ensemble.]

Hard sell for Bayes

| 11 Comments

Here's our estimate of public support for vouchers, broken down by religion/ethnicity, income, and state:

vouchermapsBAYES2000.png

(Click on image to see larger version.)

We're mapping estimates from a hierarchical Bayes model fit to data from the 2000 Annenberg survey (approximately 50,000 respondents).

In case you're wondering what Bayesian modeling did for us, here are the corresponding maps from the raw data (weighted to adjust for voter turnout, but that doesn't actually do that much anyway):

vouchermapsRAW2000.png

(Click on image to see larger version.)

OK, so Bayes gives you a lot. The costs?

Beta distribution explorer

| 1 Comment

Brendan O'Connor created a small applet that allows exploring the beta distribution interactively (just hit arrow keys on the keyboard):

beta_explorer.png

This is a good example of what interactive visualization can do - Andreas Buja was also showing some cool examples some time ago.

He also has source available (for Processing).

Richard Hahn writes:

In some talk slides you recently posted you have the following bullet point: "Need to go beyond exchangeability to shrink batches of parameters in a reasonable way." If you think other readers of the blog might find it interesting, I'd love to see you elaborate on this. While the whole talk is, of course, an elaboration, you do not elsewhere explicitly mention exchangeability. Isn't the point of de Finetti-style theorems that exchangeability is precisely the "reasonable" assumption that leads to parametric models with nice conditional independence properties? Such results entail that we're at liberty to make sophisticated, highly structured models based on conditional independence with the knowledge that a set of exchangeability judgments on observables lies back of them. Even very flexible, fancy DP-based Bayesian nonparametric models are based on notions of exchangeable random partitions. I'm probably just misreading you, but would be very interested in a clarification about what exactly you mean. If not, at root, exchangeability, then what else exactly is driving the batch shrinkage and how is it not ad hoc?

My quick reply: Consider a two-way data structure modeled as y_ij = a_i + b_j + c_ij, with no other information on the rows, the columns, or the individual cells. Then you have no choice but to model the a_i's and b_j's exchangeably. But the c_ij's can be modeled conditional on the a_j's and b_j's--that is, these latent parameters can be considered as group-level predictors. The model is still exchangeable on the i's and the j's, but not on the (ij)'s. This is sometimes called "partial exchangeability." More generally, one can consider three-way models, etc.

Daniel Egan sent me a link to an article, "Standardized or simple effect size: What should be reported?" by Thom Baguley, that recently appeared in the British Journal of Psychology. Here's the abstract:

It is regarded as best practice for psychologists to report effect size when disseminating quantitative research findings. Reporting of effect size in the psychological literature is patchy -- though this may be changing -- and when reported it is far from clear that appropriate effect size statistics are employed. This paper considers the practice of reporting point estimates of standardized effect size and explores factors such as reliability, range restriction and differences in design that distort standardized effect size unless suitable corrections are employed. For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take. Deciding on the appropriate effect size statistic to report always requires careful thought and should be influenced by the goals of the researcher, the context of the research and the potential needs of readers.

Egan writes:

I run into the problem of reporting coefficients all the time, mostly in the context of presenting effects to non-statisticians. While my audiences are generally bright, the obvious question always asked is "which of these is the biggest effect?" The fact that a sex dummy has a large numerical point estimate relative to number-of-purchases is largely irrelevant - its because sex's range is tiny compared to other covariates. But moreover, sex is irrelevant to "policy-making" - we can't change a persons sex! So what we're interested in is the viable range over which we could influence an independent variable, and the second-order likely affect upon the dependent. So two questions: 1. For pedagogical effect, is there any way of getting around these problems? How can we communicate the effects to non-statisticians easily (and think someone who has exactly 10 minutes to understand your whole report) 2. Is there any easy way to infer the elasticity of the effect - i.e. how much can we change the dependent, by attempting to exogenously change one of the independents? While I know that I could design the experiment to do this, I work in far more observational data - and this "effect" size is really what matters the most.

My quick reply to Egan is to refer to my article with Iain Pardoe on average predictive comparisons, where we discuss some of these concerns.

I also have some thoughts on the Baguley article:

Among other things, while on sabbatical in Paris next year I'll be working with my longtime collaborator Frederic Bois, a toxicologist who uses hierarchical Bayes models extensively. We have a project in toxicology that necessarily also involves research in Bayesian computation.

And, there's a postdoctoral position available! Here are the details:

In the most recent round of our recent discussion, Judea Pearl wrote:

There is nothing in his theory of potential-outcome that forces one to "condition on all information" . . . Indiscriminate conditioning is a culturally-induced ritual that has survived, like the monarchy, only because it was erroneously supposed to do no harm.

I agree with the first part of Pearl's statement but not the second part (except to the extent that everything we do, from Bayesian data analysis to typing in English, is a "culturally induced ritual"). And I think I've spotted a key point of confusion.

To put it simply, Donald Rubin's approach to statistics has three parts:

1. The potential-outcomes model for causal inference: the so-called Neyman-Rubin model in which observed data are viewed as a sample from a hypothetical population that, in the simplest case of a binary treatment, includes y_i^1 and y_i^2 for each unit i).

2. Bayesian data analysis: the mode of statistical inference in which you set up a joint probability distribution for everything in your model, then condition on all observed information to get inferences, then evaluate the model by comparing predictive inferences to observed data and other information.

3. Questions of taste: the preference for models supplied from the outside rather than models inspired by data, a preference for models with relatively few parameters (for example, trends rather than splines), a general lack of interest in exploratory data analysis, a preference for writing models analytically rather than graphically, an interest in causal rather than descriptive estimands.

As that last list indicates, my own taste in statistical modeling differs in some ways from Rubin's. But what I want to focus on here is the distinction between item 1 (the potential outcomes notation) and item 2 (Bayesian data analysis).

The potential outcome notation and Bayesian data analysis are logically distinct concepts!

Items 1 and 2 above can occur together or separately. All four combinations (yes/yes, yes/no, no/yes, no/no) are possible:

- Rubin uses Bayesian inference to fit models in the potential outcome framework.

- Rosenbaum (and, in a different way, Greenland and Robins) use the potential outcome framework but estimate using non-Bayesian methods.

- Most of the time I use Bayesian methods but am not particularly thinking about causal questions.

- And, of course, there's lots of statistics and econometrics that's non-Bayesian and does not use potential outcomes.

Bayesian inference and conditioning

In Bayesian inference, you set up a model and then you condition on everything that's been observed. Pearl writes, "Indiscriminate conditioning is a culturally-induced ritual." Culturally-induced it may be, but it's just straight Bayes. I'm not saying that Pearl has to use Bayesian inference--lots of statisticians have done just fine without ever cracking open a prior distribution--but Bayes is certainly a well-recognized approach. As I think I wrote the other day, I use Bayesian inference not because I'm under the spell of a centuries-gone clergyman; I do it because I've seen it work, for me and for others.

Pearl's mistake here, I think, is to confuse "conditioning" with "including on the right-hand side of a regression equation." Conditioning depends on how the model is set up. For example, in their 1996 article, Angrist, Imbens, and Rubin showed how, under certain assumptions, conditioning on an intermediate outcome leads to an inference that is similar to an instrumental variables estimate. They don't suggest including an intermediate variable as a regression predictor or as a predictor in a propensity score matching routine, and they don't suggest including an instrument as a predictor in a propensity score model.

If a variable is "an intermediate outcome" or "an instrument," this is information that must be encoded in the model, perhaps using words or algebra (as in econometrics or in Rubin's notation) or perhaps using graphs (as in Pearl's notation). I agree with Steve Morgan in his comment that Rubin's notation and graphs can both be useful ways of formulating such models. To return to the discussion with Pearl: Rubin is using Bayesian inference and conditioning on all information, but "conditioning" is relative to a model and does not at all imply that all variables are put in as predictors in a regression.

Another example of Bayesian inference is the poststratification which I spoke of yesterday (see item 3 here). But, as I noted then, this really has nothing to do with causality; it's just manipulation of probability distributions in a useful way that allows us to include multiple sources of information.

P.S. We're lucky to be living now rather than 500 years ago, or we'd probably all be sitting around in a village arguing about obscure passages from the Bible.

To continue with our discussion (earlier entries 1, 2, and 3):

1. Pearl has mathematically proved the equivalence of Pearl's and Rubin's frameworks. At the same time, Pearl and Rubin recommend completely different approaches. For example, Rubin conditions on all information, whereas Pearl does not do so. In practice, the two approaches are much different. Accepting Pearl's mathematics (which I have no reason to doubt), this implies to me that Pearl's axioms do not quite apply to many of the settings that I'm interested in.

I think we've reached a stable point in this part of the discussion: we can all agree that Pearl's theorem is correct, and we can disagree as to whether its axioms and conditions apply to statistical modeling in the social and environmental sciences. I'd claim some authority on this latter point, given my extensive experience in this area--and of course, Rubin, Rosenbaum, etc., have further experience--but of course I have no problem with Pearl's methods being used on political science problems, and we can evaluate such applications one at a time.

2. Pearl and I have many interests in common, and we've each written two books that are relevant to this discussion. Unfortunately, I have not studied Pearl's books in detail and I doubt he's had the time to read my books in detail also. It takes a lot of work to understand someone else's framework, work that we don't necessarily want to do if we're already spending a lot of time and effort developing our own research programmes. It will probably be the job of future researchers to make the synthesis. (Yes, yes, I know that Pearl feels that he already has the synthesis, and that he's proved this to be the case, but Pearl's synthesis doesn't yet take me all the way to where I want to go, which is to do my applied work in social and environmental sciences.) I truly am open to the probability that everything I do can be usefully folded into Pearl's framework someday.

That said, I think Pearl is on shaky ground when he tries to say that Don Rubin or Paul Rosenbaum is making a major mistake in causal inference. If Pearl's mathematics implies that Rubin and Rosenbaum are making a mistake, then my first step would be to apply the syllogism the other way and see whether Pearl's assumptions are appropriate for the problem at hand.

3. I've discussed a poststratification example. As I discussed yesterday (see the first item here), a standard idea, both in survey sampling and causal inference, is to perform estimates conditional on background variables, and then average over the population distribution of the background variables to estimate the population average. Mathematically, p(theta) = sum_x p(theta|x)p(x). Or, if x is discrete and takes on only two values, p(theta) = (N_1 p(theta|x=1) + N_2 p(theta|x=2)) / (N_1 + N_2).

This has nothing at all to do with causal inference: it's straight Bayes.

Pearl thinks that if the separate components p(theta|x) are nonidentifiable, that you can't do this, and you should not include x in the analysis. He writes:

I [Pearl] would really like to see how a Bayesian method estimates the treatment effect in two subgroups where it is not identifiable, and then, by averaging the two results (with two huge posterior uncertainties) gets the correct average treatment effect, which is identifiable, hence has a narrow posterior uncertainly. . . . I have no doubt that it can be done by fine-tuned tweaking . . . But I am talking about doing it the honest way, as you described it: "the uncertainties in the two separate groups should cancel out when they're being combined to get the average treatment effect." If I recall my happy days as a Bayesian, the only operation allowed in combining uncertainties from two subgroups is taking a linear combination of the two, weighted by the (given) relative frequencies of the groups. But, I am willing to learn new methods.

I'm glad that Pearl is willing to learn new methods--so am I--but, no new methods are needed here! This is straightforward, simple Bayes. Rod Little has written a lot about these ideas. I wrote some papers on it in 1997 and 2004. Jeff Lax and Justin Phillips do it in their multilevel modeling and poststratification papers where, for the first, time, they get good state-by-state estimates of public opinion on gay rights issues. No "fine-tuned tweaking" required. You just set up the model and it all works out. If the likelihood provides little to no information on theta|x but it does provide good information on the marginal distribution of theta, then this will work out fine.

In practice, of course, nobody is going to control for x if we have no information on it. Bayesian poststratification really becomes useful in that it can put together different sources of partial information, such as data with small sample sizes in some cells, along with census data on population cell totals.

Please, please don't say "the correct thing to do is to ignore the subgroup identity." If you want to ignore some information, that's fine--in the context of the models you are using, it might even make sense. But Jeff and Justin and the rest of us use this additional information all the time, and we get a lot out of it. What we're doing is not incorrect at all. It's Bayesian inference. We set up a joint probability model and then work from it. If you want to criticize the probability model, that's fine. If you want to criticize the entire Bayesian edifice, then you'll have to go up against mountains of applied successes.

As I wrote earlier, you don't have to be a Bayesian (or, I could say, you don't have to be a Bayesian)--I have a great respect for the work of Hastie, Tibshirani, Robins, Rosenbaum, and many others who are developing methods outside the Bayesian framework)--but I think you're on thin ice if you want to try to claim that Bayesian analysis is "incorrect."

4. Jennifer and I and many others make the routine recommendation to exclude post-treatment variables from analysis. But, as both Pearl and Rubin have noted in different contexts, it can be a very good idea to include such variables--it's just not a good idea to include them as regression predictors.) If the only think you're allowed to do is regression (as in chapter 9 of ARM), then I think it's a good idea to exclude post-treatment predictors. If you're allowed more general models, then one can and should include them. I'm happy to have been corrected by both Pearl and Rubin on this one.

5. As I noted yesterday (see second-to-last item here), all statistical methods have holes. This is what motivates us to consider new conceptual frameworks as well as incremental improvements in the systems with which we are most familiar.

Summary . . . so far

I doubt this discussion is over yet, but I hope the above notes will settle some points. In particular:

- I accept (on authority of Pearl, Wasserman, etc.) that Pearl has proved the mathematical equivalence of his framework and Rubin's. This, along with Pearl's other claim that Rubin and Rosenbaum have made major blunders in applied causal inference (a claim that I doubt), leads me to believe that Pearl's axioms are in some way not appropriate to the sorts of problems that Rubin, Rosenbaum, and I work on: social and environmental problems that don't have clean mechanistic causation stories. Pearl believes his axioms do apply to these problems, but then again he doesn't have the extensive experience that Rosenbaum and Rubin have. So I think it's very reasonable to suppose that his axioms aren't quite appropriate here.

- Poststratification works just fine. It's straightforward Bayesian inference, nothing to do with causality at all.

- I have been sloppy when telling people not to include post-treatment variables. Both Rubin and Pearl, in their different ways, have been more precise about this.

- Much of this discussion is motivated by the fact, that, in practice, none of these methods currently solves all our applied problems in the way that we would like. I'm still struggling with various problems in descriptive/predictive modeling, and causation is even harder!

- Along with this, taste--that is, working with methods we're familiar with--matters. Any of these methods is only as good as the models we put into them, and we typically are better modelers when we use languages with which we're more familiar. (But not always. Sometimes it helps to liberate oneself, try something new, and break out of the implicit constraints we've been working on.)

To follow up on yesterday's discussion, I wanted to go through a bunch of different issues involving graphical modeling and causal inference.

Contents:
- A practical issue: poststratification
- 3 kinds of graphs
- Minimal Pearl and Minimal Rubin
- Getting the most out of Minimal Pearl and Minimal Rubin
- Conceptual differences between Pearl's and Rubin's models
- Controlling for intermediate outcomes
- Statistical models are based on assumptions
- In defense of taste
- Argument from authority?
- How could these issues be resolved?
- Holes everywhere
- What I can contribute

Philip Dawid (a longtime Bayesian researcher who's done work on graphical models, decision theory, and predictive inference) saw our discussion on causality and sends in some interesting thoughts, which I'll post here and then very briefly comment on:

Having just read through this fascinating interchange, I [Dawid] confess to finding Shrier and Pearl's examples and arguments more convincing that Rubin's. At the risk of adding to the confusion, but also in hope of helping at least some others, let me briefly describe yet another way (related to Pearl's, but with significant differences) of formulating and thinking about the problem. For those who, like me, may be concerned about the need to consider the probabilistic behaviour of counterfactual variables, on the one hand, or deterministic relationships encoded graphically, on the other, this provides an observable-focused, fully stochastic, alternative. A full presentation of the essential ideas can be found in Chapters 9 (Confounding and Sufficient Covariates) and 10 (Reduction of Sufficient Covariate) of my online document "Principles of Statistical Causality".

Like Pearl, I like to think of "causal inference" as the task of inferring what would happen under a hypothetical intervention, say F_E = e, that sets the value of the exposure E at e, when the data available are collected, not under the target "interventional regime", but under some different "observational regime". We could code this regime as F_E = idle. We can think of the non-stochastic variable F_E as a parameter, indexing the joint distribution of all the variables in the problem, under the regime indicated by its value.

This Thursday at 7pm Jake Hofman and Suresh Velagapundi will present a session at New York R Statistical Programming Meetup at NYU - Silver Center (100 Washington Square East, Room 401). Here's the outline:

Background:
  • Conditional probability & Bayes' Rule
  • Treating parameters as random variables & putting distributions on them
  • Bayesian inference: from priors & likelihoods to posteriors
From Principles to Practice:
  • Simple plan; difficult to execute (normalization)
  • Resort to approximation methods (variational & MCMC)
  • Model selection / complexity control a la Bayes

Andrew Knight points me to this Kafkaesque report on Bayesian methods and evidence-based medicine. It's always good to see things like this out there,

My main disagreement with the report is on their framework in which there is a fixed data model and different choices of prior distribution. As we discuss in Section 2.8 of Bayesian Data Analysis, I much prefer the framework in which a single prior distribution (or "population distribution") is applied to many different data settings. I think that framing it my way makes the benefits of Bayesian inference much clearer.

I also don't like all the tables. But that's not really a Bayesian issue.

As I've discussed here on occasion, I like to standardize continuous regression inputs by dividing by two standard deviations. That way the rescaled variables each have sd of 1/2, which is approximately the same sd as any binary predictor, allowing the coefficients to be interpreted together.

Standardizing is often thought of as a stupid sort of low-rent statistical technique, beneath the attention of "real" statisticians and econometricians, but I actually like it, and I think this 2 sd thing is pretty cool.

As Aleks pointed out, however, standardizing based on the data is not strictly Bayesian, because the interpretation of the model parameters then depends on the sample data. As we discussed, a more fully Bayesian approach would be to think of the scale for standardization as an unknown parameter to itself be estimated from the data.

P.S. Recall that "inputs" are not the same as "predictors."

P.P.S. I scale by 2 sd to be consistent with 0/1 predictors. In retrospect, I wish I'd scaled by 1 sd and then coded binary predictors as -1 and 1 to be consistent. This would've been simpler overall. But I think it's too late now.

I've made this point before, but I just received an email on the topic and so I thought I'd point youall to section 3.3 of this article of mine from 2003 where I make the argument in detail.

This article--A Bayesian Formulation of Exploratory Data Analysis and Goodness-of-fit Testing--is one of my favorites. It also features:
- A potted history of Bayesian inference (section 2.1)
- The first published definition (I think) of U-values and P-values (section 2.3)
- A model-checking perspective on the problem of degenerate estimates for mixture models (section 3.1)
- Why this isn't all obvious (section 5)

The article is based on a presentation I gave a year earlier at a conference. It was supposed to appear in the proceedings volume, but it was late, and the conference organizer was so annoyed he refused to include it. So I published it in the International Statistical Review instead. A year later I published a related article, Exploratory Data Analysis for Complex Models, as a discussion paper in the Journal of Computational and Graphical Statistics. That second article is more coherent, but personally I prefer the International Statistical Review article because it covers so many little topics that don't fit into existing theories of inference. I think of these examples as analogous to the quantum anomalies that toppled classical physics around 1900. In this case, what I want to topple is classical Bayesian inference--by which I mean Bayesian theory that does not include model building and model checking.

Christian Robert, Nicolas Chopin, and Judith Rousseau wrote this article that will appear in Statistical Science with various discussions, including mine.

I hope those of you who are interested in the foundations of statistics will read this. Sometimes I feel like banging my head against a wall, in my frustration in trying to communicate with Bayesians who insist on framing problems in terms of the probability that theta=0 or other point hypotheses. I really feel that these people are trapped in a bad paradigm and, if they would just think things through based on first principles, they could make some progress. Anyway, here's what I wrote:

I actually own a copy of Harold Jeffreys's Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chi-squared p-value when he wanted to check the misfit of a model to data (Gelman, Meng, and Stern, 2006). I do, however, feel that it is important to understand where our probability models come from, and I welcome the opportunity to use the present article by Robert, Chopin, and Rousseau as a platform for further discussion of foundational issues.

In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys's principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys's preference for simplicity; and (3) a key generalization of Jeffreys's ideas is to explicitly include model checking in the process of data analysis

What is Info-Gap Theory?

| 8 Comments

David Fox writes:

As a 'classically' trained statistician who works on 'real' problems (mainly environmental ones) I have come to appreciate the utility and benefits of working within a Bayesian framework. I would not classify myself as a 'convert' but prefer to have an array of statistical tools from which I can select the most appropriate one for the job at hand. As they say - if all you've got is a hammer, then the whole world's a nail! On the issue of choice of priors, I believe this is an absolute strength in the evaluation and setting of environmental regulatory limits. In situations characterized by high levels of data paucity but rich with expert knowledge (albeit diverse), why would you choose to ignore the latter?

However, I should get to the real purpose of this email. A rather fierce debate has been taking place among academics in our departments of Botany and Mathematics and Statistics about the use of a 'new' form of decision-making under extreme uncertainty. It is called Info-Gap (short for information gap) Theory and owes its existence to Prof. Yakov Ben-Haim at Technion in Israel (Ben-Haim 2006). Yakov is well known to the aforementioned academics - he visits here regularly and has done a remarkably good job at 'selling' his product - to the extent that some staff and students in our Botany department and The Australian Centre of Excellence in Risk Analysis (http://www.acera.unimelb.edu.au) have enthusiastically (and some would say, blindly) embraced this 'new' paradigm for decision-making under extreme uncertainty. I must plead mea culpa, having been swept up in the initial enthusiasm and published a couple of papers which use info-gap. However, I have a growing unease that IG is not 'new' but in fact a variant of existing methodologies." While not wishing to draw you into our local debate, I was wondering if you have ever heard of info-gap theory and if you have, do you have an opinion? Prof. Ben-Haim has recently launched his own web site (http://www.info-gap.com) presumably in response to the 'hi-jacking' of the Wikepedia entry (http://en.wikipedia.org/wiki/Info-gap_decision_theory) by IG's most strident local critic, Moshe Sniedovich. Sniedovich has also established a web site (http://info-gap.moshe-online.com/) and a quick look will demonstrate the ferocity of the debate.

Just today, the following paragraph in a paper I was reading [Hickey, G.L., Craig, P.S., and Hart, A. (2009) On the application of loss functions in determining assessment factors for ecological risk. Ecotoxicology and Environmental Safety, 72, 293-300] caught my attention:

"There do exist other forms of risk measurement. However, by a very well-known theorem of Wald (1950), any admissible decision rule is a Bayes rule with respect to some prior distribution (possibly an improper prior distribution), whereby admissibility is defined to mean that no other decision rule dominates it in terms of risk. It is therefore argued by many, for example, Bernardo and Smith (2000) that it is pointless to work in decision-theory outside the Bayesian framework".

This accords with my own gut feeling that IG Theory is in fact a Bayes Rule with a non-informative prior.

My reply: I had never heard about Dr. Ben-Haim or his methods before receiving this email. I checked out the links but couldn't really see the point in this approach. The mathematics looked complicated and appeared to be a distraction from the more important goals of modeling the decision problems directly.

For some of my thoughts on Bayesian decision analysis, see chapter 22 of Bayesian Data Analysis (second edition). Bayesian decision analysis is a lot more flexible than people realize, I think, especially when used in the context of hierarchical modeling. See here for a brief discussion of my idea of "institutional decision analysis" and here for an example of Bayesian decision analysis in action.

In my article on the boxer, the wrestler, and the coin flip, I discuss some fundamental difficulties with Bayesian robusness and similar approaches.

Finally, I don't know that I'd agree with the statement that it's "pointless" to work in non-Bayesian decision theory. For me, I've found the Bayesian approach to do the job, but I can imagine there are settings where other methods can be useful. I'm not, however, a fan of those 1950's-style alternatives such as "minimax regret" and all the reat. I offer no comment on Info-Gap since I didn't put in the effort to try to understand exactly what it is.

Mark Johnson writes:

Suppose we want to build a hierarchical model where the lowest level are multinomials and the next level up are Dirichlets (the conjugate prior to multinomials). What distributions should we use for the level above that? I'm not a statistician, but aren't Dirichlet distributions exponential models? If so, Dirichlets should have conjugate priors, which might be useful for the next level up in hierarchical models. I've never heard anyone talk about conjugate priors for Dirichlets, but perhaps I'm not listening to the right people. Do you have any other suggestions for priors for Dirichlets?

My reply: I'm not sure, but I agree that there should be something reasonable here. I've personally never had much success with Dirichlets. When modeling positive parameters that are constrained to sum to 1, I prefer to use a redundantly-parameterized normal distribution. For example, if theta_1 + theta_2 + theta_3 + theta_4 = 1, with all thetas constrained to be positive, I'll use the model,

theta_j = exp(phi_j)/(exp(phi_1)+exp(phi_2)+exp(phi_3)+exp(phi_4), for j=1,2,3,4.

If you give each of the phi_j's a normal distribution, this is a more flexible model than the Dirichlet: it has 8 parameters (four means and four variances). Well, actually 7, because the means are only identified up to an arbitrary additive constant.

Frederic Bois and I used this distribution for a problem in toxicology, modeling blood flows within different compartments of the body--these were constrained to sum to total blood flow.

In this article, I used a fun stochastic approximation trick to compute reasonable values for the mean and variance parameters for the phi's.

P.S. Dana Kelly points to this article on the topic by Aitchison and Shen from 1980.

Adam Taylor writes:

One of the criticisms of Bayesian statistics seems to be that, as generally practiced, it relies on strong distributional assumptions. I'm wondering if it's possible to come up with a posterior distribution on the mean of a bunch of IID samples from an unknown distribution [I think he means "a posterior distribution on an unknown distribution given the mean of an independent sample from that distribution" -- AG], and to do it such that you don't have to make strong assumptions about what the unknown distribution is. I.e. I'm looking for some kind of nonparametric or semi-parametric Bayesian approach to this problem. Does something like this exist?

My reply: Yes, you can do this. See, for example, the articles, "On Bayesian analysis of mixtures with an unknown number of components," by Sylvia Richardson and Peter Green, Journal of the Royal Statistical Society B, 59, 731-792 (1997), and Bayesian density regression, by David Dunson, N. S. Pillai, and J-H. Park, Journal of the Royal Statistical Society B, 69, 163-183.

The short answer it's not trivial to solve the problem in reasonable generality. There are classical methods such as kernel density estimation that are much simpler, but they have problems when sample size is small.

Beyond this, my intuition is that the way to proceed is to think hierarchically. You're almost never really just analyzing a sample from one distribution; realistically, you'll be applying your method repeatedly on related problems. This returns us to the connections between hierarchical modeling and the frequency evaluation of statistical procedures.

Tyler Brough writes:

I am currently a PhD student in Finance. I was explaining my research today to a very senior scholar at a well known eastern business school, who remarked that the econometric methods I am using are "way too complex, and that unless you just do OLS no one will believe it anyway." What are your thoughts about that comment? If I know that the data generating process violates all of the assumptions underlying the classical linear regression model then I have to use more complex methods do I not?

My reply: I agree with the senior scholar that it's important--even necessary--to do the simpler linear regression in addition to the more elaborate model. If a more elaborate model gives a different answer than the least-squares regression, this doesn't necessarily mean that the more elaborate model is wrong, or even too complex--but it does mean that you need to understand what's happening in the transition from the simple to the complex model.

Sometimes I prefer the simpler model and I think the more complex model is giving misleading results and implausible extrapolation.

Other times I like the complicated model, and I put in the effort to understand why it differs from the simple model. Ultimately you have to go to the data and to the underlying problem being studied.

Juned Siddique writes:

I have a question regarding a paragraph in your paper, "Prior distributions for variance parameters in hierarchical models."

In the paper, you write, "We view any noninformative or weakly-informative prior distribution as inherently provisional--after the model has been fit, one should look at the posterior distribution and see if it makes sense. If the posterior distribution does not make sense, this implies that additional prior knowledge is available that has not been included in the model, and that contradicts the assumptions of the prior distribution that has been used. It is then appropriate to go back and alter the prior distribution to be more consistent with this external knowledge."

I don't quite understand this passage, especially the part where you write, "this implies that additional prior knowledge is available that has not been included in the model," and was hoping to get more explanation.

My situation is that I am fitting a random-effects probit model and using posterior predictive checking to check the fit of the model. One way to get the model to fit the data well is to use an informative prior that I arrived at by iterating between posterior predictive checking and making my prior more informative. While changing one's model to make it fit the data better is standard in statistics, it seems like I should be changing the likelihood, not the prior. One the other hand, my "model" is my posterior distribution which also includes the prior.

My reply:

1. You should certainly feel free to change the likelihood as well as the prior. Both can have problems.

2. With hierarchical models, the whole likelihood/prior distinction becomes less clear. In your example, you have a probability model for the data (independent observations, I assume), a nonlinear probit model with various predictors, a normal model (probably) for your group-level coefficients, and some sort of hyperprior for what remains

3. My point in the quoted passage is that in the phrase "the posterior distribution does not make sense," the "does not make sense" part is implicitly (or explicitly) a comparison to some external knowledge or expectation that you have. "Does not make sense" compared to what? This external knowledge represents additional prior information.

In a comment on my entry on why I don't like so-called Bayesian hypothesis testing, Stephen Senn writes:

Bayesian hypothesis tests are the work of Harold Jeffreys who realised that you could not proceed using vague priors for parameters unless you have a means of choosing between simpler and more complex models. Also he was keen to find ways of proving that scientific laws are true. If you think you can do this and want to do it you need Bayesian hypothesis tests.

I, personally, don't like Jeffreys's approach. However, I think that it is a tribute to his genius that he realised that such a system had to be part and parcel of any attempt to be semi-objective in the use of Bayes. Unfortunately we now have many so-called Bayesians who think they can use uninformative priors without a system of deciding between simpler and more complex hypotheses. This is not possible.

My reply: When deciding between simpler and more complex hypotheses, I generally prefer the more complex hypothesis. When I choose the simpler hypothesis, I view this as a combination of labor-saving device and approximate Bayes, pooling a parameter estimate all the way to zero instead of merely pooling it most of the way. I certainly don't see Bayes factors having any relevance, given the oft-noted problem that Bayes factors can depend decisively on aspects of the prior distribution that have no influence on the posterior distribution under each of the individual models.

My California trip

| 6 Comments

Monday UC Irvine: Weakly Informative Priors

Tuesday Caltech: Red State, Blue State

Wednesday Google: Red State, Blue State

Wednesday Stanford: I'm not sure yet

Thursday Berkeley: Red State, Blue State

Friday Berkeley: Weakly Informative Priors

If you're at any of these places, feel free to come and ask your toughest questions!

It looks a little silly that it's the same two talks over and over, but of course the audiences will all be different. Maybe I'll vary them a bit, just to keep things interesting. Also I'm giving a few more lectures at Berkeley for some sort of training program at the education school, but I don't think these are open to general audiences.

If you want to see the slides, the current versions are here and here. (But I think I'll work a bit on the Red State, Blue State presentation.) And if you want to see slides for a bunch of other talks, go here.

I received the following email:

As a psychologist teaching and using Bayesian statistics, I've been pleased to see some of my colleagues endorsing Bayesian data analysis. But I've been very chagrined to see them champion Bayes factors for null-hypothesis testing, instead of parameter estimation. My question is simple: Are there any articles that head-on challenge the Bayes-factor approach to null-hypothesis testing, and instead favor parameter estimation?

Perhaps the most straight-forward example against Bayes factors for null hypotheses was given by Stone (1997), Statistics and Computing, 7, 263-264. He showed a simple case in which the BF prefers the null but the estimated posterior excludes the null value. I realize that the two approaches are asking different questions --- I've just never really been convinced that the answer provided by a null/alternative comparison really tells us anything we want to know, because no matter what it says, I always want to do the estimation anyway.

My reply: You won't be surprised that I agree with the above perspective. Here's my article with Rubin (from Sociological Methodology 1995) where we bang on Bayes factors for 8 straight pages. I really like this article.

Chris Masse pointed me to this blog by Panos Ipeirotis, who argues that some online prediction markets give probabilities that are too good to be true:

Here's Lindley. I suspect I'd agree with Lindley on just about any issue of statistical theory and practice. I've read some of Lindley's old articles and contributions to discussions and, even when he seemed like something of an extremist at the time, in retrospect he always seems to be correct. That said, I disagree with him on Taleb. I think the difference is that Lindley was evaluating The Black Swan based on its statistical content, whereas I liked the book because it was full of ideas and stories that sparked thoughts in my mind (and, I think, in the minds of many readers).

Also, I disagree with Lindley 100% about Karl Popper. Even though, again, I think Lindley and I are extremely close on issues of statistical practice and theory.

And here's Robert. I like his connection of "black swans" to "model shift." This fits in well to my three stages of Bayesian Data Analysis (model building, model fitting, model checking), with model checking being the all-important but often neglected ugly sister. (As I've discussed many times, you rarely see graphical model checks in a published paper, because either (a) the model didn't fit, in which case, at worst you'd be too embarrassed to admit it, or at best you'd fix the model and there'd be nothing to report, of (b) the model fits ok, in which case the model check is probably only worth a sentence or two.)

From a philosophical point of view, I think the most important point of confusion about Bayesian inference is the idea that it's about computing the probability that a model is true. In all the areas I've ever worked on, the model is never true. But what you can do is find out that certain important aspects of the data are highly unlikely to be captured by the fitted model, which can facilitate a "model shift" moment. This sort of falsification is why I believe Popper's philosophy of science to be a good fit to Bayesian data analysis.

Also, I agree with Christian's characterization of Black-Scholes etc. as not "n accurate representation of reality, but rather a gentleman's agreement between traders that served to agree on prices." The way I put it was that these graduate programs in "financial mathematics / financial engineering" served a useful function by screening for students who were mathematically able and willing to work hard. It's too bad they couldn't have been learning statistics instead, but, for better or worse, competence in statistics is easier to fake than competence in math.

Christian also has an interesting conclusion:

Encouraging a total mistrust of anything scientific or academic is not helping in solving issues, but most surely pushes people in the arms of charlatans with ready answers.

I wonder what Taleb would say about this. Possibly he'd reply that it's better to have citizens to think critically than to be awed by their financial advisors.

AT writes:

I've got a count-based data set with a lot of zeroes present. I'm using zero-inflated modeling to capture the shape, and I want to test goodness-of-fit from both ends -- under- and overfitting. I've read your 1996 paper with XL and Hal Stern which recommends a "discrepancy measure" as being a good quantity to calculate with posterior predictive data. The main suggestion there was to use a chi-square statistic, but I'm sure this would be inappropriate in this case given that the zero cases would drive the entire statistic (and breaking the minimum-cell-size rule for the chi-square about 500 times in the process.) I suppose we could correct for this by doing the square-root trick to stabilize variance, but that still doesn't seem like it would resolve the problem with the zeroes. Any thoughts as to how to find a good discrepancy measure to check?

My generic response is that we always want the test summaries to relate to the substantive questions of interest. In this case, I don't have the context but I can make some quick suggestions, such as to create two test summaries: (a) the percentage of zeroes, and (b) some summary of the fit ot the counts when they are not zero.

The so-called minimum cell size rule is irrelevant, since you can compute the reference distribution directly using simulation. And issues such as stabilizing variance are not particularly relevant either, except inasmuch as they allow your test to more accurately capture the aspects of the data that are important for you to fit with your model.

I recently reviewed a report that used posterior predictive checks (that is, taking the fitted model and using it to simulate replicated data, which are then compared to the observed dataset). One of the other reviewers wrote (in response to the report, not to me):

The model goodness-of-fit statistics that the authors present on this page are biased, and should be interpreted with at least some caution. They give an over-optimistic evaluation of the fit of the hierarchical Bayes model. This is because the data are used twice: once to fit the model, and once again to assess the fit of the model. In fact, the posterior p-values are not asymptotically uniform, as they should be.

I completely disagree! I've discussed this point before. But the attitude expressed in the above quote is held strongly enough, and commonly enough, that I'm willing to spend some time trying to clear things up.

Let's unpack things.

I had a discussion with Christian Robert about the mystical feelings that seem to be sometimes inspired by Bayesian statistics. Christian began by describing this article that was on the web about constructing Bayes' theorem for simple binomial outcomes with two possible causes as "indeed funny and entertaining (at least at the beginning) but, as a mathematician, I [Christian] do not see how these many pages build more intuition than looking at the mere definition of a conditional probability and at the inversion that is the essence of Bayes' theorem. The author agrees to some level about this . . . there is however a whole crowd on the blogs that seems to see more in Bayes's theorem than a mere probability inversion . . . a focus that actually confuses--to some extent--the theorem [two-line proof, no problem, Bayes' theorem being indeed tautological] with the construction of prior probabilities or densities [a forever-debatable issue]."

I replied that there are several different points of fascination about Bayes:

Christian Robert has some thoughts on my paper with Aleks, Yu-Sung, and Grazia on weakly informative priors for logistic regression. Christian writes:

I [Christian] would have liked to see a comparison of bayesglm. with the generalised g-prior perspective we develop in Bayesian Core . . . starting with a g-like-prior on the parameters and using a non-informative prior on the factor g allows for both a natural data-based scaling and an accounting of the dependence between the covariates. This non-informative prior on g then amounts to a generalised t prior on the parameter, once g is integrated.

This sounds interesting. I agree that it makes sense to use a hierarchical model for the coefficients so that they are scaled relative to each other.

Regarding the pre-scaling that we do: I think something of this sort is necessary in order to be able to incorporate prior information. For example, if you are regressing earnings on height, it makes a difference if height is in inches, feet, meters, kilometers, etc. (Although any scale is ok if you take logs first.) I agree that the pre-scaling can be thought of as an approximation to a more formal hierarchical model of the scaling. Aleks and I discussed this when working on the bayesglm project, but it wasn't clear how to easily implement such scaling. It's possible that the t-family prior can be interpreted as some sort of mixture with a normal prior on the scaling.

In any case, maybe Aleks can try Christian's model on our corpus and see what happens. Christian links to his code, which would be a good place to start.

By Aleks, Grazia, Yu-Sung and myself. Here's the article, and here's the abstract:

We propose a new prior distribution for classical (nonhierarchical) logistic regression models, constructed by first scaling all nonbinary variables to have mean 0 and standard deviation 0.5, and then placing independent Student-t prior distributions on the coefficients. As a default choice, we recommend the Cauchy distribution with center 0 and scale 2.5, which in the simplest setting is a longer-tailed version of the distribution attained by assuming one-half additional success and one-half additional failure in a logistic regression. Cross-validation on a corpus of datasets shows the Cauchy class of prior distributions to outperform existing implementations of Gaussian and Laplace priors.

We recommend this prior distribution as a default choice for routine applied use. It has the advantage of always giving answers, even when there is complete separation in logistic regression (a common problem, even when the sample size is large and the number of predictors is small), and also automatically applying more shrinkage to higher-order interactions. This can be useful in routine data analysis as well as in automated procedures such as chained equations for missing-data imputation.

We implement a procedure to fit generalized linear models in R with the Student-t prior distribution by incorporating an approximate EM algorithm into the usual iteratively weighted least squares. We illustrate with several applications, including a series of logistic regressions predicting voting preferences, a small bioassay experiment, and an imputation model for a public health data set.

I love this stuff, and I'm interested in applying the concept of weakly informative prior distributions for many other models.

Daniel Ho and Kevin Quinn write:

We amass a new, large-scale dataset of newspaper editorials that allows us to calculate fine-grained measures of the political positions of newspaper editorial pages. Collecting and classifying over 1500 editorials adopted by 25 major US newspapers on 495 Supreme Court cases from 1994 to 2004, we apply an item response theoretic approach to place newspaper editorial boards on a substantively meaningful--and long validate--scale of political preferences.We validate the measures, show how they can be used to shed light on the permeability of the wall between news and editorial desks, and argue that the general strategy we employ has great potential for more widespread use.

Here's their key graph, which aligns the estimated ideological positions of major newspapers with recent Supreme Court justices:

news.png

They used Bayesian ideal point estimation. Their main substantive conclusion:

We recently uploaded on to CRAN multiple imputation package "mi" which we have been developing.

The aim of package mi is to make multiple imputation transparent and easy to use for the user. Hence there are few characteristics that we believe are valuable.
1. Graphical diagnostics of imputation models and convergence of the imputation process.
2. Use of bayesglm to treat the issue of separation.
3. Imputation model specification is made similar to how you would fit a regression model in R.
4. It automatically detects some problematic characteristics in the given dataset and alerts the user.

Please give it a try if you have any dataset that has missingness.

Also we are still in the process of improving the package, thus your input is most welcome.

One caution is if you are using big dataset with large number of missingness across many variables, it may take some time for process to converge. We admit, it is not the fastest imputation package on the market.

However, once we can get the basics down, speeding things up is not so difficult. So please bare with it for now.

There are future directions we plan to expand such as imputation of time-series cross-sectional data, hierarchical data, etc. But for now these features are not part of the package.

Happy Holidays!!

Regarding my article on the boxer, the wrestler, and the coin flip, Steve Hsu writes:

A world class wrestler would easily demolish a top boxer in a no holds barred fight. This has been verified by in many experiments (Inoki-Ali doesn't count)!

Steve has more details in this blog entry from 2007:

Ultimate fighting has grown from obscurity to unbelievable levels of popularity. It will soon surpass boxing as the premier combative sport. And it will soon be widely recognized that the baddest man on the planet is not a boxer, but an ultimate fighter. . . .

Unarmed single combat -- mano a mano, as they say -- has a long history, and is a subject which fascinates most men, both young and old. As a boy, I can remember serious discussions with my friends concerning which style was most effective -- karate or kung fu, boxing or wrestling, etc. How would Muhammed Ali fare against an Olympic wrestler or Judo player? What about Bruce Lee versus a Navy Seal? Of course, these discussions were completely theoretical, akin to asking whether Superman could beat Galactus in arm wrestling. There was scarcely any data available on which to base a conclusion.

However, thanks to the recent proliferation of "No Rules" or "No Holds Barred" (NHB) fighting tournaments, both in the U.S. and abroad, we finally have some interesting answers to this ancient question.

There are two aspects of a presidential election that can be predicted: the national popular vote and the relative positions of the states. The national popular vote can be forecasted months ahead of time given the economy and other predictors. for example using Doug Hibbs's model:
hibbs6.png
.

(As I wrote a few months ago, "the incumbent party sometimes loses but they never have gotten really slaughtered. In periods of low economic growth, the incumbent party can lose, but a 53-47 margin would be typical; you wouldn't expect the challenger to get much more than that.")

The relative positions of the states don't actually change much from election to election:

2004_2008_actual.png

You can do slightly better by using polls. As Matthew Yglesias puts it, "the large number of public polls on something like a presidential election makes the outcomes quite easy to forecast based on crude measures. What's more, even absent polling, Presidential election outcomes seem to be pretty predictable based on nothing more than macroeconomic variables."

Actually, even the February polls turn out to be pretty good--when combined with previous election results--to pin down the relative positions of the states.

Bayesian combination of state polls and election forecasts

Here's the revised version of my article with Kari Lock in which we forecast the election using Hibbs for the national popular vote, and a weighted average of last election (corrected for incumbency) and the February polls to get the relative positions of the states.

Lots fo fun stuff there, including this prediction (based on February Clinton-McCain and Obama-McCain polls) of which states Clinton or Obama were expected to win in November:

kari.png

Pure non-Bayesians

| 3 Comments

Back when I used to teach at Berkeley, I used to run into non-Bayesian hardliners--the kind of people who would say no to prior information even if it were wrapped nicely, placed on a warm plate, and served with a delicious pile of crisp fries. I don't run into such people much anymore but then recently Matthew Yglesias linked to my mention of economy-based election forecasts. I read the comments, one of which says:

The problem with this sort of analysis is that we're working with very few data points. We've only ever had 50 some odd Presidential elections. 12 in the television era. 2 in the internet era. It's very very hard to generalize anything from a small dataset, and even correlations don't mean much - who's to say that the macroeconomic correlations are any more meaningful than the Redskins game?

I agree with the bit about the small sample size--as I often tell people--95% intervals on the national election outcome don't mean much considering we wouldn't even try to apply a single model to 20 successive elections--but . . . "who's to say that the macroeconomic correlations are any more meaningful than the Redskins game?" ???

The answer is: Everyone can agree that macroeconomic correlations are more meaningful than the Redskins game. For one thing, voters when surveyed say that the economy is important to their voting. They don't say that the Redskins game is important. Also most people don't care about the Redskins. Etc etc. In statistics we call this prior information. Anyway, I'm not trying to pick on a blog commenter--not everyone's an expert in every field of endeavor--it was just funny to see such a pure non-Bayesian in the wild, so to speak.

Thanh Nguyen writes:

Could you tell me what is the difference between "uncertainty" and "ignorance" in this theory [of belief functions]? Some authors define "ignorance" as the "uncommitted belief" which is assigned to the whole frame of discernment, others define it as the difference between Plausibility and Belief (Pl() - Bel()). Some authors define value assigned by Belief function for elements as "uncertainty".

I don't know. All I know about belief functions is in my article about the boxer, the wrestler, and the coin flip, which is actually a writeup of something I did 20 years ago. So no new thoughts, unfortunately.

The Future of Data Analysis

| 4 Comments

Introduction A few days ago I was trying to explain the benefits of the Bayesian approach to a physicist who didn't care about the religion of truth and inference but primarily about solving a particular detection problem in particle physics. The probabilistic approach is rather standard and requires little persuasion, but the Bayesian aspect is is a level further than the probabilistic approach. So what is the benefit of the Bayesian approach? This posting will attempt to provide several reasons, from the most obvious to the least.

Frequentist Probability Probability is easily justified as a very elegant way of dealing with uncertainty in cases and variables. But probability is not observed directly but instead inferred - as are the parameters in contrast to observable predictors and outcomes. Frequentists state that the probability should be measured through the gold standard of an infinite sequence of observations, and question the benefit of Bayesian approach while criticizing the fact that inferring a parameter Bayesianly can yield worse accuracy than their favored method of "estimators" - and a bad prior can totally mess up inference. So why not use estimators if their asymptotic properties are good and the methodology often simpler than Bayes?

Overfitting Dividing the number of positive outcomes with the number of all outcomes to estimate the probability of the positive outcome is a very simple estimator: it's easy to have enough data to calculate this. But most interesting questions are not as simple: it is not interesting to calculate the probability of getting cancer, and the probability of getting cancer given smoking also requires removing the obvious effect of age. All these additional variables make a model more complicated, and the number of parameters greater. Without care and attention the model can start hallucinating properties that aren't there. The problem is shown in the following picture:

why-bayes.png

If your modeling problem is in the green area, you can happily use estimators or maximum likelihood. If you're entering the yellow area and want to retain some generalization power, you need some sort of regularization, epitomized by L1 and L2 regularization, AIC, feature selection or support vector machines. So why shouldn't we just regularize?

Priors Priors are how a Bayesian would perform regularization. After seeing a large number of regression problems from medical domains, we can safely assign a prior distribution to the size of a regression coefficient, as we have done in our paper. But then, what is the advantage over regularization? A prior is just a distribution of what the parameters should be over a particular category of problems! Isn't this a nice way to formulate regularization?

Model Uncertainty The crux of Bayes is in using probability to represent the uncertainty about the Platonic - the model, its parameters, the probability. The Bayesian approach truly starts paying a dividend when there is uncertainty in models and parameters, when we have insufficient data to accurately fit the model. Even if an estimator could rather accurately match the predictions obtained by a posterior, the variance in the posterior allows us to understand when the model can't be fit. To the best of my knowledge, no other methodology can automatically detect such problems.

Another problem that Andrew identified is that there might be situations where the data doesn't match the model very well - and even though there might be lots of data and a relatively simple model - it just doesn't fit, and the posterior will be vague.

Language of Modeling WinBUGS is an example of a higher-level modeling language. Just as programming languages have been celebrated as improving programmers' productivity: they do not require the programmer to think in terms of individual statements such as SET or JMP but in terms of functions, procedures, loops. Similarly, with Bayesian models we no longer have to think in terms of derivatives and fitting algorithms, but in terms of parameters having distributions and tied together in models. Gibbs sampler is a general-purpose fitter and proto-compiler. Of course, it's not nearly as efficient as a hand-written optimizer, but in the future tools like the Hierarchical Bayes Compiler (HBC) will create custom fitters given a higher-level specification of the model.

Summary The primary value of the Bayesian paradigm is its formal elegance which allows automation of key problems: probability takes care of unpredictability in phenomena, priors help prevent overfitting by providing outside experience (AI practitioners would refer to it as background knowledge), the use of model uncertainty helps determine the reliability of predictions, and applied Bayesians are beginning to develop model compilers!

Future The theory and practice of data analysis is currently all mixed up among a number of overlapping disciplines: (applied/mathematical/geo/medical/...)statistics, machine learning, data mining, (econo/psycho/bio)metrics, bioinformatics. All of them pursue the same problems with different but qualitatively similar tools, lacking the scale to build tools that would help them get to the next level. It is important to disentangle them. The future of data analysis should lie on these four fronts:


  1. reliable compilers and samplers that will work with large databases, provide reliable sampling (see BUGS, HBC - empowered by the new generation of programming languages such as Haskell)
  2. internet databases intended to manage background knowledge and related data sets, where the same variable appears and the same phenomenon appear in multiple tables, allowing priors to be based on more than a single data set. Research should be presented as raw data in a standardized form, not as reports and aggregates that prevent others from building on top of the finished work. Too many people are working on the same problems but not sharing the data because of an unsolved issue of the rights of the collectors of data who can only gain credit for publications (see FreeBase, Machine Learning Repository, Trendrr, Swivel, OECD.Stat)
  3. visualization & modeling environments that make it easier to clean and transform data, experiment with models, to present insights, to reduce the amount of time needed to turn data into a model that can be communicated. (see R Project, Processing, Gapminder)
  4. interpretable modeling is important to bring formal models closer to human intuition. It is still not clear what is the importance of a predictor for the outcome - the regression coefficient is close, but yet often confusing. With more powerful modeling frameworks, it is going to be possible to focus on this - not being worried about what one can fit, but instead with model choice, model selection, model language, visual language.

What do you think? What links did we miss?

Drew Conway pointed me to this:

The article entitled, "Bayesian Analysis for Intelligence: Some Focus on the Middle East," was written by Nicholas Schweitzer . . . JIOX provides no information on the essay's origins, but . . . it appears to be a declassified CIA piece written sometime in the 1970's (note mentions of Presidents Asad and Sadat, and Prime Minister Rabin on page one). . . . Schweitzer concludes that in general the Bayesian technique was able to more quickly predict "non-events" (i.e., when no hostilities would occur among Middle Eastern nations) than analysts using only their expertise and intuitions. The research design included no baseline for comparison to an actual event; therefore, we are left wondering if the Bayesian technique described here would be able to predict when something will actually happen. Despite this obvious shortcoming, it is very encouraging to observe the level of sophistication being implemented by CIA analysts some thirty-odd years ago.

I actually participated a couple years ago in an (unclassified) meeting on Bayesian analysis for military intelligence, so I know that these ideas are still out there. My only comment, regarding the Bayesian issue per se, is that the key to good statistical methods is typically making use of relevant information; non-Bayesian methods can also be effective if they can be adapted to use the info that goes into a Bayesian procedure.

Christian Robert writes:

Objet: lancement de la campagne de post-doc 2009 de la Fondation

La Fondation Sciences Mathématiques de Paris offre quinze positions post-doctorales en mathématiques et en informatique fondamentale. D'une durée d'un an - éventuellement renouvelable - ces postes sont à pourvoir à partir du 1er octobre 2009 dans les laboratoires de recherche affiliés à la Fondation.

L'appel d'offre du programme post-doctoral sera ouvert du 31 octobre au 17 décembre 2008, sur le site de la Fondation et en anglais.

He says they pay well, too! And if you do statistics, maybe you can work with me next year...

Carlin and Louis, third edition

| 3 Comments

Brad Carlin and Tom Louis recently came out with a third edition of their book, originally called Bayes and Empirical Bayes Methods for Data Analysis with a plain green cover, now called Bayesian Methods for Data Analysis with a red cover with graphs on it. In title and appearance they are thus converging to our book. They even use the "Bugs code" and "R code" marginal notation that is in my book with Jennifer (see Carlin and Louis, page 178, for example).

What's fun, though, is how different their book is from ours. I highly recommend that anyone interested in Bayesian statistics buy their book as well as Bayesian Data Analysis. This review focuses on the features of Brad's and Tom's book that differ from ours.

Let's get conjugate

| 2 Comments

David Shor writes:

I'm working on a projection system on election night, and came across a case where I have a binomial distribution with an unknown number of trials.

Is there a good conjugate prior in such a situation?

My reply: There are some articles on this by Adrian Raftery in the late 1980s, you can find references in Bayesian Data Analysis, including a homework assignment in chapter 3, I believe.

Greenspan said (on the topic of the present financial crisis):

"The whole intellectual edifice, however, collapsed in the summer of last year because the data inputted into the risk management models generally covered only the past two decades — a period of euphoria."

I hate BIC blah blah blah

| 3 Comments

It's all in chapter 6 of Bayesian Data Analysis. Anyway, Sam Gershman wrote to me:

ANOVA and the mixed-model muddle

| 2 Comments

Rick DeShon writes,

As I read through your discussion paper on the analysis of variance published in the Annals of Statistics in 2005, I became a bit confused about the connections between your notion of parameter batches and prior work on the topic of fixed and random effects. Specifically, I wonder how your approach connects to Nelder's "great mixed model muddle?"

My talks in Toronto

| 1 Comment

I recently became aware of two papers by David van Dyk on a new approach to Gibbs sampling using incompatible conditional distributions. This seems similar to the parameter expansion or redundant parameter idea developed by C. Liu, J. Liu, Meng, Rubin, van Dyk, and others, but perhaps a bit more generalizable and thus usable in routine problems.

Here's the theoretical paper (with Taeyoung Park).

And here's the more applied paper (which has a logistic regression example), with Hosung Kang.

This looks great, although I'm still not sure exactly how to apply this to our problems. Maybe we're getting closer, though...

So-called Bayesian methods

| 2 Comments

Seth points me to these papers:

John P. A. Ioannidis, Effect of Formal Statistical Significance on the Credibility of Observational Associations, Am. J. Epidemiol. 2008 168: 374-383.

Hormuzd A. Katki, Invited Commentary: Evidence-based Evaluation of p Values and Bayes Factors. Am. J. Epidemiol. 2008 168: 384-388.

John P. A. Ioannidis, The Author Responds to "Evaluating p Values and Bayes Factors", Am. J. Epidemiol. 2008 168: 389-390.

I do not, do not, do not have the energy now to comment on these. Let me just say that what is labeled in the above articles as "Bayesian" is not the only way to do Bayesian statistics. I refer you to Bayesian Data Analysis for exposition of what I consider the more reasonable Bayesian approach, which is based on modeling rather than hypothesis testing and never involves computing the posterior probability that the null hypothesis is true.

I can't stop people from doing these other things and I wouldn't even try. But I would like them to be aware of this other, more direct approach. This paper may also help.

Bayes, Bayesians

| 3 Comments

I can't remember who said this first, and I can't remember if I've already put this on the blog, but the following definition may be helpful:

Every statistician uses Bayesian inference when it is appropriate (that is, when there is a clear probability model for the sampling of parameters). A Bayesian statistician is someone who will use Bayesian inference for all problems, even when it is inappropriate.

I am a Bayesian statistician myself (for the usual reason that, even when inappropriate, Bayesian methods seem to work well).

(The above is perhaps inspired by the saying that any fool can convict a guilty man; what distinguishes a great prosecutor is the ability to convict an innocent man.)

Jeff pointed me to this paper by Brandon "not Larry" Bartels on using multilevel modeling for time series cross-sectional data. I agree with Bartels's recommendations, which are:

- Use a multilevel model to allow intercepts to vary by groups. This is more reliable than estimating intercepts by least squares or not allowing the intercepts to vary at all.
- Also allow slopes to vary. (Bartels doesn't emphasize this so strongly but I think this is important advice also.)
- Include as group-level predictors the group-level averages of important individual-level predictors. This will in many settings capture some of the otherwise unexplained group-level variation, as Joe Bafumi and I discuss.

Bartels also recommends representing individual-level predictors by their deviation from group averages. This is ok but I don't think it's necessary. It depends on the context. For example, if you have a predictor that is 1 if you're African American and 0 otherwise, I wouldn't want to subtract that from its state average. In that case you'd be better off including the individual predictor and state % African American as two predictors in the model. In other settings, Bartels's recommendation to center the predictor for each group makes more sense. Either way, this doesn't affect his main recommendation to fit a multilevel model, including important predictors in their group averages as well.

Individual and group-level predictors

Finally, I recommend my 2006 Technometrics paper, "Multilevel (hierarchical) modeling: what it can and cannot do," which begins:

Multilevel (hierarchical) modeling is a generalization of linear and generalized linear modeling in which regression coefficients are themselves given a model, whose parameters are also estimated from data. We illustrate the strengths and limitations of multilevel modeling through an example of the prediction of home radon levels in U.S. counties. The multilevel model is highly effective for predictions at both levels of the model, but could easily be misinterpreted for causal inference.

In particular, see the discussion in Section 2.4 of my paper on the interpretation of a group-level predictor. You have to be careful about calling such coefficients "effects" or interpreting them causally.

My blog discussion with Eyal Shahar (see comments #3 and onward here) reminded me of a persistent challenge I face when talking with outsiders about Bayesian statistics.

Dipankar Bandyopadhyay writes:

I am currently running an autologistic regression model where I have some fixed effects and also spatial (autlogistic) terms. Is there any recommendation from you on the appropriate choice of prior on the variance when I put a normal prior on the regression coefficients? I mean, do you recommend a folded-t, or a half-cauchy, or a uniform over the traditional inverse gamma, and in such a case, where can I get the WinBUGS codes to put folded-t, or half cauchy/half-normal priors?

Aleks points me to this paper by Jerry Friedman on non-Bayesian regularization methods. I'd also recommend our Bayesian approach (see this Annals of Applied Statistics paper). Once you're going to assume a probability model for the data (a likelihood), it's a pretty small step to include prior information as well. But read Friedman's paper in any case. He focuses more on computational issues than we do. There are really two parallel literatures.

Nick Firoozye writes:

Hierarchical models of variance

| 4 Comments

Marcus Brubaker writes:

I am currently working on a problem in computational biology using Bayesian inference and I've come to a question for which I hope you have an answer. In this problem there are a large number of noisy 2D images of a molecule, from which we wish to infer the 3D structure. Much of the modeling is straightforward but I have hit a roadblock. Specifically, the noise parameters for these images.

More Bayes rants

| 7 Comments

John Skilling wrote this response to my discussion and rejoinder on objections to Bayesian statistics.

John's claim is that Bayesian inference is not only a good idea but also is necessary. He justifies Bayesian inference using the logical reasoning of Richard Cox (1946), which I believe is equivalent to the von Neumann and Morgenstern (1948) derivation that those of us in the social sciences are familiar with.

I have no objection to the Cox/Neumann/Morgenstern argument (which is also associated with Keynes and others). However, all our models have flaws, so the big step that all of us have to make when recommending Bayesian methods is to believe that they work well even with the imperfect models that we use in practice. Models that in social science are surely far more imperfect than those used in physics (which may be one reason that physicist Skilling is so accepting of Bayes). Skilling cites Jaynes, whose writings have been a major influence in my own conception of Bayesian statistics. (The first scientific conference I ever attended, I think, was Skilling's Jaynes-inspired maximum entropy conference in 1988.) In particular, I got from Jaynes the idea that one should take any model, even a simple model, seriously enough to try to understand where it doesn't fit reality and how it should be improved.

In conclusion: I think Skilling's normative argument is powerful, but not definitive on its own, because we want statistical methods to work well even when models have serious flaws. That's why I take a more pluralistic approach. (But, as I explain in my rejoinder (linked above), I am not at all convinced by arguments about classical unbiasedness or coverage.)

To put it another way, the classical ideas of sufficiency, ancillarity, confidence coverage, hypothesis testing, etc etc: I'm happy to trash all these. But there is another set of principles out there, based on external validation (sometimes approximated by cross-validation), that seems valid to me, and does not necessarily rely on Bayes.

P.S. See also Larry Wasserman's further comments.

Learning structural forms

| 1 Comment

Josh Tenenbaum sent me a link to a paper, The discovery of structural form, by C. Kemp and himself. Also commentary by Keith Holyoak and some supporting information. Code and datasets are here.

For my own thoughts on this work, see here. Josh's talk at Columbia made me realize that all these years I'd been thinking of life as part of a "great chain of being" without realizing it.

SIAM Review asked me to review Jeff Gill's new book (Bayesian Methods: A Social and Behavioral Sciences Approach, second edition) but they said they'd like a general review essay that would be of interest to their readers, not a mere Siskel-and-Ebert on the book itself. Below is my first draft. Any comments would be appreciated.

The Battle of the Bayes

| 8 Comments

Larry Wasserman read my response to his comments on my article on Bayesian statistics and had some responses of his own. I'll post Larry's thoughts and then my response to his response etc. Here's Larry:

1. You correctly point out that if there are systematic errors, then confidence intervals will not have their advertised coverage. But this has nothing to do with the point under discussion: Bayes versus frequentist. All methods fail if there are systematic error. That's important but it is beside the point I was making. Assume the model is correct and there is no systematic error. Then what I said is correct. The frequentist method will cover as advertised (by definition) and the Bayesian method, in general will not. More importantly, and this is what I was really getting at, the hardcore subjectivist Bayesian will say that coverage is irrelevant. The people who think that coverage is completely irrelevant are being scientifically irresponsible in my opinion.

2. ``I can dispose of the first two with a reference to Agresti and Coull (1998).'' Frequentist methods have correct coverage or they aren't frequentist methods. That's the definition. And when I said ``Bayesian methods don't'' I meant, there is nothing in Bayesian inference that automatically guarantees, in general, correct coverage. I did not mean there there don't exist Bayesian methods with correct coverage. The fact that (y+1)/(n+2) has good frequentist properties is fine but in this case we're thinking of it as a frequentist estimator. The fact that it happens to be Bayes for some prior isn't what I mean by a Bayesian procedure.

3. You're right that scientists want it both ways: they want coverage AND they would like to interpret a confidence interval as a posterior probability. So what. I'd like to measure an electron's position and its velocity. But I can't. Physics tells us you can measure one or the other but not both. Tough luck for me. If you explain to a scientist that they can have the comfort of a Bayesian interpretation OR coverage bot not both, I'll bet most would pick coverage.

4. Estimating the upper .01 quantile. You say there is no frequentist method for this. First of all, there is (as I discuss below). Second, here is a challenge. Find me one example where:

(i) there is no frequentist method to solve a problem

(ii) there is a Bayesian method

(iii) we can trust the Bayesian method.

I claim you cannot do this. Suppose you do find a Bayes procedure. If it has coverage then you have found a valid frequentist method so (i) is not true. If it does not have coverage then you have failed (iii).

But lets look closer at estimating theta = .01 quantile. We can always find the order statistics X_r and X_s so that:

P(X_r < theta < X_s) >= .95

This is an exact, nonparametric statement. So [X_r,X_s] is a valid 95 percent confidence interval. When you say there is no frequentist method, I suspect you mean that this interval is going to be very wide, unless the sample size n is very large. My reply is: great! There is very little information in the data about theta unless n is large. So the interval should be wide. This a a correct representation about the uncertainty. The Bayesian interval will be narrower but this reflects the prior not the data. Of course, if the prior is reliable that's fine. But many Bayesians would simply crank out an interval, see that it is narrower than the frequentist interval and declare victory. They're deluding themselves because they're sweeping the uncertainty under the carpet. I'd rather they use the frequentist interval so that they are aware of the difficulty of the problem.

The same applies to calibration problems etc where one gets huge intervals (sometimes even the whole real line). This is a virtue not a problem. The Bayesian answer hides the problem.

And here's my reply, which I'll divide into two parts: Points of Agreement, and Points of Disagreement.

Whassup with Bart?

| 8 Comments

I've seen Jennifer Hill and Ed George give great talks on Bayesian additive regression trees. It looked awesome. So why haven't these papers appeared anywhere? All I can find are preprints.

On April Fool's Day I posted my article, "Why I don't like Bayesian statistics." At the time, some commenters asked for my responses to the criticisms that I'd raised.

My original article will appear, in slightly altered form, in the journal Bayesian Analysis, with discussion and rejoinder. Here's the article, which begins as follows:

Bayesian inference is one of the more controversial approaches to statistics. The fundamental objections to Bayesian methods are twofold: on one hand, Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience. The second objection to Bayes comes from the opposite direction and addresses the subjective strand of Bayesian inference. This article presents a series of objections to Bayesian inference, written in the voice of a hypothetical anti-Bayesian statistician. The article is intended to elicit elaborations and extensions of these and other arguments from non-Bayesians and responses from Bayesians who might have different perspectives on these issues.

And here's the rejoinder, which begins:

In the main article I presented a series of objections to Bayesian inference, written in the voice of a hypothetical anti-Bayesian statistician. Here I respond to these objections along with some other comments made by four discussants.

You'll have to wait until the journal issue comes out to read the discussions, by Jose Bernardo, Joe Kadane, Larry Wasserman, and Stephen Senn. And thanks to Bayesian Analysis editor Brad Carlin for putting this all together.

Our own Kenny Shirley spoke at the Bayes meeting on this stuff. As is often the case in applied work, what's interesting here isn't so much the model--which is enough to get the job done--but how it fits into the larger policy goals (which in this case involve quantifying uncertainty, a natural fit for Bayesian methods).

Stanley Chin writes:

I had the usual stats training in grad school, and after some years as a practicing statistician and economist find myself increasingly approaching problems from a Bayesian perspective -- never more so than in a problem that was brought to me as an external consultant. My question is brief, the set up is a little long -- the question is in the subject line, can you recommend any reading in Bayesian approaches to quality control sampling?

Rey De Castro writes:

I have a longitudinal data set that needs imputation, but the problem doesn't seem to resemble a typical imputation situation. So I'm casting about for a reasonably defensible approach that I can implement without tremendous custom-programming effort. My question concerns Bayesian approaches to imputation.

The Situation: I have longitudinal data for each of a group of schoolchildren. Each observation in the series is a multilevel class indicator of several canonical locations (i.e., indoor-home, indoor-school, outdoors, commuting) where the child reported being present during a particular 15-minute interval. Essentially, it's a series giving each child's location over time at 15-minute intervals. There are ~100 children, and each child's series is very long: ~2000 observations.

Gal Elidan writes:

I am starting as a faculty next year in the statistics department at the Hebrew University, Israel. As it may be interesting to both the computer science and statistical community, I plan to give a course a course next year on Bayesian data analysis. My (still in its early stages) plan is to give a course based on your book along with some relevant topics/applications that have seen light in the computer science community in recent years (e.g. the Chinese restaurant process). I would greatly appreciate it greatly if you could share with me any material that you have used in the past in teaching this course. Since I have little experience estimating work load, I could use help in knowing how many problems you assigned each time.

My reply:

David Dunson forwarded me this article for
a book that is coming out on Nonparametric Bayes in Practice. I think David's work is great but I keep encountering it in separate research articles and never in a single place which explains when to use each sort of model. I'll have to read the article in detail, but it seems like a good start. I suggested to David that he write a book but he pointed out that nobody reads books. But do people read articles in handbooks? I don't know. I guess what's really needed is a convenient software implementation for all of it. In the meantime, this article seems like the place to go.

A more formal rant

| 8 Comments

I've written up my rant more seriously here. Here's the new abstract:

Bayesian inference is one of the more controversial approaches to statistics. The fundamental objections to Bayesian methods are twofold: on one hand, Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience. The second objection to Bayes comes from the opposite direction and addresses the subjective strand of Bayesian inference. This article presents a series of objections to Bayesian inference, written in the voice of a hypothetical anti-Bayesian statistician. The article is intended to elicit elaborations and extensions of these and other arguments from non-Bayesians and responses from Bayesians who might have different perspectives on these issues.

And here's how the article concludes:

In the decades since this work and Box and Tiao’s and Berger’s definitive books on Bayesian inference and decision theory, the debates have shifted from theory toward practice. But many of the fundamental disputes remain and are worth airing on occasion, to see the extent to which modern developments in Bayesian and non-Bayesian methods alike can inform the discussion.

In answer to many of the earlier commenters: yes, I have replies for the criticisms. But I didn't want to put them here because I worried that they would inhibit the flow of discussion that I'd like to see come from this article. I will post my replies at some point (at which time I'm sure they'll be a disappointment, after all the hype).

Interesting spam

| 3 Comments

I usually don't like spam, but this message I got the other day from Ed Tranham was pretty good:

Occam

| 8 Comments

Regarding my anti-Occam stance ("I don't count 'Occam's Razor,' or 'Ockham's Razor,' or whatever, as a justification. You gotta do better than digging up a 700-year-old quote."), David Gillman writes:

I was at your talk at MIT yesterday, and something bothered me until I realized just now that your reason for rejecting Occam's Razor was wrong, from a Bayesian point of view. A priori what's the probability that something somebody says will be remembered for 800 years? I figure it's machine learning people who want your models to be simple, but Occam's answer to that would be that you aren't a machine.

He also says,

If somebody quotes ancient wisdom and you disagree with them, Occam's Razor says don't blame the ancient wisdom, because the person is probably misappropriating it.

Good point.

My talk at MIT on Monday

| 7 Comments

I'm speaking Monday 14 April at 4:30 on weakly informative prior distributions and models with interactions. I'll try to make things accessible to a general audience of people who might not know much about statistics in general or Bayesian methods in particular.

Michail Fragkias writes,

Recent Comments

  • Nick Cox : Jacob: Thanks for your extra comments. You'd have saved yourself read more
  • Asa: Thanks everyone. I figured out a pretty solid solution to read more
  • Stuart Buck: Is it that medical schools are trying to screen out read more
  • Jacob: BTW, in no way I am putting down R. R read more
  • Jacob: Nick, Of course, my comment on MATLAB's popularity is based read more
  • Steven: http://www.cockeyed.com/science/gallon/liquid.html See for more info read more
  • Andrew Gelman: Jonathan: You are giving the conventional definition of risk aversion read more
  • Jonathan: As an economist who does his work with "the public," read more
  • BrendanH: I'll second the lme4/R recommendation, on the grounds that it read more
  • Chris Brew: In Linguistics Ohio State invites people to on-site recruiting visits read more
  • PalMD: Independent of the evidence, med schools and residencies require very read more
  • Eliot: I agree with Elizabeth and Jeremy -- Americans and Europeans read more
  • ssendam: One of the irritating things about economics is that it read more
  • wei: probabilities should be conditional upon all possible paragraphs to express read more
  • Jonathan: It's far from surprising that a guy who doesn't believe read more
  • Jeremy Miles: I'd never thought of it like that - but risk read more
  • Elizabeth: You could just describe the same phenomena by saying that read more
  • manuelg: Yes, I think I understand what you are saying. Converting read more
  • derek: How meaningful is it for an American to talk about read more
  • Andrew Gelman: Thorfinn: I don't actually think the Type 1 and Type read more