Recently in Bayesian Statistics Category

Kind of Bayesian

| No Comments

Astrophysicist Andrew Jaffe pointed me to this and discussion of my philosophy of statistics (which is, in turn, my rational reconstruction of the statistical practice of Bayesians such as Rubin and Jaynes). Jaffe's summary is fair enough and I only disagree in a few points:

1. Jaffe writes:

Subjective probability, at least the way it is actually used by practicing scientists, is a sort of "as-if" subjectivity -- how would an agent reason if her beliefs were reflected in a certain set of probability distributions? This is why when I discuss probability I try to make the pedantic point that all probabilities are conditional, at least on some background prior information or context.

I agree, and my problem with the usual procedures used for Bayesian model comparison and Bayesian model averaging is not that these approaches are subjective but that the particular models being considered don't make sense. I'm thinking of the sorts of models that say the truth is either A or B or C. As discussed in chapter 6 of BDA, I prefer continuous model expansion to discrete model averaging.

Either way, we're doing Bayesian inference conditional on a model; I'd just rather do it on a model that I like. There is some relevant statistical analysis here, I think, about how these different sorts of models perform under different real-world situations.

2. Jaffe writes that I view my philosophy as "Popperian rather than Kuhnian." That's not quite right. In my paper with Shalizi, we speak of our philosophy as containing elements of Popper, Kuhn, and Lakatos. In particular, we can make a Kuhnian identification of Bayesian inference within a model as "normal science" and model checking and replacement as "scientific revolution." (From a Lakatosian perspective, I identify various responses to model checks as different forms of operations in a scientific research programme, ranging from exception-handling through modification of the protective belt of auxiliary hypothesis through full replacement of a model.)

3. Jaffe writes that I "make a rather strange leap: deciding amongst any discrete set of parameters falls into the category of model comparison." This reveals that I wasn't so clear in stating my position. I'm not saying that a Bayesian such as myself shouldn't or wouldn't apply Bayesian inference to a discrete-parameter model. What I was saying is that my philosophy isn't complete. Direct Bayesian inference is fine with some discrete-parameter models (for example, a dense discrete grid approximating a normal prior distribution) but not for others (for example, discrete models for variable selection, where any given coefficient is either "in" (that is, estimated by least squares with a flat prior) or "out" (set to be exactly zero)). My incoherence is that I don't really have a clear rule of when it's OK to do Bayesian model averaging and when it's not.

As noted in my recent article, I don't think this incoherence is fatal--all other statistical frameworks I know of have incoherence issues--but it's interesting.

Andy McKenzie writes:

In their March 9 "counterpoint" in nature biotech to the prospect that we should try to integrate more sources of data in clinical practice (see "point" arguing for this), Isaac Kohane and David Margulies claim that,

"Finally, how much better is our new knowledge than older knowledge? When is the incremental benefit of a genomic variant(s) or gene expression profile relative to a family history or classic histopathology insufficient and when does it add rather than subtract variance?"

Perhaps I am mistaken (thus this email), but it seems that this claim runs contra to the definition of conditional probability. That is, if you have a hierarchical model, and the family history / classical histopathology already suggests a parameter estimate with some variance, how could the new genomic info possibly increase the variance of that parameter estimate? Surely the question is how much variance the new genomic info reduces and whether it therefore justifies the cost. Perhaps the authors mean the word "relative" as "replacing," but then I don't see why they use the word "incremental." So color me confused, and I'd love it if you could help me clear this up.

My reply:

We consider this in chapter 2 in Bayesian Data Analysis, I think in a couple of the homework problems. The short answer is that, in expectation, the posterior variance decreases as you get more information, but, depending on the model, in particular cases the variance can increase. For some models such as the normal and binomial, the posterior variance can only decrease. But consider the t model with low degrees of freedom (which can be interpreted as a mixture of normals with common mean and different variances). if you observe an extreme value, that's evidence that the variance is high, and indeed your posterior variance can go up.

That said, the quote above might be addressing a different issue, that of overfitting. But I always say that overfitting is not a problem of too much information, it's a problem of a model that's not set up to handle a given level of information.

Paul Pudaite writes in response to my discussion with Bartels regarding effect sizes and measurement error models:

You [Gelman] wrote: "I actually think there will be some (non-Gaussian) models for which, as y gets larger, E(x|y) can actually go back toward zero."

I [Pudaite] encountered this phenomenon some time in the '90s. See this graph which shows the conditional expectation of X given Z, when Z = X + Y and the probability density functions of X and Y are, respectively, exp(-x^2) and 1/(y^2+1) (times appropriate constants). As the magnitude of Z increases, E[X|Z] shrinks to zero.

Screen shot 2011-07-15 at 6.23.12 PM.png

I wasn't sure it was worth the effort to try to publish a two paragraph paper.

I suspect that this is true whenever the tail of one distribution is 'sufficiently heavy' with respect to the tail of the other. Hmm, I suppose there might be enough substance in a paper that attempted to characterize this outcome for, say, unimodal symmetric distributions.

Maybe someone can do this? I think it's an important problem. Perhaps some relevance to the Jeffreys-Lindley paradox also.

Static sensitivity analysis

| 6 Comments

This is one of my favorite ideas. I used it in an application but have never formally studied it or written it up as a general method.

Sensitivity analysis is when you check how inferences change when you vary fit several different models or when you vary inputs within a model. Sensitivity analysis is often recommended but is typically difficult to do, what with the hassle of carrying around all these different estimates. In Bayesian inference, sensitivity analysis is associated with varying the prior distribution, which irritates me: why not consider sensitivity to the likelihood, as that's typically just as arbitrary as the prior while having a much larger effect on the inferences.

So we came up with static sensitivity analysis, which is a way to assess sensitivity to assumptions while fitting only one model. The idea is that Bayesian posterior simulation gives you a range of parameter values, and from these you can learn about sensitivity directly.

The published example comes from my paper with Frederic Bois and Don Maszle on the toxicokinetics of percloroethylene (PERC). One of the products of the analysis was estimation of the percent of PERC metabolized at high and low doses. We fit a multilevel model to data from six experimental subjects, so we obtained inference for the percent metabolized at each dose for each person and the distribution of these percents over the general population.

Here's the static sensitivity analysis:

Screen shot 2011-07-14 at 9.25.46 AM.png

Each plot shows inferences for two quantities of interest--percent metabolized at each of the two doses--with the dots representing different draws from the fitted posterior distribution. (The percent metabolized is lower at high doses (an effect of saturation of the metabolic process in the liver), so in this case it's possible to "cheat" and display two summaries on each plot.) The four graphs show percent metabolized as a function of four different variables in the model. All these graphs represent inference for subject A, one of the six people in the experiment. (It would be possible to look at the other five subjects, but the set of graphs here gives the general idea.)

To understand the static sensitivity analysis, consider the upper-left graph. The simulations reveal some posterior uncertainty about the percent metabolized (it is estimated to be between about 10-40% at low dose and 0.5-2% at high dose) and also on the fat-blood partition coefficient displayed on the x-axis (it is estimated to be somewhere between 65 and 110). More to the point, the fat-blood partition coefficient influences the inference for metabolism at low dose but not at high dose. This result can be directly interpreted as sensitivity to the prior distribution for this parameter: if you shift the prior to the left or right, you will shift the inferences up or down for percent metabolized at low dose, but not at high dose.

Now look at the lower-left graph. The scaling coefficient strongly influences the percent metabolized at high dose but has essentially no effect on the low-dose rate.

Suppose that as a decision-maker you are primarily interested in the effects of low-dose exposure. Then you'll want to get good information about the fat-blood partition coefficient (if possible) but it's not so important to get more precise on the scaling coefficient. You can go similarly through the other graphs.

I think this has potential as a general method, but I've never studied it or written it up as such. It's a fun problem: it has applied importance but also links to a huge theoretical literature on sensitivity analysis.

A few days ago I discussed the evaluation of somewhat-plausible claims that are somewhat supported by theory and somewhat supported by statistical evidence. One point I raised was that an implausibly large estimate of effect size can be cause for concern:

Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.

That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!"

Larry Bartels notes that my reasoning above is a bit incoherent:

I [Bartels] strongly agree with your bottom line that our main aim should be "understanding effect sizes on a real scale." However, your paradoxical conclusion ("the larger the estimated effect, the more likely it is to be a mistake") seems to distract attention from the effect size of primary interest-the magnitude of the "true" (causal) effect.

If the model you have in mind is b=c+d+e, where b is the estimated effect, c is the "true" (causal) effect, d is a "systematic error" (in your language), and e is a "random error," your point seems to be that your posterior belief regarding the magnitude of the "systematic error," E(d|b), is increasing in b. But the more important fact would seem to be that your posterior belief regarding the magnitude of the "true" (causal) effect, E(c|b), is also increasing in b (at least for plausible-seeming distributional assumptions).

Your prior uncertainty regarding the distributions of these various components will determine how much of the estimated effect you attribute to c and how much you attribute to d, and in the case of "wacky claims" you may indeed want to attribute most of it to d; nevertheless, it seems hard to see why a larger estimated effect should not increase your posterior estimate of the magnitude of the true causal effect, at least to some extent.

Conversely, your skeptical assessment of the flaws in the design of the July 4th study may very well lead you to believe that d>>0; but wouldn't that same skepticism have been warranted (though it might not have been elicited) even if the estimated effect had happened to look more plausible (say, half as large or one-tenth as large)?

Focusing on whether a surprising empirical result is "a mistake" (whatever that means) seems to concede too much to the simple-minded is-there-an-effect-or-isn't-there perspective, while obscuring your more fundamental interest in "understanding [true] effect sizes on a real scale."

Larry's got a point. I'll have to think about this in the context of an example. Maybe a more correct statement would be that, given reasonable models for x, d, and e, if the estimate gets implausibly large, the estimate for x does not increase proportionally. I actually think there will be some (non-Gaussian) models for which, as y gets larger, E(x|y) can actually go back toward zero. But this will depend on the distributional form.

I agree that "how likely is it to be a mistake" is the wrong way to look at things. For example, in the July 4th study, there are a lot of sources of variation, only some of which are controlled for in the analysis that was presented. No analysis is perfect, so the "mistake" framing is generally not so helpful.

I was pleasantly surprised to have my recreational reading about baseball in the New Yorker interrupted by a digression on statistics. Sam Fuld of the Tampa Bay Rays, was the subjet of a Ben McGrath profile in the 4 July 2011 issue of the New Yorker, in an article titled Super Sam. After quoting a minor-league trainer who described Fuld as "a bit of a geek" (who isn't these days?), McGrath gets into that lovely New Yorker detail:

One could have pointed out the more persuasive and telling examples, such as the fact that in 2005, after his first pro season, with the Class-A Peoria Chiefs, Fuld applied for a fall internship with Stats, Inc., the research firm that supplies broadcasters with much of the data anad analysis that you hear in sports telecasts.

After a description of what they had him doing, reviewing footage of games and cataloguing, he said

"I thought, They have a stat for everything, but they don't have any stats regarding foul balls."

Nick Polson and James Scott write:

We generalize the half-Cauchy prior for a global scale parameter to the wider class of hy- pergeometric inverted-beta priors. We derive expressions for posterior moments and marginal densities when these priors are used for a top-level normal variance in a Bayesian hierarchical model. Finally, we prove a result that characterizes the frequentist risk of the Bayes estimators under all priors in the class. These arguments provide an alternative, classical justification for the use of the half-Cauchy prior in Bayesian hierarchical models, complementing the arguments in Gelman (2006).

This makes me happy, of course. It's great to be validated.

The only think I didn't catch is how they set the scale parameter for the half-Cauchy prior. In my 2006 paper I frame it as a weakly informative prior and recommend that the scale be set based on actual prior knowledge. But Polson and Scott are talking about a default choice. I used to think that such a default would not really be possible but given our recent success with automatic priors for regularized point estimates, now I'm thinking that a reasonable default might be possible in the full Bayes case too.

P.S. I found the above article while looking on Polson's site for this excellent paper, which considers in a more theoretical way some of the themes that Jennifer, Masanao, and I are exploring in our research on hierarchical models and multiple comparisons.

Around these parts we see a continuing flow of unusual claims supported by some statistical evidence. The claims are varyingly plausible a priori. Some examples (I won't bother to supply the links; regular readers will remember these examples and newcomers can find them by searching):

- Obesity is contagious
- People's names affect where they live, what jobs they take, etc.
- Beautiful people are more likely to have girl babies
- More attractive instructors have higher teaching evaluations
- In a basketball game, it's better to be behind by a point at halftime than to be ahead by a point
- Praying for someone without their knowledge improves their recovery from heart attacks
- A variety of claims about ESP

How should we think about these claims? The usual approach is to evaluate the statistical evidence--in particular, to look for reasons that the claimed results are not really statistically significant. If nobody can shoot down a claim, it survives.

The other part of the story is the prior. The less plausible the claim, the more carefully I'm inclined to check the analysis.

But what does it mean, exactly, to check an analysis? The key step is to interpret the findings quantitatively: not just as significant/non-significant but as an effect size, and then looking at the implications of the estimated effect.

I'll explore in the context of two examples, one from political science and one from psychology. An easy example is one in which the estimated effect is completely plausible (for example, the incumbency advantage in U.S. elections), or in which it is completely implausible (for example, a new and unreplicated claim of ESP).

Neither of the examples I consider here is easy: both of the claims are odd but plausible, and both are supported by data, theory, and reasonably sophisticated analysis.

The effect of rain on July 4th

My co-blogger John Sides linked to an article by Andreas Madestam and David Yanagizawa-Drott that reports that going to July 4th celebrations in childhood had the effect of making people more Republican. Madestam and Yanagizawa-Drott write:

Using daily precipitation data to proxy for exogenous variation in participation on Fourth of July as a child, we examine the role of the celebrations for people born in 1920-1990. We find that days without rain on Fourth of July in childhood have lifelong effects. In particular, they shift adult views and behavior in favor of the Republicans and increase later-life political participation. Our estimates are significant: one Fourth of July without rain before age 18 raises the likelihood of identifying as a Republican by 2 percent and voting for the Republican candidate by 4 percent. . . .

Here was John's reaction:

In sum, if you were born before 1970, and experienced sunny July 4th days between the ages of 7-14, and lived in a predominantly Republican county, you may be more Republican as a consequence.

When I [John] first read the abstract, I did not believe the findings at all. I doubted whether July 4th celebrations were all that influential. And the effects seem to occur too early in the life cycle: would an 8-year-old would be affected politically? Doesn't the average 8-year-old care more about fireworks than patriotism?

But the paper does a lot of spadework and, ultimately, I was left thinking "Huh, maybe this is true." I'm still not certain, but it was worth a blog post.

My reaction is similar to John's but a bit more on the skeptical side.

Let's start with effect size. One July 4th without rain increases the probability of Republican vote by 4%. From their Figure 3, the number of rain-free July 4ths is between 6 and 12 for most respondents. So if we go from the low to the high end, we get an effect of 6*4%, or 24%.

[Note: See comment below from Winston Lim. If the effect is 24% (not 24 percentage points!) on the Republican vote and 0% on the Democratic vote, then the effect on the vote share D/(D+R) is 1.24/1.24 - 1/2 or approximately 6%. So the estimate is much less extreme than I'd thought. The confusion arose because I am used to seeing results reported in terms of the percent of the two-party vote share, but these researchers used a different form of summary.]

Does a childhood full of sunny July 4ths really make you 24 percentage points more likely to vote Republican? (The authors find no such effect when considering the weather in a few other days in July.) I could imagine an effect--but 24 percent of the vote? The number seems too high--especially considering the expected attenuation (noted in section 3.1 of the paper) because not everyone goes to a July 4th celebration and that they don't actually know the counties where the survey respondents lived as children. It's hard enough to believe an effect size of 24%, but it's really hard to believe of 24% as an underestimate.

So what could've gone wrong? The most convincing part of the analysis was that they found no effect of rain on July 2, 3, 5, or 6. But this made me wonder about the other days of the year. I'd like to see them automate their analysis and loop it thru all 365 days, then make a graph showing how the coefficient for July 4th fits in. (I'm not saying they should include all 365 in a single regression--that would be a mess. Rather, I'm suggesting the simpler option of 365 analyses, each for a single date.)

Otherwise there are various features in the analysis that could cause problems. The authors predict individual survey respondents given the July 4th weather when they were children, in the counties where they currently reside. Right away we can imagine all sorts of biases based on how moves and who stays put.

Setting aside these measurement issues, the big identification issue is that counties with more rain might be systematically different than counties with less rain. To the extent the weather can be considered a random treatment, the randomization is occurring across years within counties. The authors attempt to deal with this by including "county fixed effects"--that is, allowing the intercept to vary by county. That's ok but their data span a 70 year period, and counties have changed a lot politically in 70 years. They also include linear time trends for states, which helps some more, but I'm still a little concerned about systematic differences not captured in these trends.

No study is perfect, and I'm not saying these are devastating criticisms. I'm just trying to work through my thoughts here.

The effects of names on life choices

For another example, consider the study by Brett Pelham, Matthew Mirenberg, and John Jones of the dentists named Dennis (and the related stories of people with names beginning with F getting low grades, baseball players with K names getting more strikeouts, etc.). I found these claims varyingly plausible: the business with the grades and the strikeouts sounded like a joke, but the claims about career choices etc seemed possible.

My first step in trying to understand these claims was to estimate an effect size: my crude estimate was that, if the research findings were correct, that about 1% of people choose their career based on their first names.

This seemed possible to me, but Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.

That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!

Simonsohn moved the discussion forward by calibrating the effect-size questions to other measurable quantities:

We need a benchmark to make a more informed judgment if the effect is small or large. For example, the Dennis/dentist effect should be much smaller than parent-dentist/child-dentist. I think this is almost certainly true but it is an easy hurdle. The J marries J effect should not be much larger than the effect of, say, conditioning on going to the same high-school, having sat next to each other in class for a whole semester.

I have no idea if that hurdle is passed. These are arbitrary thresholds for sure, but better I'd argue than both my "100% increase is too big", and your "pr(marry smith) up from 1% to 2% is ok."

Summary

No easy answers. But I think that understanding effect sizes on a real scale is a start.

Kent Osband writes:

Maximum likelihood gives the beat fit to the training data but in general overfits, yielding overly-noisy parameter estimates that don't perform so well when predicting new data. A popular solution to this overfitting problem takes advantage of the iterative nature of most maximum likelihood algorithms by stopping early. In general, an iterative optimization algorithm goes from a starting point to the maximum of some objective function. If the starting point has some good properties, then early stopping can work well, keeping some of the virtues of the starting point while respecting the data.

This trick can be performed the other way, too, starting with the data and then processing it to move it toward a model. That's how the iterative proportional fitting algorithm of Deming and Stephan (1940) works to fit multivariate categorical data to known margins.

In any case, the trick is to stop at the right point--not so soon that you're ignoring the data but not so late that you end up with something too noisy. Here's an example of what you might want:

Screen shot 2011-07-05 at 2.32.06 PM.png

The trouble is, you don't actually know the true value so you can't directly use this sort of plot to make a stopping decision.

Everybody knows that early stopping of an iterative maximum likelihood estimate is approximately equivalent to maximum penalized likelihood estimation (that is, the posterior mode) under some prior distribution centered at the starting point. By stopping early, you're compromising between the prior and the data estimates.

Early stopping is a simple implementation but I prefer thinking about the posterior mode because I can better understand an algorithm if I can interpret it as optimizing some objective function.

A key question, though, is where to stop? Different rules correspond to different solutions. For example, one appealing rule for maximum likelihood is to stop when the chi-squared discrepancy between data and fitted model is below some preset level such as its unconditional expected value. This would stop some of the more extreme varieties of overfitting. Such a procedure, however, is not the same as penalized maximum likelihood with a fixed prior, as it represents a different way of setting the tuning parameter.

I discussed this in my very first statistics paper, "Constrained maximum entropy methods in an image reconstruction problem" (follow the link above). The topic of early stopping came up in conversation not long ago and so I think this might be worth posting. The particular models and methods discussed in this article are not really of interest any more, but I think the general principles are still relevant.

The key ideas of the article appear in section 5 on page 433.

Non-statistical content

I read this article by Rivka Galchen on quantum computing. Much of the article was about an eccentric scientist in his fifties named David Deutch. I'm sure the guy is brilliant but I wasn't particularly interested in his not particularly interesting life story (apparently he's thin and lives in Oxford). There was a brief description of quantum computing itself, which reminds me of the discussion we had a couple years ago under the heading, The laws of conditional probability are false (and the update here).

I don't have anything new to say here; I'd just never heard of quantum computing before and it seemed relevant to our discussion. The uncertainty inherent in quantum computing seems closely related to Jouni's idea of fully Bayesian computing, that uncertainty should be inherent in the computational structure rather than tacked on at the end.

P.S. No, I'm not working on July 4th! This post is two months old, we just have a long waiting list of blog entries.

A couple years ago Rod Little was invited to write an article for the diamond jubilee of the Calcutta Statistical Association Bulletin. His article was published with discussions from Danny Pfefferman, J. N. K. Rao, Don Rubin, and myself.
Here it all is.

I'll paste my discussion below, but it's worth reading the others' perspectives too. Especially the part in Rod's rejoinder where he points out a mistake I made.

When predicting 0/1 data we can use logit (or probit or robit or some other robust model such as invlogit (0.01 + 0.98*X*beta)). Logit is simple enough and we can use bayesglm to regularize and avoid the problem of separation.

What if there are more than 2 categories? If they're ordered (1, 2, 3, etc), we can do ordered logit (and use bayespolr() to avoid separation). If the categories are unordered (vanilla, chocolate, strawberry), there are unordered multinomial logit and probit models out there.

But it's not so easy to fit these multinomial model in a multilevel setting (with coefficients that vary by group), especially if the computation is embedded in an iterative routine such as mi where you have real time constraints at each step.

So this got me wondering whether we could kluge it with logits. Here's the basic idea (in the ordered and unordered forms):

- If you have a variable that goes 1, 2, 3, etc., set up a series of logits: 1 vs. 2,3,...; 2 vs. 3,...; and so forth. Fit each one with bayesglm (or whatever). The usual ordered logit is a special case of this model in which the coefficients (except for the constant term) are the same for each model. (At least, I'm guessing that's what's happening; if not, there are some equivalently-dimensioned constraints.) The simple alternative has no constraints. intermediate versions could link the models with soft constraints, some prior distribution on the coefficients. (This would need some hyperparameters but these could be estimated too if the whole model is being fit in some iterative context.)

- If you have a vanilla-chocolate-strawberry variable, do the same thing; just order the categories first, either in some reasonable way based on substantive information or else using some automatic rule such as putting the categories in decreasing order of frequency in the data. In any case, you'd first predict the probability of being in category 1, then the probability of being in 2 (given that you're not in 1), then the probability of 3 (given not 1 or 2), and so forth.

Depending on your control over the problem, you could choose how to model the variables. For example, in a political survey with some missing ethnicity responses, you might model that variable as ordered: white/other/hispanic/black. In other contexts you might go unordered.

I recognized that my patchwork of logits is a bit of a hack, but I like its flexibility, as well as the computational simplicity of building it out of simple logits. Maybe there's some literature on this (perhaps explaining why it's a bad idea)?? I'd appreciate any comments or suggestions.

I've been writing a lot about my philosophy of Bayesian statistics and how it fits into Popper's ideas about falsification and Kuhn's ideas about scientific revolutions.

Here's my long, somewhat technical paper with Cosma Shalizi.
Here's our shorter overview for the volume on the philosophy of social science.
Here's my latest try (for an online symposium), focusing on the key issues.

I'm pretty happy with my approach--the familiar idea that Bayesian data analysis iterates the three steps of model building, inference, and model checking--but it does have some unresolved (maybe unresolvable) problems. Here are a couple mentioned in the third of the above links.

Consider a simple model with independent data y_1, y_2, .., y_10 ~ N(θ,σ^2), with a prior distribution θ ~ N(0,10^2) and σ known and taking on some value of approximately 10. Inference about μ is straightforward, as is model checking, whether based on graphs or numerical summaries such as the sample variance and skewness.

But now suppose we consider μ as a random variable defined on the integers. Thus θ = 0 or 1 or 2 or 3 or ... or -1 or -2 or -3 or ..., and with a discrete prior distribution formed by the discrete approximation to the N(0,10^2) distribution. In practice, with the sample size and parameters as defined above, the inferences are essentially unchanged from the continuous case, as we have defined θ on a suitably tight grid.

But from my philosophical position, the discrete model is completely different: I have already written that I do not like to choose or average over a discrete set of models. This is a silly example but it illustrates a hole in my philosophical foundations: when am I allowed to do normal Bayesian inference about a parameter θ in a model, and when do I consider θ to be indexing a class of models, in which case I consider posterior inference about θ to be an illegitimate bit of induction? I understand the distinction in extreme cases--they correspond to the difference between normal science and potential scientific revolutions--but the demarcation does not cleanly align with whether a model is discrete or continuous.

Another incoherence in Bayesian data analysis, as I practice it, arises after a model check. Judgment is required to decide what to do after learning that an aspect of data is not fitted well by the model--or, for that matter, in deciding what to do in the other case, when a test does not reject. In either case, we must think about the purposes of our modeling and our available resources for data collection and computation. I am deductively Bayesian when performing inference and checking within a model, but I must go outside this framework when making decisions about whether and how to alter my model.

In my defense, I see comparable incoherence in all other statistical philosophies:

- Subjective Bayesianism appears fully coherent but falls apart when you examine the assumption that your prior distribution can completely reflect prior knowledge. This can't be, even setting aside that actual prior distributions tend to be chosen from convenient parametric families. If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.

- Classical parametric statistics disallows probabilistic prior information but assumes the likelihood function to be precisely known, which can't make sense except in some very special cases. Robust analysis attempts to account for uncertainty about model specification but relies on additional assumptions such as independence.

- Classical nonparametric methods rely strongly on symmetry, translation invariance, independence, and other generally unrealistic assumptions.

My point here is not to say that my preferred methods are better than others but rather to couple my admission of philosophical incoherence with a reminder that there is no available coherent alternative.

To put it another way, how would an "AI" do Bayesian analysis (or statistical inference in general)? The straight-up subjective Bayes approach requires the AI to already have all possible models specified in its database with appropriate prior probabilities. That doesn't seem plausible to me. But my approach requires models to be generated on the fly (in response to earlier model checks and the appearance of new data). It's clear enough how an AI could perform inference on a specified graphical model (or on a mixture of such models); it's not so clear how an AI could do model checking. When I do the three steps of Bayesian data analysis, human input is needed to interpret graphs and decide on model improvements. But we can't ask the AI to delegate such tasks to a homunculus. Another way of saying this is that, if Bayesian data analysis is a form of applied statistics, an AI can't fully do statistics until it can do science. In the meantime, though, the computer is a hugely useful tool in each of the three stages of Bayesian data analysis, even if it can't put it all together yet. I'm hoping that by automating some of the steps required to evaluate and compare models, we can get a better sense of what outside knowledge we are adding at each step.

P.S. Cosma Shalizi writes:

If your graphical model does not have all possible edges, then there are conditional independencies which could be checked mechanically (and there are programs which do things like that). I know you don't like zeroes in the model, but it's at least a step in the right direction, no?

To which I reply:

Yup. Or at times it could be a step in the wrong direction, depending on the model. But if the program has the ability to check the model (or at least to pass the relevant graphs on to the homunculus), then I would think that working with conditional independence approximations could be a useful way to move forward.

That's part of my incoherence. On one hand, I hate discrete model averaging. On the other hand, I typically end up with one model, which is just a special case of an average of models.

For the analysis of binary data, various deterministic models have been proposed, which are generally simpler to fit and easier to understand than probabilistic models. We claim that corresponding to any deterministic model is an implicit stochastic model in which the deterministic model fits imperfectly, with errors occurring at random. In the context of binary data, we consider a model in which the probability of error depends on the model prediction. We show how to fit this model using a stochastic modification of deterministic optimization schemes.

The advantages of fitting the stochastic model explicitly (rather than implicitly, by simply fitting a deterministic model and accepting the occurrence of errors) include quantification of uncertainty in the deterministic model's parameter estimates, better estimation of the true model error rate, and the ability to check the fit of the model nontrivially. We illustrate this with a simple theoretical example of item response data and with empirical examples from archeology and the psychology of choice.

Here's the article (by Iwin Leenen, Iven Van Mechelen, Paul De Boeck, Jeroen Poblome, and myself). It didn't get a lot of attention when it came out, but I still think it's an excellent and widely applicable idea. Lots of people are running around out there fitting deterministic prediction models, and our paper describes a method for taking such models and interpreting them probabilistically. By turning a predictive model into a generative probabilistic model, we can check model fit and point toward potential improvements (which might be implementable in the original deterministic framework).

For awhile I've been fitting most of my multilevel models using lmer/glmer, which gives point estimates of the group-level variance parameters (maximum marginal likelihood estimate for lmer and an approximation for glmer). I'm usually satisfied with this--sure, point estimation understates the uncertainty in model fitting, but that's typically the least of our worries.

Sometimes, though, lmer/glmer estimates group-level variances at 0 or estimates group-level correlation parameters at +/- 1. Typically, when this happens, it's not that we're so sure the variance is close to zero or that the correlation is close to 1 or -1; rather, the marginal likelihood does not provide a lot of information about these parameters of the group-level error distribution.

I don't want point estimates on the boundary. I don't want to say that the unexplained variance in some dimension is exactly zero.

One way to handle this problem is full Bayes: slap a prior on sigma, do your Gibbs and Metropolis, and summarize posterior inferences by simulation. That's fine but it's another bit of programming and computation. We wanted to be able to do it using the existing lmer/glmer (or, in Stata, gllamm) framework.

Yeojin Chung, Sophia Rabe-Hesketh, Jingchen Liu, Vince Dorie, and I came up with a solution. We compute a maximum penalized marginal likelihood estimate, the marginal posterior mode under a gamma (not inverse gamma; you know how I hate that) prior for the scale parameter. Our default choice is gamma (2, 1/A), where A is set to some large value such as 10 (rescaled relative to the regression predictors, as usual). The shape parameter of 2 gives a prior that is linear at 0: thus, the posterior mode can't be 0 under any circumstances--but it can be as close to 0 as is required based on the likelihood. Some basic algebra shows that if the marginal maximum likelihood estimate is at 0, our posterior mode estimate will be about 1 standard error away from 0, which seems about right. The rate parameter, 1/A, is set to be small so that the prior will only pull the estimate down if it is unexpectedly large, as might occur if the number of groups is small.

Our prior distribution is weakly informative in that it does something to the estimate but is generic and doesn't have much information as we would have in just about any particular example. I think that in many applied examples it would make sense to use stronger prior information. But I think that, as a default choice, the weakly informative gamma (2, 1/A) prior makes more sense than the uniform distribution (corresponding to the nonregularized estimate).

In the multivariate case, we use a Wishart (not inverse-Wishart), which when we choose the right parameters keeps the variances away from 0 and the correlation matrix positive definite.

One interesting feature of these models is that the weakly-informative regularizing prior we use for a point estimate is not the same as the weakly-informative prior that we would use for full Bayes (where there is no need to keep the posterior mode away from the boundary).

Here's the abstract to our paper:

Variance parameters in mixed or multilevel models can be difficult to estimate, espe- cially when the number of groups is small. We propose a maximum penalized likelihood approach which is equivalent to estimating variance parameters by their marginal poste- rior mode, given a weakly informative prior distribution. By choosing the prior from the gamma family with at least 1 degree of freedom, we ensure that the prior density is zero at the boundary and thus the marginal posterior mode of the group-level variance will be positive. The use of a weakly informative prior allows us to stabilize our estimates while remaining faithful to the data.

Let me again thank my collaborators, Yeojin Chung, Sophia Rabe-Hesketh, Jingchen Liu, Vince Dorie. We're now working on various generalizations, including multivariate distributions, generalized linear models, and hierarchical models for the variance parameters (going beyond the simple ideas in section 6 of my article from a few years ago).

Martyn Plummer replied to my recent blog on DIC with information that was important enough that I thought it deserved its own blog entry. Martyn wrote:

DIC has been around for 10 years now and despite being immensely popular with applied statisticians it has generated very little theoretical interest. In fact, the silence has been deafening. I [Martyn] hope my paper added some clarity.

As you say, DIC is (an approximation to) a theoretical out-of-sample predictive error. When I finished the paper I was a little embarrassed to see that I had almost perfectly reconstructed the justification of AIC as approximate cross-validation measure by Stone (1977), with a Bayesian spin of course.

But even this insight leaves a lot of choices open. You need to choose the right loss function and also which level of the model you want to replicate from. David Spiegelhalter and colleagues called this the "focus". In practice the focus is limited to the lowest level of the model. You generally can't calculate the log likelihood (the default penalty) for higher level parameters. But this narrow choice might not correspond to the interests of the analyst. For example, in disease mapping DIC answers the question "What model yield the disease map that best captures the features of the observed incidence data during this period?" But people are often asking more fundamental questions about their models, like "Is there spatial aggregation in disease X?" There is quite a big gap between these questions.

Regarding the slow convergence of DIC, you might want to try an alternative definition of the effective number of parameters pD that I came up with in 2002, in the discussion of Spiegelhalter et al. It is non-negative and coordinate free. It can be calculated from 2 or more parallel chains and so its sample variance can be estimated using standard MCMC diagnostics. I finally justified it in my 2008 paper and implemented in JAGS. The steps are (or should be):

- Compile a model with at least 2 parallel chains
- Load the dic module
- Set a trace monitor for "pD".
- Output with the coda command

If you are only interested in the sample mean, not the variance, the dic.samples function from the rjags package will give you this in a nice R object wrapper.

I suppose we can implement this in stan too.

Aki Vehtari commented too, with a link to a recent article by Sumio Watanabe on something called the widely applicable information criterion. Watanabe's article begins:

In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.

It's great to see progress in this area. After all these years of BIC variants, which I just hate, I like that researchers are moving back to the predictive-error framework. I think that Spiegelhalter, Best, Carlin, and Van der Linde made an important contribution with their DIC paper ten years ago. Even if DIC is not perfect and can be improved, they pushed the field in the right direction.

The deviance information criterion (or DIC) is an idea of Brad Carlin and others for comparing the fits of models estimated using Bayesian simulation (for more information, see this article by Angelika van der Linde).

I don't really ever know what to make of DIC. On one hand, it seems sensible, it handles uncertainty in inferences within each model, and it does not depend on aspects of the models that don't affect inferences within each model (unlike Bayes factors; see discussion here). On the other hand, I don't really have any idea what I would do with DIC in any real example. In our book we included an example of DIC--people use it and we don't have any great alternatives--but I had to be pretty careful that the example made sense. Unlike the usual setting where we use a method and that gives us insight into a problem, here we used our insight into the problem to make sure that in this particular case the method gave a reasonable answer.

One of my practical problems with DIC is that it seems to take a lot of simulations to calculate it precisely. Long after we've achieved good mixing of the chains and good inference for parameters of interest and we're ready to go on, it turns out that DIC is still unstable. In the example in our book we ran for a zillion iterations to make sure the DIC was ok.

DIC's slow convergence is not something I've seen written about anywhere--a student and I worked on a project a few years ago to quantify the DIC convergence issue but we never finished it--but, by a simple application of the folk theorem, it suggests that the measure might have some more fundamental problems.

I've long thought of DIC, like AIC, as being a theory-assisted estimate of out-of-sample predictive error. But I've always been stuck on the details, maybe because I've never really used either measure in any applied problem.

While writing this blog I came across an article by Martyn Plummer that gives a sense of the current thinking on DIC and its strengths and limitations. Plummer's paper begins:

The deviance information criterion (DIC) is widely used for Bayesian model comparison, despite the lack of a clear theoretical foundation. DIC is shown to be an approximation to a penalized loss function based on the deviance, with a penalty derived from a cross-validation argument. This approximation is valid only when the effective number of parameters in the model is much smaller than the number of independent observations. In disease mapping, a typical application of DIC, this assumption does not hold and DIC under-penalizes more complex models. Another deviance-based loss function, derived from the same decision-theoretic framework, is applied to mixture models, which have previously been considered an unsuitable application for DIC.

Again, I'm not trying knock DIC, I'm just trying to express my current understanding of it.

P.S. More here.

In response to this article by Cosma Shalizi and myself on the philosophy of Bayesian statistics, David Hogg writes:

I [Hogg] agree--even in physics and astronomy--that the models are not "True" in the God-like sense of being absolute reality (that is, I am not a realist); and I have argued (a philosophically very naive paper, but hey, I was new to all this) that for pretty fundamental reasons we could never arrive at the True (with a capital "T") model of the Universe. The goal of inference is to find the "best" model, where "best" might have something to do with prediction, or explanation, or message length, or (horror!) our utility. Needless to say, most of my physics friends *are* realists, even in the face of "effective theories" as Newtonian mechanics is an effective theory of GR and GR is an effective theory of "quantum gravity" (this plays to your point, because if you think any theory is possibly an effective theory, how could you ever find Truth?). I also liked the ideas that the prior is really a testable regularization, and part of the model, and that model checking is our main work as scientists.

My only issue with the paper is around Section 4.3, where you say that you can't even use Bayes to average or compare the probabilities of models. I agree that you don't think any of your models are True, but if you decide that what the scientist is trying to do is explain or encode (as in the translation between inference and signal compression), then model averaging using Bayes *will* give the best possible result. That is, it seems to me like there *is* an interpretation of what a scientist *does* that makes Bayesian averaging a good idea. I guess you can say that you don't think that is what a scientist does, but that gets into technical assumptions about epistemology that I don't understand. I guess what I am asking is: Don't you use--as you are required by the rules of measure theory--Bayesian averaging, and isn't it useful? Same with updating. They are useful and correct. It is just that you are not *done* when you have done all that; you still have to do model checking and expanding and generalizing afterwards (but even this can still be understood in terms of finding the best possible effective theory or encoding for the data).

Yet another way of trying to explain my confusion is this: When you describe the convergence process in a model space that *doesn't* contain the truth, you say that all it tries to do is match the distribution of the data. But isn't that what science *is*? Matching the distribution of the data with a simpler model? So then Bayes is doing exactly what we want!

My reply:

Bayesian model averaging could work, and in some situations is does work, but it won't necessarily work. The problem arises with the models being averaged, in particular the posterior probabilities of the individual models depend crucially on untestable aspects of the prior distributions. In particular, flat priors cause zero marginal likelihoods, and approximations to flat priors cause sensitivity (for example, if you use a N(0,A^2) prior with very large A, then the marginal posterior probability of your model will be proportional to 1/A, hence it matters a lot whether A is 100 or 1000 or 1 million, even though it won't matter at all for inference within a model.

This is not to say that posterior model averaging is necessarily useless, merely that if you want to do it, I think you need to think seriously about the different pieces of the super-model that you're estimating. At this point I'd prefer continuous model expansion rather than discrete model averaging. We discuss this point in chapter 6 of BDA.

David's follow-up:

I agree very much with this point (the 1/A point, which is a huge issue in Bayes factors), and in part for this reason I have been using leave-one-out cross-validation to do model comparison (no free parameters, justifiable, makes sense to scientists and engineers, even skeptical ones, etc). I would also be interested in your opinion about leave-one-out cross-validation; my engineer/CS friends love it.

My reply:

Cross-validation is great and I've used it on occasion. I don't really understand it. This is not a criticism, I just want to think harder about it at some point. To me, cross-validation is tied into predictive model checking in that ideas such as "leave one out" are fundamentally related do data collection. Xval is like model checking in that the data come in through the sampling distribution, not just the likelihood.

Deviance as a difference

| No Comments

Peng Yu writes:

On page 180 of BDA2, deviance is defined as D(y,\theta)=-2log p(y|\theta). However, according to GLM 2/e by McCullagh and Nelder, deviance is the different of the log-likelihood of the full model and the base model (times 2) (see the equation on the wiki webpage). The english word 'deviance' implies the difference from a standard (in this case, the base model). I'm wondering what the rationale for your definition of deviance, which consists of only 1 term rather than 2 terms.

My reply:

Deviance is typically computed as a relative quantity; that is, people look at the difference in deviance. So the two definitions are equivalent.

I'm a few weeks behind in my New Yorker reading and so just recently read this fascinating article by Ryan Lizza on the current administration's foreign policy. He gives some insights into the transformation Obama from antiwar candidate to a president conducting three wars.

Speaking as a statistician, though, what grabbed my eye was a doctrine of journalist/professor/policymaker Samantha Power. Lizza writes:

In 2002, after graduating from Harvard Law School, she wrote "A Problem from Hell," which surveyed the grim history of six genocides committed in the twentieth century. Propounding a liberal-interventionist view, Power argued that "mass killing" on the scale of Rwanda or Bosnia must be prevented by other nations, including the United States. She wrote that America and its allies rarely have perfect information about when a regime is about to commit genocide; a President, therefore, must have "a bias toward belief" that massacres are imminent.

From a statistical perspective, this sounds completely wrong! If you want to argue that it's a good idea to intervene, even if you're not sure, or if you want to argue that it's wise to intervene, even if the act of intervention will forestall the evidence for genocide that would be the motivation for intervention, that's fine. It's a cost-benefit analysis and it's best to lay out the costs and benefits as clearly as possible (within the constraints established by military and diplomatic secrecy). But to try to shade the probabilities to get the decision you want . . . that doesn't seem like a good idea at all!

To be fair, the above quote predates the Iraq WMD fiasco, our most notorious recent example of a "bias toward belief" that influenced policy. Perhaps Power has changed her mind on the virtues of biasing one's belief.

P.S. Samantha Power has been non-statistical before.

P.P.S. Just in case anyone wants to pull the discussion in a more theoretical direction: No, Power's (and, for that matter, Cheney's) "bias toward belief" is not simply a Bayesian prior. My point here is that she's constructing a belief system (a prior) based not on a model of what's happening or even on a subjective probability but rather on what she needs to get the outcome she wants. That's not Bayes. In Bayes, the prior and the utility function are separate.

Follow the discussion (originated by Mike Jordan) at the Statistics Forum.

Jeff writes:

How far off is bglmer and can it handle ordered logit or multinom logit?

My reply:

bglmer is very close. No ordered logit but I was just talking about it with Sophia today. My guess is that the easiest way to fit a hierarchical ordered logit or multinom logit will be to use stan. For right now I'd recommend using glmer/bglmer to fit the ordered logits in order (e.g., 1 vs. 2,3,4, then 2 vs. 3,4, then 3 vs. 4). Or maybe there's already a hierarchical multinomial logit in mcmcpack or somewhere?

Something on Applied Bayesian Statistics

April 27, 4:10-5 p.m., 1011 Evans Hall

I will deliver one of the following three talks:
1. Of beauty, sex, and power: Statistical challenges in estimating small effects
2. Why we (usually) don't worry about multiple comparisons
3. Parameterization and Bayesian modeling
Whoever shows up on time to the seminar gets to vote, and I'll give the talk that gets the most votes.

Galin Jones, Steve Brooks, Xiao-Li Meng and I edited a handbook of Markov Chain Monte Carlo that has just been published. My chapter (with Kenny Shirley) is here, and it begins like this:

Convergence of Markov chain simulations can be monitored by measuring the diffusion and mixing of multiple independently-simulated chains, but different levels of convergence are appropriate for different goals. When considering inference from stochastic simulation, we need to separate two tasks: (1) inference about parameters and functions of parameters based on broad characteristics of their distribution, and (2) more precise computation of expectations and other functions of probability distributions. For the first task, there is a natural limit to precision beyond which additional simulations add essentially nothing; for the second task, the appropriate precision must be decided from external considerations. We illustrate with an example from our current research, a hierarchical model of trends in opinions on the death penalty in U.S. states.

To read all the other chapters, you'll have to buy the book!

A student writes:

I have a question about an earlier recommendation of yours on the election of the prior distribution for the precision hyperparameter of a normal distribution, and a reference for the recommendation. If I recall correctly I have read that you have suggested to use Gamma(1.4, 0.4) instead of Gamma(0.01,0.01) for the prior distribution of the precision hyper parameter of a normal distribution.

I would very much appreciate if you would have the time to point me to this publication of yours. The reason is that I have used the prior distribution (Gamma(1.4, 0.4)) in a study which we now revise for publication, and where a reviewer question the choice of the distribution (claiming that it is too informative!).

I am well aware of that you in recent publications (Prior distributions for variance parameters in hierarchical models. Bayesian Analysis; Data Analysis using regression and multilevel/hierarchical models) suggest to model the precision as pow(standard deviation, -2) and to use either a Uniform or a Half-Cauchy distribution. However, since our model was fitted before I saw these publications, I would much like to find your earlier recommendation (which works fine!).

My reply: I've never heard of a Gamma (1.4, 0.4) distribution. I have no idea where this came from! But I can believe that it might work well--it would depend on the application.

Gamma (1.4, 0.4)?? Perhaps this was created by matching some moments or quantiles???

Rob Kass's article on statistical pragmatism is scheduled to appear in Statistical Science along with some discussions. Here are my comments.

I agree with Rob Kass's point that we can and should make use of statistical methods developed under different philosophies, and I am happy to take the opportunity to elaborate on some of his arguments.

I'll discuss the following:
- Foundations of probability
- Confidence intervals and hypothesis tests
- Sampling
- Subjectivity and belief
- Different schools of statistics

Dean Eckles writes:

I remember reading on your blog that you were working on some tools to fit multilevel models that also include "fixed" effects -- such as continuous predictors -- that are also estimated with shrinkage (for example, an L1 or L2 penalty). Any new developments on this front?

I often find myself wanting to fit a multilevel model to some data, but also needing to include a number of "fixed" effects, mainly continuous variables. This makes me wary of overfitting to these predictors, so then I'd want to use some kind of shrinkage.

As far as I can tell, the main options for doing this now is by going fully Bayesian and using a Gibbs sampler. With MCMCglmm or BUGS/JAGS I could just specify a prior on the fixed effects that corresponds to a desired penalty. However, this is pretty slow, especially with a large data set and because I'd like to select the penalty parameter by cross-validation (which is where this isn't very Bayesian I guess?).

My reply:

We allow informative priors in blmer/bglmer. Unfortunately blmer/bglmer aren't ready yet but they will be soon, I hope. They'll be in the "arm" package in R. We're also working on a bigger project of multilevel models for deep interactions of continuous predictors. But that won't be ready for awhile; we still have to figure out what we want to do there.

Of Beauty, Sex, and Power: Statistical Challenges in Estimating Small Effects. At the Institute of Policy Research, Thurs 7 Apr 2011, 3.30pm.

Regular blog readers know all about this topic. (Here are the slides.) But, rest assured, I don't just mock. I also offer constructive suggestions.

My last talk at Northwestern was fifteen years ago. Actually, I gave two lectures then, in the process of being turned down for a jobenjoying their chilly Midwestern hospitality.

P.S. I searched on the web and also found this announcement which gives the wrong title.

Marc Tanguay writes in with a specific question that has a very general answer. First, the question:

I [Tanguay] am currently running a MCMC for which I have 3 parameters that are restricted to a specific space. 2 are bounded between 0 and 1 while the third is binary and updated by a Beta-Binomial. Since my priors are also bounded, I notice that, conditional on All the rest (which covers both data and other parameters), the density was not varying a lot within the space of the parameters. As a result, the acceptance rate is high, about 85%, and this despite the fact that all the parameter's space is explore. Since in your book, the optimal acceptance rates prescribed are lower that 50% (in case of multiple parameters), do you think I should worry about getting 85%. Or is this normal given the restrictions on the parameters?

First off: Yes, my guess is that you should be taking bigger jumps. 85% seems like too high an acceptance rate for Metropolis jumping.

More generally, though, my recommendation is to monitor expected squared jumped distance (ESJD), which is a cleaner measure than acceptance probability. Generally, the higher is ESJD, the happier you should be. See this paper with Cristian Pasarica.

The short story is that if you maximize ESJD, you're minimizing the first-order autocorrelation. And, within any parameterized family of jumping rules, if you minimize the first-order autocorrelation, I think you'll pretty much be minimizing all of your autocorrelations and maximizing your efficiency.

As Cristian and I discuss in the paper, you can use a simple Rao-Blackwellization to compute expected squared jumped distance, rather than simply average squared jumped distance. We develop some tricks based on differentiation and importance sampling to adaptively optimize the algorithm to maximize ESJD, but you can always try the crude approach of trying a few different jumping scales, calculating ESJD for each, and then picking the best to go forward.

Steve Ziliak points me to this article by the always-excellent Carl Bialik, slamming hypothesis tests. I only wish Carl had talked with me before so hastily posting, though! I would've argued with some of the things in the article. In particular, he writes:

Reese and Brad Carlin . . . suggest that Bayesian statistics are a better alternative, because they tackle the probability that the hypothesis is true head-on, and incorporate prior knowledge about the variables involved.

Brad Carlin does great work in theory, methods, and applications, and I like the bit about the prior knowledge (although I might prefer the more general phrase "additional information"), but I hate that quote!

My quick response is that the hypothesis of zero effect is almost never true! The problem with the significance testing framework--Bayesian or otherwise--is in the obsession with the possibility of an exact zero effect. The real concern is not with zero, it's with claiming a positive effect when the true effect is negative, or claiming a large effect when the true effect is small, or claiming a precise estimate of an effect when the true effect is highly variable, or . . . I've probably missed a few possibilities here but you get the idea.

In addition, none of Carl's correspondents mentioned the "statistical significance filter": the idea that, to make the cut of statistical significance, an estimate has to reach some threshold. As a result of this selection bias, statistically significant estimates tend to be overestimates--whether or not a Bayesian method is used, and whether or not there are any problems with fishing through the data.

Bayesian inference is great--I've written a few books on the topic--but, y'know, garbage in, garbage out. If you start with a model of exactly zero effects, that's what will pop out.

I completely agree with this quote from Susan Ellenberg, reported in the above article:

You have to make a lot of assumptions in order to do any statistical test, and all of those are questionable.

And being Bayesian doesn't get around that problem. Not at all.

P.S. Steve Stigler is quoted as saying, "I don't think in science we generally sanction the unequivocal acceptance of significance tests." Unfortunately, I have no idea what he means here, given the two completely opposite meanings of the word "sanction" (see the P.S. here.)

Radford writes:

The word "conservative" gets used many ways, for various political purposes, but I would take it's basic meaning to be someone who thinks there's a lot of wisdom in traditional ways of doing things, even if we don't understand exactly why those ways are good, so we should be reluctant to change unless we have a strong argument that some other way is better. This sounds very Bayesian, with a prior reducing the impact of new data.

I agree completely, and I think Radford will very much enjoy my article with Aleks Jakulin, "Bayes: radical, liberal, or conservative?" Radford's comment also fits with my increasing inclination to use informative prior distributions.

This is a chance for me to combine two of my interests--politics and statistics--and probably to irritate both halves of the readership of this blog. Anyway...

I recently wrote about the apparent correlation between Bayes/non-Bayes statistical ideology and liberal/conservative political ideology:

The Bayes/non-Bayes fissure had a bit of a political dimension--with anti-Bayesians being the old-line conservatives (for example, Ronald Fisher) and Bayesians having a more of a left-wing flavor (for example, Dennis Lindley). Lots of counterexamples at an individual level, but my impression is that on average the old curmudgeonly, get-off-my-lawn types were (with some notable exceptions) more likely to be anti-Bayesian.

This was somewhat based on my experiences at Berkeley. Actually, some of the cranky anti-Bayesians were probably Democrats as well, but when they were being anti-Bayesian they seemed pretty conservative.

Recently I received an interesting item from Gerald Cliff, a professor of mathematics at the University of Alberta. Cliff wrote:

I took two graduate courses in Statistics at the University of Illinois, Urbana-Champaign in the early 1970s, taught by Jacob Wolfowitz. He was very conservative, and anti-Bayesian. I admit that my attitudes towards Bayesian statistics come from him. He said that if one has a population with a normal distribution and unknown mean which one is trying to estimate, it is foolish to assume that the mean is random; it is fixed, and currently unknown to the statistician, but one should not assume that it is a random variable.

Wolfowitz was in favor of the Vietnam War, which was still on at the time. He is the father of Paul Wolfowitz, active in the Bush administration.

To which I replied:

Very interesting. I never met Neyman while I was at Berkeley (he had passed away before I got there) but I've heard that he was very liberal politically (as was David Blackwell). Regarding the normal distribution comment below, I would say:

1. Bayesians consider parameters to be fixed but unknown. The prior distribution is a regularization tool that allows more stable estimates.

2. The biggest assumptions in probability models are typically not the prior distribution but in the data model. In this case, Wolfowitz was willing to assume a normal distribution with no question but then balked at using any knowledge about its mean. It seems odd to me, as a Bayesian, for one's knowledge to be divided so sharply: zero knowledge about the parameter, perfect certainty about the distributional family.

To return to the political dimension: From basic principles, I don't see any strong logical connection between Bayesianism and left-wing politics. In statistics, non-Bayesian ("classical") methods such as maximum likelihood are often taken to be conservative, as compared to the more assumption-laden Bayesian approach, but, as Aleks Jakulin and I have argued, the labeling of a political method as liberal or conservative depends crucially on what is considered your default.

As statisticians, we are generally trained to respect conservatism, which can sometimes be defined mathematically (for example, nominal 95% intervals that contain the true value more than 95% of the time) and sometimes with reference to tradition (for example, deferring to least-squares or maximum-likelihood estimates). Statisticians are typically worried about messing with data, which perhaps is one reason that the Current Index to Statistics lists 131 articles with "conservative" in the title or keywords and only 46 with the words "liberal" or "radical."

In that sense, given that, until recently, non-Bayesian approaches were the norm in statistics, it was the more radical group of statisticians (on average) who wanted to try something different. And I could see how a real hardline conservative such as Wolfowitz could see a continuity between anti-Bayesian skepticism and political conservatism, just how, on the other side of the political spectrum, a leftist such as Lindley could ally Bayesian thinking with support of socialism, a planned economy, and the like.

As noted above, I don't think these connections make much logical sense but I can see where they were coming from (with exceptions, of course, as noted regarding Neyman above).

Bayesian spam!

| 4 Comments

Cool! I know Bayes has reached the big time when I receive spam like this:

Bayesian networks are rapidly emerging as a new research paradigm . . . With this monthly newsletter, we'll keep you up to date . . . Financial Analytics Webinar . . . will exhibit at this year's INFORMS Analytics Conference in downtown Chicago. Please join us for our Bayesian networks technology workshop on April 10 . . . a powerful desktop application (Windows/Mac/Unix) for knowledge discovery, data mining, analytics, predictive modeling and simulation . . . the world's only comprehensive software package for learning, editing and analyzing Bayesian networks . . . If you no longer wish to receive these emails, please reply to this message with "Unsubscribe" in the subject line . . .

You know the saying, "It's not real unless it's on TV"? My saying is: It's not real until it's on spam.

Explaining that plot.

| 15 Comments

With some upgrades from a previous post.
PoliticEst.png

And with a hopefully clear 40+ page draft paper (see page 16).
Drawing Inference - Literally and by Individual
Contribution.pdf

Comments are welcome, though my reponses may be delayed.
(Working on how to best render the graphs.)

K?
p.s. Plot was modified so that it might be better interpreted without reading any of the paper - though I would not suggest that - reading at least pages 1 to 17 is recomended.

I read this story by Adrian Chen on Gawker (yeah, yeah, so sue me):

Why That 'NASA Discovers Alien Life' Story Is Bullshit

Fox News has a super-exciting article today: "Exclusive: NASA Scientist claims Evidence of Alien Life on Meteorite." OMG, aliens exist! Except this NASA scientist has been claiming to have evidence of alien life on meteorites for years.

Chen continues with a quote from the Fox News item:

[NASA scientist Richard B. Hoover] gave FoxNews.com early access to the out-of-this-world research, published late Friday evening in the March edition of the Journal of Cosmology. In it, Hoover describes the latest findings in his study of an extremely rare class of meteorites, called CI1 carbonaceous chondrites -- only nine such meteorites are known to exist on Earth. . . .

The bad news is that Hoover reported this same sort of finding in various low-rent venues for several years. Replication, huh? Chen also helpfully points us to the website of the Journal of Cosmology, which boasts that it "now receives over 500,000 Hits per month, which makes the Journal of Cosmology among the top online science journals."

So where's the statistics?

OK, fine. So far, so boring. Maybe it's even a bit mean to make fun of this crank scientist guy. The guy is apparently an excellent instrumentation engineer who is subject to wishful thinking. Put this together with tabloid journalism and you can get as many silly headlines as you'd like.

The statistics connection is that Hoover has repeatedly made similar unverified claims in this area, which suggests that this current work is not to be trusted. When someone has a track record of mistakes (or, at best, claims not backed up by an additional evidence), this is bad news.

In this case, we have prior information. But not the usual form of prior information on the parameter of interest, theta. Rather, we have prior information on the likelihood--that is, on the model that relates the data to the underlying phenomenon of interest. Any data comes with an implicit model of measurement error--the idea that what you're measuring isn't quite what you want to be measuring. Once we see a researcher or research group having a track record of strong but unsupported claims, this gives us some information about that measurement-error model.

In this particular example, I can't see a good way of quantifying this prior information or this measurement-error model (or any reason to do so). More generally, though, I think there's an important point here. People spend so much time arguing about their priors but often there are big problems with the likelihood. (Consider, for example, the recent ESP studies.)

Jonathan Livengood writes:

I have a couple of questions on your paper with Cosma Shalizi on "Philosophy and the practice of Bayesian statistics."

First, you distinguish between inductive approaches and hypothetico-deductive approaches to inference and locate statistical practice (at least, the practice of model building and checking) on the hypothetico-deductive side. Do you think that there are any interesting elements of statistical practice that are properly inductive? For example, suppose someone is playing around with a system that more or less resembles a toy model, like drawing balls from an urn or some such, and where the person has some well-defined priors. The person makes a number of draws from the urn and applies Bayes theorem to get a posterior. On your view, is that person making an induction? If so, how much space is there in statistical practice for genuine inductions like this?

Second, I agree with you that one ought to distinguish induction from other kinds of risky inference, but I'm not sure that I see a clear payoff from making the distinction. I'm worried because a lot of smart philosophers just don't distinguish "inductive" inferences from "risky" inferences. One reason (I think) is that they have in mind Hume's problem of induction. (Set aside whether Hume ever actually raised such a problem.) Famously, Popper claimed that falsificationism solves Hume's problem. In a compelling (I think) rejoinder, Wes Salmon points out that if you want to do anything with a scientific theory (or a statistical model), then you need to believe that it is going to make good predictions. But if that is right, then a model that survives attempts at falsification and then gets used to make predictions is still going to be open to a Humean attack. In that respect, then, hypothetico-deductivism isn't anti-inductivist after all. Rather, it's a variety of induction and suffers all the same difficulties as simple enumerative induction. So, I guess what I'd like to know is in what ways you think the philosophers are misled here. What is the value / motivation for distinguishing induction from hypothetico-deductive inference? Do you think there is any value to the distinction vis-a-vis Hume's problem? And what is your take on the dispute between Popper and Salmon?

I replied:

My short answer is that inductive inference of the balls-in-urns variety takes place within a model, and the deductive Popperian reasoning takes place when evaluating a model. Beyond this, I'm not so familiar with the philosophy literature. I think of "Popper" more as a totem than as an actual person or body of work. Finally, I recognize that my philosophy, like Popper's, does not say much about where models come from. Crudely speaking, I think of models as a language, with models created in the same way that we create sentences, by working with recursive structures. But I don't really have anything formal to say on the topic.

Livengood then wrote:

The part of Salmon's writing that I had in mind is his Foundations of Scientific Inference. See especially Section 3 on deductivism, starting on page 21.

Let me just press a little bit so that I am sure I'm understanding the proposal. When you say that inductive inference takes place within a model, are you claiming that an inductive inference is justified just to the extent that the model within which the induction takes place is justified (or approximately correct or some such -- I know you won't say "true" here ...)? If so, then under what conditions do you think a model is justified? That is, under what conditions do you think one is justified in making *predictions* on the basis of a model?

My reply:

No model will perform well for every kind of prediction. For any particular kind of prediction, we can use posterior predictive checks and related ideas such as cross-validation to see if the model performs well on these dimensions of interest. There will (almost) always be some assumptions required, some sense in which any prediction is conditional on something. Stepping back a bit, I'd say that scientists get experience with certain models, they work well for prediction until they don't. For an example from my own research, consider opinion polling. Those survey estimates you see in the newspapers are conditional on all sorts of assumptions. Different assumptions get checked at different times, often after some embarrassing failure.

John Salvatier writes:

What do you and your readers think are the trickiest models to fit? If I had an algorithm that I claimed could fit many models with little fuss, what kinds of models would really impress you? I am interested in testing different MCMC sampling methods to evaluate their performance and I want to stretch the bounds of their abilities.

I don't know what's the trickiest, but just about anything I work on in a serious way gives me some troubles. This reminds me that we should finish our Bayesian Benchmarks paper already.

Mike Grosskopf writes:

Christian points me to this amusing story by Jonah Lehrer about Mohan Srivastava, (perhaps the same person as R. Mohan Srivastava, coauthor of a book called Applied Geostatistics) who discovered a flaw in a scratch-off game in which he could figure out which tickets were likely to win based on partial information visible on the ticket. It appears that scratch-off lotteries elsewhere have similar flaws in their design.

The obvious question is, why doesn't the lottery create the patterns on the tickets (including which "teaser" numbers to reveal) completely at random? It shouldn't be hard to design this so that zero information is supplied from the outside. in which case Srivastava's trick would be impossible.

So why not put down the numbers randomly? Lehrer quotes Srivastava as saying:

The tickets are clearly mass-produced, which means there must be some computer program that lays down the numbers. Of course, it would be really nice if the computer could just spit out random digits. But that's not possible, since the lottery corporation needs to control the number of winning tickets. The game can't be truly random. Instead, it has to generate the illusion of randomness while actually being carefully determined.

I'd phrase this slightly differently. We're talking about $3 payoffs here, so, no, the corporation does not need to control the number of winning tickets. What they do need to control is the probability of a win, but that can be done using a completely random algorithm.

From reading the article, I think the real reason the winning tickets could be predicted is that the lottery tickets were designed to be misleadingly appealing. Lehrer writes:

Instead of just scratching off the latex and immediately discovering a loser, players have to spend time matching up the revealed numbers with the boards. Ticket designers fill the cards with near-misses (two-in-a-row matchups instead of the necessary three) and players spend tantalizing seconds looking for their win. No wonder players get hooked.

"Ticket designers fill the cards with near-misses . . .": This doesn't sound like they're just slapping down random numbers. Instead, the system seems to be rigged in the fashion of old-time carnival games in order to manipulate one's intuition that the probability of near-misses should be informative about the underlying probability of hits. (See here for some general discussion of the use of precursors to estimate the probability of extremely rare events.)

In this sense, the story is slightly more interesting than "Lottery designers made a mistake." The mistake they made is directly connected to the manipulations they make in order to sucker people into spend more money.

P.S. Lehrer writes that Srivastava does consulting. This news story should get him all the business he needs for awhile!

This post is an (unpaid) advertisement for the following extremely useful resource:

  • Petersen, K. B. and M. S. Pedersen. 2008. The Matrix Cookbook. Tehcnical Report, Technical University of Denmark.

It contains 70+ pages of useful relations and derivations involving matrices. What grabbed my eye was the computation of gradients for matrix operations ranging from eigenvalues and determinants to multivariate normal density functions. I had no idea the multivariate normal had such a clean gradient (see section 8).

Mike McLaughlin writes:

Consider the Seeds example in vol. 1 of the BUGS examples. There, a binomial likelihood has a p parameter constructed, via logit, from two covariates. What I am wondering is: Would it be legitimate, in a binomial + logit problem like this, to allow binomial p[i] to be a function of the corresponding n[i] or would that amount to using the data in the prior? In other words, in the context of the Seeds example, is r[] the only data or is n[] data as well and therefore not permissible in a prior formulation?

I [McLaughlin] currently have a model with a common beta prior for all p[i] but would like to mitigate this commonality (a kind of James-Stein effect) when there are lots of observations for some i. But this seems to feed the data back into the prior. Does it really?

It also occurs to me [McLaughlin] that, perhaps, a binomial likelihood is not the one to use here (not flexible enough).

My reply:

Strictly speaking, "n" is data, and so what you want is a likelihood function p(y,n|theta), where theta represents all the parameters in the model. In a binomial-type example, it would make sense to factor the likelihood as p(y|n,theta)*p(n|theta). Or, to make this even clearer: p(y|n,theta_1)*p(n|theta_2), where theta_1 are the parameters of the binomial distribution (or whatever generalization you're using) and theta_2 are the parameters involving n. The vectors theta_1 and theta_2 can overlap. In any case, the next step is the prior distribution, p(theta_1,theta_2). Prior dependence between theta_1 and theta_2 induces a model of the form that you're talking about.

In practice, I think it can be reasonable to simplify a bit and write p(y|n,theta) and then use a prior of the form p(theta|n). We discuss this sort of thing in the first or second section of the regression chapter in BDA. Whether you treat n as data to be modeled or data to be conditioned on, either way you can put dependence with theta into the model.

Andrew Gelman (Columbia University) and Eric Johnson (Columbia University) seek to hire a post-doctoral fellow to work on the application of the latest methods of multilevel data analysis, visualization and regression modeling to an important commercial problem: forecasting retail sales at the individual item level. These forecasts are used to make ordering, pricing and promotions decisions which can have significant economic impact to the retail chain such that even modest improvements in the accuracy of predictions, across a large retailer's product line, can yield substantial margin improvements.

Activities focus on the development of iterative imputation algorithms and diagnostics for missing-data imputation. Activities would include model-development, programming, and data analysis. This project is to be undertaken with, and largely funded by, a firm which provides forecasting technology and services to large retail chains, and which will provide access to a unique and rich set of proprietary data. The postdoc will be expected to spend some time working directly with this firm, but this is fundamentally a research position.

The ideal candidate will have a background in statistics, psychometrics, or economics and be interested in marketing or related topics. He or she should be able to work fluently in R and should already know about hierarchical models and Bayesian inference and computation.

The successful candidate will become part of the lively Applied Statistics Center community, which includes several postdocs (with varied backgrounds in statistics, computer science, and social science), Ph.D., M.A., and undergraduate students, and faculty at Columbia and elsewhere. We want people who love collaboration and have the imagination, drive, and technical skills to make a difference in our projects.

If you are interested in this position, please send a letter of application, a CV, some of your articles, and three letters of recommendation to the Applied Statistics Center coordinator, Caroline Peters, cp2530@columbia.edu. Review of applications will begin immediately.

Andrew Gelman (Columbia University) and Jennifer Hill (New York University) seek to hire a post-doctoral fellow to work on development of iterative imputation algorithms and diagnostics for missing-data imputation. Activities would include model-development, programming, and data analysis. This project is funded by a grant from the Institute of Education Sciences. Other collaborators on the project include Jingchen Liu and Ben Goodrich. This is a two-year position that would start this summer (2011) or earlier if possible.

The ideal candidate will have a statistics or computer science background, will be interested in statistical modeling, serious programming, and applications. He or she should be able to work fluently in R and should already know about hierarchical models and Bayesian inference and computation. Experience or interest in designing GUIs would be a bonus, since we are attempting to improve our GUI for missing data imputation.

The successful candidate will become part of the lively Applied Statistics Center community, which includes several postdocs (with varied backgrounds in statistics, computer science, and social science), Ph.D., M.A., and undergraduate students, and faculty at Columbia and elsewhere. We want people who love collaboration and have the imagination, drive, and technical skills to make a difference in our projects.

If you are interested in this position, please send a letter of application, a CV, some of your articles, and three letters of recommendation to the Applied Statistics Center coordinator, Caroline Peters, cp2530@columbia.edu. Review of applications will begin immediately.

Bayes at the end

| 21 Comments

John Cook noticed something:

I [Cook] was looking at the preface of an old statistics book and read this:
The Bayesian techniques occur at the end of each chapter; therefore they can be omitted if time does not permit their inclusion.

This approach is typical. Many textbooks present frequentist statistics with a little Bayesian statistics at the end of each section or at the end of the book.

There are a couple ways to look at that. One is simply that Bayesian methods are optional. They must not be that important or they'd get more space. The author even recommends dropping them if pressed for time.

Another way to look at this is that Bayesian statistics must be simpler than frequentist statistics since the Bayesian approach to each task requires fewer pages.

My reaction:

Classical statistics is all about summarizing the data.

Bayesian statistics is data + prior information.

On those grounds alone, Bayes is more complicated, and it makes sense to do classical statistics first. Not necessarily p-values etc., but estimates, standard errors, and confidence intervals for sure.

Bill Harris writes:

I've read your paper and presentation showing why you don't usually worry about multiple comparisons. I see how that applies when you are comparing results across multiple settings (states, etc.).

Does the same principle hold when you are exploring data to find interesting relationships? For example, you have some data, and you're trying a series of models to see which gives you the most useful insight. Do you try your models on a subset of the data so you have another subset for confirmatory analysis later, or do you simply throw all the data against your models?

My reply: I'd like to estimate all the relationships at once and use a multilevel model to do partial pooling to handle the mutiplicity issues. That said, in practice, in my applied work I'm always bouncing back and forth between different hypotheses and different datasets, and often I learn a lot when next year's data come in and I can modify my hypotheses. The trouble with the classical hypothesis-testing framework, at least for me, is that so-called statistical hypotheses are very precise things, whereas the sorts of hypotheses that arise in science and social science are vaguer and are not so amenable to "testing" in the classical sense.

After reading all the comments here I remembered that I've actually written a paper on the generalized method of moments--including the bit about maximum likelihood being a special case. The basic idea is simple enough that it must have been rediscovered dozens of times by different people (sort of like the trapezoidal rule).

In our case, we were motivated to (independently) develop the (well-known, but not by me) generalized method of moments as a way of specifying an indirectly-parameterized prior distribution, rather than as a way of estimating parameters from direct data. But the math is the same.

Bayes in China update

| 14 Comments

Some clarification on the Bayes-in-China issue raised last week:

1. We heard that the Chinese publisher cited the following pages that might contain politically objectionable materials: 3, 5, 21, 73, 112, 201.

2. It appears that, as some commenters suggested, the objection was to some of the applications, not to the Bayesian methods.

3. Our book is not censored in China. In fact, as some commenters mentioned, it is possible to buy it there, and it is also available in university libraries there. The edition of the book which was canceled was intended to be a low-cost reprint of the book. The original book is still available. I used the phrase "Banned in China" as a joke and I apologize if it was misinterpreted.

4. I have no quarrel with the Chinese government or with any Chinese publishers. They can publish whatever books they would like. I found this episode amusing only because I do not think my book on regression and multilevel models has any strong political content. I suspect the publisher was being unnecessarily sensitive to potentially objectionable material, but this is their call. I thought this was an interesting story (which is why I posted the original email on the blog) but I did not, and do not, intend it as any sort of comment on the Chinese government, Chinese society, etc. China is a big country and this is one person at one publisher making one decision. That's all it is; it's not a statement about China in general.

I did not write the above out of any fear of legal action etc. I just think it's important to be fair and clear, and it is possible that some of what I wrote could have been misinterpreted in translation. If anyone has further questions on this, feel free to ask in the comments and I will clarify as best as I can.

Benedict Carey writes a follow-up article on ESP studies and Bayesian statistics. (See here for my previous thoughts on the topic.) Everything Carey writes is fine, and he even uses an example I recommended:

The statistical approach that has dominated the social sciences for almost a century is called significance testing. The idea is straightforward. A finding from any well-designed study -- say, a correlation between a personality trait and the risk of depression -- is considered "significant" if its probability of occurring by chance is less than 5 percent.

This arbitrary cutoff makes sense when the effect being studied is a large one -- for example, when measuring the so-called Stroop effect. This effect predicts that naming the color of a word is faster and more accurate when the word and color match ("red" in red letters) than when they do not ("red" in blue letters), and is very strong in almost everyone.

"But if the true effect of what you are measuring is small," said Andrew Gelman, a professor of statistics and political science at Columbia University, "then by necessity anything you discover is going to be an overestimate" of that effect.

The above description of classical hypothesis testing isn't bad. Strictly speaking, one would follow "is less than 5 percent" above with "if the null hypothesis of zero effect were actually true," but they have serious space limitations, and I doubt many readers would get much out of that elaboration, so I'm happy with what Carey put there.

One subtlety that he didn't quite catch was the way that researchers mix the Neyman-Pearson and Fisher approaches to inference. The 5% cutoff (associated with Neyman and Pearson) is indeed standard, and it is indeed subject to all the problems we know about, most simply that statistical significance occurs at least 5% of the time, so if you do a lot of experiments you're gonna have a lot of chances to find statistical significance. But p-values are also used as a measure of evidence: that's Fisher's approach and it leads to its own problems (as discussed in the news article as well).

The other problem, which is not so well known, comes up in my quote: when you're studying small effects and you use statistical significance as a filter and don't do any partial pooling, whatever you have that's left standing that survives the filtering process will overestimate the true effect. And classical corrections for "multiple comparisons" do not solve the problem: they merely create a more rigorous statistical significance filter, but anything that survives that filter will be even more of an overestimate.

If classical hypothesis testing is so horrible, how is it that it could be so popular? In particular, what was going on when a well-respected researcher like this ESP guy would use inappropriate statistical methods.

My answer to Carey was to give a sort of sociological story, which went as follows.

Psychologists have experience studying large effects, the sort of study in which data from 24 participants is enough to estimate a main effect and 50 will be enough to estimate interactions of interest. I gave the example of the Stroop effect (they have a nice one of those on display right now at the Natural History Museum) as an example of a large effect where classical statistics will do just fine.

My point was, if you've gone your whole career studying large effects with methods that work, then it's natural to think you have great methods. You might not realize that your methods, which appear quite general, actually fall apart when applied to small effects. Such as ESP or human sex ratios.

The ESP dude was a victim of his own success: His past accomplishments studying large effects gave him an unwarranted feeling of confidence that his methods would work on small effects.

This sort of thing comes up a lot, and in my recent discussion of Efron's article, I list it as my second meta-principle of statistics, the "methodological attribution problem," which is that people think that methods that work in one sort of problem will work in others.

The other thing that Carey didn't have the space to include was that Bayes is not just about estimating the weight of evidence in favor of a hypothesis. The other key part of Bayesian inference--the more important part, I'd argue--is "shrinkage" or "partial pooling," in which estimates get pooled toward zero (or, more generally, toward their estimates based on external information).

Shrinkage is key, because if all you use is a statistical significance filter--or even a Bayes factor filter--when all is said and done, you'll still be left with overestimates. Whatever filter you use--whatever rule you use to decide whether something is worth publishing--I still want to see some modeling and shrinkage (or, at least, some retrospective power analysis) to handle the overestimation problem. This is something Martin and I discussed in our discussion of the "voodoo correlations" paper of Vul et al.

Should the paper have been published in a top psychology journal?

Real-life psychology researcher Tal Yarkoni adds some good thoughts but then he makes the ridiculous (to me) juxtaposition of the following two claims: (1) The ESP study didn't find anything real, there's no such thing as ESP, and the study suffered many methodological flaws, and (2) The journal was right to publish the paper.

If you start with (1), I don't see how you get to (2). I mean, sure, Yarkoni gives his reasons (basically, the claim that the ESP paper, while somewhat crappy, is no crappier than most papers that are published in top psychology journals), but I don't buy it. If the effect is there, why not have them demonstrated it for real? I mean, how hard would it be for the experimenters to gather more data, do some sifting, find out which subjects are good at ESP, etc. There's no rush, right? No need to publish preliminary, barely-statistically-significant findings. I don't see what's wrong with the journal asking for better evidence. It's not like a study of the democratic or capitalistic peace, where you have a fixed amount of data and you have to learn what you can. In experimental psychology, once you have the experiment set up, it's practically free to gather more data.

P.S. One thing that saddens me is that, instead of using the sex-ratio example (which I think would've been perfect for this article, Carey uses the following completely fake example:

Consider the following experiment. Suppose there was reason to believe that a coin was slightly weighted toward heads. In a test, the coin comes up heads 527 times out of 1,000.

And they he goes on two write about coin flipping. But, as I showed in my article with Deb, there is no such thing as a coin weighted to have a probability p (different from 1/2) of heads.

OK, I know about fake examples. I'm writing an intro textbook, and I know that fake examples can be great. But not this one!

P.P.S. I'm also disappointed he didn't use the famous dead-fish example, where Bennett, Baird, Miller, and Wolferd found statistically significant correlations in an MRI of a dead salmon. The correlations were not only statistically significant, they were large and newsworthy!

P.P.P.S. The Times does this weird thing with its articles where it puts auto-links on Duke University, Columbia University, and the University of Missouri. I find this a bit distracting and unprofessional.

John Talbott points me to this, which I briefly mocked a couple months ago. I largely agree with the critics of this research, but I want to reiterate my point from earlier that all the statistical sophistication in the world won't help you if you're studying a null effect. This is not to say that the actual effect is zero--who am I to say?--just that the comments about the high-quality statistics in the article don't say much to me.

There's lots of discussion of the lack of science underlying ESP claims. I can't offer anything useful on that account (not being a psychologist, I could imagine all sorts of stories about brain waves or whatever), but I would like to point out something that usually doesn't seem to get mentioned in these discussions, which is that lots of people want to believe in ESP. After all, it would be cool to read minds. (It wouldn't be so cool, maybe, if other people could read your mind and you couldn't read theirs, but I suspect most people don't think of it that way.) And ESP seems so plausible, in a wish-fulfilling sort of way. It really feels like if you concentrate really hard, you can read minds, or predict the future, or whatever. Heck, when I play squash I always feel that if I really really try hard, I should be able to win every point. The only thing that stops me from really believing this is that I realize that the same logic holds symmetrically for my opponent. But with ESP, absent a controlled study, it's easy to see evidence all around you supporting your wishful thinking. (See my quote in bold here.) Recall the experiments reported by Ellen Langer, that people would shake their dice more forcefully when trying to roll high numbers and would roll gently when going for low numbers.

When I was a little kid, it was pretty intuitive to believe that if I really tried, I could fly like Superman. There, of course, there was abundant evidence--many crashes in the backyard--that it wouldn't work. For something as vague as ESP, that sort of simple test isn't there. And ESP researchers know this--they use good statistics--but it doesn't remove the element of wishful thinking. And, as David Weakiem and I have discussed, classical statistical methods that work reasonably well when studying moderate or large effects (see the work of Fisher, Snedecor, Cochran, etc.) fall apart in the presence of small effects.

I think it's naive when people implicitly assume that the study's claims are correct, or the study's statistical methods are weak. Generally, the smaller the effects you're studying, the better the statistics you need. ESP is a field of small effects and so ESP researchers use high-quality statistics.

To put it another way: whatever methodological errors happen to be in the paper in question, probably occur in lots of researcher papers in "legitimate" psychology research. The difference is that when you're studying a large, robust phenomenon, little statistical errors won't be so damaging as in a study of a fragile, possibly zero effect.

In some ways, there's an analogy to the difficulties of using surveys to estimate small proportions, in which case misclassification errors can loom large, as discussed here.

Now to criticize the critics: some so-called Bayesian analysis that I don't really like

I agree with the critics of the ESP paper that Bayesian analysis is a good way to combine the results of this not-so-exciting new finding that people in the study got 53% correct instead of the expected 50% correct, with the long history of research in this area.

But I wouldn't use the Bayesian methods that these critics recommend. In particular, I think it's ludicrous for Wagenmakers et.al. to claim a prior probability of 10^-20 for ESP, and I also think that they're way off base when they start talking about "Bayesian t-tests" and point null hypotheses. I think a formulation based on measurement-error models would be far more useful. I'm very disturbed by purportedly Bayesian methods that start with meaningless priors which then yield posterior probabilities that, instead of being interpreted quantitatively, have to be converted to made-up categories such as "extreme evidence," "very strong evidence," "anecdotal evidence," and the like. This seems to me to be taking some of the most arbitrary aspects of classical statistics. Perhaps I should call this the "no true Bayesian" phenomenon.

And, if you know me at all (in a professional capacity), you'll know I hate statements like this:

Another advantage of the Bayesian test that it is consistent: as the number of participants grows large, the probability of discovering the true hypothesis approaches 1.

The "true hypothesis," huh? I have to go to bed now (no, I'm not going to bed at 9am; I set this blog up to post entries automatically every morning). If you happen to run into an experiment of interest in which psychologists are "discovering a true hypothesis," (in the statistical sense of a precise model), feel free to wake me up and tell me. It'll be newsworthy, that's for sure.

Anyway, the ESP thing is pretty silly and so there are lots of ways of shooting it down. I'm only picking on Wagenmakers et.al. because often we're full of uncertainty about more interesting problems For example, new educational strategies and their effects on different sorts of students. For these sorts of problems, I don't think that models of null effects, verbal characterizations of Bayes factors, and reassurances about "discovering the true hypothesis" are going to cut it. These methods are important, and I think that, even when criticizing silly studies, we should think carefully about what we're doing and what our methods are actually purporting to do.

Type S error: When your estimate is the wrong sign, compared to the true value of the parameter

Type M error: When the magnitude of your estimate is far off, compared to the true value of the parameter

More here.

I've become increasingly uncomfortable with the term "confidence interval," for several reasons:

- The well-known difficulties in interpretation (officially the confidence statement can be interpreted only on average, but people typically implicitly give the Bayesian interpretation to each case),

- The ambiguity between confidence intervals and predictive intervals. (See the footnote in BDA where we discuss the difference between "inference" and "prediction" in the classical framework.)

- The awkwardness of explaining that confidence intervals are big in noisy situations where you have less confidence, and confidence intervals are small when you have more confidence.

So here's my proposal. Let's use the term "uncertainty interval" instead. The uncertainty interval tells you how much uncertainty you have. That works pretty well, I think.

P.S. As of this writing, "confidence interval" outGoogles "uncertainty interval" by the huge margin of 9.5 million to 54000. So we have a ways to go.

WWJD? U can find out!

| 1 Comment

Two positions open in the statistics group at the NYU education school. If you get the job, you get to work with Jennifer HIll!

One position is a postdoctoral fellowship, and the other is a visiting professorship. The latter position requires "the demonstrated ability to develop a nationally recognized research program," which seems like a lot to ask for a visiting professor. Do they expect the visiting prof to develop a nationally recognized research program and then leave it there at NYU after the visit is over?

In any case, Jennifer and her colleagues are doing excellent work, both applied and methodological, and this seems like a great opportunity.

Word count stats from the Google books database prove that Bayesianism is expanding faster than the universe.

Screen shot 2010-12-19 at 4.27.41 PM.png

A n-gram is a tuple of n words.

Giorgio Corani writes:

Your work on weakly informative priors is close to some research I [Corani] did (together with Prof. Zaffalon) in the last years using the so-called imprecise probabilities. The idea is to work with a set of priors (containing even very different priors); to update them via Bayes' rule and then compute a set of posteriors.

The set of priors is convex and the priors are Dirichlet (thus, conjugate to the likelihood); this allows to compute the set of posteriors exactly and efficiently.

I [Corani] have used this approach for classification, extending naive Bayes and TAN to imprecise probabilities. Classifiers based on imprecise probabilities return more classes when they find that the most probable class is prior-dependent, i.e., if picking different priors in the convex set leads to identify different classes as the most probable one. Instead of returning a single (unreliable) prior-dependent class, credal classifiers in this case preserve reliability by issuing a set-valued classification. By experiments, we have consistently found that Bayesian classifiers are unreliable on the instances which are classified in an indeterminate way (but reliably) by our classifiers.

This looks potentially interesting. It's not an approach I've ever thought about much (nor do I really have the time to think about it now, unfortunately), but I thought I'd post the link to some papers for those of you who might be interested. Imprecise priors have been proposed as a method for encoding weak prior information, so perhaps there is something important there.

My recent article with Xian and Judith. In English.

Interested readers can try to figure out which parts were written by each of the three authors (recognizing that each of us edited the whole thing).

Aleks points me to this research summary from Dan Goldstein. Good stuff. I've heard of a lot of this--I actually use some of it in my intro statistics course, when we show the students how they can express probability trees using frequencies--but it's good to see it all in one place.

Diabetes stops at the state line?

| 7 Comments

From Discover:

stateline.png

Razib Khan asks:

But follow the gradient from El Paso to the Illinois-Missouri border. The differences are small across state lines, but the consistent differences along the borders really don't make. Are there state-level policies or regulations causing this? Or, are there state-level differences in measurement? This weird pattern shows up in other CDC data I've seen.

Turns out that CDC isn't providing data, they're providing model. Frank Howland answered:

I suspect the answer has to do with the manner in which the county estimates are produced. I went to the original data source, the CDC, and then to the relevant FAQ.

There they say that the diabetes prevalence estimates come from the "CDC's Behavioral Risk Factor Surveillance System (BRFSS) and data from the U.S. Census Bureau's Population Estimates Program. The BRFSS is an ongoing, monthly, state-based telephone survey of the adult population. The survey provides state-specific information"

So the CDC then uses a complicated statistical procedure ("indirect model-dependent estimates" using Bayesian techniques and multilevel Poisson regression models) to go from state to county prevalence estimates. My hunch is that the state level averages thereby affect the county estimates. The FAQ in fact says "State is included as a county-level covariate."

I'd prefer to have real data, not a model. I'd do the model myself, thank you. Data itself is tricky enough, as J. Stamp said.

Biostatistics via Pragmatic and Perceptive Bayes.

| No Comments

This conference touches nicely on many of the more Biostatistics related topics that have come up on this blog from a pragmatic and perceptive Bayesian perspective.

Fourth Annual Bayesian Biostatistics Conference

Including the star of that recent Cochrane TV debate who will be the key note speaker.

See here Subtle statistical issues to be debated on TV. and perhaps the last comment which is my personal take on that debate.

Reruns are still available here http://justin.tv/cochranetv/b/272278382

K?

Xuequn Hu writes:

I am an econ doctoral student, trying to do some empirical work using Bayesian methods. Recently I read a paper(and its discussion) that pitches Bayesian methods against GMM (Generalized Method of Moments), which is quite popular in econometrics for frequentists. I am wondering if you can, here or on your blog, give some insights about these two methods, from the perspective of a Bayesian statistician. I know GMM does not conform to likelihood principle, but Bayesian are often charged with strong distribution assumptions.

I can't actually help on this, since I don't know what GMM is. My guess is that, like other methods that don't explicitly use prior estimation, this method will work well if sufficient information is included as data. Which would imply a hierarchical structure.

Seth sent along an article (not by him) from the psychology literature and wrote:

This is a good example of your complaint about statistical significance. The authors want to say that predictability of information determines how distracting something is and have two conditions that vary in predictability. One is significantly distracting, the other isn't. But the two conditions are not significantly different from each other. So the two conditions are different more weakly than p = 0.05.

I don't think the reviewers failed to notice this. They just thought it should be published anyway, is my guess.

To me, the interesting question is: where should the bar be? at p = 0.05? at p = 0.10? something else? How can we figure out where to put the bar?

I replied:

My quick answer is that we have to get away from .05 and .10 and move to something that takes into account prior information. This could be Bayesian (of course) or could be done classically using power calculations, as discussed in this article.

Scott Berry, Brad Carlin, Jack Lee, and Peter Muller recently came out with a book with the above title.

The book packs a lot into its 280 pages and is fun to read as well (even if they do use the word "modalities" in their first paragraph, and later on they use the phrase "DIC criterion," which upsets my tidy, logical mind). The book starts off fast on page 1 and never lets go.

Clinical trials are a big part of statistics and it's cool to see the topic taken seriously and being treated rigorously. (Here I'm not talking about empty mathematical rigor (or, should I say, "rigor"), so-called optimal designs and all that, but rather the rigor of applied statistics, mapping models to reality.)

Also I have a few technical suggestions.

1. The authors fit a lot of models in Bugs, which is fine, but they go overboard on the WinBUGS thing. There's WinBUGS, OpenBUGS, JAGS: they're all Bugs recommend running Bugs from R using the clunky BRugs interface rather than the smoother bugs() function, which has good defaults and conveniently returns graphical summaries and convergence diagnostics. The result is to get tangled in software complications and distance the user from statistical modeling.

2. On page 61 they demonstrate an excellent graphical summary that reveals that, in a particular example, their posterior distribution is improper--or, strictly speaking, that the posterior depends strongly on the choice of an arbitrary truncation point in the prior distribution. But then they stick with the bad model! Huh? This doesn't seem like such a good idea.

3. They cover all of Bayesian inference in a couple chapters, which is fine--interested readers can learn the whole thing from the Carlin and Louis book--but in their haste they sometimes slip up. For example, from page 5:

Randomization minimizes the possibility of selection bias, and it tends to balance the treatment groups over covariates, both known and unknown. There are difference, however, in the Bayesian and frequentist views of randomization. In the latter, randomization serves as the basis for inference, whereas the basis for inference in the Bayesian approach is subjective probability, which does not require randomization.

I get their general drift but I don't agree completely. First, randomization is a basis for frequentist inference, but it's not fair to call it the basis. There's lots of frequentist inference for nonrandomized studies. Second, I agree that the basis for Bayesian inference is probability but I don't buy the "subjective" part (except to the extent that all science is subjective). Third, the above paragraph leaves out why a Bayesian would want to randomize. The basic reason is robustness, as we discuss in chapter 7 of BDA.

4. I was wondering what the authors would say about Sander Greenland's work on multiple-bias modeling. Greenland uses Bayesian methods and has thought a lot about bias and causal inference in practical medical settings. I looked up Greenland in the index and all I could find was one page, which referred to some of his more theoretical work:

Greenland, Lanes, and Jara (2008) explore the use of structural nested models and advocate what they call g-estimation, a form of test-based estimation adhering to the ITT principle and accomodating a semiparametric Cox partial likelihood.

Nothing on multiple-bias modeling. Also I didn't see any mention of this paper by John "no relation" Carlin and others. Finally, the above paragraph is a bit odd in that "test-based estimation" and "semiparametric Cox partial likelihood" are nowhere defined in the book (or, at least, I couldn't find them in the index). I mean, sure, the reader can google these things, but I'd really like to see these ideas presented in the context of the book.

5. The very last section covers subgroup analysis and then mentions multilevel models (the natural Bayesian approach to the problem) but then doesn't really follow through. They go into a long digression on decision analysis. That's fine, but I'd like to see a worked example of a multilevel model for subgroup analysis, instead of just the reference to Hodges et al. (2007).

In summary, I like this book and it left me wanting even more. I hope that everyone working on clinical trials reads it and that it has a large influence.

And, just to be clear, most of my criticisms above are of the form, "I like it and want more." In particular, my own books don't have anything to say on multiple-bias models, test-based estimation, semiparametric Cox partial likelihood, multilevel models for subgroup analysis, or various other topics I'm asking for elaboration on. As it stands, Berry, Carlin, Lee, and Muller have packed a lot into 280 pages.

John Salvatier pointed me to this blog on derivative based MCMC algorithms (also sometimes called "hybrid" or "Hamiltonian" Monte Carlo) and automatic differentiation as the future of MCMC.

This all makes sense to me and is consistent both with my mathematical intuition from studying Metropolis algorithms and my experience with Matt using hybrid MCMC when fitting hierarchical spline models. In particular, I agree with Salvatier's point about the potential for computation of analytic derivatives of the log-density function. As long as we're mostly snapping together our models using analytically-simple pieces, the same part of the program that handles the computation of log-posterior densities should also be able to compute derivatives analytically.

I've been a big fan of automatic derivative-based MCMC methods since I started hearing about them a couple years ago (I'm thinking of the DREAM project and of Mark Girolami's paper), and I too wonder why they haven't been used more. I guess we can try implementing them in our current project in which we're trying to fit models with deep interactions. I also suspect there are some underlying connections between derivative-based jumping rules and redundant parameterizations for hierarchical models.

It's funny. Salvatier is saying what I've been saying (not very convincingly) for a couple years. But, somehow, seeing it in somebody else's words makes it much more persuasive, and again I'm all excited about this stuff.

My only amendment to Salvatier's blog is that I wouldn't refer to these as "new" algorithms; they've been around for something like 25 years, I think.

Quote of the day

| 3 Comments

"A statistical model is usually taken to be summarized by a likelihood, or a likelihood and a prior distribution, but we go an extra step by noting that the parameters of a model are typically batched, and we take this batching as an essential part of the model."

I had the following email exchange with Jonathan DePeri, a reader of Bayesian Data Analysis.

I recently saw this article that Stephen Senn wrote a couple of years ago, criticizing Bayesian sensitivity analyses that relied on vague prior distributions. I'm moving more and more toward the idea that Bayesian analysis should include actual prior information, so I generally agree with his points. As I used to say when teaching Bayesian data analysis, a Bayesian model is modular, and different pieces can be swapped in and out as needed. So you might start with an extremely weak prior distribution, but if it makes a difference it's time to bite the bullet and include more information.

My only disagreement with Senn's paper is in its recommendation to try the so-called fixed-effects analysis. Beyond the difficulties with terminology (the expressions "fixed" and "random" effects are defined in different ways by different people in the literature; see here for a rant on the topic which made its way into some of my articles and books), there is the problem that, when a model gets complicated, some of the estimates that are called "fixed effects" get very noisy, especially in sparse data settings such as arise in logistic regression.

Bayes jumps the shark

| 5 Comments

John Goldin sends in this, from an interview with Alan Dershowitz:

Sam Seaver writes:

I'm a graduate student in computational biology, and I'm relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have.

My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many 'NA's. Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species.

There are quite a few singularities in the dataset (summary(glm) reports that 14 coefficients are not defined because of singularities), and I know the pathways (and some nutrients) I can remove because they are almost empty, but I would rather not because these pathways may apply to other species. So I was wondering if there are complementary and/or alternative methods to logistic regression that would give me a coefficient of a kind for each pathway?

My reply:

If you have this kind of sparsity, I think you'll need to add some prior information or structure to your model. Our paper on bayesglm suggests a reasonable default prior, but it sounds to me that you'll have to go further.

To put it another way: give up the idea that you're estimating 70 distinct parameters. Instead, think of these coefficients as linked to each other in a complex web.

More generally, I don't think it ever makes sense to think of a problem with a lot of loose parameters. Hierarchical structure is key. One of our major research problems now is to set up general models for structured parameters, going beyond simple exchangeability.

David Rohde writes:

Rob Kass writes:

Statistics has moved beyond the frequentist-Bayesian controversies of the past. Where does this leave our ability to interpret results? I [Kass] suggest that a philosophy compatible with statistical practice, labeled here statistical pragmatism, serves as a foundation for inference. Statistical pragmatism is inclusive and emphasizes the assumptions that connect statistical models with observed data. I argue that introductory courses often mis-characterize the process of statistical inference and I propose an alternative "big picture" depiction.

In my comments, I pretty much agree with everything Rob says, with a few points of elaboration:

Kass describes probability theory as anchored upon physical randomization (coin flips, die rolls and the like) but being useful more generally as a mathematical model. I completely agree but would also add another anchoring point: calibration. Calibration of probability assessments is an objective, not subjective process, although some subjectivity (or scientific judgment) is necessarily involved in the choice of events used in the calibration. In that way, Bayesian probability calibration is closely connected to frequentist probability statements, in that both are conditional on "reference sets" of comparable events . . .

In a modern Bayesian approach, confidence intervals and hypothesis testing are both important but are not isomorphic; they represent two different steps of inference. Confidence statements, or posterior intervals, are summaries of inference about parameters conditional on an assumed model. Hypothesis testing--or, more generally, model checking--is the process of comparing observed data to replications under the model if it were true. . . .

Kass discusses the role of sampling as a model for understanding statistical inference. But sampling is more than a metaphor; it is crucial in many aspects of statistics. . . .

The only two statements in Kass's article that I clearly disagree with are the following two claims: "the only solid foundation for Bayesianism is subjective," and "the most fundamental belief of any scientist is that the theoretical and real worlds are aligned." , , , Claims of the subjectivity of Bayesian inference have been much debated, and I am under no illusion that I can resolve them here. But I will repeat my point made at the outset of this discussion that Bayesian probability, like frequentist probability, is except in the simplest of examples a model-based activity that is mathematically anchored by physical randomization at one end and calibration to a reference set at the other. , , , a person who is really worried about subjective model-building might profitably spend more effort thinking about assumptions inherent in additive models, logistic regressions, proportional hazards models, and the like. Even the Wilcoxon test is based on assumptions . . .

Like Kass, I believe that philosophical debates can be a good thing, if they motivate us to think carefully about our unexamined assumptions. Perhaps even the existence of subfields that rarely communicate with each other has been a source of progress in allowing different strands of research to be developed in a pluralistic environment, in a way that might not have been so easily done if statistical communication had been dominated by any single intolerant group. . . .

Sanjay Kaul wrotes:

By statute ("the least burdensome" pathway), the approval standard for devices by the US FDA is lower than for drugs. Before a new drug can be marketed, the sponsor must show "substantial evidence of effectiveness" as based on two or more well-controlled clinical studies (which literally means 2 trials, each with a p value of <0.05, or 1 large trial with a robust p value <0.00125). In contrast, the sponsor of a new device, especially those that are designated as high-risk (Class III) device, need only demonstrate "substantial equivalence" to an FDA-approved device via the 510(k) exemption or a "reasonable assurance of safety and effectiveness", evaluated through a pre-market approval and typically based on a single study.

What does "reasonable assurance" or "substantial equivalence" imply to you as a Bayesian? These are obviously qualitative constructs, but if one were to quantify them, how would you go about addressing it?

I sent Deborah Mayo a link to my paper with Cosma Shalizi on the philosophy of statistics, and she sent me the link to this conference which unfortunately already occurred. (It's too bad, because I'd have liked to have been there.) I summarized my philosophy as follows:

I am highly sympathetic to the approach of Lakatos (or of Popper, if you consider Lakatos's "Popper_2" to be a reasonable simulation of the true Popperism), in that (a) I view statistical models as being built within theoretical structures, and (b) I see the checking and refutation of models to be a key part of scientific progress. A big problem I have with mainstream Bayesianism is its "inductivist" view that science can operate completely smoothly with posterior updates: the idea that new data causes us to increase the posterior probability of good models and decrease the posterior probability of bad models. I don't buy that: I see models as ever-changing entities that are flexible and can be patched and expanded, and I also feel strongly (based on my own experience and on my understanding of science) that some of our most important learning comes when we refute our models (in relevant ways). To put it another way: unlike many Bayesians, I believe that a model check--a hypothesis test--can be valuable, even when (or especially when) there is no alternative at hand.

I also think that my philosophical approach fits well with modern Bayesian data analysis, which is characterized not just by the calculation of posterior probabilities but by a three-step process: (1) model building, (2) inference conditional on an assumed model, (3) model checking, then returning back to step (1) as needed, either to expand the model or to ditch it and start anew.

I think that the association of Popperian falsification with classical statistical methods, and the association of inductive reasoning with Bayesian inference, is unfortunate, and I'd like to (a) convlnce the Popperians that Bayesian methods allow one to be a most effective Popperian, and (b) convince the Bayesians of the problems with formal inductive reasoning. (See the second column of page 177 here.)

Mayo and I then had an email exchange, which I'll repeat here. I'm hoping this will lead to clearer communications between philosophers and applied statisticians. (As Cosma and I discuss in our paper, philosophy is important for statisticians: it can influence how we use and interpret our methods.)

Mayo:

Here's my discussion of this article for the Journal of the Royal Statistical Society:

I will comment on this paper in my role as applied statistician and consumer of Bayesian computation. In the last few years, my colleagues and I have felt the need to fit predictive survey responses given multiple discrete predictors, for example estimating voting given ethnicity and income within each of the fifty states, or estimating public opinion about gay marriage given age, sex, ethnicity, education, and state. We would like to be able to fit such models with ten or more predictors--for example, religion, religious attendance, marital status, and urban/rural/suburban residence in addition to the factors mentioned above.

There are (at least) three reasons for fitting a model with many predictive factors and potentially a huge number of interactions among them:

1. Deep interactions can be of substantive interest. For example, Gelman et al. (2009) discuss the importance of interactions between income, religion, religious attendance, and state in understanding how people vote.

2. Deep interactions can increase predictive power. For example Gelman and Ghitza (2010) show how the relation between voter turnout and the combination of sex, ethnicity, education, and state has systematic patterns that would be not be captured by main effects or even two-way interactions.

3. Deep interactions can help correct for sampling problems. Nonresponse rates in opinion polls continue to rise, and this puts a premium on post-sampling adjustments. We can adjust for known differences between sampling and population using poststratification, but to do so we need reasonable estimates of the average survey response within narrow slices of the population (Gelman, 2007).

Our key difficulty--familiar in applied statistics but not always so clear in discussions of statistical computation--is that, while we have an idea of the sort of model we would like to fit, we are unclear on the details. Thus, our computational task is not merely to fit a single model but to try out many different possibilities. My colleagues and I need computational tools that are:

(a) able to work with moderately large datasets (aggregations of surveys with total sample size in the hundreds of thousands);
(b) able to handle complicated models with tens of thousands of latent parameters;
(c) flexible enough to fit models that we haven't yet thought of;
(d) fast enough that we can fit model after model.

We all know by now that hierarchical Bayesian methods are a good way of estimating large numbers of parameters. I am excited about the article under discussion, and others like it, because the tools therein promise to satisfy conditions (a), (b), (c), (d) above.

Yesterday at the sister blog, Nate Silver forecast that the Republicans have a two-thirds chance of regaining the House of Representatives in the upcoming election, with an expected gain of 45 House seats.

Last month, Bafumi, Erikson, and Wlezien released their forecast that gives the Republicans an 80% chance of takeover and an expected gain of 50 seats.

As all the above writers emphasize, these forecasts are full of uncertainty, so I treat the two predictions--a 45-seat swing or a 50-seat swing--as essentially identical at the national level.

And, as regular readers know, as far back as a year ago, the generic Congressional ballot (those questions of the form, "Which party do you plan to vote for in November?") was also pointing to big Republican gains.

As Bafumi et al. point out, early generic polls are strongly predictive of the election outcome, but they need to be interpreted carefully. The polls move in a generally predictable manner during the year leading up to an election, and so you want to fit a model to the polls when making a forecast, rather than just taking their numbers at face value.

Methods

Having read Nate's description of his methods and also the Bafumi, Erikson, and Wlezien paper, my impression is that the two forecasting procedures are very similar. Both of them use national-level information to predict the nationwide vote swing, then use district-level information to map that national swing onto a district level. Finally, both methods represent forecasting uncertainty as a probability distribution over the 435 district-level outcomes and then summarize that distribution using simulations.

I'll go through the steps in order.

1. Forecast national vote share for the two parties from a regression model using the generic ballot and other information including the president's party, his approval rating, and recent economic performance.

2. Map the estimated national swing to district-by-district swings using the previous election results in each district as a baseline, then correcting for incumbency and uncontested elections.

2a. Nate also looks at local polls and expert district-by-district forecasts from the Cook Report and CQ Politics and, where these polls forecasts differ from the adjusted-uniform-swing model above, he compromises between the different sources of available information. He also throws in other district-level information including data on campaign contributions.

3. Fit the model to previous years' data and use the errors in those retrospective fit to get an estimate of forecast uncertainty. Using simulation, propagate that uncertainty to get uncertain forecast of elections at the district and national levels.

Step 2 is important and is indeed done by the best political-science forecasters. As Kari and I discuss, to a large extent, local and national swings can be modeled separately, and it is a common mistake for people to look at just one or the other.

The key difference between Nate's forecast and the others seems to be step 2a. Even if, as I expect, step 2a adds little to the accuracy of the national forecast, I think it's an excellent idea--after all, the elections are being held within districts. And, as Nate notes, when the local information differs dramatically from the nationally-forecast trend, often something interesting is going on. And these sorts of anomalies should be much more easily found by comparing to forecasts than by looking at polls in isolation.

Cyrus writes:

I [Cyrus] was teaching a class on multilevel modeling, and we were playing around with different method to fit a random effects logit model with 2 random intercepts---one corresponding to "family" and another corresponding to "community" (labeled "mom" and "cluster" in the data, respectively). There are also a few regressors at the individual, family, and community level. We were replicating in part some of the results from the following paper: Improved estimation procedures for multilevel models with binary response: a case-study, by G Rodriguez, N Goldman.

(I say "replicating in part" because we didn't include all the regressors that they use, only a subset.) We were looking at the performance of estimation via glmer in R's lme4 package, glmmPQL in R's MASS package, and Stata's xtmelogit. We wanted to study the performance of various estimation methods, including adaptive quadrature methods and penalized quasi-likelihood.

I was shocked to discover that glmer's default setting is adaptive Gaussian quadrature with 1 integration point (that is, a Laplace approximation). In addition, and even more shocking, glmer **does not allow** more than one integration point to be used for models with more than one random effect. With only one random effect, you can increase the number of integration points. But with multiple random effects (like what we had, which was 2), when you try to specify multiple integration points (with the nAGQ setting), you get an error saying that it can't do it. At first, I kinda shrugged, but then when I saw the difference that it made, alarm bells started going off.

I made a table showing results from estimates using glmer's Laplace approximation, and then also results from the glmmPQL which uses penalized quasi-liklihood, and then results from Stata's xtmelogit using first its own Laplace approximation and then a 7-pt Gaussian adaptive quadrature. Note that in Stata, the 7-pt quadrature fit is the default. This is a high dimsenional "brute force" method for fitting a complex likelihood.

Now, when we look at the results for the random effects standard deviation estimates, we see that the Laplace approximations in R (via glmer) and Stata (model 3 in the Table) are identical. In addition, taking the 7-pt quadrature estimate from xtmelogit to be the best guess of the bunch, the Laplace approximation-based estimates are HORRIBLY biased downward. That is, the estimated random effect variance is way too small. The PQL estimates are also biased downward, but not so badly. This is in line with what Rodriguez and Goldman found, using a 20-point quadrature and an MCMC fit as benchmarks (they produced nearly identical results).

The upshot is that glmer's default is REALLY BAD; and this default is the only option when you have more than one random effect. My suggestion then is to either use Stata, which has a very sensible 7-point quadrature default (although as a brute force method, it is slow) or fit the model with MCMC.

If I understand, you are using glmer? If so, have you (1) looked into this or (2) checked what glmer is giving you against xtmelogit or BUGS? I would trust the xtmelogit or BUGS answers more.

My reply:

1. I am working with Sophia Rabe-Hesketh, who's one of the people who programmed gllam in Stata. Sophia has been telling me for awhile that the method used by gllam is better than the Laplace method that is currently used in glmer.

2. Just some terminology: I'd prefer to talk about different "approximations" rather than different "estimation methods." To me, there's just one estimation method here--hierarchical Bayes (or, as Sophia would say, marginal likelihood inference)--and these various approaches to point estimates of variance parameters are approximations to get around the difficulties of doing the integrals and computing the marginal likelihood.

3. I also prefer to use the term "varying intercepts" (and slopes) rather than "random effects," because the term "random effects" is not so clearly defined in the literature (as Jennifer and I discuss in our book). This doesn't change the substance of your comment but it makes it easier for me to follow.

4. In my project with Sophia (and with Jingchen Liu, also cc-ed here), we're working on several parallel tracks:
(a) Adding a prior distribution on the variance parameters to stabilize the point estimates and keep them away from the boundary of parameter space (which, in the context of scalar variance parameters, implies that we keep the estimates away from zero).
(b) Adding a few steps of the Metropolis algorithm to capture some of the uncertainty in the variance parameters and also fix some of that bias you're talking about. It's my intuition that a low number (e.g., 10) Metropolis steps will clean up a lot of problems. Although, right now, that's just an intuition, we haven't tried it yet.
(c) Working in R (using glmer) and also separately in Stata (using gllam).

5. Last but not least, it sounds like you've encountered an important principle of research: The way to really learn a subject is to teach it! That's why college professors(at least at a place like Columbia) know so much: we're always teaching new things.

David Shor writes:

How does Bayes do it?

| 2 Comments

I received the following message from a statistician working in industry:

I am studying your paper, A Weakly Informative Default Prior Distribution for Logistic and Other Regression Models. I am not clear why the Bayesian approaches with some priors can usually handle the issue of nonidentifiability or can get stable estimates of parameters in model fit, while the frequentist approaches cannot.

My reply:

1. The term "frequentist approach" is pretty general. "Frequentist" refers to an approach for evaluating inferences, not a method for creating estimates. In particular, any Bayes estimate can be viewed as a frequentist inference if you feel like evaluating its frequency properties. In logistic regression, maximum likelihood has some big problems that are solved with penalized likelihood--equivalently, Bayesian inference. A frequentist can feel free to consider the prior as a penalty function rather than a probability distribution of parameters.

2. The reason our approach works well is that we are adding information. In a logistic regression with separation, there is a lack of information in the likeilhood, and the prior distribution helps out by ruling out unrealistic possibilities.

3. There are settings where our Bayesian method will mess up. For example, if the true logistic regression coefficient is -20, and you have a moderate sample size, our estimate will be much closer to zero (while the maximum likelihood estimate will be minus infinity, which for some purposes might be an acceptable estimate).

Probably I should write more about this sometime. Various questions along those lines arose during my recent talk at Cambridge.

Eric McGhee writes:

I sent a copy of my paper (coauthored with Cosma Shalizi) on Philosophy and the practice of Bayesian statistics in the social sciences to Richard Berk, who wrote:

I read your paper this morning. I think we are pretty much on the same page about all models being wrong. I like very much the way you handle this in the paper. Yes, Newton's work is wrong, but surely useful. I also like your twist on Bayesian methods. Makes good sense to me. Perhaps most important, your paper raises some difficult issues I have been trying to think more carefully about.

1. If the goal of a model is to be useful, surely we need to explore that "useful" means. At the very least, usefulness will depend on use. So a model that is useful for forecasting may or may not be useful for causal inference.

2. Usefulness will be a matter of degree. So that for each use we will need one or more metrics to represent how useful the model is. In what looks at first to be simple example, if the use is forecasting, forecasting accuracy by something like MSE may be a place to start. But that will depend on one's forecasting loss function, which might not be quadratic or even symmetric. This is a problem I have actually be working on and have some applications appearing. Other kinds of use imply a very different set of metrics --- what is a good usefulness metric for causal inference, for instance?

3. It seems to me that your Bayesian approach is one of several good ways (and not mutually exclusive ways) of doing data analysis. Taking a little liberty with what you say, you try a form of description and if it does not capture well what is in the data, you alter the description. But like use, it will be multidimensional and a matter of degree. There are these days so many interesting ways that statisticians have been thinking about description that I suspect it will be a while (if ever) before we have a compelling and systematic way to think about the process. And it goes to the heart of doing science.

4. I guess I am uneasy with your approach when it uses the same data to build and evaluate a model. I think we would agree that out-of-sample evaluation is required.

5. There are also some issues about statistical inference after models are revised and re-estimated using the same data. I have attached ">a recent paper written for criminologists, co-authored with Larry Brown and Linda Zhao, that appeared in Quantitative Criminology. It is frequentist in perspective. Larry and Ed George are working on a Bayesian version. Along with Andreas Buja and Larry Shepp, we are working on appropriate methods to post-model selection inference, given that current practice is just plain wrong and often very misleading. Bottom line: what does one make of Bayesian output when the model involved has been tuned to the data?

My reply:

I agree with your points #1 and #2. We always talk about a model being "useful" but the concept is hard to quantify.

I also agree with #3. Bayes has worked well for me but I'm sure that other methods could work fine also.

Regarding point #4, the use of the same data to build and evaluate the model is not particularly Bayesian. I see what we do as an extension of non-Bayesian ideas such as chi^2 tests, residual plots, and exploratory data analysis--all of which, in different ways, are methods for assessing model fit using the data that were used to fit the model. In any case, I agree that out-of-sample checks are vital to true statistical understanding.

To put it another way: I think you're imagining that I'm proposing within-sample checks as an alternative to out-of-sample checking. But that's not what I'm saying. What I'm proposing is to do within-sample checks as an alternative to doing no checking at all, which unfortunately is the standard in much of the Bayesian world (abetted by the subjective-Bayes theory/ideology). When a model passes a within-sample check, it doesn't mean the model is correct. But in many many cases, I've learned a lot from seeing a model fail a within-sample check.

Regarding your very last point, there is some classic work on Bayesian inference accounting for estimation of the prior from data. This is the work of various people in the 1960s and 1970s on hierarchical Bayes, when it was realized that "empirical Bayes" or "estimating the prior from data" could be subsumed into a larger hierarchical framework. My guess is that such ideas could be generalized to a higher level of the modeling hierarchy.

Recent Comments

  • C Ryan King: I'd say that the previous discussion had a feature which read more
  • K? O'Rourke: On the surface, it seems like my plots, but read more
  • Vic: I agree with the intervention-based approach -- spending and growth read more
  • Phil: David: Ideally I think one would model the process that read more
  • Bill Jefferys: Amplifying on Derek's comment: http://en.wikipedia.org/wiki/Buridan%27s_ass read more
  • Nameless: It is not uncommon in macro to have relationships that read more
  • derek: taking in each others' laundry It's more like the farmer read more
  • DK: #17. All these quadrillions and other super low p-values assume read more
  • Andrew Gelman: Anon: No such assumption is required. If you multiply the read more
  • anon: Doesn't this rely on some form of assumed orthogonality in read more
  • Andrew Gelman: David: Yup. What makes these graphs special is: (a) Interpretation. read more
  • David Shor: This seems pretty similar to the "Correlations" feature in the read more
  • David W. Hogg: If you want probabilistic results (probabilities over outcomes, with and read more
  • Cheryl Carpenter: Bob is my brother and he mentioned this blog entry read more
  • Bob Carpenter: That's awesome. Thanks. Exactly the graphs I was talking about. read more

About this Archive

This page is an archive of recent entries in the Bayesian Statistics category.

Art is the previous category.

Causal Inference is the next category.

Find recent content on the main index or look in the archives to find all content.