Recently in Multilevel Modeling Category

Nick Polson and James Scott write:

We generalize the half-Cauchy prior for a global scale parameter to the wider class of hy- pergeometric inverted-beta priors. We derive expressions for posterior moments and marginal densities when these priors are used for a top-level normal variance in a Bayesian hierarchical model. Finally, we prove a result that characterizes the frequentist risk of the Bayes estimators under all priors in the class. These arguments provide an alternative, classical justification for the use of the half-Cauchy prior in Bayesian hierarchical models, complementing the arguments in Gelman (2006).

This makes me happy, of course. It's great to be validated.

The only think I didn't catch is how they set the scale parameter for the half-Cauchy prior. In my 2006 paper I frame it as a weakly informative prior and recommend that the scale be set based on actual prior knowledge. But Polson and Scott are talking about a default choice. I used to think that such a default would not really be possible but given our recent success with automatic priors for regularized point estimates, now I'm thinking that a reasonable default might be possible in the full Bayes case too.

P.S. I found the above article while looking on Polson's site for this excellent paper, which considers in a more theoretical way some of the themes that Jennifer, Masanao, and I are exploring in our research on hierarchical models and multiple comparisons.

When predicting 0/1 data we can use logit (or probit or robit or some other robust model such as invlogit (0.01 + 0.98*X*beta)). Logit is simple enough and we can use bayesglm to regularize and avoid the problem of separation.

What if there are more than 2 categories? If they're ordered (1, 2, 3, etc), we can do ordered logit (and use bayespolr() to avoid separation). If the categories are unordered (vanilla, chocolate, strawberry), there are unordered multinomial logit and probit models out there.

But it's not so easy to fit these multinomial model in a multilevel setting (with coefficients that vary by group), especially if the computation is embedded in an iterative routine such as mi where you have real time constraints at each step.

So this got me wondering whether we could kluge it with logits. Here's the basic idea (in the ordered and unordered forms):

- If you have a variable that goes 1, 2, 3, etc., set up a series of logits: 1 vs. 2,3,...; 2 vs. 3,...; and so forth. Fit each one with bayesglm (or whatever). The usual ordered logit is a special case of this model in which the coefficients (except for the constant term) are the same for each model. (At least, I'm guessing that's what's happening; if not, there are some equivalently-dimensioned constraints.) The simple alternative has no constraints. intermediate versions could link the models with soft constraints, some prior distribution on the coefficients. (This would need some hyperparameters but these could be estimated too if the whole model is being fit in some iterative context.)

- If you have a vanilla-chocolate-strawberry variable, do the same thing; just order the categories first, either in some reasonable way based on substantive information or else using some automatic rule such as putting the categories in decreasing order of frequency in the data. In any case, you'd first predict the probability of being in category 1, then the probability of being in 2 (given that you're not in 1), then the probability of 3 (given not 1 or 2), and so forth.

Depending on your control over the problem, you could choose how to model the variables. For example, in a political survey with some missing ethnicity responses, you might model that variable as ordered: white/other/hispanic/black. In other contexts you might go unordered.

I recognized that my patchwork of logits is a bit of a hack, but I like its flexibility, as well as the computational simplicity of building it out of simple logits. Maybe there's some literature on this (perhaps explaining why it's a bad idea)?? I'd appreciate any comments or suggestions.

For awhile I've been fitting most of my multilevel models using lmer/glmer, which gives point estimates of the group-level variance parameters (maximum marginal likelihood estimate for lmer and an approximation for glmer). I'm usually satisfied with this--sure, point estimation understates the uncertainty in model fitting, but that's typically the least of our worries.

Sometimes, though, lmer/glmer estimates group-level variances at 0 or estimates group-level correlation parameters at +/- 1. Typically, when this happens, it's not that we're so sure the variance is close to zero or that the correlation is close to 1 or -1; rather, the marginal likelihood does not provide a lot of information about these parameters of the group-level error distribution.

I don't want point estimates on the boundary. I don't want to say that the unexplained variance in some dimension is exactly zero.

One way to handle this problem is full Bayes: slap a prior on sigma, do your Gibbs and Metropolis, and summarize posterior inferences by simulation. That's fine but it's another bit of programming and computation. We wanted to be able to do it using the existing lmer/glmer (or, in Stata, gllamm) framework.

Yeojin Chung, Sophia Rabe-Hesketh, Jingchen Liu, Vince Dorie, and I came up with a solution. We compute a maximum penalized marginal likelihood estimate, the marginal posterior mode under a gamma (not inverse gamma; you know how I hate that) prior for the scale parameter. Our default choice is gamma (2, 1/A), where A is set to some large value such as 10 (rescaled relative to the regression predictors, as usual). The shape parameter of 2 gives a prior that is linear at 0: thus, the posterior mode can't be 0 under any circumstances--but it can be as close to 0 as is required based on the likelihood. Some basic algebra shows that if the marginal maximum likelihood estimate is at 0, our posterior mode estimate will be about 1 standard error away from 0, which seems about right. The rate parameter, 1/A, is set to be small so that the prior will only pull the estimate down if it is unexpectedly large, as might occur if the number of groups is small.

Our prior distribution is weakly informative in that it does something to the estimate but is generic and doesn't have much information as we would have in just about any particular example. I think that in many applied examples it would make sense to use stronger prior information. But I think that, as a default choice, the weakly informative gamma (2, 1/A) prior makes more sense than the uniform distribution (corresponding to the nonregularized estimate).

In the multivariate case, we use a Wishart (not inverse-Wishart), which when we choose the right parameters keeps the variances away from 0 and the correlation matrix positive definite.

One interesting feature of these models is that the weakly-informative regularizing prior we use for a point estimate is not the same as the weakly-informative prior that we would use for full Bayes (where there is no need to keep the posterior mode away from the boundary).

Here's the abstract to our paper:

Variance parameters in mixed or multilevel models can be difficult to estimate, espe- cially when the number of groups is small. We propose a maximum penalized likelihood approach which is equivalent to estimating variance parameters by their marginal poste- rior mode, given a weakly informative prior distribution. By choosing the prior from the gamma family with at least 1 degree of freedom, we ensure that the prior density is zero at the boundary and thus the marginal posterior mode of the group-level variance will be positive. The use of a weakly informative prior allows us to stabilize our estimates while remaining faithful to the data.

Let me again thank my collaborators, Yeojin Chung, Sophia Rabe-Hesketh, Jingchen Liu, Vince Dorie. We're now working on various generalizations, including multivariate distributions, generalized linear models, and hierarchical models for the variance parameters (going beyond the simple ideas in section 6 of my article from a few years ago).

There are a few things I want to do:

1. Understand a fitted model using tools such as average predictive comparisons, R-squared, and partial pooling factors. In defining these concepts, Iain and I came up with some clever tricks, including (but not limited to):

- Separating the inputs and averaging over all possible values of the input not being altered (for average predictive comparisons);

- Defining partial pooling without referring to a raw-data or maximum-likelihood or no-pooling estimate (these don't necessarily exist when you're fitting logistic regression with sparse data);

- Defining an R-squared for each level of a multilevel model.

The methods get pretty complicated, though, and they have some loose ends--in particular, for average predictive comparisons with continuous input variables.

So now we want to implement these in R and put them into arm along with bglmer etc.

2. Setting up coefplot so it works more generally (that is, so the graphics look nice for models with one predictor, two predictors, or twenty predictors). Also a bunch of expansions to coefplot:

- Defining coefplot for multilevel models

- Also displaying average predictive comparisons for nonlinear models

- Setting it up to automatically display several regressions in a large "table"

3. Automatic plots showing data and fitted regression lines/curves. With multiple inputs, you hold all the inputs but one to fixed values--it's sort of like an average predictive comparison, but graphical. We also have to handle interactions and multilevel models.

4. Generalizing R-squared and partial pooling factors for multivariate (varying-intercept, varying-slope) models.

5. Graphs showing what happens as you add a multilevel component to a model. This is something I've been thinking about for awhile, ever since doing the police stop and frisk model with Jeff Fagan and Alex Kiss. I wanted a graph that showed how the key estimates were changing when we went multilevel, and what in the data was making the change.

One reason to see why these sort of process explanations are important is . . . we give them all the time. We're always giving these data-and-model stories of why when we control for variable X, our estimate changes on variable Y. Or why our multilevel estimates are a compromise between something and something else. What I'd like to do is to formalize and automate these explanations. Just as we have formulated and automated much of inference and we have (to some extent) formulated and automated graphical model checking, I'd like to formulate and automate the steps of understanding a fitted model--which often requires understanding in the context of fitted alternatives that are close by in model space.

6. I have more ideas but this is enough for now.

When it rains it pours . . .

John Transue writes:

I saw a post on Andrew Sullivan's blog today about life expectancy in different US counties. With a bunch of the worst counties being in Mississippi, I thought that it might be another case of analysts getting extreme values from small counties.

However, the paper (see here) includes a pretty interesting methods section. This is from page 5, "Specifically, we used a mixed-effects Poisson regression with time, geospatial, and covariate components. Poisson regression fits count outcome variables, e.g., death counts, and is preferable to a logistic model because the latter is biased when an outcome is rare (occurring in less than 1% of observations)."

They have downloadable data. I believe that the data are predicted values from the model. A web appendix also gives 90% CIs for their estimates.

Do you think they solved the small county problem and that the worst counties really are where their spreadsheet suggests?

My reply:

I don't have a chance to look in detail but it sounds like they're on the right track. I like that they cross-validated; that's what we did to check we were ok with our county-level radon estimates.

Regarding your question about the small county problem: no matter what you do, all maps of parameter estimates are misleading. Even the best point estimates can't capture uncertainty. As noted above, cross-validation (at the level of the county, not of the individual observation) is a good way to keep checking.

Brendan Nyhan points me to this from Don Taylor:

Can national data be used to estimate state-level results? . . . A challenge is the fact that the sample size in many states is very small . . . Richard [Gonzales] used a regression approach to extrapolate this information to provide a state-level support for health reform:
To get around the challenge presented by small sample sizes, the model presented here combines the benefits of incorporating auxiliary demographic information about the states with the hierarchical modeling approach commonly used in small area estimation. The model is designed to "shrink" estimates toward the average level of support in the region when there are few observations available, while simultaneously adjusting for the demographics and political ideology in the state. This approach therefore takes fuller advantage of all information available in the data to estimate state-level public opinion.

This is a great idea, and it is already being used all over the place in political science. For example, here. Or here. Or here.

See here for an overview article, "How should we estimate public opinion in the states?" by Jeff Lax and Justin Phillips.

It's good to see practical ideas being developed independently in different fields. I know that methods developed by public health researchers have been useful in political science, and I hope that in turn they can take advantage of the progress we've made in multilevel regression and poststratification.

I received an email with the above title from Daniel Adkins, who elaborates:

I [Adkins] am having a tough time with a dataset including 40K obs and 8K subjects. Trying to estimate a 2 level logit with random intercept and age slope and about 13 fixed covariates. I have tried several R packages (lme4, lme4a, glmmPQL, MCMCglmm) and stata xtmelogit and gllamm to no avail. xtmelogit crashes from insufficient memory. The R packages yield false convergences. A simpler model w/ random intercept only gives stable estimates in lme4 with a very large number of quadrature point (nAGQ>220). When i try this (nAGQ=221) with the random age term, it doesn't make it through a single iteration in 72 hours (have tried both w/ and w/out RE correlation). I am using a power desktop that is top of the line compared to anything other than a cluster. Have tried start values for fixed effects in lme4 and that doesn't help (couldn't figure out how to specify RE starts). Do you have any advice. Should I move to BUGS (no experience, so that would be a little painful)? Try as SEM in Mplus? Is there a trick in lme4 or gllamm, that might help? Cluster/more RAM? I am thinking of splitting the sample, modeling, then combining estimates via meta-analysis, but reviewers may balk at this.

Could you give any advice? I have done a lot of diagnostics and simpler models, so I know ~ what the parameter estimates are, but need the formal full models for a paper.

What with that 93% marginal tax rate [regular readers will know what I'm talking about], I might as well just answer this one for free . . .

My first question is what happened in gllamm? My general impression is that gllamm is more reliable than lme4 etc for nonlinear models. Another option is to try bglmer--the regularization in the priors should make the computation more stable. I don't know Mplus so can't say about that, but, no, I wouldn't recommend Bugs--my guess is that would be too slow.

Another option is some sort of intermediate approach, such as: Fit one of those simpler models that you like. In that simpler model, compute and save the "linear predictor" (X*beta). Then use this as a single predictor (maybe it can have a varying slope also) in a multilevel model. Or something like that. If you have a simpler model that you understand, then you should be able to blaze a trail from there to the model you finally want to fit.

Adkins sent an update:

I was finally able to get stable, sensible estimates in lme4 after 1) rescaling age (to avoid estimating a very small random effect variance) and 2) moving to the 64-bit version of R. To answer your question, my sense was that gllamm would get stuck in a flat, slightly bumpy portion of the likelihood function and couldn't get out to move toward the global optimum. Could have probably circumvented this with good starts but I figured things out in lme4 first.

Aaahhh--rescaling. That's advice I gave someone else the other day, oddly enough. I should've thought of that.

Zoltan Fazekas writes:

I am a 2nd year graduate student in political science at the University of Vienna. In my empirical research I often employ multilevel modeling, and recently I came across a situation that kept me wondering for quite a while. As I did not find much on this in the literature and considering the topics that you work on and blog about, I figured I will try to contact you.

A couple years ago we had an amazing all-star session at the Joint Statistical Meetings. The topic was new approaches to survey weighting (which is a mess, as I'm sure you've heard).

Xiao-Li Meng recommended shrinking weights by taking them to a fractional power (such as square root) instead of trimming the extremes.

Rod Little combined design-based and model-based survey inference.

Michael Elliott used mixture models for complex survey design.

And here's my introduction to the session.

Robert Birkelbach:

I am writing my Bachelor Thesis in which I want to assess the reading competencies of German elementary school children using the PIRLS2006 data. My levels are classrooms and the individuals. However, my dependent variable is a multiple imputed (m=5) reading test. The problem I have is, that I do not know, whether I can just calculate 5 linear multilevel models and then average all the results (the coefficients, standard deviation, bic, intra class correlation, R2, t-statistics, p-values etc) or if I need different formulas for integrating the results of the five models into one because it is a multilevel analysis? Do you think there's a better way in solving my problem? I would greatly appreciate if you could help me with a problem regarding my analysis -- I am quite a newbie to multilevel modeling and especially to multiple imputation. Also: Is it okay to use frequentist models when the multiple imputation was done bayesian? Would the different philosophies of scientific testing contradict each other?

My reply:

I receommend doing 5 separate analyses, pushing them all the way thru to the end, then combining them using the combining-imputation rules given in the imputatoin chapter of our book. Everything should go fine.

Jeff writes:

How far off is bglmer and can it handle ordered logit or multinom logit?

My reply:

bglmer is very close. No ordered logit but I was just talking about it with Sophia today. My guess is that the easiest way to fit a hierarchical ordered logit or multinom logit will be to use stan. For right now I'd recommend using glmer/bglmer to fit the ordered logits in order (e.g., 1 vs. 2,3,4, then 2 vs. 3,4, then 3 vs. 4). Or maybe there's already a hierarchical multinomial logit in mcmcpack or somewhere?

My talk at Stanford on Tuesday


Of Beauty, Sex, and Power: Statistical Challenges in Estimating Small Effects.

Tues 26 Apr, 12-1 in the Graham Stuart Lounge, 4th Floor, Encina West.

Dean Eckles writes:

I remember reading on your blog that you were working on some tools to fit multilevel models that also include "fixed" effects -- such as continuous predictors -- that are also estimated with shrinkage (for example, an L1 or L2 penalty). Any new developments on this front?

I often find myself wanting to fit a multilevel model to some data, but also needing to include a number of "fixed" effects, mainly continuous variables. This makes me wary of overfitting to these predictors, so then I'd want to use some kind of shrinkage.

As far as I can tell, the main options for doing this now is by going fully Bayesian and using a Gibbs sampler. With MCMCglmm or BUGS/JAGS I could just specify a prior on the fixed effects that corresponds to a desired penalty. However, this is pretty slow, especially with a large data set and because I'd like to select the penalty parameter by cross-validation (which is where this isn't very Bayesian I guess?).

My reply:

We allow informative priors in blmer/bglmer. Unfortunately blmer/bglmer aren't ready yet but they will be soon, I hope. They'll be in the "arm" package in R. We're also working on a bigger project of multilevel models for deep interactions of continuous predictors. But that won't be ready for awhile; we still have to figure out what we want to do there.

Sam Stroope writes:

I'm creating county-level averages based on individual-level respondents. My question is, how few respondents are reasonable to use when calculating the average by county? My end model will be a county-level (only) SEM model.

My reply: Any number of respondents should work. If you have very few respondents, you should just end up with large standard errors which will propagate through your analysis.

P.S. I must have deleted my original reply by accident so I reconstructed something above.

Gregory Eady writes:

I'm working on a paper examining the effect of superpower alliance on a binary DV (war). I hypothesize that the size of the effect is much higher during the Cold War than it is afterwards. I'm going to run a Chow test to check whether this effect differs significantly between 1960-1989 and 1990-2007 (Scott Long also has a method using predicted probabilities), but I'd also like to show the trend graphically, and thought that your "Secret Weapon" would be useful here. I wonder if there is anything I should be concerned about when doing this with a (rare-events) logistic regression. I was thinking to graph the coefficients in 5-year periods, moving a single year at a time (1960-64, 1961-65, 1962-66, and so on), reporting the coefficient in the graph for the middle year of each 5-year range).

My reply:

I don't know nuthin bout no Chow test but, sure, I'd think the secret weapon would work. If you're analyzing 5-year periods, it might be cleaner just to keep the periods disjoint. Set the boundaries of these periods in a reasonable way (if necessary using periods of unequal lengths so that your intervals don't straddle important potential change points). I suppose in this case you could do 1960-64, 65-69, ..., and this would break at 1989/90 so it would be fine. If you're really running into rare events, though, you might want 10-year periods rather than 5-year.


| No Comments

Pete Gries writes:

I [Gries] am not sure if what you are suggesting by "doing data analysis in a patternless way" is a pitch for deductive over inductive approaches as a solution to the problem of reporting and publication bias. If so, I may somewhat disagree. A constant quest to prove or disprove theory in a deductive manner is one of the primary causes of both reporting and publication bias. I'm actually becoming a proponent of a remarkably non-existent species - "applied political science" - because there is so much animosity in our discipline to inductive empirical statistical work that seeks to answer real world empirical questions rather than contribute to parsimonious theory building. Anyone want to start a JAPS - Journal of Applied Political Science? Our discipline is in danger of irrelevance.

My reply: By "doing data analysis in a patternless way," I meant statistical methods such as least squares, maximum likelihood, etc., that estimate parameters independently without recognizing the constraints and relationships between them. If you estimate each study on its own, without reference to all the other work being done in the same field, then you're depriving yourself of a lot of information and inviting noisy estimates and, in particular, overestimates of small effects.

Andrew Gelman (Columbia University) and Jennifer Hill (New York University) seek to hire a post-doctoral fellow to work on development of iterative imputation algorithms and diagnostics for missing-data imputation. Activities would include model-development, programming, and data analysis. This project is funded by a grant from the Institute of Education Sciences. Other collaborators on the project include Jingchen Liu and Ben Goodrich. This is a two-year position that would start this summer (2011) or earlier if possible.

The ideal candidate will have a statistics or computer science background, will be interested in statistical modeling, serious programming, and applications. He or she should be able to work fluently in R and should already know about hierarchical models and Bayesian inference and computation. Experience or interest in designing GUIs would be a bonus, since we are attempting to improve our GUI for missing data imputation.

The successful candidate will become part of the lively Applied Statistics Center community, which includes several postdocs (with varied backgrounds in statistics, computer science, and social science), Ph.D., M.A., and undergraduate students, and faculty at Columbia and elsewhere. We want people who love collaboration and have the imagination, drive, and technical skills to make a difference in our projects.

If you are interested in this position, please send a letter of application, a CV, some of your articles, and three letters of recommendation to the Applied Statistics Center coordinator, Caroline Peters, Review of applications will begin immediately.

We need help picking out an automatic differentiation package for Hamiltonian Monte Carlo sampling from the posterior of a generalized linear model with deep interactions. Specifically, we need to compute gradients for log probability functions with thousands of parameters that involve matrix (determinants, eigenvalues, inverses), stats (distributions), and math (log gamma) functions. Any suggestions?

Bill Harris writes:

I've read your paper and presentation showing why you don't usually worry about multiple comparisons. I see how that applies when you are comparing results across multiple settings (states, etc.).

Does the same principle hold when you are exploring data to find interesting relationships? For example, you have some data, and you're trying a series of models to see which gives you the most useful insight. Do you try your models on a subset of the data so you have another subset for confirmatory analysis later, or do you simply throw all the data against your models?

My reply: I'd like to estimate all the relationships at once and use a multilevel model to do partial pooling to handle the mutiplicity issues. That said, in practice, in my applied work I'm always bouncing back and forth between different hypotheses and different datasets, and often I learn a lot when next year's data come in and I can modify my hypotheses. The trouble with the classical hypothesis-testing framework, at least for me, is that so-called statistical hypotheses are very precise things, whereas the sorts of hypotheses that arise in science and social science are vaguer and are not so amenable to "testing" in the classical sense.

Bayes in China update


Some clarification on the Bayes-in-China issue raised last week:

1. We heard that the Chinese publisher cited the following pages that might contain politically objectionable materials: 3, 5, 21, 73, 112, 201.

2. It appears that, as some commenters suggested, the objection was to some of the applications, not to the Bayesian methods.

3. Our book is not censored in China. In fact, as some commenters mentioned, it is possible to buy it there, and it is also available in university libraries there. The edition of the book which was canceled was intended to be a low-cost reprint of the book. The original book is still available. I used the phrase "Banned in China" as a joke and I apologize if it was misinterpreted.

4. I have no quarrel with the Chinese government or with any Chinese publishers. They can publish whatever books they would like. I found this episode amusing only because I do not think my book on regression and multilevel models has any strong political content. I suspect the publisher was being unnecessarily sensitive to potentially objectionable material, but this is their call. I thought this was an interesting story (which is why I posted the original email on the blog) but I did not, and do not, intend it as any sort of comment on the Chinese government, Chinese society, etc. China is a big country and this is one person at one publisher making one decision. That's all it is; it's not a statement about China in general.

I did not write the above out of any fear of legal action etc. I just think it's important to be fair and clear, and it is possible that some of what I wrote could have been misinterpreted in translation. If anyone has further questions on this, feel free to ask in the comments and I will clarify as best as I can.

I received the following in email from our publisher:

I write with regards to the project to publish a China Edition of your book "Data Analysis Using Regression and Multilevel/Hierarchical Models" (ISBN-13: 9780521686891) for the mainland Chinese market. I regret to inform you that we have been notified by our partner in China, Posts & Telecommunications Press (PTP), that due to various politically sensitive materials in the text, the China Edition has not met with the approval of the publishing authorities in China, and as such PTP will not be able to proceed with the publication of this edition. We will therefore have to cancel plans for the China Edition of your book. Please accept my apologies for this unforeseen development. If you have any queries regarding this, do feel free to let me know.

Oooh, it makes me feel so . . . subversive. It reminds me how, in Sunday school, they told us that if we were ever visiting Russia, we should smuggle Bibles in our luggage because the people there weren't allowed to worship.

Xiao-Li Meng told me once that in China they didn't teach Bayesian statistics because the idea of a prior distribution was contrary to Communism (since the "prior" represented the overthrown traditions, I suppose).

And then there's this.

I think that the next printing of our book should have "Banned in China" slapped on the cover. That should be good for sales, right?

P.S. Update here.

A colleague recently sent me a copy of some articles on the estimation of treatment interactions (a topic that's interested me for awhile). One of the articles, which appeared in the Lancet in 2000, was called "Subgroup analysis and other (mis)uses of baseline data in clinical trials," by Susan F. Assmann, Stuart J. Pocock, Laura E. Enos, and Linda E. Kasten. . . .

Hey, wait a minute--I know Susan Assmann! Well, I sort of know her. When I was a freshman in college, I asked my adviser, who was an applied math prof, if I could do some research. He connected me to Susan, who was one of his Ph.D. students, and she gave me a tiny part of her thesis to work on.

The problem went as follows. You have a function f(x), for x going from 0 to infinity, that is defined as follows. Between 0 and 1, f(x)=x. Then, for x higher than 1, f'(x) = f(x) - f(x-1). The goal is to figure out what f(x) does. I think I'm getting this right here, but I might be getting confused on some of the details. The original form of the problem had some sort of probability interpretation, I think--something to do with a one-dimensional packing problem, maybe f(x) was the expected number of objects that would fit in an interval of size x, if the objects were drawn from a uniform distribution. Probably not that, but maybe something of that sort.

One of the fun things about attacking this sort of problem as a freshman is that I knew nothing about the literature on this sort of problem or even what it was called (a differential-difference equation, or it can also be formulated using as an integral). Nor was I set up to do any simulations on the computer. I just solved the problem from scratch. First I figured out the function in the range [1,2], [2,3], and so forth, then I made a graph (pencil on graph paper) and conjectured the asymptotic behavior of f. The next step was to prove my conjecture. It ate at me. I worked on the problem on and off for about eleven months, then one day I finally did it: I had carefully proved the behavior of my function! This accomplishment gave me a warm feeling for years after.

I never actually told Susan Assmann about this--I think that by then she had graduated, and I never found out whether she figured out the problem herself as part of her Ph.D. thesis or whether it was never really needed in the first place. And I can't remember if I told my adviser. (He was a funny guy: extremely friendly to everyone, including his freshman advisees, but one time we were in his office when he took a phone call. He was super-friendly during the call, then after the call was over he said, "What an asshole." After this I never knew whether to trust the guy. If he was that nice to some asshole on the phone, what did it mean that he was nice to us?) I switched advisers. the new adviser was much nicer--I knew him because I'd taken a class with him--but it didn't really matter since he was just another mathematician. I was lucky enough to stumble into statistics, but that's another story.

Anyway, it was funny to see that name--Susan Assmann! I did a quick web search and I'm pretty sure it is the same person. And her paper was cited 430 times--that's pretty impressive!

P.S. The actual paper by Assmann et al. is reasonable. It's a review of some statistical practice in medical research. They discuss the futility of subgroup analysis given that, compared to main effects, interactions are typically (a) smaller in magnitude and (b) estimated with larger standard errors. That's pretty much a recipe for disaster! (I made a similar argument in a 2001 article in Biostatistics, except that my article went in depth for one particular model and Assmann et al. were offering more general advice. And, unlike me, they had some data.) Ultimately I do think treatment interactions and subgroup analysis are important, but they should be estimated using multilevel models. If you try to estimate complex interactions using significance tests or classical interval estimation, you'll probably just be wasting your time, for reasons explained by Assmann et al.

Joscha Legewie points to this article by Lars Ronnegard, Xia Shen, and Moudud Alam, "hglm: A Package for Fitting Hierarchical Generalized Linear Models," which just appeared in the R journal. This new package has the advantage, compared to lmer(), of allowing non-normal distributions for the varying coefficients. On the downside, they seem to have reverted to the ugly lme-style syntax (for example, "fixed = y ~ week, random = ~ 1|ID" rather than "y ~ week + (1|D)"). The old-style syntax has difficulties handling non-nested grouping factors. They also say they can estimated models with correlated random effects, but isn't that just the same as varying-intercept, varying-slope models, which lmer (or Stata alternatives such as gllam) can already do? There's also a bunch of stuff on H-likelihood theory, which seems pretty pointless to me (although probably it won't do much harm either).

In any case, this package might be useful to some of you, hence this note.

WWJD? U can find out!

| 1 Comment

Two positions open in the statistics group at the NYU education school. If you get the job, you get to work with Jennifer HIll!

One position is a postdoctoral fellowship, and the other is a visiting professorship. The latter position requires "the demonstrated ability to develop a nationally recognized research program," which seems like a lot to ask for a visiting professor. Do they expect the visiting prof to develop a nationally recognized research program and then leave it there at NYU after the visit is over?

In any case, Jennifer and her colleagues are doing excellent work, both applied and methodological, and this seems like a great opportunity.

So-called fixed and random effects


Someone writes:

I am hoping you can give me some advice about when to use fixed and random effects model. I am currently working on a paper that examines the effect of . . . by comparing states . . .

It got reviewed . . . by three economists and all suggest that we run a fixed effects model. We ran a hierarchial model in the paper that allow the intercept and slope to vary before and after . . . My question is which is correct? We have ran it both ways and really it makes no difference which model you run, the results are very similar. But for my own learning, I would really like to understand which to use under what circumstances. Is the fact that we use the whole population reason enough to just run a fixed effect model?

Perhaps you can suggest a good reference to this question of when to run a fixed vs. random effects model.

I'm not always sure what is meant by a "fixed effects model"; see my paper on Anova for discussion of the problems with this terminology:
Sometimes there is a concern about fitting multilevel models when there are correlations; see this paper for discussion of how to deal with this:

The short answer to your question is that, no, the fact that you use the whole population should not determine the model you fit. In particular, there is no reason for you to use a model with group-level variance equal to infinity. There is various literature with conflicting recommendations on the topic (see my Anova paper for references), but, as I discuss in that paper, a lot of these recommendations are less coherent than they might seem at first.

Karri Seppa writes:

My topic is regional variation in the cause-specific survival of breast cancer patients across the 21 hospital districts in Finland, this component being modeled by random effects. I am interested mainly in the district-specific effects, and with a hierarchical model I can get reasonable estimates also for sparsely populated districts.

Based on the recommendation given in the book by yourself and Dr. Hill (2007) I tend to think that the finite-population variance would be an appropriate measure to summarize the overall variation across the 21 districts. However, I feel it is somewhat incoherent first to assume a Normal distribution for the district effects, involving a "superpopulation" variance parameter, and then to compute the finite-population variance from the estimated district-specific parameters. I wonder whether the finite-population variance were more appropriate in the context of a model with fixed district effects?

My reply:

Clutering and variance components


Raymond Lim writes:

Do you have any recommendations on clustering and binary models? My particular problem is I'm running a firm fixed effect logit and want to cluster by industry-year (every combination of industry-year). My control variable of interest in measured by industry-year and when I cluster by industry-year, the standard errors are 300x larger than when I don't cluster. Strangely, this problem only occurs when doing logit and not OLS (linear probability). Also, clustering just by field doesn't blow up the errors. My hunch is it has something to do with the non-nested structure of year, but I don't understand why this is only problematic under logit and not OLS.

My reply:

I'd recommend including four multilevel variance parameters, one for firm, one for industry, one for year, and one for industry-year. (In lmer, that's (1 | firm) + (1 | industry) + (1 | year) + (1 | industry.year)). No need to include (1 | firm.year) since in your data this is the error term. Try some varying slopes too.

If you have a lot of firms, you might first try the secret weapon, fitting a model separately for each year. Or if you have a lot of years, break the data up into 5-year periods and do the above analysis separately for each half-decade. Things change over time, and I'm always wary of models with long time periods (decades or more). I see this a lot in political science, where people naively think that they can just solve all their problems with so-called "state fixed effects," as if Vermont in 1952 is anything like Vermont in 2008.

My other recommendation is to build up your model from simple parts and try to identify exactly where your procedure is blowing up. We have a graph in ARM that makes this point. (Masanao and Yu-Sung know what graph I'm talking about.)

Quote of the day


"A statistical model is usually taken to be summarized by a likelihood, or a likelihood and a prior distribution, but we go an extra step by noting that the parameters of a model are typically batched, and we take this batching as an essential part of the model."

Multilevel quantile regression


Ryan Seals writes:

I'm an epidemiologist at Emory University, and I'm working on a project of release patterns in jails (basically trying to model how long individuals are in jail before they're release, for purposes of designing short-term health interventions, i.e. HIV testing, drug counseling, etc...). The question lends itself to quantile regression; we're interested in the # of days it takes for 50% and 75% of inmates to be released. But being a clustered/nested data structure, it also obviously lends itself to multilevel modeling, with the group-level being individual jails.

So: do you know of any work on multilevel quantile regression? My quick lit search didn't yield much, and I don't see any preprogrammed way to do it in SAS.

My reply:

To start with, I'm putting in the R keyword here, on the hope that some readers might be able to refer you to an R function that does what you want. Beyond this, I think it should be possible to program something in Bugs. In ARM we have an example of a multilevel ordered logit, which doesn't sound so different from what you're doing. I've never done a full quantile regression, but I imagine that you have to take some care in setting up the distributional form.

To start, you could fit some multilevel logistic regressions using different quantiles as cut-off points and plot your inferences to see generally what's going on.

Lee Mobley writes:

I recently read what you posted on your blog How does statistical analysis differ when analyzing the entire population rather than a sample?

What you said in the blog accords with my training in econometrics. However I am concerned about a new wrinkle on this problem that derives from multilevel modeling.

Jason Roos sends along this article:

On election days many of us see a colorful map of the U.S. where each tiny county has a color on the continuum between red and blue. So far we have not used such data to improve the effectiveness of marketing models. In this study, we show that we should.

We demonstrate the usefulness of political data via an interesting application--the demand for movies. Using boxoffice data from 25 counties in the U.S. Midwest (21 quarters between 2000 and 2005) we show that by including political data one can improve out-of-sample predictions significantly. Specifically, we estimate the improvement in forecasts due to the addition of political data to be around $43 million per year for the entire U.S. theatrical market.

Furthermore, when it comes to movies we depart from previous work in another way. While previous studies have relied on pre-determined movie genres, we estimate perceived movie attributes in a latent space and formulate viewers' tastes as ideal points. Using perceived attributes improves the out-of-sample predictions even further (by around $93 million per year). Furthermore, the latent dimensions that we identify are not only effective in improving predictions, they are also quite insightful about the nature of movies.

My reaction:

I'm too busy to actually read this one, but it's a great title! But--hey!--whassup with all those tables? Let's start by replacing Tables 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10 by graphs. (See here for some tips.) Or are you gonna tell me that the reader really needs to know that the standard error for the "Turnout - Midterm" variable in Dimension 4 is "0.108"? I mean, why not just print out a core dump in hex and cut out the middleman?

In all seriousness, the paper looks interesting and I'm sure would hugely benefit by some plots of data and fitted models.

On a more specific note, I wonder if the authors can shed any light on the controversial question of the Brokeback Mountain's popularity in Republican-voting areas in "red states" (search Kaus Brokeback Mountain for more than you could possibly want to read on this topic). I never saw the movie, nor have I followed the debates about its box office, but I recall there being some controversy on the topic.

Alban Zeber writes:

Our "arm" package in R requires Doug Bates's "lme4" which fits multilevel models.

lme4 is currently having some problems on the Mac. But installation on the Mac can be done; it just takes a bit of work.

I have two sets of instructions below.

Mandelbrot on taxonomy (from 1955; the first publication about fractals that I know of):


Searching for Mandelbrot on the blog led me to Akaike, who also recently passed away and also did interesting early work on self-similar stochastic processes.

For example, this wonderful opening of his 1962 paper, "On a limiting process which asymptotically produces f^{-2} spectral density":

In the recent papers in which the results of the spectral analyses of roughnesses of runways or roadways are reported, the power spectral densities of approximately the form f^{-2} (f: frequency) are often treated. This fact directed the present author to the investigation of the limiting process which will provide the f^{-2} form under fairly general assumptions. In this paper a very simple model is given which explains a way how the f^{-2} form is obtained asymptotically. Our fundamental model is that the stochastic process, which might be considered to represent the roughness of the runway, is obtained by alternative repetitions of roughening and smoothing. We can easily get the limiting form of the spectrum for this model. Further, by taking into account the physical meaning of roughening and smoothing we can formulate the conditions under which this general result assures that the f^{-2} form will eventually take place.

P.S. I've placed this in the Multilevel Modeling category because fractals are a form of multilevel model, although not always recognized as such: fractals are hi-tech physical science models, whereas multilevel modeling is associated with low-grade fields such as education and social science. The connection is clear, though, when you consider Mandelbrot's taxonomy model or the connection between Akaike's dynamic model of roughness to complex models of student, teacher, and classroom effects.

I met Mandelbrot once, about 20 years ago. Unfortunately, at the time I didn't recognize the general importance of multilevel models, so all I could really do in the conversation was to express my awe and appreciation of his work. If I could go back now, I'd have some more interesting things to ask.

I never met Akaike at all.

I recently saw this article that Stephen Senn wrote a couple of years ago, criticizing Bayesian sensitivity analyses that relied on vague prior distributions. I'm moving more and more toward the idea that Bayesian analysis should include actual prior information, so I generally agree with his points. As I used to say when teaching Bayesian data analysis, a Bayesian model is modular, and different pieces can be swapped in and out as needed. So you might start with an extremely weak prior distribution, but if it makes a difference it's time to bite the bullet and include more information.

My only disagreement with Senn's paper is in its recommendation to try the so-called fixed-effects analysis. Beyond the difficulties with terminology (the expressions "fixed" and "random" effects are defined in different ways by different people in the literature; see here for a rant on the topic which made its way into some of my articles and books), there is the problem that, when a model gets complicated, some of the estimates that are called "fixed effects" get very noisy, especially in sparse data settings such as arise in logistic regression.

I received the following question from an education researcher:

James O'Brien writes:

How would you explain, to a "classically-trained" hypothesis-tester, that "It's OK to fit a multilevel model even if some groups have only one observation each"?

I [O'Brien] think I understand the logic and the statistical principles at work in this, but I've having trouble being clear and persuasive. I also feel like I'm contending with some methodological conventional wisdom here.

My reply: I'm so used to this idea that I find it difficult to defend it in some sort of general conceptual way. So let me retreat to a more functional defense, which is that multilevel modeling gives good estimates, especially when the number of observations per group is small.

One way to see this in any particular example in through cross-validation. Another way is to consider the alternatives. If you try really hard you can come up with a "classical hypothesis testing" approach which will do as well as the multilevel model. It would just take a lot of work. I'd rather put that effort into statistical modeling and data visualization instead.

If you are in a situation where someone really doesn't want to do the multilevel model, you could perhaps ask your skeptical colleague what his or her goals are in your particular statistical modeling problem. Then you can go from there.

Here's my discussion of this article for the Journal of the Royal Statistical Society:

I will comment on this paper in my role as applied statistician and consumer of Bayesian computation. In the last few years, my colleagues and I have felt the need to fit predictive survey responses given multiple discrete predictors, for example estimating voting given ethnicity and income within each of the fifty states, or estimating public opinion about gay marriage given age, sex, ethnicity, education, and state. We would like to be able to fit such models with ten or more predictors--for example, religion, religious attendance, marital status, and urban/rural/suburban residence in addition to the factors mentioned above.

There are (at least) three reasons for fitting a model with many predictive factors and potentially a huge number of interactions among them:

1. Deep interactions can be of substantive interest. For example, Gelman et al. (2009) discuss the importance of interactions between income, religion, religious attendance, and state in understanding how people vote.

2. Deep interactions can increase predictive power. For example Gelman and Ghitza (2010) show how the relation between voter turnout and the combination of sex, ethnicity, education, and state has systematic patterns that would be not be captured by main effects or even two-way interactions.

3. Deep interactions can help correct for sampling problems. Nonresponse rates in opinion polls continue to rise, and this puts a premium on post-sampling adjustments. We can adjust for known differences between sampling and population using poststratification, but to do so we need reasonable estimates of the average survey response within narrow slices of the population (Gelman, 2007).

Our key difficulty--familiar in applied statistics but not always so clear in discussions of statistical computation--is that, while we have an idea of the sort of model we would like to fit, we are unclear on the details. Thus, our computational task is not merely to fit a single model but to try out many different possibilities. My colleagues and I need computational tools that are:

(a) able to work with moderately large datasets (aggregations of surveys with total sample size in the hundreds of thousands);
(b) able to handle complicated models with tens of thousands of latent parameters;
(c) flexible enough to fit models that we haven't yet thought of;
(d) fast enough that we can fit model after model.

We all know by now that hierarchical Bayesian methods are a good way of estimating large numbers of parameters. I am excited about the article under discussion, and others like it, because the tools therein promise to satisfy conditions (a), (b), (c), (d) above.

Yesterday at the sister blog, Nate Silver forecast that the Republicans have a two-thirds chance of regaining the House of Representatives in the upcoming election, with an expected gain of 45 House seats.

Last month, Bafumi, Erikson, and Wlezien released their forecast that gives the Republicans an 80% chance of takeover and an expected gain of 50 seats.

As all the above writers emphasize, these forecasts are full of uncertainty, so I treat the two predictions--a 45-seat swing or a 50-seat swing--as essentially identical at the national level.

And, as regular readers know, as far back as a year ago, the generic Congressional ballot (those questions of the form, "Which party do you plan to vote for in November?") was also pointing to big Republican gains.

As Bafumi et al. point out, early generic polls are strongly predictive of the election outcome, but they need to be interpreted carefully. The polls move in a generally predictable manner during the year leading up to an election, and so you want to fit a model to the polls when making a forecast, rather than just taking their numbers at face value.


Having read Nate's description of his methods and also the Bafumi, Erikson, and Wlezien paper, my impression is that the two forecasting procedures are very similar. Both of them use national-level information to predict the nationwide vote swing, then use district-level information to map that national swing onto a district level. Finally, both methods represent forecasting uncertainty as a probability distribution over the 435 district-level outcomes and then summarize that distribution using simulations.

I'll go through the steps in order.

1. Forecast national vote share for the two parties from a regression model using the generic ballot and other information including the president's party, his approval rating, and recent economic performance.

2. Map the estimated national swing to district-by-district swings using the previous election results in each district as a baseline, then correcting for incumbency and uncontested elections.

2a. Nate also looks at local polls and expert district-by-district forecasts from the Cook Report and CQ Politics and, where these polls forecasts differ from the adjusted-uniform-swing model above, he compromises between the different sources of available information. He also throws in other district-level information including data on campaign contributions.

3. Fit the model to previous years' data and use the errors in those retrospective fit to get an estimate of forecast uncertainty. Using simulation, propagate that uncertainty to get uncertain forecast of elections at the district and national levels.

Step 2 is important and is indeed done by the best political-science forecasters. As Kari and I discuss, to a large extent, local and national swings can be modeled separately, and it is a common mistake for people to look at just one or the other.

The key difference between Nate's forecast and the others seems to be step 2a. Even if, as I expect, step 2a adds little to the accuracy of the national forecast, I think it's an excellent idea--after all, the elections are being held within districts. And, as Nate notes, when the local information differs dramatically from the nationally-forecast trend, often something interesting is going on. And these sorts of anomalies should be much more easily found by comparing to forecasts than by looking at polls in isolation.

Cyrus writes:

I [Cyrus] was teaching a class on multilevel modeling, and we were playing around with different method to fit a random effects logit model with 2 random intercepts---one corresponding to "family" and another corresponding to "community" (labeled "mom" and "cluster" in the data, respectively). There are also a few regressors at the individual, family, and community level. We were replicating in part some of the results from the following paper: Improved estimation procedures for multilevel models with binary response: a case-study, by G Rodriguez, N Goldman.

(I say "replicating in part" because we didn't include all the regressors that they use, only a subset.) We were looking at the performance of estimation via glmer in R's lme4 package, glmmPQL in R's MASS package, and Stata's xtmelogit. We wanted to study the performance of various estimation methods, including adaptive quadrature methods and penalized quasi-likelihood.

I was shocked to discover that glmer's default setting is adaptive Gaussian quadrature with 1 integration point (that is, a Laplace approximation). In addition, and even more shocking, glmer **does not allow** more than one integration point to be used for models with more than one random effect. With only one random effect, you can increase the number of integration points. But with multiple random effects (like what we had, which was 2), when you try to specify multiple integration points (with the nAGQ setting), you get an error saying that it can't do it. At first, I kinda shrugged, but then when I saw the difference that it made, alarm bells started going off.

I made a table showing results from estimates using glmer's Laplace approximation, and then also results from the glmmPQL which uses penalized quasi-liklihood, and then results from Stata's xtmelogit using first its own Laplace approximation and then a 7-pt Gaussian adaptive quadrature. Note that in Stata, the 7-pt quadrature fit is the default. This is a high dimsenional "brute force" method for fitting a complex likelihood.

Now, when we look at the results for the random effects standard deviation estimates, we see that the Laplace approximations in R (via glmer) and Stata (model 3 in the Table) are identical. In addition, taking the 7-pt quadrature estimate from xtmelogit to be the best guess of the bunch, the Laplace approximation-based estimates are HORRIBLY biased downward. That is, the estimated random effect variance is way too small. The PQL estimates are also biased downward, but not so badly. This is in line with what Rodriguez and Goldman found, using a 20-point quadrature and an MCMC fit as benchmarks (they produced nearly identical results).

The upshot is that glmer's default is REALLY BAD; and this default is the only option when you have more than one random effect. My suggestion then is to either use Stata, which has a very sensible 7-point quadrature default (although as a brute force method, it is slow) or fit the model with MCMC.

If I understand, you are using glmer? If so, have you (1) looked into this or (2) checked what glmer is giving you against xtmelogit or BUGS? I would trust the xtmelogit or BUGS answers more.

My reply:

1. I am working with Sophia Rabe-Hesketh, who's one of the people who programmed gllam in Stata. Sophia has been telling me for awhile that the method used by gllam is better than the Laplace method that is currently used in glmer.

2. Just some terminology: I'd prefer to talk about different "approximations" rather than different "estimation methods." To me, there's just one estimation method here--hierarchical Bayes (or, as Sophia would say, marginal likelihood inference)--and these various approaches to point estimates of variance parameters are approximations to get around the difficulties of doing the integrals and computing the marginal likelihood.

3. I also prefer to use the term "varying intercepts" (and slopes) rather than "random effects," because the term "random effects" is not so clearly defined in the literature (as Jennifer and I discuss in our book). This doesn't change the substance of your comment but it makes it easier for me to follow.

4. In my project with Sophia (and with Jingchen Liu, also cc-ed here), we're working on several parallel tracks:
(a) Adding a prior distribution on the variance parameters to stabilize the point estimates and keep them away from the boundary of parameter space (which, in the context of scalar variance parameters, implies that we keep the estimates away from zero).
(b) Adding a few steps of the Metropolis algorithm to capture some of the uncertainty in the variance parameters and also fix some of that bias you're talking about. It's my intuition that a low number (e.g., 10) Metropolis steps will clean up a lot of problems. Although, right now, that's just an intuition, we haven't tried it yet.
(c) Working in R (using glmer) and also separately in Stata (using gllam).

5. Last but not least, it sounds like you've encountered an important principle of research: The way to really learn a subject is to teach it! That's why college professors(at least at a place like Columbia) know so much: we're always teaching new things.

The $900 kindergarten teacher


Paul Bleicher writes:

This simply screams "post-hoc, multiple comparisons problem," though I haven't seen the paper.

A quote from the online news report:

The findings revealed that kindergarten matters--a lot. Students of kindergarten teachers with above-average experience earn $900 more in annual wages than students of teachers with less experience than average. Being in a class of 15 students instead of a class of 22 increased students' chances of attending college, especially for children who were disadvantaged . . . Children whose test scores improved to the 60th percentile were also less likely to become single parents, more likely to own a home by age 28, and more likely to save for retirement earlier in their work lives.

I haven't seen the paper either. $900 doesn't seem like so much to me, but I suppose it depends where you stand on the income ladder.

Regarding the multiple comparisons problem: this could be a great example for fitting a multilevel model. Seriously.

Dave Armstrong writes:

Subhadeep Mukhopadhyay writes:

I am convinced of the power of hierarchical modeling and individual parameter pooling concept. I was wondering how could multi-level modeling could influence the estimate of grad mean (NOT individual label).

My reply: Multilevel modeling will affect the estimate of the grand mean in two ways:

1. If the group-level mean is correlated with group size, then the partial pooling will change the estimate of the grand mean (and, indeed, you might want to include group size or some similar variable as a group-level predictor.

2. In any case, the extra error term(s) in a multilevel model will typically affect the standard error of everything, including the estimate of the grand mean.

David Shor writes:

References on predicting elections


Mike Axelrod writes:

I [Axelrod] am interested in building a model that predicts voting on the precinct level, using variables such as party registration, age, sex, income etc. Surely political scientists have worked on this problem.

I would be grateful for any reference you could provide in the way of articles and books.

My reply: Political scientists have worked on this problem, and it's easy enough to imagine hierarchical models of the sort discussed in my book with Jennifer. I can picture what I would do if asked to forecast at the precinct level, for example to model exit polls. (In fact, I was briefly hired by the exit poll consortium in 2000 to do this, but then after I told them about hierarchical Bayes, they un-hired me!) But I don't actually know of any literature on precinct-level forecasting. Perhaps one of you out there knows of some references?

Eric McGhee writes:

ARM solutions

| No Comments

People sometimes email asking if a solution set is available for the exercises in ARM. The answer, unfortunately, is no. Many years ago, I wrote up 50 solutions for BDA and it was a lot of work--really, it was like writing a small book in itself. The trouble is that, once I started writing them up, I wanted to do it right, to set a good example. That's a lot more effort than simply scrawling down some quick answers.

In discussing the ongoing Los Angeles Times series on teacher effectiveness, Alex Tabarrok and I both were impressed that the newspaper was reporting results on individual teachers, moving beyond the general research findings ("teachers matter," "KIPP really works, but it requires several extra hours in the school day," and so forth) that we usually see from value-added analyses in education. My first reaction was that the L.A. Times could get away with this because, unlike academic researchers, they can do whatever they want as long as they don't break the law. They don't have to answer to an Institutional Review Board.

(By referring to this study by its publication outlet rather than its authors, I'm violating my usual rule (see the last paragraph here). In this case, I think it's ok to refer to the "L.A. Times study" because what's notable is not the analysis (thorough as it may be) but how it is being reported.)

Here I'd like to highlight a few other things came up in our blog discussion, and then I'll paste in a long and informative comment sent to me by David Huelsbeck.

But first some background.

Mister P gets married


Jeff, Justin, and I write:

Gay marriage is not going away as a highly emotional, contested issue. Proposition 8, the California ballot measure that bans same-sex marriage, has seen to that, as it winds its way through the federal courts. But perhaps the public has reached a turning point.

And check out the (mildly) dynamic graphics. The picture below is ok but for the full effect you have to click through and play the movie.


Matching at two levels


Steve Porter writes with a question about matching for inferences in a hierarchical data structure. I've never thought about this particular issue, but it seems potentially important.

Maybe one or more of you have some useful suggestions?

Porter writes:

George Leckie writes:

The Centre for Multilevel Modelling is seeking to appoint two social statisticians or social scientists with advanced quantitative skills for two ESRC-funded research projects:

1. Research Assistant in Social Statistics, 1 year from 1 October 2010

2. Research Assistant/Associate in Social Statistics, 34 months from 1 October 2010.

See paragraphs 13-15 of this article by Dan Balz.

That half-Cauchy prior

| No Comments

Xiaoyu Qian writes:

Daniel Corsi writes:

I was wondering if you could help me with some code to set up a posterior predictive check for an unordered multinomial multilevel model. In this case the outcome is categories of bmi (underweight, nomral weight, and overweight) based on individuals from 360 different areas. What I would like to do is set up a replicated dataset to see how the number of overweight/underweight/normal weight individuals based on the model compares to the actual data and some kind of a graphical summary. I am following along with chapter 24 of the arm book but I want to verify that the replicated data accounts for the multilevel structure of the data of people within areas. I am attaching the code I used to run a simple model with only 2 predictors (area wealth and urban/rural designation).

My reply: The Bugs code is a bit much for me to look at--but I do recommend that you run it from R, which will give you more flexibility in preprocessing and postprocessing the data. Beyond this, there are different ways of doing replications. You can replicate new people in existing areas or new people in new areas. It depends on your application. In a cluster sample, you'll probably want new areas. In an exhaustive sample, there might not be any new areas and you'll want to keep your existing group-level parameters.

John Kastellec, Jeff Lax, and Justin Phillips write:

Do senators respond to the preferences of their states' median voters or only to the preferences of their co-partisans? We [Kastellec et al.] study responsiveness using roll call votes on ten recent Supreme Court nominations. We develop a method for estimating state-level public opinion broken down by partisanship. We find that senators respond more powerfully to their partisan base when casting such roll call votes. Indeed, when their state median voter and party median voter disagree, senators strongly favor the latter. [emphasis added] This has significant implications for the study of legislative responsiveness, the role of public opinion in shaping the personnel of the nations highest court, and the degree to which we should expect the Supreme Court to be counter-majoritarian. Our method can be applied elsewhere to estimate opinion by state and partisan group, or by many other typologies, so as to study other important questions of democratic responsiveness and performance.

Their article uses Mister P and features some beautiful graphs.

Using ranks as numbers

| No Comments

David Shor writes:

I'm dealing with a situation where I have two datasets, one that assigns each participant a discrete score out of five for a set of particular traits (Dog behavior characteristics by breed), and another from an independent source that ranks each breed by each characteristic. It's also possible to obtain the results of a survey, where experts were asked to rank 7 randomly picked breeds by characteristics.

I'm interested in obtaining estimates for each trait, and intuitively, it seems clear that the second and third dataset provide a lot of information. But it's unclear how to incorporate them to infer latent variables, since only sample ranks are observed. This seems like it is a common problem, do you have any suggestions?

My quick answer is that you can treat ranks as numbers (a point we make somewhere in Bayesian Data Analysis, I believe) and just fit an item-response model from there.

Val Johnson wrote an article on this in Jasa a few years ago, "Bayesian analysis of rank data with application to primate intelligence experiments." He also did similar work calibrating college grades.

Maggie Fox writes:

Brain scans may be able to predict what you will do better than you can yourself . . . They found a way to interpret "real time" brain images to show whether people who viewed messages about using sunscreen would actually use sunscreen during the following week.

The scans were more accurate than the volunteers were, Emily Falk and colleagues at the University of California Los Angeles reported in the Journal of Neuroscience. . . .

About half the volunteers had correctly predicted whether they would use sunscreen. The research team analyzed and re-analyzed the MRI scans to see if they could find any brain activity that would do better.

Activity in one area of the brain, a particular part of the medial prefrontal cortex, provided the best information.

"From this region of the brain, we can predict for about three-quarters of the people whether they will increase their use of sunscreen beyond what they say they will do," Lieberman said.

"It is the one region of the prefrontal cortex that we know is disproportionately larger in humans than in other primates," he added. "This region is associated with self-awareness, and seems to be critical for thinking about yourself and thinking about your preferences and values."

Hmm . . . they "analyzed and re-analyzed the scans to see if they could find any brain activity" that would predict better than 50%?! This doesn't sound so promising. But maybe the reporter messed up on the details . . .

I took advantage of my library subscription to take a look at the article, "Predicting Persuasion-Induced Behavior Change from the Brain," by Emily Falk,Elliot Berkman,Traci Mann, Brittany Harrison, and Matthew Lieberman. Here's what they say:

- "Regions of interest were constructed based on coordinates reported by Soon et al. (2008) in MPFC and precuneus, regions that also appeared in a study of persuasive messaging." OK, so they picked two regions of interest ahead of time. They didn't just search for "any brain activity." I'll take their word for it that they just looked at these two, that they didn't actually look at 50 regions and then say they reported just two.

- Their main result had a t-statistic of 2.3 (on 18 degrees of freedom, thus statistically significant at the 3% level) in one of the two regions they looked at, and a t-statistic of 1.5 (not statistically significant) in the other. A simple multiple-comparisons correction takes the p-value of 0.03 and bounces it up to an over-the-threshold 0.06, which I think would make the result unpublishable! On the other hand, a simple average gives a healthy t-statistic of (1.5+2.3)/sqrt(2) = 2.7, although that ignores any possible correlation between the two regions (they don't seem to supply that information in their article).

- They also do a cross-validation but this seems 100% pointless to me since they do the cross-validation on the region that already "won" on the full data analysis. For the cross-validation to mean anything at all, they'd have to use the separate winner on each of the cross-validatory fits.

- As an outcome, they use before-after change. They should really control for the "before" measurement as a regression predictor. That's a freebie. And, when you're operating at a 6% significance level, you should take any freebie that you can get! (It's possible that they tried adjusting for the "before" measurement and it didn't work, but I assume they didn't do that, since I didn't see any report of such an analysis in the article.)

The bottom line

I'm not saying that the reported findings are wrong, I'm just saying that they're not necessarily statistically significant in the usual way this term is used. I think that, in the future, such work would be improved by more strongly linking the statistical analysis to the psychological theories. Rather than simply picking two regions to look at, then taking the winner in a study of n=20 people, and going from there to the theories, perhaps they could more directly model what they're expecting to see.

The difference between . . .

Also, the difference between "significant'' and "not significant'' is not itself statistically significant. How is this relevant in the present study? They looked at two regions, MPFC and precuneus. Both showed positive correlations, one with a t-value of 2.3, one with a t-value of 1.5. The first of these is statistically significant (well, it is, if you ignore that it's the maximum of two values), the second is not. But the difference is not anything close to statistically significant, not at all! So why such a heavy emphasis on the winner and such a neglect of #2?

Here's the count from a simple document search:

MPFC: 20 instances (including 2 in the abstract)
precuneus: 8 instances (0 in the abstract)

P.S. The "picked just two regions" bit gives a sense of why I prefer Bayesian inference to classical hypothesis testing. The right thing, I think, is actually to look at all 50 regions (or 100, or however many regions there are) and do an analysis including all of them. Not simply picking the region that is most strongly correlated with the outcome and then doing a correction--that's not the most statistically efficient thing to do, you're just asking, begging to be overwhelmed by noise)--but rather using the prior information about regions in a subtler way than simply picking out 2 and ignoring the other 48. For example, you could have a region-level predictor which represents prior belief in the region's importance. Or you could group the regions into a few pre-chosen categories and then estimate a hierarchical model with each group of regions being its own batch with group-level mean and standard deviation estimated from data. The point is, you have information you want to use--prior knowledge from the literature--without it unduly restricting the possibilities for discovery in your data analysis.

Near the end, they write:

In addition, we observed increased activity in regions involved in memory encoding, attention, visual imagery, motor execution and imitation, and affective experience with increased behavior change.

These were not pre-chosen regions, which is fine, but at this point I'd like to see the histogram of correlations for all the regions, along with a hierarchical model that allows appropriate shrinkage. Or even a simple comparison to the distribution of correlations one might expect to see by chance. By suggesting this, I'm not trying to imply that all the findings in this paper are due to chance; rather, I'm trying to use statistical methods to subtract out the chance variation as much as possible.

P.P.S. Just to say this one more time: I'm not at all trying to claim that the researchers are wrong. Even if they haven't proven anything in a convincing way, I'll take their word for it that their hypothesis makes scientific sense. And, as they point out, their data are definitely consistent with their hypotheses.

P.P.P.S. For those who haven't been following these issues, see here, here, here, and here.

Grazia Pittau, Roberto Zelli, and I came out with a paper investigating the role of economic variables in predicting regional disparities in reported life satisfaction of European Union citizens. We use multilevel modeling to explicitly account for the hierarchical nature of our data, respondents within regions and countries, and for understanding patterns of variation within and between regions. Here's what we found:

- Personal income matters more in poor regions than in rich regions, a pattern that still holds for regions within the same country.

- Being unemployed is negatively associated with life satisfaction even after controlled for income variation. Living in high unemployment regions does not alleviate the unhappiness of being out of work.

- After controlling for individual characteristics and modeling interactions, regional differences in life satisfaction still remain.

Here's a quick graph; there's more in the article:

George Leckie points to this free online course from the Centre for Multilevel Modelling (approx 600 pages of materials covering theory and implementation in MLwiN and Stata).



Joe Fruehwald writes:

I'm working with linguistic data, specifically binomial hits and misses of a certain variable for certain words (specifically whether or not the "t" sound was pronounced at the end of words like "soft"). Word frequency follows a power law, with most words appearing just once, and with some words being hyperfrequent. I'm not interested in specific word effects, but I am interested in the effect of word frequency.

A logistic model fit is going to be heavily influenced by the effect of the hyperfrequent words which constitute only one type. To control for the item effect, I would fit a multilevel model with a random intercept by word, but like I said, most of the words appear only once.

Is there a principled approach to this problem?

My response: It's ok to fit a multilevel model even if most groups only have one observation each. You'll want to throw in some word-level predictors too. Think of the multilevel model not as a substitute for the usual thoughtful steps of statistical modeling but rather as a way to account for unmodeled error at the group level.

John Lawson writes:

I have been experimenting using Bayesian Methods to estimate variance components, and I have noticed that even when I use a noninformative prior, my estimates are never close to the method of moments or REML estimates. In every case I have tried, the sum of the Bayesian estimated variance components is always larger than the sum of the estimates obtained by method of moments or REML.

Postdoc #1. Hierarchical Modeling and Computation: We are fitting hierarchical regression models with deep interactions. We're working on new models with structured prior distributions, and this also requires advances in Bayesian computation. Applications include public opinion, climate reconstruction, and education research.

Postdoc #1 is funded by grants from the Department of Energy, Institute of Education Sciences, and National Science Foundation.

Postdoc #2. Hierarchical Modeling and Statistical Graphics: The goal of this research program is to investigate the application of the latest methods of multi-level data analysis, visualization and regression modeling to an important commercial problem: forecasting retail sales at the individual item level. These forecasts are used to make ordering, pricing and promotions decisions which can have significant economic impact to the retail chain such that even modest improvements in the accuracy of predictions, across a large retailer's product line, can yield substantial margin improvements.

Project #2 is to be undertaken with, and largely funded by, a firm which provides forecasting technology and services to large retail chains, and which will provide access to a unique and rich set of proprietary data. The postdoc will be expected to spend some time working directly with this firm, but it is fundamentally a research position.

Ideally, postdoc #1 will have a statistics or computer science background, will be interested in statistical modeling, serious programming, and applications. Ideally, postdoc #2 will have a background in statistics, psychometrics, or economics and be interested in marketing or related topics. Both postdocs should be able to work fluently in R and should already know about hierarchical models and Bayesian inference and computation.

Both projects will be at the Applied Statistics Center at Columbia University, also with connections outside Columbia. Collaborators on Postdoc #1 include Jinchen Liu, Jennifer Hill, Matt Schofield, Upmanu Lall, Chad Scherrer, Alan Edelman, and Sophia Rabe-Hesketh. Collaborators on Postdoc #2 include Eric Johnson.

If you're interested in either, please send a letter of application, a c.v., some of your articles, and three letters of recommendation to the Applied Statistics Center coordinator, Caroline Peters,

Patricia and I have cleaned up some of the R and Bugs code and collected the data for almost all the examples in ARM. See here for links to zip files with the code and data.

Vlad Kogan writes:

I've using your book on regression and multilevel modeling and have a quick R question for you. Do you happen to know if there is any R package that can estimate a two-stage (instrumental variable) multi-level model?

My reply: I don't know. I'll post on blog and maybe there will be a response. You could also try the R help list.

Random matrices in the news


Mark Buchanan wrote a cover article for the New Scientist on random matrices, a heretofore obscure area of probability theory that his headline writer characterizes as "the deep law that shapes our reality."

It's interesting stuff, and he gets into some statistical applications at the end, so I'll give you my take on it.

But first, some background.

About two hundred years ago, the mathematician/physicist Laplace discovered what is now called the central limit theorem, which is that, under certain conditions, the average of a large number of small random variables has an approximate normal (bell-shaped) distribution. A bit over 100 years ago, social scientists such as Galton applied this theorem to all sorts of biological and social phenomena. The central limit theorem, in its generality, is also important in the information that it indirectly conveys when it fails.

For example, the distribution of the heights of adult men or women is nicely bell-shaped, but the distribution of the heights of all adults has a different, more spread-out distribution. This is because your height is the sum of many small factors and one large factor--your sex. The conditions of the theorem are that no single factor (or small number of factors) should be important on its own. For another example, it has long been observed that incomes do not follow a bell-shaped curve, even on the logarithmic scale. Nor do sizes of cities and many other social phenomena. These "power-law curves," which don't fit the central limit theorem, have motivated social scientists such as Herbert Simon to come up with processes more complicated than simple averaging (for example, models in which the rich get richer).

The central limit theorem is an example of an attractor--a mathematical model that appears as a limit as sample size gets large. The key feature of an attractor is that it destroys information. Think of it as being like a funnel: all sorts of things can come in, but a single thing--the bell-shaped curve--comes out. (Or, for other models, such as that used to describe the distribution of incomes, the attractor might be a power-law distribution.) The beauty of an attractor is that, if you believe the model, it can be used to explain an observed pattern without needing to know the details of its components. Thus, for example, we can see that the heights of men or of women have bell-shaped distributions, without knowing the details of the many small genetic and environmental influences on height.

Now to random matrices.

Statistics is, I hope, on balance a force for good, helping us understand the world better. But I think a lot of damage has been created by statistical models and methods that inappropriately combine data that come from different places.

As a colleague of mine at Berkeley commented to me--I paraphrase, as this conversation occurred many years ago--"This meta-analysis stuff is fine, in theory, but really people are just doing it because they want to run the Gibbs sampler." Another one of my Berkeley colleagues wisely pointed out that the fifty states are not exchangeable, nor are they a sample from a larger population. So it is clearly inappropriate to apply a probability model for parameters to vary by state. From a statistical point of view, the only valid approaches are either to estimate a single model for all the states together or to estimate parameters for the states using unbiased inference (or some approximation that is asymptotically unbiased and efficient).

Unfortunately, recently I've been seeing more and more of this random effects modeling, or shrinkage, or whatever you want to call it, and as a statistician, I think it's time to lay down the law. I'm especially annoyed to see this sort of thing in political science and public opinion research. Pollsters work hard to get probability samples that can be summarized using rigorous unbiased inference, and it does nobody any good to pollute this with speculative, model-based inference. I'll leave the model building to the probabilists; when it comes to statistics, I prefer a bit more rigor. The true job of a statistician is not to say what might be true or what he wishes were true, but rather to express what the data have to say.

Here's a recent example of a hierarchical model that got some press. (No, it didn't make it through the rigor of a peer-reviewed journal; instead it made its way to the New York Times by way of a website that's run by a baseball statistician. A sad case of the decline of intellectual standards in America, but that's another story.) I'll repeat the graph because it's so seductive yet so misleading:

Alexis Le Nestour writes:

I am currently fitting a model explaining the probability of seeking a treatment in Senegal. I have 4513 observations at the individual level nested in 505 households. Our sampling was a two stage stratified sampling procedure. Our clusters are farming organizations picked up in function of their size, then we randomly chose 5 households per farming organization and finally we interviewed all the household members.

Hong Jiao writes:

I work in the area of educational measurement and statistics. I am currently using the MCMC method to estimate model parameters for item response theory (IRT) models. Based on your book, Gelman and Hill (2007), the scale indeterminancy issue can be solved by constraining the mean of item parameters or the mean of person parameters to be 0. Some of my colleagues stated that by using the priors for either the item or person parameters, we do not need to set any constraints for the scale indeterminancy when using the MCMC for the IRT model estimation. They thought the use of the priors determine the scale. I am somewhat in disagreement with them but have nobody to confirm.

My response:

I recommend fitting the model with soft constraints (for example, by setting one of the prior means to 0 and imposing an inequality constraint somewhere to deal with the sign aliasing) and then using summarizing inferences using relative values of the parameters. The key idea is to post-process to get finite population inferences. We discuss this in one of the late chapters of our book and also in our Political Analysis article from 2005. (In particular, see Section 2 of that article.)

Constraining the prior mean is fine but it doesn't really go far enough, I think. Ultimately it depends on what your inferential goals are, though.

Sifting and Sieving


Following our recent discussion of p-values, Anne commented:

We use p-values for something different: setting detection thresholds for pulsar searches. If you're looking at, say, a million independent Fourier frequencies, and you want to bring up an expected one for further study, you look for a power high enough that its p-value is less than one in a million. (Similarly if you're adding multiple harmonics, coherently or incoherently, though counting your "number of trials" becomes more difficult.) I don't know whether there's another tool that can really do the job. (The low computing cost is also important, since in fact those million Fourier frequencies are multiplied by ten thousand dispersion measure trials and five thousand beams.)

That said, we don't really use p-values: in practice, radio-frequency interference means we have no real grasp on the statistics of our problem. There are basically always many signals that are statistically significant but not real, so we rely on ad-hoc methods to try to manage the detection rates.

I don't know anything about astronomy--just for example, I can't remember which way the crescent moon curves in its different phases during the month--but I can offer some general statistical thoughts.

My sense is that p-values are not the best tool for this job. I recommend my paper with Jennifer and Masanao on multiple comparisons; you can also see my talks on the topic. (There's even a video version where you can hear people laughing at my jokes!) Our general advice is to model the underlying effects rather than thinking of them as a million completely unrelated outcomes.

The idea is to get away from the whole sterile p-value/Bayes-factor math games and move toward statistical modeling.

Another idea that's often effective is to select as subset of your million possibilities for screening and then analyze that subset more carefully. The work of Tian Zheng and Shaw-Hwa Lo on feature selection (see the Statistics category here) might be relevant for this purpose.

Neil D writes:

I'm starting to work on a project which I think might benefit from some multilevel modeling, but I'm not sure. Essentially it is a multiple regression model with two explanatory variables. The intercept is expected to be close to zero.

Over the time the data has been collected there have been six changes to the relevant tax rules in my country, and what has been done so far is to fit a model where the regression coefficients are different in the seven tax regimes. I'm thinking that some partial pooling might be helpful, and might improve the estimate of the the regression coefficients in the last tax regime, which really is of major interest.

I haven't done the analysis yet, but I'm assuming that such an analysis would be worthwhile and relatively straigtforward assuming that the regression coefficients in the different tax regimes are bivariate normal. What worries me a little, particularly as the analysis is sure to be scrutinized and criticized by various interested parties, is the assumption of exchangability, since as the different tax regimes are introduced there was an expectation that the regression coefficients would both go up in some cases, or both go down, or one would go up but the other would likely be unaffected. I'm not sure if it is possible to incorporate this information.

My reply: I'd start with the secret weapon.

P.S. Whenever I put "multilevel" in the title of a blog entry, I get spam from multilevel marketing companies. Could you guys just cut it out, please?

Daniel Kramer writes:

In your book, Data Analysis Using Regression and Multilevel..., you state that in situations when little group-level variation is observed, multilevel models reduce to classical regression with no group indicators. Does this essentially mean that with zero group-variation, predicted coefficients, deviance, and AIC would be the same to estimates obtained with classical regression? I ask because I have been asked by an editor to adopt a multimodel inference approach (Burnham and Anderson 2002) in my analysis. Typically a small set of candidate models are ranked using an information theoretic criterion and model averaging may be used to derive coefficient estimates or predictions. Thus, would it be appropriate to compare single-level and multi-level models derived from the same data set using AIC? I am skeptical since the deviance for the null models are different. Of course, there may be no reason to compare single and multi-level models if there is no cost (i.e. reduced model fit) in adopting a multi-level framework as long as the case can be made that the data are hierarchical. The only cost you mention in your book is the added complexity.

My reply: If you're fitting multilevel models, perhaps better to use DIC than AIC. DIC has problems too (search this blog for DIC for discussion of this point), but with AIC you'll definitely run into trouble counting parameters. (BIC has other, more serious problems: it's not a measure of predictive error the way that AIC and DIC are.) More generally, I can see the appeal of presenting and averaging over several models, even if I've rarely ended up doing that myself. Indeed, I'd prefer to let everything of interest vary by group.

P.S. What's the deal with the 256-character limit on titles, anyway?

Tom Clark writes:

Doug Bates's lmer book



Nick Allum writes:

Jim Madden writes:

I have been developing interactive graphical software for visualizing hierarchical linear models of large sets of student performance data that I have been allowed access to by the Louisiana Department of Education. This is essentially pre/post test data, with student demographic information (birthday, sex, race, ses, disability status) and school associated with each record. (Actually, I can construct student trajectories for several years, but I have not tried using this capability yet.) My goal is to make the modeling more transparent to audiences that are not trained in statistics, and in particular, I am trying to design the graphics so that nuances and uncertainties are apparent to naive viewers.

Andrew Gelman, Jingchen Liu, and Sophia Rabe-Hesketh are looking for two fulltime postdocs, one to be based at the Department of Statistics, Columbia University, and the other at the Graduate School of Education, University of California, Berkeley.

The positions are funded by a grant entitled "Practical Tools for Multilevel/Hierarchical Modeling in Education" (Institute of Education Sciences, Department of Education). The project addresses modeling and computational issues that stand in the way of fitting larger, more realistic models of variation in social science data, in particular the problem that (restricted) maximum likelihood estimation often yields estimates on the boundary, such as zero variances or perfect correlations. The proposed solution is to specify weakly informative prior distributions - that is, prior distributions that will affect inferences only when the data provide little information about the parameters. Existing software for maximum likelihood estimation can then be modified to perform Bayes posterior modal estimation, and these ideas can also be used in fully Bayesian computation. In either case, the goal is to be able to fit and understand more complex, nuanced models. The postdocs will contribute to all aspects of this research and will implement the methods in C and R (Columbia postdoc) and Stata (Berkeley postdoc).

Both locations are exciting places to work with research groups comprising several faculty, postdocs, and students working on a wide range of interesting applied problems. The Columbia and Berkeley groups will communicate with each other on a regular basis, and there will be annual workshops with outside panels of experts.

Applicants should have a Ph.D. in Statistics or a related area and have strong programming skills, preferably with some experience in C/R/Stata. Experience with Bayesian methods and good knowledge of hierarchical Bayesian and/or "classical" multilevel modeling would be a great advantage.

The expected duration of the positions is 2 years with an approximate start date of September 1, 2010.

Please submit a statement of interest, not exceeding 5 pages, and your CV to either Andrew Gelman and Jingchen Liu ( or Sophia Rabe-Hesketh (Graduate School of Education, 3659 Tolman Hall, Berkeley, CA 94720-1670,, stating whether you would also be interested in the other position. Applications will be considered as they arrive but should be submitted no later than April 1st.

This is in addition to, but related to, our other postdoctoral positions. A single application will work for all of them.

Dean Eckles writes:

I have a hopefully interesting question about methods for analyzing varying coefficients as a way of describing similarity and variation between the groups.

In particular, we are studying individual differences in susceptibility to different influence strategies. We have quantitative outcomes (related to buying books), and influence strategy (7 levels, including a control) is a within-subjects factor with two implementations of each strategy (also within-subjects).

Yet more antblogging

| No Comments


James Waters writes:

Commenter RogerH pointed me to this article by Welton, Ades, Carlin, Altman, and Sterne on models for potentially biased evidence in meta-analysis using empirically based priors. The "Carlin" in the author list is my longtime collaborator John, so I really shouldn't have had to hear about this through a blog comment. Anyway, they write:

We present models for the combined analysis of evidence from randomized controlled trials categorized as being at either low or high risk of bias due to a flaw in their conduct. We formulate a bias model that incorporates between-study and between-meta-analysis heterogeneity in bias, and uncertainty in overall mean bias. We obtain algebraic expressions for the posterior distribution of the bias-adjusted treatment effect, which provide limiting values for the information that can be obtained from studies at high risk of bias. The parameters of the bias model can be estimated from collections of previously published meta-analyses. We explore alternative models for such data, and alternative methods for introducing prior information on the bias parameters into a new meta-analysis. Results from an illustrative example show that the bias-adjusted treatment effect estimates are sensitive to the way in which the meta-epidemiological data are modelled, but that using point estimates for bias parameters provides an adequate approximation to using a full joint prior distribution. A sensitivity analysis shows that the gain in precision from including studies at high risk of bias is likely to be low, however numerous or large their size, and that little is gained by incorporating such studies, unless the information from studies at low risk of bias is limited.We discuss approaches that might increase the value of including studies at high risk of bias, and the acceptability of the methods in the evaluation of health care interventions.

I really really like this idea. As Welton et al. discuss, their method represents two key conceptual advances:

1. In addition to downweighting questionable or possibly-biased studies, they also shift them to adjust in the direction of correcting for the bias.

2. Instead of merely deciding which studies to trust based on prior knowledge, literature review, and external considerations, they also use the data, through a meta-analysis, to estimate the amount of adjustment to do.

And, as a bonus, the article has excellent graphs. (It also has three ugly tables, with gratuitous precision such as "-0.781 (-1.002, -0.562)," but the graph-to-table ratio is much better than usual in this sort of statistical research paper, so I can't really complain.)

This work has some similarities to the corrections for nonsampling errors that we do in survey research. As such, I have one idea here. Would it be possible to take the partially-pooled estimates from any given analysis and re-express them as equivalent weights in a weighted average? (This is an idea I've discussed with John and is also featured in my "Survey weighting is a mess" paper.) I'm not saying there's anything so wonderful about weighted estimates, but it could help in understanding these methods to have a bridge to the past, as it were, and see how they compare in this way to other approaches.

Payment demanded for the meal

There's no free lunch, of course. What assumptions did Welton et al. put in to make this work? They write:

We base the parameters of our bias model on empirical evidence from collections of previously published meta-analyses, because single meta-analyses typically provide only limited information on the extent of bias . . . This, of course, entails the strong assumption that the mean bias in a new meta-analysis is exchangeable with the mean biases in the meta-analyses included in previous empirical (meta-epidemiological) studies. For example, the meta-analyses that were included in the study of Schulz et al. (1995) are mostly from maternity and child care studies, and we must doubt whether the mean bias in studies on drugs for schizophrenia (the Clozapine example meta-analysis) is exchangeable with the mean biases in this collection of meta-analyses.

Assumptions are good. I expect their assumptions are better than the default alternatives, and it's good to have the model laid out there for possible criticism and improvement.

P.S. The article focuses on medical examples but I think the methods would also be appropriate for experiments and observational studies in social science. A new way of thinking about the identification issues that we're talking about all the time.

Jimmy points me to this article, "Why most discovered true associations are inflated," by J. P. Ioannidis. As Jimmy pointed out, this is exactly what we call type M (for magnitude) errors. I completely agree with Ioannidis's point, which he seems to be making more systematically than David Weakliem and I did in our recent article on the topic.

My only suggestion beyond what Ioannidis wrote has to do with potential solutions to the problem. His ideas include: "being cautious about newly discovered effect sizes, considering some rational down-adjustment, using analytical methods that correct for the anticipated inflation, ignoring the magnitude of the effect (if not necessary), conducting large studies in the discovery phase, using strict protocols for analyses, pursuing complete and transparent reporting of all results, placing emphasis on replication, and being fair with interpretation of results."

These are all good ideas. Here are two more suggestions:

1. Retrospective power calculations. See page 312 of our article for the classical version or page 313 for the Bayesian version. I think these can be considered as implementations of Iaonnides's ideas of caution, adjustment, and correction.

2. Hierarchical modeling, which partially pools estimated effects and reduces Type M errors as well as handling many multiple comparisons issues. Fuller discussion here (or see here for the soon-to-go-viral video version).

P.S. Here's the first mention of Type M errors that I know of. The problem is important enough, though, that I suspect there are articles on the topic going back to the 1950s or earlier in the psychometric literature.

Recent Comments

  • Millsy: Here's some shorter term, very preliminary, very basic, very ugly read more
  • Matt: Ben Fry’s Baseball Chart looks more like an art-museum-security-laser plot. read more
  • Rodney Sparapani: I guess that means that we can't post comments on read more
  • Pablo Verde: Excellent article! Where I can get the R script for read more
  • Millsy: I'll race Dan to the challenge! read more
  • Dan Turkenkopf: I think I already have the code to do most read more
  • Willem: I feel the hunger in that last paragraph. Remember: if read more
  • Andrew Gelman: James: I thought of that but then doesn't the scaling read more
  • Jeffrey Showman: I agree. I've appreciated them (lineplots, slope graphs, or "bumps read more
  • Dean Eckles: I think what distinguishes a true slopegraph from a parallel read more
  • James Scott: Andrew: we scale the half-Cauchy prior by sigma (the scale read more
  • Andrew Gelman: Maarten: Indeed, as we say in our paper (and I read more
  • Xi'an: Thanks for the pointer, Aki! I had completely forgotten about read more
  • Xi'an: This is an interesting note but the authors do not read more
  • Maarten Buis: The null hypothesis is not true when the data is read more

About this Archive

This page is an archive of recent entries in the Multilevel Modeling category.

Miscellaneous Statistics is the previous category.

Political Science is the next category.

Find recent content on the main index or look in the archives to find all content.