Recently in Multilevel Modeling Category

Asa writes:

I took your class on multilevel models last year and have since found myself applying them in several different contexts. I am about to start a new project with a dataset in the tens of millions of observations. In my experience, multilevel modeling has been most important when the number of observations in at least one subgroup of interest is small. Getting started on this project, I have two questions:

1) Do multilevel models still have the potential to add much accuracy to predictions when n is very large in all subgroups of interest?

2) Do you find SAS, STATA, or R to be more efficient at handling multilevel/"mixed effects" models with such a large dataset (wont be needing any logit/poisson/glm models)?

My reply:

Regarding software, I'm not sure, but my guess is that Stata might be best with large datasets. Stata also has an active user community that can help with such questions.

For your second question, if n is large in all subgroups, then multilevel modeling is typically not needed. But if n is large in all subgroups, you can simply fit a separate model in each group. That is equivalent to a full-interaction model. At that point you might be interested in details within subgroups, and then you might want a multilevel model.

Asa then wrote:

Yes, a "full interaction" model was the alternative I was thinking of. And yes, I can imagine the results from that model raising further questions about whats going on within groups as well.

My previous guess was that SAS would be the most efficient for multilevel modeling with big data. But I just completely wrecked my (albeit early 2000's era) laptop looping proc mixed a bunch of times with a much smaller dataset.

I don't really know on the SAS vs. Stata issue. In general, I have warmer feelings toward Stata than SAS, but, on any particular problem, who knows? I'm pretty sure that R would choke on any of these problems.

On the other hand, if you end up breaking the problem into smaller pieces anyway, maybe the slowness of R wouldn't be so much of a problem. R does have the advantage of flexibility.

Interactions and Bayesian Anova

| No Comments

Gregor Gorjanc writes:

An Anova sort of question

| 1 Comment

Suresh Krishna writes:

Matt Fox writes:

I teach various Epidemiology courses in Boston and in South Africa and have been reading your blog for the past year or so and used several of your examples in class . . . I am curious to know why you are skeptical of structural models. Much of my training has been in how essential these models are and I rarely hear the other side of the debate.

I've never used structural models myself. They just seem to require so many assumptions that I don't know how to interpret them. (Of course the same could be said of Bayesian methods, but that doesn't seem to bother me at all.) One thing I like to say is that in observational settings I feel I can interpret at most one variable causally. The difficulty is that it's hard to control for things that happen after the variable that you're thinking of as the "treatment."

To put it another way, there's a research paradigm in which you fit a model--maybe a regression, maybe a structural equations model, maybe a multilevel model, whatever--and then you read off the coefficients, with each coefficient telling you something. You gather these together and those are your conclusions.

My paradigm is a bit different. I sometimes say that each causal inference requires its own analysis and maybe its own experiment. I find it difficult to causally interpret several different coefficients from the same model.

It's all about the salamanders

| No Comments

Jacob Felson writes:

I have a statistics question that may lend itself to multilevel modeling and incomplete data.

More on fitting multilevel models

| No Comments

Eric Schwartz writes:

Thanks for the blog post. I have three follow up questions:

Eric Schwartz writes a long question (complete with R code). My reply is at the end. Schwartz writes:

How do you recommend one tests whether fixed effects are different from zero in a generalized linear mixed model?

Douglas Bates argues that it is inappropriate to use p-values from t-statistics reported in lmer(). I've followed all of his and your subsequent postings on the topic that I have found, but still have some questions:

Ban Chuan Cheah writes: "This paper may be relevant to a recent entry on your blog." Here's the abstract:

Multilevel models are used to revisit Moulton's (1990) work on clustering. Moulton showed that when aggregate level data is combined with micro level data, the estimated standard errors from OLS estimates on the aggregate data are too small leading the analyst to reject the null hypothesis of no effect. Simulations using similar data suggest that even when corrected for clustering, the null hypothesis is over-rejected compared to the estimates obtained from multilevel models. The relationship between survey sampling and Moulton's correction is also explored. The parallel between these two areas is extended into multiway clustering. Simulations using a data set with students clustered within classrooms and classrooms within schools suggest that the over-rejection rate from multilevel models is smaller than those corrected for clustering. This is particularly true when the number of clusters (classrooms) is small. The results suggest that modeling the clustering of the data using a multilevel methods is a better approach than fixing the standard errors of the OLS estimate.

Big juicy datasets

| No Comments

From Ben Olding, the MIT cell phone data and a corresponding article. I can't remember what this was about, but when Ben described it to me a few months ago, I recall that it sounded cool.

I'm still waiting for someone to work with me to reanalyze Iyengar and Fishman's speed-dating data using hierarchical models.

Paul Cross writes:

In reading your book and papers on multilevel modeling I've noticed that you do not do much explicit modeling of spatial or temporal effects. I'm wondering if this is philosophically driven, perhaps because you prefer to get at underlying cause of the spatial correlations rather than just describing the spatial (or temporal) patterns. On a more technical level though, are there issues associated with including hierarchical effects in spatial models (e.g. Besag-York and Mollie convolution models)? Do spatial and non-spatial predictors compete with one another to predict the outcome resulting in biased estimates of both? Is this a simple issue of confounding, and if so how would one know that a non-spatial explanatory variable was confounding the spatial effects? Would it require a separate analysis of the spatial properties of all explanatory variables. As a disease ecologist, separating importance of covariates from the contagious process across time and space is a central problem.

My response: I think I'd be a better person if I fit spatial and time correlations. And similar models for other continuous predictors: for example, when modeling voting given income, maybe some sort of spline model instead of a sloppy combination of linear and categorical factors. I think the big reason I don't do it is that it takes a lot of work, and I'd rather put the effort into modeling interactions. Often I put in spatial patterns in a crude way, for example including regions as predictors when modeling U.S. states. But I do think that spatial modeling can be a good idea--don't take my laziness as an anti-endorsement.

P.S. I prefer the term "patterns" rather than "effects" in this context.

Jeff Lane writes:

I was just talking with Delia about two-stage regressions compared to multilevel analysis and we were looking at Two-Stage Regression and Multilevel Modeling: A Discussion of Several Papers for the Journal "Political Analysis" and the 2005 blog discussion, in which you posted the following response to someone struggling with choice of models:

I received the following question in the mail:

Progress

| No Comments

Going through the Profiles in Research published by the Journal of Educational and Behavioral Statistics, I was amused to see the following concluding paragraph in the interview with Lyle Jones:

Despite my [Jones's] strong preference for interval estimation, there are situations for which a test of significance still may be appropriate. One is multiple comparisons, such as comparisons between all pairs of states for average student achievement scale scores in NAEP [National Assessment of Educational Progress]. A related application is assessing the goodness of fit of a model to an array of values. In these cases, interval estimation is not easily employed and the careful application of significance tests may continue to serve about as well as any alternative.

No! Not at all! My paper with Jennifer and Masanao specifically shows how interval estimation (i.e., multilevel modeling) solves the NAEP comparisons problem just fine (setting aside the question of whether we should be interested in these state-level averages in the first place). It's good to knows that some progress has been made since 2003.

Donna Harrington writes:

I will be teaching a new multilevel models course in the fall and am currently reading your text, /Data Analysis Using Regression and Multilevel/Hierarchical Models/ as I prepare. I am enjoying the book and am considering adopting it for use in the course.

Would you be willing to share the syllabus you have used for your Applied Regression and Multilevel Models course? I am particularly interested in seeing how much of the book you use in a one semester course.

My reply:

I have to admit that, over the years, I've made my syllabuses less and less detailed as I've focused more and more on writing the books. For a multilevel modeling course, I suggested the following:

- chapters 3,4,5: linear and logistic regression
- chapter 7: basics of simulation
- chapter 9: basics of causal inference
- chapters 11-14: multilevel linear and logistic regression (up to and including varying-intercept, varying-slope models)
- chapter 18: all the theory that they'll need.

For a one-semester introductory course, my usual strategy for a one-semester course is to focus chapters 2-10: that is, cover everything except multilevel modeling. Linear regression, logistic, glm, computation, and causal inference. Then for the last part of the course, I can choose among some options, including: intro to multilevel models, sample size and power calculations, and missing data imputation.

P.S. To those of you who haven't had the opportunity to take a course from me: Don't worry about it. I'm better at writing than teaching. Maybe you're better off learning out of one of my books with somebody else actually teaching the class.

A political scientist writes:

Here's a question that occurred to me that others may also have. I imagine "Mister P" will become a popular technique to circumvent sample size limitations and create state-level data for various public opinion variables. Just wondering: are there any reasons why one wouldn't want to use such estimates as a state-level outcome variable? In particular, does the dependence between observations caused by borrowing strength in the multilevel model violate the independence assumptions of standard statistical models? Lax and Phillips use "Mister P" state-level estimates as a predictor, but I'm not sure if someone has used them as an outcome or whether it would be appropriate to do so
.

First off, I love that the email to me was headed, "mister p question." And I know Jeff will appreciate that too. We had many discussions about what to call the method.

To get back to the question at hand: yes, I think it should be ok to use estimates from Mister P as predictor or outcome variables in a subsequent analysis. In either case, it could be viewed as an approximation to a full model that incorporates your regression of interest, along with the Mr. P adjustments.

I imagine, though, that there are settings where you could get the wrong answer by using the Mr. P estimates as predictors or as outcomes. One way I could imagine things going wrong is through varying sample sizes. Estimates will get pooled more in the states with fewer respondents, and I could see this causing a problem. For a simple example, imagine a setting with a weak signal, lots of noise, and no state-level predictors. Then you'd "discover" that small states are all near the average, and large states are more variable.

Another way a problem could arise, perhaps, is if you have a state-level predictor that is not statistically significant but still induces a correlation. With the partial pooling, you'll see a stronger relation with the predictor in the Mr. P estimates than in the raw data, and if you pipe this through to a regression analysis, I could imagine you could see statistical significance when it's not really there.

I think there's an article to be written on this.

As part of our Red State, Blue State research, we developed statistical tools for estimating public opinion among subsets of the population. Recently Yu-Sung Su, Yair Ghitza, and I applied these methods to see where school vouchers are more or less popular.

We started with the 2000 National Annenberg Election Survey, which had responses from about 50,000 randomly-sampled Americans to the question: "Give tax credits or vouchers to help parents send their children to private schools--should the federal government do this or not?" 45% of those who expressed an opinion on this question said yes, but the percentage varied a lot by state, income level, and religious/ethnic group; These maps show our estimates:

vouchermaps2000A.png

(Click on image to see larger version.)

Vouchers are most popular among high-income white Catholics and Evangelicals and low-income Hispanics. In general, among white groups, the higher the income, the more popular are school vouchers. But among nonwhites, it goes the other way, with vouchers being popular in the lower income categories but then becoming less popular among the middle class.

You can also see that support for vouchers roughly matches Republican voting, but not completely. Vouchers are popular in the heavily Catholic Northeast and California, less so in many of the mostly Protestant states in the Southeast. We also see a regional pattern among African Americans, where vouchers are most popular outside the South.

We checked our results by fitting the same model to the Annenberg survey from 2004, and, much to our relief, we found similar patterns:

Jeff Lax and Justin Phillips posted this summary of attitudes on a bunch of gay rights questions:

gay.png

They did it all using multilevel regression and poststratification. And a ton of effort.

P.S. My only criticisms of the above graph are:

(a) I'd just put labels at 20%, 30%, 40%, etc. I think the labels at 25, 35, etc., are overkill and make the numbers harder to read. And the tick marks should be smaller.

(b) The use of color and the legend on the upper left are well done. But they should place the items in the legend in the same order as the averages in the graphs. Thus, it should be same-sex marriage, then 2nd parent acdoption, then civil unions, then health benefits, and so forth.

Fancy statistical analysis can indeed lead to better understanding.

Jeff Lax and Justin Phillips used the method of multilevel regression and poststratification ("Mister P"; see here and here) to estimate attitudes toward gay rights in the states. They put together a dataset using national opinion polls from 1994 through 2009 and analyzed several different opinion questions on gay rights.

Policy on gay rights in the U.S. is mostly set at the state level, and Lax and Phillips's main substantive finding is that state policies are strongly responsive to public opinion. However, in some areas, policies are lagging behind opinion somewhat.

A fascinating trend

Here I'll focus on the coolest thing Lax and Phillips found, which is a graph of state-by-state trends in public support for gay marriage. In the past fifteen years, gay marriage has increased in popularity in all fifty states. No news there, but what was a surprise to me is where the largest changes have occurred. The popularity of gay marriage has increased fastest in the states where gay rights were already relatively popular in the 1990s.

In 1995, support for gay marriage exceeded 30% in only six states: New York, Rhode Island, Connecticut, Massachusetts, California, and Vermont. In these states, support for gay marriage has increased by an average of almost 20 percentage points. In contrast, support has increased by less than 10 percentage points in the six states that in 1995 were most anti-gay-marriage--Utah, Oklahoma, Alabama, Mississippi, Arkansas, and Idaho.

Here's the picture showing all 50 states:

lax6.png

I was stunned when I saw this picture. I generally expect to see uniform swing, or maybe even some "regression to the mean," with the lowest values increasing the most and the highest values declining, relative to the average. But that's not what's happening at all. What's going on?

Some possible explanations:

- A "tipping point": As gay rights become more accepted in a state, more gay people come out of the closet. And once straight people realize how many of their friends and relatives are gay, they're more likely to be supportive of gay rights. Recall that the average American knows something like 700 people. So if 5% of your friends and acquaintances are gay, that's 35 people you know--if they come out and let you know they're gay. Even accounting for variation in social networks--some people know 100 gay people, others may only know 10--there's the real potential for increased awareness leading to increased acceptance.

Conversely, in states where gay rights are highly unpopular, gay people will be slower to reveal themselves, and thus the knowing-and-accepting process will go slower.

- The role of politics: As gay rights become more popular in "blue states" such as New York, Massachusetts, California, etc., it becomes more in the interest of liberal politicians to push the issue (consider Governor David Paterson's recent efforts in New York). Conversely, in states where gay marriage is highly unpopular, it's in the interest of social conservatives to bring the issue to the forefront of public discussion. So the general public is likely to get the liberal spin on gay rights in liberal states and the conservative spin in conservative states. Perhaps this could help explain the divergence.

Where do we go next in studying this?

- We can look at other issues, not just on gay rights, to see where this sort of divergence occurs, and where we see the more expected uniform swing or regression-to-the-mean patterns.

- For the gay rights questions, we can break up the analysis by demographic factors--in particular, religion and age--to see where opinions are changing the fastest.

- To study the "tipping point" model, we could look at survey data on "Do you know any gay people?" and "How many gay people do you know?" over time and by state.

- To study the role of politics, we could gather data on the involvement of state politicians and political groups on gay issues.

I'm sure there are lots of other good ideas we haven't thought of.

P.S. More here.

Banova

| 4 Comments

Mark Bucciarelli writes:

I'm interesting in applying the Bayesian ANOVA you described in your 2005 paper to some data I am analyzing.

Is your arm package for R the place to start? (I'm on Linux/Mac, so I'll have to build OpenBugs and maybe update the RBugs package.)

Or is there a more direct path?

I'm analyzing the impact of ad attributes on the variance in click rates; e.g., product category, time of day, graphical vs. text, etc, etc.

My reply: For most of the computations in that article, I actually used a Fortran program that I wrote (and ran from Splus). There's no way this could be recovered; it would be easier to start from scratch. For the last example (in Figures 6 and 7), I used Bugs; actually, I repeated this example in my book with Hill. For your example, I'd suggest Bugs if your sample size isn't too large. Or you could try Doug Bates's lmer() function which we use in our arm package. lmer doesn't currently express uncertainty in the variance parameters but it's a good start, certainly much better than trying to use R's aov() function or similar procedures in other packages.

Who wants school vouchers?

| 8 Comments

vouchermaps2000.png

(Right click on "View image" to see the whole thing.)

Among whites, vouchers appear to be more popular with the upper middle class and rich (with predictable religious variation: the strongest support is among Catholics, then born-again Protestants, then others).

Among blacks and hispanics, though, vouchers are more popular among the poor.

We'll have to check this on some other data.

Some details:

A couple days ago, I wrote, of Martin and Quinn's estimated positions of Supreme Court justices, that

I don't know whether to believe the numbers. Is the Anthony Kennedy of 2007 (ideology score 0.14) really so close to Hugo Black in 1970 (ideology score 0.06)? To look at it another way, according to these numbers, in 1973 (the year of Roe v. Wade), six of the justices are colored red and the median justice is listed at 0.67. In 2007, only five are red and the median is at 0.14. In fact, in 2005 the median is listed as -0.07, or slightly to the left of center. Is it really plausible that the court was more liberal in 2005 than in 1973? Maybe so, but something looks fishy to me here.

In reply, Andrew Martin wrote:

re: Black and Kennedy, I [Martin] tend to think of them as pretty similar. Both were moderates (although on somewhat different types of cases), one a moderate Dem the other a moderate Rep. So them being close is not implausible.

There were a couple of very liberal decisions in the 1972 term (when Roe was decided), including a Roe and a death penalty case. But even on civil liberties the court reached a conservative decision in Miller (the obscenity case). And there were some more conservative decisions in other areas of law. Today's court is surely more conservative on civil liberties issues (although there haven't really been many cases...), but may be a little to the left on some other issues (Hamdan). Today it all gets down to what Kennedy wants to do. If it were what Roberts would do the court would be far to the right.

That's an argument for plausibility, but the argument may be implausible. We tend to think about the Court in terms of the most politically salient cases, but the model treats Roe as equally to, say, a tax case. And, of course, the measures have huge limitations because they are just based on binary data, on all cases, with some reasonably strong model assumptions, etc.

The point about counting different domains of the law is interesting, along with the age-period-cohort sort of question of how you can even try to align left-right today with the corresponding positions in 1972.

Awhile ago I posted some maps based on the Pew pre-election polls to estimate how Obama and McCain did among different income groups, for all voters and for non-Hispanic whites alone. The next day the blogger and political activist Kos posted some criticisms. I disagree with one of Kos's suggestions--he wanted me to rely on exit polls, but I don't actually see them as more reliable than the Pew pre-election polls--but he pointed out some serious problems with my maps. I realized that some fixes were in order. Most importantly:

- My maps would be improved by replacing solid red and blue with continuous shading to distinguish between landslides and narrow margins.

- I needed a more flexible model that would allow the nonlinear pattern of voting and income to vary by state. (In the previous model, I fit a nonlinear pattern (by including a separate logistic regression coefficient for each of the five income categories) but allowed the states to vary only with intercepts and slopes. In the new model, we're letting all five coefficients vary by state.)

During the past couple of months, I've been working on this when I've had a spare hour or two, and now I think we have something reasonable to share. Here it is:

10graphs2008income.png

States colored deep red and deep blue indicate clear McCain and Obama wins; pink and light blue represent wins by narrower margins, with a continuous range of shades going to pure white for states estimated at exactly 50/50.

General comments

The maps are based on a model fit to four ethnic categories (non-Hispanic white, black, Hispanic, other), but I'm only displaying total and non-Hispanic whites. The others are interesting too but they're based on a lot less data: they're my (current) best estimates but are much more reliant on model extrapolation.

The estimates are entirely based on the Pew data--except that we use Census-based voter turnout estimates to reweight estimates in each state, and we shift each state's estimates to be consistent with the actual election outcome in the state. (For example, if our estimate says that Obama got 48% of the total vote in a state (adding up voters from all income and ethnicity categories), and he actually got 46%, then we'd pull down our estimates for each category so that the estimated total is 46%.)

Some particular changes

I'll talk about a couple of states where Kos pointed out issues with my original maps.

New Hampshire. John McCain won 45% of the two-party vote in New Hampshire, a state which is 93% non-Hispanic white, 1% black, 2% Hispanic, 2% Asian, and 2% other. Based on the Census survey, we estimate that non-Hispanic whites were 96% of New Hampshire's voters in 2008. If whites represented 96% of the voters, and if McCain received 20% of the votes of the other 4%, then his share of the white vote would be 46%--thus, as Kos pointed out, it's hard to believe that McCain won in four of the five income categories among whites in the state, as my original map had implied. The problem was in the way that I'd adjusted things to the national vote.

Michigan. As Kos points out, Michigan was closely divided among whites, and so there was something fishy about my original maps, which had Obama winning among whites in four of the five income categories. The new map does not have this problem.

Colorado. This state reveals some problems with the published exit poll data: according to CNN, McCain got 48% of the white vote in Colorado, but, when this was broken down by income, he got 45% of the vote of whites under $50,000 and 47% of the vote of whites over $50,000. This is a mathematical impossibility: using the exit poll numbers, McCain's percentage of the total white vote should then be (.19*45% + .62*47%)/(.19+.62) = 46.5%, not 48%. I don't know which of these--if either--is correct. I assume all of these numbers are from the corrected exit polls, adjusted to match up to the actual vote proportions in each state. Our estimate gives McCain 51% of the white vote in Colorado. I think this is possible too, and for that matter it's consistent with the exit poll estimate of 48%, which has a standard error of at least sqrt(.48*(1-.48)/(.81*1254))=.015, so the exit poll number is within two standard errors of our estimate.

Estimates and raw data

Here are graphs showing our estimates, along with the weighted average from the Pew surveys in each group.(including only those respondents who expressed a preference for Obama or McCain and also said they were "absolutely certain" they had registered to vote):

48states.png

You can see the partial pooling from the data to the model, with more pooling in small states such as Wyoming, Rhode Island, and Vermont, and less pooling in states such as California, Texas, and New York where sample sizes were larger. The graphs show estimated McCain vote share, so, unsurprisingly, the lines for whites are higher than the lines for all voters, with differences smaller in states such as Wyoming or Vermont where there are very few nonwhite voters.

Some technical details

Even after restricting to respondents who are certain they are registered, the pre-election polls don't do a great job matching the population of voters. To correctly weight to voters (rather than to the general adult population), we used the 2008 Current Population Survey post-election supplement, which has information on voter turnout. We'll write a technical article describing exactly what we did, but the short version is that the CPS numbers are generally considered to be much more reliable than exit polls or pre-election polls for estimating turnout rates among different groups within a state. What we actually did was to use a multilevel model to smooth the CPS numbers using the latest population totals from the American Community Survey.

Yair also came up with a cool color scheme. Instead of going from deep red to deep blue through purple, we divided up the color scheme as follows: for proportions between 0 and .5, we used different shades of blue (deep blue, getting progressively lighter, toward white), then going from .5 to 1, we used deeper and deeper reds, starting with white, through light pink, to red. (Don't worry, I'll post the R code.) This worked much, much better than the purple schemes I was playing with before. More visual resolution, and a key benefit is that it's immediately clear which states are above and below the 50% threshold. Finally, I did a little trick of my own and used a square-root transformation (more specifically, if the estimate vote proportion for McCain is x, I defined z = 2*(x-.5), and then wored with sign(z)*sqrt(z)) to spread out the resolution near 0.5 and compress it near 0 and 1.

One other thing. The Pew organization sent me their raw data and posted them on the web for anyone to use. The exit polls still refuse to report anything but summaries. I don't see this refusal as a sign of confidence on their part. Please also read my earlier note for further discussion of the Pew and exit polls.

All this work is joint with Yair Ghitza.

Peter Selb writes:

Commenter BorisG asked why I made my pretty maps using pre-election polls rather than exit polls. I responded in the comments but then I did a search and noticed that Kos had a longer discussion of this point on his blog. So I thought maybe it would help to discuss this all further here.

To start with, I appreciate the careful scrutiny. One of my pet peeves is people assuming a number or graph is correct, just because it has been asserted. BorisG and Kos and others are doing a useful service by subjecting my maps to criticism.

Several issues are involved:

- Data availability;
- Problems with pre-election polls;
- Problems with election polls;
- Differences between raw data and my best model-assisted estimates;
- Thresholding at 50%;
- Small sample sizes in some states.

I'll discuss each of these issues in turn.

How did white people vote?

| 16 Comments

I posted the maps at 538.

And here's what we did:

From Jessica, I saw a review by "Econjeff" of my review of Joshua Angrist and Jorn-Steffen Pischke's new book, "Mostly Harmless Econometrics: An Empiricist's Companion."

Econjeff pretty much agrees with what I wrote, but with one comment:

I [Econjeff] am a bit surprised by Gelman's call for more on hierarchical models; I think economists are right to treat these as a combination of useful pedagogical tool for education research design and an unnecessarily functional-form dependent way to get the standard errors right when then the unit of treatment differs from the units available in the data.

I think this is a common perception of multilevel (hierarchical) models among economists. Regular readers of this blog will not be surprised to hear that I disagree completely! The purpose of a multilevel model is not to "get the standard errors right" but rather to model structure in the data.

An analogy that might help here for economists is time series analysis. If you have data with time series structure and you ignore it, you can get over-optimistic standard errors. But that's not the main reason people do time series modeling. The main reason is that the time series structure is interesting and important in its own right. We are interested in individual and contextual effects and unexplained variation at the individual and group levels, just as we are interested in autocorrelation, periodicity, long-range dependence, and so forth.

See chapters 1 and 11 of ARM for more discussion of motivations for multilevel modeling.

Ed Vul, Christine Harris, Piotr Winkielman, and Harold Pashler wrote an article where:

1. They point out that correlations reported in FMRI medical imaging studies are commonly overstated because researchers tend to report only the highest correlations, or only those correlations that exceed some threshold.

2. They suggest that these statistical problems are leading researchers, and the general public, to overstate the connections between social behaviors and specific brain patterns.

After posting on this article, I received a bunch of comments and questions as well as some responses:

This article by Jabbi, Keysers, Singer, and Stephan argues that, because brain imaging resesarchers adjust their p-values and significance thresholds for multiple comparisons (the thousands of voxels in a brain image), their statistical methods don't have the problems that Vul et al. claimed.

This reply by Vul to the Jabbi et al. article. Here Vul argues that adjustment of significance levels does not stop the selected correlations themselves from being too high. I found Vul's argument here to be convincing. Multiple comparisons methods control the rate of false alarms in a setting where true effects are zero--but I don't see that to be relevant to the imaging setting, where differences are not in fact zero. Lots of things affect blood flow in the brain, and we would never expect the average scans of two different groups of people to be the same.

This article by Lieberman, Berkman, and Wager, who defend social neuroscience and argue the following:

1. They accept Vul et al.'s point 1 above (correlations are overstated) but present some evidence that the correlations aren't as overstated as Vul et al. might fear.

2. They disagree with the implied claim that the overstated correlations have distorted scientists' understanding of social neuroscience research.

3. They object to Vul et al's focusing on social neuroscience, given that the same statistical issues arise in all sorts of brain imaging studies.

4. They point out some specific areas where Vul et al. mischaracterized the data-analytic methods used in this field.

I think Lieberman et al. make some good points, but, as Vul et al. point out, researchers often do use correlations to summarize their results. And, even if said correlations survived a multiple-comparisons analysis, readers might interpret these at face value without understanding the selection issue. So all this shake-out is probably a good thing, especially where correlation estimates are being compared to each other.

My thoughts

First off, I haven't worked seriously in medical imaging for nearly 20 years and have only one published paper in the area, so my comments are mostly informed by my perspective on general statistical issues, as well as my own experience thinking about estimation of effect sizes in studies with low statistical power.

Regarding the singling-out of social neuroscience, I see the point of Lieberman et al. I was thinking that maybe one reason for this is that in social neuroscience it's perhaps more difficult to get external validation in the way that might be more possible in other areas of neuroscience where there is some measurement in the blood or whatever that can be taken. I'm not sure about this, just a conjecture.

It's hard for me to believe that the approach based on separate analyses of voxels and p-values, is really the best way to go. The null hypothesis of zero correlations isn't so interesting. What's really of interest is the pattern of where the differences are in the brain.

Related to this point is that, ultimately, when trying to understand differences in brain processing between different sorts of people (or between people doing different tasks), the maximum correlation among voxels is ultimately not what you're looking for. That is why researchers summarize using regions of interest (as in p.7 of the Lieberman et al. article). Vul et al. were correct to warn about overinterpretation of correlations that have been selected as the maximum: the naive reader can see such correlations (and accompanying scatterplots) to think that certain personality traits are more predictable from brain scans than they actually are.

I think the way forward will be to go beyond correlations and the horrible multiple-comparisons framework, which causes so much confusion. Vul et al. and Lieberman et al. both point out that classical multiple comparisons adjustments do not eliminate the systematic overstatement of correlations. A hierarchical Bayes approach (using some sort of mixture for the population of pixel differences, ideally modeled hierarchically with pixels grouped within regions of interest) would help here..

And now for some amateur psychologizing (unsupported by any statistical analysis, correlational or other)

I suspect that one of the motivations of Vul et al in writing their article was frustration at too-good-to-be-true numbers which they felt led to exaggerated claims of neuro super-science.

Conversely, I suspect one of the frustrations of Lieberman et al. is that they are doing a lot more than correlations and fishing expeditions--they're running experiments to test theories in psychology, they're trying to synthesize results from many different labs. And from that perspective it must be frustrating for them to see a criticism (featured in the popular press) that is so focused on correlation, which is really the least of their concerns.

It also seems that both sides were irritated by what they saw as giddy press coverage: on one side, claims of dramatic breakthroughs in understanding the biological basis of behavior and personality; on the other, claims of a dramatic Emperor-has-no-clothes debunking. As scientists, most of us welcome press coverage--after all, we think this work is important and we'd like others to know about it--but . . . fawning press coverage of something that we think is wrong--that's just annoying.

P.S. Wager is a friend--he teaches in the psychology department here--but I don't think my personal knowledge has hindered my evaluation here.

P.P.S. I ran the above by various people involved and they gave some helpful clarifications. But I've probably left in a couple of sloppy statements here and there.

Mark Blumenthal writes about some of the elaborate analyses done by Catalist and other political consulting firms that helped organize the Democrats' get-out-the-vote efforts in 2008. Presumably the Republicans will be doing this next time as well. It brings swing-voter targeting to the next level. It would also be interesting to do some multilevel modeling to put together their county- and individual-level analyses (and, soon, their precinct-level analyses). Also it's worth thinking about how national politics might change as these techniques become more widely used.

This whole exchange is interesting and is closely related to our current research on prior distributions and Bayesian inference for varying-intercept, varying-slope models.

It all started when Sean Zhang asked,

I am using your book to self-teach myself using R for multilevel modeling.

One question I have is that why lmer cannot provide var-cov matrix of estimates of random components. You mentioned in your book that lmer can only provide point estimates for variance component and that is one of the reasons to go Bayesian and use Bugs.

I am running a simple random intercept model using SAS glimmix and found that glimmix provides standard error for variance estimate of random intercept. I looked into glimmix document(see attachment, page 121, theta contains random effect parameters) and can imagine that SAS may use hessian or outer produce of gradient (see page for their likelihood function) to get them.

My question is then why lmer cannot do similar thing as SAS to report var-cov matrix of estimates of random components?

I replied,

AT writes:

A Facebook friend pointed me to this Gladwell piece discussing how you can('t) predict whether a teacher will be successful, but more importantly, on the range of advancement of a class depending on a teacher's ability.

The claim that a good teacher can make as much as a full year's difference in their students' advancement isn't too surprising to me [AT]. What I want to know is whether there are good studies around that look at the interaction effects between teachers and their students' backgrounds. In particular, I'd say that there are a number of teachers I've had who are very polarizing -- some make their students advance far, some stall them in their paths and make them give up the discipline.

My quick reply: From the work of Jonah Rockoff and others, I am convinced that teacher effects are real and they are large. And school effects are mostly the composition of teacher effects. I'm not sure about how large the interactions are (i.e., if some teachers do better with good students and others do better with poor students). Jennifer and I have talked about estimating such interactions (with a big multilevel model, of course) but I don't know what's up with that.

And, of course, this has no implications for university teaching, where as we know the sole qualification is to publish technical articles in obscure journals.

Fitting a model with constraints

| 6 Comments

Chris Chatham writes:

I am using multilevel logistic regression to model individuals' abilties to 'stop' a planned motor movement (my binary outcome), based on the delay between the beginning of the trial and the occurrence of the stop signal (my input with 4 different values). As an apriori assumption, I'd like to specify that the fitted model predict perfect 'stopping' when the stop signal is provided at 0 delay and no 'stopping' whatsoever when the signal is provided at each individual's maximum reaction time. While these particular delays were not tested, the assumptions are sound; my question is how I can ensure that the model fits these assumptions without including some arbitrary number of made-up observations in my dataset?

My reply: As a person who has difficulty in suppressing motor movements, I'm interested in your example. Getting to the statistics, I can't understand enough of what you're saying to give a direct answer, but more generically I'd say try to avoid any hard constraints such as zero intercepts or sharp cutoffs at maximum reaction time. I'd first fit the model straight with no such constraints. Then if the constraints are consistent with your inferences, you could consider setting certain parameters to zero (perhaps in this case you'd be setting main effects to zero while letting interactions vary).

This is cool stuff (by Jeff Lax and Justin Phillips).

bayesglm in Matlab?

| 3 Comments

I received the following email:

This formula is so, so important. It tells you that when you have two sources of variation, only the larger one matters (unless the variances are very close to each other). It comes up all the time in multilevel modeling.

ANOVA and the mixed-model muddle

| 2 Comments

Rick DeShon writes,

As I read through your discussion paper on the analysis of variance published in the Annals of Statistics in 2005, I became a bit confused about the connections between your notion of parameter batches and prior work on the topic of fixed and random effects. Specifically, I wonder how your approach connects to Nelder's "great mixed model muddle?"

Juan Morales writes:

I am currently fitting a multilevel model to data of fruit removal rates which I model using binomial distributions for the number of removed fruit out of total fruits available. I would like to estimate the proportion of variance explained and the amount of pooling at each level (tree, forest stand and so on). You show how to do such things for the radon example and mention that something similar could be done for generalized linear models using deviances. Has this been done somewhere?

My reply: R-squared for multilevel linear models is discussed in our book and in my paper with Pardoe. I think it would make sense to do this with logistic regression also (perhaps using the latent variable formulation with residual s.e. of 1.6, as Jennifer and I discuss in chapter 5). But I haven't done it yet. A good research paper, I think!

Rajeev sends a link to this paper on hierarchical modeling for evaluating multi-site interventions:

This article discusses the evaluation of programs implemented at multiple sites. Two frequently used methods are pooling the data or using fixed effects (an extreme version of which estimates separate models for each site). The former approach ignores site effects. The latter incorporates site effects but lacks a framework for predicting the impact of subsequent implementations of the program (e.g., would a new implementation resemble Riverside?). I present a hierarchical model that lies between these two extremes. Using data from the Greater Avenues for Independence demonstration, I demonstrate that the model captures much of the site-to-site variation of the treatment effects but has less uncertainty than estimating the treatment effect separately for each site. I also show that when predictive uncertainty is ignored, the treatment impact for the Riverside sites is significant, but when predictive uncertainty is considered, the impact for these sites is insignificant. Finally, I demonstrate that the model extrapolates site effects with reasonable accuracy when the site being predicted does not differ substantially from the sites already observed. For example, the San Diego treatment effects could have been predicted based on their site characteristics, but the Riverside effects are consistently underpredicted.

Seems like a good idea to me. Remember, interactions are important!

Hierarchical models of variance

| 4 Comments

Marcus Brubaker writes:

I am currently working on a problem in computational biology using Bayesian inference and I've come to a question for which I hope you have an answer. In this problem there are a large number of noisy 2D images of a molecule, from which we wish to infer the 3D structure. Much of the modeling is straightforward but I have hit a roadblock. Specifically, the noise parameters for these images.

Josh Tucker sent me this paper by Alexander Herzog and himself on attitudes toward the European Union in different European countries. Here's the abstract:

In this paper, we [Herzog and Tucker] document a hitherto unrecognized “micro-macro paradox” of EU accession in post-communist countries: on the micro-level, economic prosperity increases the likelihood of supporting EU membership; while on the macro-level, economic prosperity decreases aggregate levels of support for EU membership.

Non-nested latent class models

| 1 Comment

Gabriel Katz writes:

Serena Ng sends along this paper by Emanuel Moench, Simon Potter, and herself. Here's the abstract:

Below are two long questions about variance components and multilevel models.

Shinichi Nakagawa forwards these questions from Hoger Schielzeth:

Shinghi Nakagawa sent along this, by Josep Carrasco and Lluis Joer. I haven't looked at it but it might be useful.

John Kastellec sent me this attractive paper:

We [Kastellec et al.] study the relationship between state-level public opinion and the roll call votes of senators on Supreme Court nominees. Applying recent advances in multilevel modeling, we use national polls on nine recent Supreme Court nominees to produce state-of-the-art estimates of public support for the confirmation of each nominee in all 50 states. We show that greater public support strongly increases the probability that a senator will vote to approve a nominee, even after controlling for standard predictors of roll call voting. We also find that the impact of opinion varies with context: it has a greater effect on opposition party senators, on ideologically opposed senators, and for generally weak nominees. These results establish a systematic and powerful link between constituency opinion and voting on Supreme Court nominees.

Another triumph of the Lax/Phillips approach of linking policy to state-level opinion (see also here). Also another example of the synergy that's supposed to happen with an academic department, with Jeff, Justin, John, and myself each bringing unique contributions. I don't think any of this would've happened if we weren't all brought together with repeated interactions on the 7th floor.

Rey De Castro writes:

I have a longitudinal data set that needs imputation, but the problem doesn't seem to resemble a typical imputation situation. So I'm casting about for a reasonably defensible approach that I can implement without tremendous custom-programming effort. My question concerns Bayesian approaches to imputation.

The Situation: I have longitudinal data for each of a group of schoolchildren. Each observation in the series is a multilevel class indicator of several canonical locations (i.e., indoor-home, indoor-school, outdoors, commuting) where the child reported being present during a particular 15-minute interval. Essentially, it's a series giving each child's location over time at 15-minute intervals. There are ~100 children, and each child's series is very long: ~2000 observations.

John Kastellec writes:

Let's say you wanted to estimate a multilevel model with an interaction in the individual-level model, say:

Pr(y=1) = logit-1(B0 + B1X + B2Z + B3XZ)

and you wanted to allow the interaction effect to vary by group. Would the correct procedure be to allow all the coefficients to vary by group, then interpret the main and interactive effects as you would normally (i.e. for each group)?

Yup.

David Ross writes,

Chris Weiss writes with a question about propensity score matching with multilevel data:

James O'Brien writes,

Just wondering if you had any thoughts on testing a random intercepts model in multilevel logistic regression. In my present study, the results of a Wald test (which Twisk, for example, suggests as a kind of approximation for testing whether parameter variance is effectively zero), produce a p-observed of .06.

I've noticed in the MLWin manual as well, they argue that p<.1 for the Wald is evidence that slopes and intercepts (in that particular case) indeed vary. Are they perhaps willing to relax the cut-off because of the approximate nature of the test?

Am I perhaps missing something? Perhaps I need to go back to MLWin and do some bootstrapping.

My reply: I've been known to bootstrap (see my 1992 paper in the Journal of Cerebral Blood Flow and Metabolism, and my recent paper with Alessandra analyzing the storable votes data), but in this case I don't think you have to go to that level of effort.

The short answer is that these true variances are never zero (at least, not in social science applications); a non-significant test doesn't mean the variance is zero, it just means that the data are consistent with the zero variance model. So, if you want, you can keep the variance component in the model and ignore the test. Conversely, if the variance is low, it might not hurt you much to just exclude it from the model (i.e., to set it to zero). It depends what you're doing. If you're just including this variance component to help estimate some other parameter in the model, you can probably just get rid of it, but if you're specifically trying to compare particular coefficients here, or to make predictions for new groups, you better keep it in. In that case, you might want to do fully Bayes (something like the 8-schools model) so you're not conditioning on a particular imperfect estimate of that variance parameter.

This issue actually comes up a lot. Fortunately, it's when the variance parameter is least important (when it could in practice be replaced by zero) that it's trickiest to estimate. A rare bit of good news in statistical inference.

Robert Rohrschneider writes:

I [Rohrschneider] am trying to gain an understanding of the pitfalls of multi-level analyses in my work which typically requires that I merge country data with surveys of individuals, usually in Europe. I wonder whether you could reduce my confusion about one issue of multilevel modeling in political science. It appears that the simple two-stage regression approach to multilevel data structures (i.e., a variant of which you and Jennifer Hill describe on p. 240 in your 2007 book) is gaining in popularity in political science.

Jimmy sent along this link. I don't know anything about it, but it could be useful to people planning multilevel studies. I also recommend chapter 20 of our book.

And also see here for my other thoughts on power analysis.

A political scientist wrote in with a question that actually comes up a lot, having to do with hard-to-estimate group-level variance and correlation parameters in multilevel models. The short answer is, when these things are hard to estimate, it's often because there's not much information about them in the data, and sometimes you can get away with just setting these parameters to zero and making your life easier.

Anyway, here's his question:

Juan-Jose Gibaja-Martins writes,

In our recent research we have been working with data on per capita GDP, Human Development Index, etc in european countries and we are interested in performing a Principal Components Analysis (PCA). So far we have been weighting our data -based on the idea that we should allow a bigger weight to a bigger economy- My question is: does it make sense to weight our data -for instance with the population or the GDP of each country- or should we keep all the weights equal to 1? Is it advisable to use weighted PCA? If so, when?

Oddly enough, a related question came up in our quantitative political science research group a couple weeks ago: Rebecca and Matt were talking about a cross-national study that had something like 1000 respondents from each of several European countries, and the question was how to weight the results to account for the fact that some countries are larger than others. A simple data analysis would implicitly count all countries equally, which doesn't sound right.

My solution (as usual) is multilevel modeling plus poststratification: in short, you only need to worry about these weights to the extent that the estimand of interest varies from country to country. Rather than weighting the data, my recommendation is to use a multilevel model to get estimates for all the countries and then to poststratify to get continent-wide estimates, if these are desired.

Longhai Li did a really cool Ph.D. thesis (under the supervision of Radford Neal) on computing for models with deep interactions. The website containing all stuff about this software, including
the R packages, documentations and references, is here and here. Here's a quick description (from the website):

This R package is used in two situations. The first is to predict the next outcome based on the previous states of a discrete sequence. The second is to classify a discrete response based on a number of discrete covariates. In both situations, we use Bayesian logistic regression models that consider the high-order interactions. The time arising from using high-order interactions is reduced greatly by our compression technique that represents a group of original parameters as a single one in MCMC step. In this version, we use log-normal prior for the hyperparameters. When it is used for the second situation --- classification, we consider the full set of interaction patterns up to a specified order.

And here's the research paper (by Longhai and Radford). I wonder if they've achieved some of my goals in wanting weakly informative priors for models with interactions. That Cauchy thing rings a bell.

P.S. to Longhai: I don't recommend keeping your software in two places. Won't it be a pain to keep both sites up-to-date? Or maybe it's done automatically, I don't know.

Ken Lee writes:

Meta-meta-analysis?

| 1 Comment

Jacob Felson writes,

In a 2002 article in the Journal of Economic Perspectives, economists Samuel Bowles and Herbert Gintis attempt to decompose the correlation in income between parents and children into genetic and environmental components. In addition, the authors attempt to calculate the extent to which the intergenerational correlation in income is due to inherited IQ. But unlike most data analyses, they do not use any particular data. Instead, the authors collect estimates of effect sizes from a variety of meta-analyses and attempt to estimate ballpark estimates for parameters in a theoretical path analysis. I am wondering -- is this "meta-meta-analysis" appropriate? Can one conduct path analyses and evaluate relative magnitudes of correlations taken from a variety of studies?

My reply: I haven't read the paper, but I'm skeptical of path analysis in general, and the meta-meta-analysis setting doesn't help any!

Dave Judkins wrote:

Here's the paper (with Jennifer and Masanao), and here's the abstract:

The problem of multiple comparisons can disappear when viewed from a Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise. These address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.

Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p-values corresponding to intervals of fixed width). Multilevel estimates make comparisons more conservative, in the sense that intervals for comparisons are less likely to include zero; as a result, those comparisons that are made with confidence are more likely to be valid.

Check out Figures 4 and 6; it's pretty cool stuff. You really see the efficiency gain. We had several more examples but no room in the paper for all of them. Also, here's the presentation.

Meta-analysis question

| 4 Comments

Brant Inman writes:

David Afshartous writes,

Verbeke & Molenberghs (2000; Linear Mixed Models for Longitudinal Data; Section 23.2, p.392) discuss an analytic procedure for power calculations. Specifically, for the case of testing a general linear hypothesis, under the alternative hypothesis the test statistic distribution can be approximated by a noncentral F distribution (supported by simulation results of Helms 1992). And power calculations are thus obtained by incorporating the appropriate quantile from the null F distribution (as opposed to a null chi-squared distribution that doesn't account for variability introduced in estimating variance components).

What is your opinion of this approach versus your suggested simulation approach (p.437 of recent book)? Perhaps the analytic method above is not as appealing due to the dependence of the results on the method to estimate the denominator degrees of freedom in the null F distribution (albeit lower dependence for larger sample sizes)?

My reply: I have mixed feelings about power calculations in general. The topic of statistical power is hugely important (see here for my recent thoughts on the perils of underpowered studies), but I have real problems with the standard "NIH-style" power calculations. Briefly, the problem is that the calculations are set up as if there is a certain power goal, and you get the data necessary to reach it, but realistically it's often the other way around, with a sample size set from practical considerations and then a power calculation set up to get the answer you need. We have a whole chapter on power calculations in our book but I don't really know if I like the idea at all.

In answer to the specific question: I hate thinking about these F distributions--the last time I thought hard about them was for my 1992 paper with Rubin--so I prefer simulation despite its occasional awkwardness.

See here for a link to a useful article on power calcuations by Russ Lenth. (The link is from June, 2005; many of you probably weren't reading this blog back then...)

Bill DuMouchel wrote:

I recently came across your paper, "A default prior distribution for logistic and other regression models," where you suggest the student-t as a prior for the coefficients. My application involves drug safety data and very many predictors (hundreds or thousands of drugs might be associated with an adverse event in a database). Rather than a very weakly informative prior, I would prefer to select the t-distribution scale parameter (call it tau) to shrink the coefficients toward 0 (or toward some other value in a fancier model) as much as can be justified by the data. So I want to fit a simple hierarchical model where tau is estimated. Is there an easy modification of your algorithm to adjust tau at every iteration and to ensure convergence to the MLE of tau (or maximum posterior estimate if we add a prior for tau)? And do you know of any arguments for why regularization by cross-validation would really be any better than fitting tau by a hierarchical model, especially if the goal is parameter interpretation rather than pure prediction?

I replied:

We also have a hierarchical version that does what you want, except that the distribution for the coeffs is normal rather than t. (I couldn't figure out how to get the EM working for a hierarchical t model. The point is that the EM for the t model uses the formulation of a t as a mixture of normals, i.e., it's essentially already a hierarchical normal.)

We're still debugging the hierarchical version, hope to have something publicly available (as an R package) soon.

Regarding your qu about cross-validation, yes, I think a hierarchical model would be better The point of the cross-validation in our paper was to evaluate priors for unvarying parameters which would not be modeled hierarchically.

Bill then wrote:

I did have my heart set on a hierarchical model for t rather than normal, because I wanted to avoid over shrinking very large coefficients while still "tuning" the prior scale parameter to the data empirically. (Although my worry about over shrinking might be less urgent if I use prior information to create "batches" that can have their own centers of shrinkage, as in your in-progress hierarchical bayesglm program.)

Lee Edlefsen and I [Bill D.] are working on a drug adverse events dataset with about 3 million rows and three thousand predictors, using logistic regression and some extensions of LR, and with thousands of different response events to fit. Plus the potential non repeatability of MCMC results would be a real turnoff for the FDA regulators and pharma industry researchers.

An EM question

I have a question for Chuanhai or Xiao-Li or someone like that: is it possible to do EM with two levels of latent variables in the model? In the usual formulation, there are data y, latent parameters z, and hyperparameters theta, and EM gives you the maximum likelihood (or posterior mode) estimate of theta, conditional on y and averaging over z. This can commonly be done fairly easily because z commonly has (or can be approximated with) a simple distribution given y and theta. This scenario describes regression with fixed Student-t priors, or regression with normal priors with unknown mean and variance.

But what about regression with t priors with unknown center and scale? There are now two levels of latent variables. Can an EM, or approximate EM, be constructed here? As Bill and I discussed in our emails, Gibbs is great, and it's much easier to set up and program than EM, but it's harder to debug. There's something nice about a deterministic algorithm, especially if it's built with bells and whistles that go off when something goes wrong.

Notation for crossover designs

| No Comments

David Afshartous writes,

Panel regressions

| 1 Comment

Ismail writes,

Can you explain a little bit about Fixed Effects in panel regressions and when is it appropriate to use it- for example does it make sense to use it on this data set.

My reply (beyond that it's funny to see a reference to an economics paper by someone named "Dollar") is that it's a good idea to model panel data using three error terms, at the unit, time, and unit*time levels. You can (and should) also have predictors at all three levels, as appropriate. Also I'm not a fan of the overloaded term "fixed effects," but that's another story. (Search this blog for more on the topic.)

ANOVA question

| 1 Comment

Richard Gunton writes,

Shravan adds an index entry to our book:

In the index at the back, please add crossed varying intercepts and crossed random effects as an entry for the pilot data discussed on p. 289. That kind of structure is very common in psychology/psycholinguistics, as you probably know, and many people will be looking for crossed random factor specificiations in your book but won't find it unless they know to look under non-nested models in the index.

Shravan also writes:

Also, I really think the code needs to be cleaned up to make it usable generally. It's really a lot of work to try to figure out which parts are missing in each of the code files. I will give you a more structured example and some suggestions for improvement soon. You may lose a lot of impatient people if you don't focus on clean code for the next edition of the book (conversely, you will gain a wider readership if you get the code cleaned up).

OK, OK, we'll do it...

Multilevel time series analysis

| 2 Comments

Shang Ha writes,

Daniel Lakeland writes with a question about lmer() that is more generally a question about partially nested multilevel models:

Kaiser writes with a question that comes up a lot, on getting a good baseline comparison of variation:

Jeff pointed me to this interesting paper by David Primo, Matthew Jacobsmeier, and Jeffrey Milyo comparing multilevel models and clustered standard errors as tools for estimating regression models with two-level data.

Gamma-Poisson mixtures

| 4 Comments

Laura Hatfield writes,

Survey weighting is a mess

| No Comments

Dave Judkins writes, regarding my Struggles with Survey Weighting and Regression Modeling paper,

I am hoping you might be able to clarify a point in your approach. How does a variable like number of phone lines in the house get used in equation 5? (Given that N.pop and X.pop are not available.) Does your work in Section 3 apply only to X variables with known population distributions?

My reply:

Dan Schrage writes with a question about how to model group-level variation:

I [Dan] am trying to better understand the recommendation in your new book to always use random effects (pg. 246) in modeling. (I'm following your definition #5 here of fixed and random effects, as is standard in econometrics.) In econometrics, as I'm sure you know, the classical advice (dating from at least Mundlak (1978)) is this: If unobserved heterogeneity is correlated with regressors in your model, use fixed effects; otherwise, use random effects since they're more efficient. The idea is that random effects lumps this unobserved heterogeneity into the composite error term, and so if it the unobserved heterogeneity is correlated with the regressors, then the regressors are correlated with the error term, and this is bad news: estimators will be biased and inconsistent. So I'm trying to understand why your advice is essentially not to worry about this issue.

Is this just an argument about the bias/variance tradeoff, or is there something deeper here? Perhaps all of this just speaks to one of the common gaps between statistics and econometrics--econometricians tend to care a lot about asymptotic properties like consistency, and statisticians seem to care much less (correct me if I'm off here--I'm still trying to get a good sense of the divide). Econometricians are almost never willing to use an inconsistent estimator, no matter what the gain in efficiency.

My reply:

Ying Yuan writes,

Anova

| 1 Comment

Cari Kaufman writes,

I am writing a paper on using Gaussian processes for Bayesian functional ANOVA, and I'd like to draw some connections to your 2005 Annals paper. In my own work I've chosen to use a 1-1 reparameterization of the cell means, that is, to constrain the levels within each factor. But I am intrigued by your use of exchangeable levels for all factors, and I'm hoping you can take a few minutes to help me clarify your motivation for this decision. Since not all parameters are estimable under the unconstrained model, don't you encounter problems with mixing when the sums of the levels trade off with the grand mean? It seems in many situations it's advantageous to have an orthogonal design matrix, especially when the observed levels correspond to all possible levels in the population. Do you have any thoughts on this you can share?

I should say I found the paper very useful, especially your graphical representation of the variance components. I also like your distinction between the superpopulation and finite population variances, which helped me clarify what happens when generalizing to functional responses. Basically, we can share information across the domain to estimate the superpopulation variances by having a stationary Gaussian process prior, but the finite population variances can differ over the domain, which gives some nice insight into where
various sources of variability are important. (At the moment I'm working with climate modellers, who can really use maps of where various sources of variability show up in their output.)

My reply: I'm not quite sure what the question is, but I think you're pointing out the redundant parameterization issue, that if we specify all levels of a factor, and then have other crosscutting or nested factors (or even just a constant term), then the linear parameters are not all identifiable. I would deal with this issue by fitting the large, nonidentified model and then summarizing using the relevant finite-population summaries. We discuss this a bit in Sections 19.4-19.5 and Chapters 21-22 of our new book.

A couple notes on this:

1. Mixing of the Gibbs sampler can be slow on the original, redundant parameter space but fast on the transformed space, which is what we really care about. Also, things work better with proper priors. My new thing is weakly informative priors which don't include all your prior information but act to regularize your inferences and keep the algorithms in a reasonable space where they can converge faster. The orthoganality that you want can come in this lower-dimensional summary.

2. The redundant-parameter model is identified, if only weakly, as long as we use proper prior distributions on the variance parameters. In Bayesian Data Analysis and in my 2005 Anova paper, I was using flat prior distributions on these "sigma" parameters. But since then I've moved to proper priors, or, in the Anova context, hierarchical priors. See this paper for more information, including an example in Section 6 of the hierarchical model for the variance parameters.

Here's the abstract for my talk tomorrow for the 50th anniversary conference of the Harvard statistics department.

Some Open Problems in Hierarchical Models

Xue Li and Dick Campbell write,

We are doing research on racial and SES disparities in stage at breast cancer diagnosis which involves using tract-level poverty estimates to impute person-level SES variables. We are trying to do this using Bayesian methods.

Markus Loecher asks some questions that come up on occasion:

I am in the middle of going through your book on multilevel/hierarchical models. . . hope to apply them to some multi-scale spatial problems that I am working on. I am still struggling with some of the more subtle implications of the mixed effects model and the "partial pooling" approach which even when formulated more along frequentist terminology seems to have a distinct Bayesian flavor to it ? One particular source of confusion to me is Figure 12.1 While I understand the problems with estimates that rely on small sample sizes and also find the shrunken estimates in Fig. 12.1b appealing, I cannot overcome my feeling that this "dilemma" should be taken care of and expressed in the respective wider confidence intervals. Pooling across counties seems like quite a strong assumption on exchangeability ? A more mundane question is: how did you get the confidence intervals for sample sizes of 1 ?

My response:

1. Frequentist, Bayesian: they're just words. What matters is what you're doing. That's why we talk about partial pooling, 'cos that's what we're doing to the data. If you push me on it, though, yeah, the entire book is Bayesian.

2. You'd like the intervals in Figure 12.1b to be wider. Here's a way to think of it: suppose the data we saw were a random sample from a huge dataset, with thousands of measurements per county. Now suppose you wanted to use the no-pooling or the multilevel estimates as a way of making a prediction for the average of the thousands of measurements in each county. Finally suppose that you want to give each estimate a conf interval, so that in each case there's a 68% chance that the estimate contains the true value. Then you'd find (assuming the model is correct) that, indeed, the no-pooling estimates would require those really wide confidence intervals (as in Fig 12.1a), but the multilevel estimates would only need narrower intervals (as in Fig 12.1b).

3. "Exchangeability" refers to statistical methods (or, more generally, mathematical formulas) that treat the different groups (in this case, counties) symmetrically. The no-pooling, multilevel, and complete-pooling procedures are all exchangeable. So if you don't like exchangeability, there's no reason to pick on the multilevel model. More to the point, the multilevel model allows you to easily go beyond exchangeability by adding group-level predictors as illustrated later on in the chapter.

4. Even when sample size is 1 (or 0), you can get confidence intervals--the info comes from the group-level model.

Mike Larsen asks,

David Afshartous writes,

What is the difference between sim() and mcsamp(), if any? If there is a difference, what should I expect in terms of difference in results?

My reply:

Shane Murphy writes,

I am a graduate student in political science (interested in economics as well), and I was reading your recent blog posts about significance testing, and the problems common for economists doing statistics. Do you know of and recommend any books to students learning econometrics or statistics for social science? Also, just in case your answer is your own book, "Data Analysis Using Regression and Multilevel/Hierarchical Models," is this book an appropriate way to learn "econometrics" (which is just statistics for economists, right?)?

My reply: Yes, I do recommend my book with Jennifer Hill. I also think it's the right book to learn applied statistics for economics. However, within economics, "econometrics" usually means something more theoretical, I think. You could take a look at a book such as Wooldridge's, which presents the theory pretty clearly.

Question on hypothesis testing

| 3 Comments

Mike Frank writes,

Hi, I'm a graduate student at MIT in Brain and Cognitive Sciences. I'm an avid reader of your blog and user of your textbook and so I thought I would email you this question in the hopes you have thoughts on it. I'm in a strange position in my research in that I do a lot of Bayesian modeling of cognitive processes but then end up doing standard psychology experiments to test predictions of the models where I have to use simpler frequentist statistical methods (which are standard in psychology, hard to publish without them) to analyze those data.

Yu-Sung writes,

I told Phil Stark about this paper on Type S error rates and this paper on statistical challenges in estimating small effects, and he replied with these references on Type S errors and multiple comparisons:

Benjamini, Y. and Stark, P.B., 1996. Non-equivariant simultaneous confidence intervals less likely to contain zero, J. Am. Stat. Assoc., 91, 329-337.

Benjamini, Y., Y. Hochberg, and P.B. Stark, 1998. Confidence Intervals with more Power to determine the Sign: Two Ends constrain the Means, J. Amer. Stat. Assoc., 93, 309-317.

This is how researchers communicate: we just send links to our papers back and forth.

Seth Wayland writes,

In Chapter 14.1 of your new book, the example uses only predictors for which you have census data at the state level. In the postratification step, you just plug the values of those covariates into the model, and viola, you have an estimate for that poststratification cell! What about including further individual level predictors in the model to account for probability of selection such as household size and number of phones in the household, or even an individual-level predictor that might improve the model? How do you then calculate the estimate for each poststratification cell?

My response: yes, this is something we are struggling with.

The long answer is that we would treat the population distribution of all the predictors, Census and non-Census variables (those desirable individual-level predictors which are only observed in the sample and not in the population), as unknown. We'd give it all a big fat prior distribution and do Bayesian inference. This sounds like a lot but I think it's doable using regression models with interactions. We're working on this now, starting with simple models with just one non-Census variable. The closest we've come so far with is this paper with Cavan and Jonathan on poststratification without population inference (see blog entry here).

The short answer is that it should be possible to do a quick-and-dirty version of the above plan, estimating the joint distribution of Census and non-Census variables using point estimates for the distribution of non-Census variables given the Census variables, based on weighting using the survey data within each Census post-stratification cell. This is only an approximation because it ignores uncertainty (for example, if a particular cell includes 4 people in single-phone households and 3 people in multiple-phone households, the weighted totals become 4 and 1.5, so the quick-and-dirty approach would use the point estimate of 1/(4+1.5) as the proportion in single-phone households in that cell, ignoring the uncertainty arising from sampling variability).

I do think that this (the quick version, then the full version) is ultimately the way to go, since the poststratification strategy allows us to model the data and get small-area estimates, such as state-level opinions from national polls.

As is often the case, the challenge in statistics is to include all relevant information (from the Census as well as the survey, and maybe also from other surveys), and to do this while setting up a model that is structured enough to take advantage of all these data but not so structured that it overwhelms this information.

Jeff writes,

Can you suggest some (simple) material on the differences between ML MLM and Bayesian MLM? Since we are using LMER, and not winbugs etc., then Justin and I are NOT doing Bayesian MLM?

My reply: The only real difference is when the unexplained group-level variance is hard to estimate, which typically occurs when this variance is low and the number of groups is small (in which case, classical point estimates won't work well, and even a small amount of prior information will help; see Sections 5.1 and 5.2 of this paper). I think we discuss this issue in the new book, also see this from 1996, for example.

Multilevel modeling in Stata

| 1 Comment

Roberto Gutierrez sends along this presentation describing new multilevel modeling options in Stata 10. There's some interesting discussion of computation time. Also, when they introduce the models, I'd suggest looking at the 5 different ways of writing a multilevel model (as in Chapter 12 of our book).

"P" sends along a link to this paper by Sebastian Zollner and Jonathan Pritchard:

Genomewide association studies are now a widely used approach in the search for loci that affect complex traits. After detection of significant association, estimates of penetrance and allele-frequency parameters for the associated variant indicate the importance of that variant and facilitate the planning of replication studies. However, when these estimates are based on the original data used to detect the variant, the results are affected by an ascertainment bias known as the “winner’s curse.” The actual genetic effect is typically smaller than its estimate. This overestimation of the genetic effect may cause replication studies to fail because the necessary sample size is underestimated. Here, we present an approach that corrects for the ascertainment bias and generates an estimate of the frequency of a variant and its penetrance parameters. The method produces a point estimate and confidence region for the parameter estimates. We study the performance of this method using simulated data sets and show that it is possible to greatly reduce the bias in the parameter estimates, even when the original association study had low power.

This is the statistical phenomenon we were writing about here in the context of sex-ratio studies: when sample sizes are small, "statistically significant" estimates will tend to be much larger than the true parameters being estimated. Sollner and Pritchard have a statistical correction procedure, conditioning on the observation of a statistically significant result. I think a Bayesian approach would work better--this might be worth trying on Sollner and Pritchard's examples.

R-squared: useful or evil?

| 1 Comment

I had the following email exchange with Gary King.

How do you summarize logistic regressions and other nonlinear models? The coefficients are only interpretable on a transformed scale. One quick approach is to divide logistic regression coefficients by 4 to convert on to the probability scale--that works for probabilities near 1/2--and another approach is to compute changes with other predictors held at average values (as we did for Figure 8 in this paper). A more general strategy is to average over the distribution of the data--this will make more sense, especially with discrete predictors. Iain Pardoe and I wrote a paper on this which will appear in Sociological Methodology:

In a predictive model, what is the expected difference in the outcome associated with a unit difference in one of the inputs? In a linear regression model without interactions, this average predictive comparison is simply a regression coefficient (with associated uncertainty). In a model with nonlinearity or interactions, however, the average predictive comparison in general depends on the values of the predictors. We consider various definitions based on averages over a population distribution of the predictors, and we compute standard errors based on uncertainty in model parameters. We illustrate with a study of criminal justice data for urban counties in the United States. The outcome of interest measures whether a convicted felon received a prison sentence rather than a jail or non-custodial sentence, with predictors available at both individual and county levels. We fit three models: a hierarchical logistic regression with varying coefficients for the within-county intercepts as well as for each individual predictor; a hierarchical model with varying intercepts only; and a non-hierarchical model that ignores the multilevel nature of the data. The regression coefficients have different interpretations for the different models; in contrast, the models can be compared directly using predictive comparisons. Furthermore, predictive comparisons clarify the interplay between the individual and county predictors for the hierarchical models, as well as illustrating the relative size of varying county effects.

The next step is to program it in general in R.

Data Analysis Using etc. etc.

| 1 Comment

Timothy Hellwig wrote a nice review of our book in The Political Methodologist.

My only complaint is that he writes, "Instructors of first-year graduate methods courses should consider complementing their texts with material from Part I." I think we should be the primary text! In all seriousness, it would be helpful to know what he (and others) think our book should have so that it can become appropriate for a primary text. (We're preparing to put together a shorter version suitable for basic classes in regression.) I agree completely with Hellwig in preferring "learning by doing" to "learning by lecturing."

Imputing categorical regressors

| No Comments

Jonathan writes,

A grad student, Gabriel Katz (no relation), and I are working on some MCMC code with survey data to handle misreporting issues in voting data. (An old idea of mine that I presented an early draft of at Columbia years ago). Since we have coded up the models in Bugs/JAGS, we decided we might as well try to also handle some of the missing data in covariates. We are, however, a little trouble with the imputation of missing categorical covariates. We have been trying ordered probit-logit priors (so as to avoid using the easy way out using normal priors), but the problem is that it is quite hard to get BUGS/Jags to bracket-slice for certain categories of the ordered variables. The problem seems to be in choosing good priors for the thresholds and means of the categorical variable. We don't know of any principled way to this. We have tried several different values for the variance of the priors, but in our model we have 5 categorical variables with a total of 22 thresholds, so trial-and-error seems hopeless. Since this is really just a secondary problem for us, perhaps we should ignore the missingness problem as most researchers do. If you have any suggestions or pointers, it would be much appreciated.

Anna Reimondos writes,

Standard errors from lmer()

| No Comments

Jarrett Byrnes writes,

Shravan writes,

Two different people (Christoper Mann and Jeff Lax) pointed me to this graph in the Wall Street Journal that features a goofy regression line. My expertise on taxes and economic growth is zero, and the statistical problems with the regression line are apparent, so I don't really have anything to say here. Hey, if all roads go through Rome, it's only fair that all lines go through Norway.

But, to get serious for a minute . . . Setting aside the concerns with the regression line or with measurement issues in defining the variables being graphed, it's an interesting reminder of the duality between descriptive vs. causal inference and aggregate vs. individual-level analysis (or, as would be said in psychology, between-subject vs. within-subject analysis). I'm not criticizing the use of graphs such as these (or corresponding regression models) that use between-country comparisons to make implicit causal inferences about policies--it's just helpful to remember the assumptions needed to draw these conclusions.

Jonah Piovia-Scott writes,

Tom Moertel writes,

I wanted to let you know that I have packaged the CRAN "arm" package and its prerequisite R packages for Fedora Linux, making it easy to install using Fedora's integrated package-management system. Normally, installing "arm" is somewhat tricky because of the dependency upon on R2WinBUGS, which requires BRugs, which doesn't currently build on Linux. (I got around this problem by making R2WinBUGS not require BRugs, which is the best work-around I could come up with. The resulting functionality is therefore incomplete, but at least Linux users get the functionality that is possible on their platform rather than a failed installation.)

I don't know how many of your students or readers use Fedora Linux, but if any do, feel free to point them to my packages: http://blog.moertel.com/articles/2007/04/25/new-fedora-core-rpms-for-cran-packages

Just a warning: we've been updating "arm" occasionally, mostly improvements to bayesglm() and fixing the calls to lmer().

Michael Papenfus writes,

I have been using your ARM book to learn multilevel modeling and I have a question which I cannot seem to find answer to in your book. How can I assess the overall significance of a random effect (varying intercept)? Let say I estimate

M1 <- lmer(y ~ 1 + x (1|country)) and obtain
the results from your book which include

Error terrms:
Group Name Std.Dev.
county (Intercept) 0.33
Residual 0.76

Is there a way to test for the overall significance of the random effect (varying intercept)? I don't believe that the se.ranef is what I am looking for. I believe that some packages such as Stata xtmixed provide standard errors to the variance components.

My reply: the mcsamp() function does approximate posterior simulations for the multilevel model and will give you a sense of the uncertainty in the variance parameters. To test the statistical significance, though, you have to fit the model with and without the variance parameters and check some measure of fit. I haven't thought too much about this problem because I usually keep in whatever I can, discarding variance parameters only for convenience (in which case it's usually clear in practical terms that a variance component is small enough that it can be ignored). Ultimately, I think all the variance components exist at some level, although they can be small enough that they can be ignored. That said, you can compare models with and without variance components using predictive accuracy. See the table on the top of p.526 for an example.

Bayesian Anova

| 1 Comment

Song Qian sent me this paper, to appear in the journal Ecology, on ecological applications of multilevel analysis of variance. Here's the abstract:

A Bayesian representation of the analysis of variance by Gelman (2005) is introduced with ecological examples. These examples demonstrate typical situations we encounter in ecological studies. Compared to the conventional methods, the multilevel approach is more flexible in model formulation, easier to setup, and easier to present. Because the emphasis is on estimation, multilevel model results are more informative than the results from a significance test. The improved capacity is largely due to the changed computation methods. In our examples, we show that (1) the multilevel model is able to discern a treatment effect that is too weak for the conventional approach, (2) the graphical presentation associated with the multilevel method is more informative, and (3) the multilevel model can incorporate all sources of uncertainty to accurately describe the true relationship between the outcome and potential predictors.

I like the method (of course) and also the graphical displays. The next step is to move beyond exchangeable models, especially for interactions.

Recent Comments

  • Asa: Thanks everyone. I figured out a pretty solid solution to read more
  • Stuart Buck: Is it that medical schools are trying to screen out read more
  • Jacob: BTW, in no way I am putting down R. R read more
  • Jacob: Nick, Of course, my comment on MATLAB's popularity is based read more
  • Steven: http://www.cockeyed.com/science/gallon/liquid.html See for more info read more
  • Andrew Gelman: Jonathan: You are giving the conventional definition of risk aversion read more
  • Jonathan: As an economist who does his work with "the public," read more
  • BrendanH: I'll second the lme4/R recommendation, on the grounds that it read more
  • Chris Brew: In Linguistics Ohio State invites people to on-site recruiting visits read more
  • PalMD: Independent of the evidence, med schools and residencies require very read more
  • Eliot: I agree with Elizabeth and Jeremy -- Americans and Europeans read more
  • ssendam: One of the irritating things about economics is that it read more
  • wei: probabilities should be conditional upon all possible paragraphs to express read more
  • Jonathan: It's far from surprising that a guy who doesn't believe read more
  • Jeremy Miles: I'd never thought of it like that - but risk read more
  • Elizabeth: You could just describe the same phenomena by saying that read more
  • manuelg: Yes, I think I understand what you are saying. Converting read more
  • derek: How meaningful is it for an American to talk about read more
  • Andrew Gelman: Thorfinn: I don't actually think the Type 1 and Type read more
  • jonathan: My father gave medical boards, did some admissions, taught in read more