R vs. Stata, or, Different ways to estimate multilevel models

Cyrus writes:

I [Cyrus] was teaching a class on multilevel modeling, and we were playing around with different method to fit a random effects logit model with 2 random intercepts—one corresponding to “family” and another corresponding to “community” (labeled “mom” and “cluster” in the data, respectively). There are also a few regressors at the individual, family, and community level. We were replicating in part some of the results from the following paper: Improved estimation procedures for multilevel models with binary response: a case-study, by G Rodriguez, N Goldman.

(I say “replicating in part” because we didn’t include all the regressors that they use, only a subset.) We were looking at the performance of estimation via glmer in R’s lme4 package, glmmPQL in R’s MASS package, and Stata’s xtmelogit. We wanted to study the performance of various estimation methods, including adaptive quadrature methods and penalized quasi-likelihood.

I was shocked to discover that glmer’s default setting is adaptive Gaussian quadrature with 1 integration point (that is, a Laplace approximation). In addition, and even more shocking, glmer **does not allow** more than one integration point to be used for models with more than one random effect. With only one random effect, you can increase the number of integration points. But with multiple random effects (like what we had, which was 2), when you try to specify multiple integration points (with the nAGQ setting), you get an error saying that it can’t do it. At first, I kinda shrugged, but then when I saw the difference that it made, alarm bells started going off.

I made a table showing results from estimates using glmer’s Laplace approximation, and then also results from the glmmPQL which uses penalized quasi-liklihood, and then results from Stata’s xtmelogit using first its own Laplace approximation and then a 7-pt Gaussian adaptive quadrature. Note that in Stata, the 7-pt quadrature fit is the default. This is a high dimsenional “brute force” method for fitting a complex likelihood.

Now, when we look at the results for the random effects standard deviation estimates, we see that the Laplace approximations in R (via glmer) and Stata (model 3 in the Table) are identical. In addition, taking the 7-pt quadrature estimate from xtmelogit to be the best guess of the bunch, the Laplace approximation-based estimates are HORRIBLY biased downward. That is, the estimated random effect variance is way too small. The PQL estimates are also biased downward, but not so badly. This is in line with what Rodriguez and Goldman found, using a 20-point quadrature and an MCMC fit as benchmarks (they produced nearly identical results).

The upshot is that glmer’s default is REALLY BAD; and this default is the only option when you have more than one random effect. My suggestion then is to either use Stata, which has a very sensible 7-point quadrature default (although as a brute force method, it is slow) or fit the model with MCMC.

If I understand, you are using glmer? If so, have you (1) looked into this or (2) checked what glmer is giving you against xtmelogit or BUGS? I would trust the xtmelogit or BUGS answers more.

My reply:

1. I am working with Sophia Rabe-Hesketh, who’s one of the people who programmed gllam in Stata. Sophia has been telling me for awhile that the method used by gllam is better than the Laplace method that is currently used in glmer.

2. Just some terminology: I’d prefer to talk about different “approximations” rather than different “estimation methods.” To me, there’s just one estimation method here–hierarchical Bayes (or, as Sophia would say, marginal likelihood inference)–and these various approaches to point estimates of variance parameters are approximations to get around the difficulties of doing the integrals and computing the marginal likelihood.

3. I also prefer to use the term “varying intercepts” (and slopes) rather than “random effects,” because the term “random effects” is not so clearly defined in the literature (as Jennifer and I discuss in our book). This doesn’t change the substance of your comment but it makes it easier for me to follow.

4. In my project with Sophia (and with Jingchen Liu, also cc-ed here), we’re working on several parallel tracks:
(a) Adding a prior distribution on the variance parameters to stabilize the point estimates and keep them away from the boundary of parameter space (which, in the context of scalar variance parameters, implies that we keep the estimates away from zero).
(b) Adding a few steps of the Metropolis algorithm to capture some of the uncertainty in the variance parameters and also fix some of that bias you’re talking about. It’s my intuition that a low number (e.g., 10) Metropolis steps will clean up a lot of problems. Although, right now, that’s just an intuition, we haven’t tried it yet.
(c) Working in R (using glmer) and also separately in Stata (using gllam).

5. Last but not least, it sounds like you’ve encountered an important principle of research: The way to really learn a subject is to teach it! That’s why college professors(at least at a place like Columbia) know so much: we’re always teaching new things.

4 thoughts on “R vs. Stata, or, Different ways to estimate multilevel models

  1. If Cyrus wants to learn more about this behaviour, I suspect the best place to ask is the R-SIG-mixed-models mailing list. Doug Bates (the author of lme4) will be able to tell you if this just a temporary issue (due to finite programming time), or whether there are advantages to this behaviour in other situations. I believe Russell Millar raised a similar issue a couple of days ago.

  2. A more principled discussion would be:
    N. E. Breslow and D. G. Clayton, Approximate inference in generalized linear mixed models. J. Amer. Statist. Assoc., 2001, 88, pp. 9–25.
    and the papers which followed.

    A practical discussion of the number of points in Gaussian quadrature is: Lesaffre, E. & Spiessens, B. On the effect of the number of quadrature points in a logistic random-effects model: an example. J. Royal Stat. Soc. A, 2001, 50, pp.325-335.

    Probably better than running one(!) model with no clear benchmark…

  3. A question arises… would the folks who switched from using multiple F-tests with linguistic stimuli (items as error terms and subjects as error terms) have been better off sticking with that than use glmer?

  4. Some replies to the comments, since this is actually something that came up some time ago, and we've since looked into it a lot more:

    On Hadley's suggestion: we contacted Douglas Bates, and he was very generous in considering the problem. There was a rich email exchange with him and some other colleagues (that Andrew mentioned). Ultimately, I think we all agreed that yes, this was a problem that should be addressed ideally, it hadn't yet in lme4 because of finite programming time (nice phrase), and that while there is some controversy on appropriate approximation methods, right now the glmer defaults should only serve as a first cut, with perhaps a sampling based approximation (e.g. MCMC) or more complex approximations along the lines of what xtmelogit and gllamm offer as better choices for the final analysis.

    On Daniel's suggestions: correct, the original message that I sent to Andrew arose from a one off thing, but of course we all looked into it further and the behavior that I documented in that particular example was actually representative, rather than a fluke. A good reference on this is:

    Harry Joe. 2008. "Accuracy of Laplace approximation for discrete response mixed models." Computational Statistics & Data Analysis. 52(12):5066-5074.

    But there is another point that was meant to be emphasized (as indicated by Andrew's titling of the post). Statistical tools are built with the expectation that people who do not fully understand what they are doing will nonetheless use them. It's part of the division of labor among researchers. That being the case, default settings can be really important, and my take was that the default settings that the Stata people set were quite sensible, and that the defaults for glmer probably aren't what most people would want if they knew better. For applied researchers (not statisticians) I think this is a useful reminder to think twice about the tools you are using, especially when you are doing things that you do not fully understand.

Comments are closed.