Recently in Statistical computing Category


My reaction was, It's cute how the bars move but why is this the future?

Aleks replied:

Integrated in the browser, works on any device, requires no software installation. Here are more examples, for maps.

R on the cloud


Just as scientists should never really have to think much about statistics, I feel that, in an ideal world, statisticians would never have to worry about computing. In the real world, though, we have to spend a lot of time building our own tools.

It would be great if we could routinely run R with speed and memory limitations being less of a concern. One suggestion that sometimes arises is to run things on "the cloud." So I was interested upon receiving this email from Niklas Frassa:

For awhile I've been fitting most of my multilevel models using lmer/glmer, which gives point estimates of the group-level variance parameters (maximum marginal likelihood estimate for lmer and an approximation for glmer). I'm usually satisfied with this--sure, point estimation understates the uncertainty in model fitting, but that's typically the least of our worries.

Sometimes, though, lmer/glmer estimates group-level variances at 0 or estimates group-level correlation parameters at +/- 1. Typically, when this happens, it's not that we're so sure the variance is close to zero or that the correlation is close to 1 or -1; rather, the marginal likelihood does not provide a lot of information about these parameters of the group-level error distribution.

I don't want point estimates on the boundary. I don't want to say that the unexplained variance in some dimension is exactly zero.

One way to handle this problem is full Bayes: slap a prior on sigma, do your Gibbs and Metropolis, and summarize posterior inferences by simulation. That's fine but it's another bit of programming and computation. We wanted to be able to do it using the existing lmer/glmer (or, in Stata, gllamm) framework.

Yeojin Chung, Sophia Rabe-Hesketh, Jingchen Liu, Vince Dorie, and I came up with a solution. We compute a maximum penalized marginal likelihood estimate, the marginal posterior mode under a gamma (not inverse gamma; you know how I hate that) prior for the scale parameter. Our default choice is gamma (2, 1/A), where A is set to some large value such as 10 (rescaled relative to the regression predictors, as usual). The shape parameter of 2 gives a prior that is linear at 0: thus, the posterior mode can't be 0 under any circumstances--but it can be as close to 0 as is required based on the likelihood. Some basic algebra shows that if the marginal maximum likelihood estimate is at 0, our posterior mode estimate will be about 1 standard error away from 0, which seems about right. The rate parameter, 1/A, is set to be small so that the prior will only pull the estimate down if it is unexpectedly large, as might occur if the number of groups is small.

Our prior distribution is weakly informative in that it does something to the estimate but is generic and doesn't have much information as we would have in just about any particular example. I think that in many applied examples it would make sense to use stronger prior information. But I think that, as a default choice, the weakly informative gamma (2, 1/A) prior makes more sense than the uniform distribution (corresponding to the nonregularized estimate).

In the multivariate case, we use a Wishart (not inverse-Wishart), which when we choose the right parameters keeps the variances away from 0 and the correlation matrix positive definite.

One interesting feature of these models is that the weakly-informative regularizing prior we use for a point estimate is not the same as the weakly-informative prior that we would use for full Bayes (where there is no need to keep the posterior mode away from the boundary).

Here's the abstract to our paper:

Variance parameters in mixed or multilevel models can be difficult to estimate, espe- cially when the number of groups is small. We propose a maximum penalized likelihood approach which is equivalent to estimating variance parameters by their marginal poste- rior mode, given a weakly informative prior distribution. By choosing the prior from the gamma family with at least 1 degree of freedom, we ensure that the prior density is zero at the boundary and thus the marginal posterior mode of the group-level variance will be positive. The use of a weakly informative prior allows us to stabilize our estimates while remaining faithful to the data.

Let me again thank my collaborators, Yeojin Chung, Sophia Rabe-Hesketh, Jingchen Liu, Vince Dorie. We're now working on various generalizations, including multivariate distributions, generalized linear models, and hierarchical models for the variance parameters (going beyond the simple ideas in section 6 of my article from a few years ago).

Martyn Plummer replied to my recent blog on DIC with information that was important enough that I thought it deserved its own blog entry. Martyn wrote:

DIC has been around for 10 years now and despite being immensely popular with applied statisticians it has generated very little theoretical interest. In fact, the silence has been deafening. I [Martyn] hope my paper added some clarity.

As you say, DIC is (an approximation to) a theoretical out-of-sample predictive error. When I finished the paper I was a little embarrassed to see that I had almost perfectly reconstructed the justification of AIC as approximate cross-validation measure by Stone (1977), with a Bayesian spin of course.

But even this insight leaves a lot of choices open. You need to choose the right loss function and also which level of the model you want to replicate from. David Spiegelhalter and colleagues called this the "focus". In practice the focus is limited to the lowest level of the model. You generally can't calculate the log likelihood (the default penalty) for higher level parameters. But this narrow choice might not correspond to the interests of the analyst. For example, in disease mapping DIC answers the question "What model yield the disease map that best captures the features of the observed incidence data during this period?" But people are often asking more fundamental questions about their models, like "Is there spatial aggregation in disease X?" There is quite a big gap between these questions.

Regarding the slow convergence of DIC, you might want to try an alternative definition of the effective number of parameters pD that I came up with in 2002, in the discussion of Spiegelhalter et al. It is non-negative and coordinate free. It can be calculated from 2 or more parallel chains and so its sample variance can be estimated using standard MCMC diagnostics. I finally justified it in my 2008 paper and implemented in JAGS. The steps are (or should be):

- Compile a model with at least 2 parallel chains
- Load the dic module
- Set a trace monitor for "pD".
- Output with the coda command

If you are only interested in the sample mean, not the variance, the dic.samples function from the rjags package will give you this in a nice R object wrapper.

I suppose we can implement this in stan too.

Aki Vehtari commented too, with a link to a recent article by Sumio Watanabe on something called the widely applicable information criterion. Watanabe's article begins:

In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.

It's great to see progress in this area. After all these years of BIC variants, which I just hate, I like that researchers are moving back to the predictive-error framework. I think that Spiegelhalter, Best, Carlin, and Van der Linde made an important contribution with their DIC paper ten years ago. Even if DIC is not perfect and can be improved, they pushed the field in the right direction.

There are a few things I want to do:

1. Understand a fitted model using tools such as average predictive comparisons, R-squared, and partial pooling factors. In defining these concepts, Iain and I came up with some clever tricks, including (but not limited to):

- Separating the inputs and averaging over all possible values of the input not being altered (for average predictive comparisons);

- Defining partial pooling without referring to a raw-data or maximum-likelihood or no-pooling estimate (these don't necessarily exist when you're fitting logistic regression with sparse data);

- Defining an R-squared for each level of a multilevel model.

The methods get pretty complicated, though, and they have some loose ends--in particular, for average predictive comparisons with continuous input variables.

So now we want to implement these in R and put them into arm along with bglmer etc.

2. Setting up coefplot so it works more generally (that is, so the graphics look nice for models with one predictor, two predictors, or twenty predictors). Also a bunch of expansions to coefplot:

- Defining coefplot for multilevel models

- Also displaying average predictive comparisons for nonlinear models

- Setting it up to automatically display several regressions in a large "table"

3. Automatic plots showing data and fitted regression lines/curves. With multiple inputs, you hold all the inputs but one to fixed values--it's sort of like an average predictive comparison, but graphical. We also have to handle interactions and multilevel models.

4. Generalizing R-squared and partial pooling factors for multivariate (varying-intercept, varying-slope) models.

5. Graphs showing what happens as you add a multilevel component to a model. This is something I've been thinking about for awhile, ever since doing the police stop and frisk model with Jeff Fagan and Alex Kiss. I wanted a graph that showed how the key estimates were changing when we went multilevel, and what in the data was making the change.

One reason to see why these sort of process explanations are important is . . . we give them all the time. We're always giving these data-and-model stories of why when we control for variable X, our estimate changes on variable Y. Or why our multilevel estimates are a compromise between something and something else. What I'd like to do is to formalize and automate these explanations. Just as we have formulated and automated much of inference and we have (to some extent) formulated and automated graphical model checking, I'd like to formulate and automate the steps of understanding a fitted model--which often requires understanding in the context of fitted alternatives that are close by in model space.

6. I have more ideas but this is enough for now.

I received an email with the above title from Daniel Adkins, who elaborates:

I [Adkins] am having a tough time with a dataset including 40K obs and 8K subjects. Trying to estimate a 2 level logit with random intercept and age slope and about 13 fixed covariates. I have tried several R packages (lme4, lme4a, glmmPQL, MCMCglmm) and stata xtmelogit and gllamm to no avail. xtmelogit crashes from insufficient memory. The R packages yield false convergences. A simpler model w/ random intercept only gives stable estimates in lme4 with a very large number of quadrature point (nAGQ>220). When i try this (nAGQ=221) with the random age term, it doesn't make it through a single iteration in 72 hours (have tried both w/ and w/out RE correlation). I am using a power desktop that is top of the line compared to anything other than a cluster. Have tried start values for fixed effects in lme4 and that doesn't help (couldn't figure out how to specify RE starts). Do you have any advice. Should I move to BUGS (no experience, so that would be a little painful)? Try as SEM in Mplus? Is there a trick in lme4 or gllamm, that might help? Cluster/more RAM? I am thinking of splitting the sample, modeling, then combining estimates via meta-analysis, but reviewers may balk at this.

Could you give any advice? I have done a lot of diagnostics and simpler models, so I know ~ what the parameter estimates are, but need the formal full models for a paper.

What with that 93% marginal tax rate [regular readers will know what I'm talking about], I might as well just answer this one for free . . .

My first question is what happened in gllamm? My general impression is that gllamm is more reliable than lme4 etc for nonlinear models. Another option is to try bglmer--the regularization in the priors should make the computation more stable. I don't know Mplus so can't say about that, but, no, I wouldn't recommend Bugs--my guess is that would be too slow.

Another option is some sort of intermediate approach, such as: Fit one of those simpler models that you like. In that simpler model, compute and save the "linear predictor" (X*beta). Then use this as a single predictor (maybe it can have a varying slope also) in a multilevel model. Or something like that. If you have a simpler model that you understand, then you should be able to blaze a trail from there to the model you finally want to fit.

Adkins sent an update:

I was finally able to get stable, sensible estimates in lme4 after 1) rescaling age (to avoid estimating a very small random effect variance) and 2) moving to the 64-bit version of R. To answer your question, my sense was that gllamm would get stuck in a flat, slightly bumpy portion of the likelihood function and couldn't get out to move toward the global optimum. Could have probably circumvented this with good starts but I figured things out in lme4 first.

Aaahhh--rescaling. That's advice I gave someone else the other day, oddly enough. I should've thought of that.

After reading this from John Langford:

The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Jimmy asks:

Does this sound related to modeling the deep interactions you often talk about? (I [Jimmy] never understand the stuff on hunch, but thought that might be so?)

My reply: I don't understand that stuff on hunch so well either--he uses slightly different jargon than I do! That said, it looks interesting and important so I'm pointing you all to it.

Jon Goldhill points us to a new search engine, Zanran, which is for finding data and statistics. Goldhill writes:

It's useful when you're looking for a graph/table rather than a single number. For example, if you look for 'teenage births rates in the united states' in Zanran you'll see a series of graphs. If you check in Google, there's plenty of material - but you'd have to open everything up to see if it had any real numbers. (I hope you'll appreciate Zanran's preview capability as well - hovering over the icons gives a useful preview of the content.)

Whassup with glm()?

We're having problem with starting values in glm(). A very simple logistic regression with just an intercept with a very simple starting value (beta=5) blows up.

Peter Huber's most famous work derives from his paper on robust statistics published nearly fifty years ago in which he introduced the concept of M-estimation (a generalization of maximum likelihood) to unify some ideas of Tukey and others for estimation procedures that were relatively insensitive to small departures from the assumed model.

Huber has in many ways been ahead of his time. While remaining connected to the theoretical ideas from the early part of his career, his interests have shifted to computational and graphical statistics. I never took Huber's class on data analysis--he left Harvard while I was still in graduate school--but fortunately I have an opportunity to learn his lessons now, as he has just released a book, "Data Analysis: What Can Be Learned from the Past 50 Years."

The book puts together a few articles published in the past 15 years, along with some new material. Many of the examples are decades old, which is appropriate given that Huber is reviewing fifty years of the development of his ideas. (I used to be impatient with statistics books that were full of dead examples but then I started to realize this was happening to me! The 8 schools experiments are almost 35 years old. The Electric Company is 40. The chicken brains are over 20 years old. The radon study is 15 years old, the data from the redistricting study are from the 1960s and 1970s, and so on. And of course even my more recent examples are getting older at the rate of one year per year and don't keep so well once they're out of the fridge. So at this point in my career I'd like to make a virtue of necessity and say that it's just fine to work with old examples that we really understand.

OK. As noted, Huber is modern--a follower of Tukey--in his treatment of computing and graphics as central to the statistical enterprise. His ISP software is R-like (as we would say now; of course ISP came first), and the principle of interactivity was important. He also has worked on various graphical methods for data exploration and dimension reduction; although I have not used these programs myself, I view them as close in spirit to the graphical tools that we now use to explore our data in the context of our fitted models.

Right now, data analysis seems dominated by three approaches:
- Machine learning
- Bayes
- Graphical exploratory data analysis
with some overlap, of course.

Many other statistical approaches/methods exist (e.g., time series/spatial, generalized estimating equations, nonparametrics, even some old-fashioned extensions of Fisher, Neyman, and Pearson), but they seem more along the lines of closed approaches to "inference" rather than open-ended tools for "data analysis."

I like Huber's pluralistic perspective, which ranges from contamination models to object-oriented programming, from geophysics to data cleaning. His is not a book to turn to for specific advice; rather, I enjoyed reading his thoughts on a variety of statistical issues and reflecting upon the connections between Huber's strategies for data analysis and his better-known theoretical work.

Huber writes:

Too much emphasis is put on futile attempts to automate non-routine tasks, and not enough effort is spent on facilitating routine work.

I really like this quote and would take it a step further: If a statistical method can be routinized it can be used much more often and its limitations better understood.

Huber also writes:

The interpretation of the results of goodness-of-fit tests must rely on judgment of content rather than on P-values.

This perspective is commonplace today but, as Huber writes, "for a traditional mathematical statistician, the implied primacy of judgment over mathematical proof and over statistical significance clearly goes against the grain." The next question is where the judgment comes from. One answer is that an experienced statistician might work on a few hundred applied problems during his or her career, and that will impart some judgment. But what advice can we give to people without such a personal history? My approach has been to impart as much of the lessons I have learned into methods in my books, but Huber is surely right that any collection of specific instructions will miss something.

It is an occupational hazard of all scholars to have an incomplete perspective on work outside their own subfield. For example, section 5.2 of the book in question contains the following disturbing (to me) claim: "Bayesian statistics lacks a mechanism for assessing goodness-of-fit in absolute terms. . . . Within orthodox Bayesian statistics, we cannot even address the question whether a model Mi, under consideration at stage i of the investigation, is consonant with the data y."

Huh? Huh? Also please see chapter 6 of Bayesian Data Analysis and my article, "A Bayesian formulation of exploratory data analysis and goodness-of-fit testing," which appeared in the International Statistical Review in 2003. (Huber's chapter 5 was written in 2000 so too soon for my 2003 paper, but the first edition of our book and our paper on posterior predictive checks had already appeared several years before.)

Just to be clear: I'm not faulting Huber for not citing my work. The statistics literature is huge and ever-expanding. It's just unfortunate that such a basic misunderstanding--the idea that Bayesians can't check their models--persists.

I like what Huber writes about approximately specified models, and I think he'd be very comfortable with our formulation of Bayesian data analysis, from the very first page of our book, as comprising three steps: (1) Model building, (2) Inference, (3) Model checking. Step 3 is crucial to making steps 1 and 2 work. Statisticians have written a lot about the problems with inference in a world in which models are tested--and that's fine, such biases are a worthy topic of study--but consider the alternative, in which models were fit without ever being checked. This would be horrible indeed.

Here's a quote that is all too true (from section 5.7, following a long and interesting discussion of a decomposition of a time series in physics):

For some parts of the model (usually the less interesting ones) we may have an abundance of degrees of freedom, and a scarcity for the interesting parts.

This reminds me of a conversation I've had with Don Rubin in the context of several different examples. Like many (most?) statisticians, my tendency is to try to model the data. Don, in contrast, prefers to set up a model that matches what the scientists in the particular field of application are studying. He doesn't worry so much about fit to the data and doesn't do much graphing. For example, the schizophrenics' reaction time example (featured in the mixture-modeling chapter of Bayesian Data Analysis), we used the model Don recommended of a mixture of normal distributions with a fixed lag between them. Looking at the data and thinking about the phenomenon, a fixed lag didn't make sense to me, but Don emphasized that the psychology researchers were interested in an average difference and so it didn't make sense in his perspective to try to do any further modeling on these data. He said that if we wanted to model the variation of the lag, that would be fine but it would make sense to gather more data rather than knocking ourselves out on this particular small data set. In a field such as international relations, this get-more-data approach might not work, but in experimental psychology it seems like a good idea. (And I have to admit that I have not at all kept up with whatever research has been done in eye-tracking and schizophrenia in the past twenty years.)

This all reminds me of another story from when I was teaching at Berkeley. Phil Price and I had two students working with us on hierarchical modeling to estimate the distributions of home radon in U.S. counties. One day, one of the students simply quit. Why? He said he just wasn't comfortable with the Bayesian approach. I was used to the old-style Berkeley environment so just accepted it: the kid had been indoctrinated and I didn't have the energy to try to unbrainwash him. But Phil was curious, having just completed a cross-validation demonstrating how well Bayes was working in this example, Phil asked the student what he would do in our problem instead of a hierarchical model: if the student had a better idea, Phil said, we'd be happy to test it out. The student thought for a moment and said, well, I suppose Professor X (one of my colleagues from down the hall at the time) would say that the solution is to gather more data. At this point Phil blew up. Gather more data! We already have measurements from 80,000 houses! Could you tell us how many more measurements you think you'd need? The student had no answer to this but remained steadfast in his discomfort with the idea of performing statistical inference using conditional probability.

I think that student, and others like him, would benefit from reading Huber's book and realizing that even a deep theoretician saw the need for using a diversity of statistical methods.

Also relevant to those who worship the supposed purity of likelihoods, or permutation tests, or whatever, is this line from Huber's book:

A statistician rarely sees the raw data themselves--most large data collections in the sciences are being heavily preprocessed already in the collection stage, and the scientists not only tend to forget to mention it, but sometimes they also forget exactly what they had done.

We're often torn between modeling the raw raw data or modeling the processed data. The latter choice can throw away important information but has the advantage, not only of computational convenience but also, sometimes, conceptual simplicity: processed data are typically closer to the form of the scientific concepts being modeled. For example, an economist might prefer to analyze some sort of preprocessed price data rather than data on individual transactions. Sure, there's information in the transactions but, depending on the context of the analysis, this behavioral story might distract from the more immediate goals of the economist. Other times, though, the only way to solve a problem is to go back to the raw data, and Huber provides several such examples in his book.

I will conclude with a discussion of a couple of Huber's examples that overlap with my own applied research.

Radon. In section 3.8, Huber writes:

We found (through exploratory data analysis of a large environmental data set) that very high radon levels were tightly localized and occurred in houses sitting on the locations of old mine shafts. . . . The issue here is one of "data mining" in the sense of looking for a rare nugget, not one of looking, like a traditional statistician, "for a central tendency, a measure of variability, measures of pairwise association between a number of variables." Random samples would have been useless, too: either one would have missed the exceptional values altogether, or one would have thrown them out as outliers.

I'm not so sure. Our radon research was based on two random samples, one of which, as noted above, included 80,000 houses. I agree that if you have a nonrandom sample of a million houses, it's a good idea to use it for some exploratory analysis, so I'm not at all knocking what Huber has done, but I think he's a bit too quick to dismiss random samples as "useless." Also, I don't buy his claim that extreme values, if found, would've been discarded as outliers. The point about outliers is that you look at them, you don't just throw them out!

Aggregation. In chapter 6, Huber deplores that not enough attention is devoted to Simpson's paradox. But then he demonstrates the idea with two fake-data examples. If a problem is important, I think it should be important enough to appear in real data. I recommend our Red State Blue State article for starters.

Survey data. In section 7.2, Huber analyzes data from a small survey of the opinions of jurors. When I looked at the list of survey items, I immediately thought of how I would reverse some of the scales to put everything in the same direction (this is basic textbook advice). Huber ends up doing this too, but only after performing a singular value decomposition. That's fine but in general I'd recommend doing all the easy scalings first so the statistical method has a chance to discover something new. More generally, methods such as singular value decomposition and principal components analyses have their limitations--they can work fine for balanced data such as in this example but in more complicated problems I'd go with item-response or ideal-point models. In general I prefer approaches based on models rather than algorithms: when a model goes wrong I can look for the assumption that was violated, whereas when an algorithm spits out a result that doesn't make sense, I'm not always sure how to proceed. This may be a matter of taste or emphasis more than anything else; see my discussion on Tukey's philosophy.

The next example in Huber's book is the problem of reconstructing maps. I think he'd be interested in the work of Josh Tenenbaum and his collaborators on learning structured models such as maps and trees. Multidimensional scaling is fine--Huber gives a couple of references from 1970--but we can do a lot more now!

In conclusion, I found Huber's book to be enjoyable and thought provoking. It's good to have a sense of what a prominent theoretical statistician thinks about applied statistics.

Is that what she said?

| 1 Comment

Eric Booth cozies up to this article by Chloe Kiddon and Yuriy Brun (software here). I think they make their point in a gentle yet forceful manner.

Jeff writes:

How far off is bglmer and can it handle ordered logit or multinom logit?

My reply:

bglmer is very close. No ordered logit but I was just talking about it with Sophia today. My guess is that the easiest way to fit a hierarchical ordered logit or multinom logit will be to use stan. For right now I'd recommend using glmer/bglmer to fit the ordered logits in order (e.g., 1 vs. 2,3,4, then 2 vs. 3,4, then 3 vs. 4). Or maybe there's already a hierarchical multinomial logit in mcmcpack or somewhere?

Galin Jones, Steve Brooks, Xiao-Li Meng and I edited a handbook of Markov Chain Monte Carlo that has just been published. My chapter (with Kenny Shirley) is here, and it begins like this:

Convergence of Markov chain simulations can be monitored by measuring the diffusion and mixing of multiple independently-simulated chains, but different levels of convergence are appropriate for different goals. When considering inference from stochastic simulation, we need to separate two tasks: (1) inference about parameters and functions of parameters based on broad characteristics of their distribution, and (2) more precise computation of expectations and other functions of probability distributions. For the first task, there is a natural limit to precision beyond which additional simulations add essentially nothing; for the second task, the appropriate precision must be decided from external considerations. We illustrate with an example from our current research, a hierarchical model of trends in opinions on the death penalty in U.S. states.

To read all the other chapters, you'll have to buy the book!

Marc Tanguay writes in with a specific question that has a very general answer. First, the question:

I [Tanguay] am currently running a MCMC for which I have 3 parameters that are restricted to a specific space. 2 are bounded between 0 and 1 while the third is binary and updated by a Beta-Binomial. Since my priors are also bounded, I notice that, conditional on All the rest (which covers both data and other parameters), the density was not varying a lot within the space of the parameters. As a result, the acceptance rate is high, about 85%, and this despite the fact that all the parameter's space is explore. Since in your book, the optimal acceptance rates prescribed are lower that 50% (in case of multiple parameters), do you think I should worry about getting 85%. Or is this normal given the restrictions on the parameters?

First off: Yes, my guess is that you should be taking bigger jumps. 85% seems like too high an acceptance rate for Metropolis jumping.

More generally, though, my recommendation is to monitor expected squared jumped distance (ESJD), which is a cleaner measure than acceptance probability. Generally, the higher is ESJD, the happier you should be. See this paper with Cristian Pasarica.

The short story is that if you maximize ESJD, you're minimizing the first-order autocorrelation. And, within any parameterized family of jumping rules, if you minimize the first-order autocorrelation, I think you'll pretty much be minimizing all of your autocorrelations and maximizing your efficiency.

As Cristian and I discuss in the paper, you can use a simple Rao-Blackwellization to compute expected squared jumped distance, rather than simply average squared jumped distance. We develop some tricks based on differentiation and importance sampling to adaptively optimize the algorithm to maximize ESJD, but you can always try the crude approach of trying a few different jumping scales, calculating ESJD for each, and then picking the best to go forward.

A common aphorism among artificial intelligence practitioners is that A.I. is whatever machines can't currently do.

Adam Gopnik, writing for the New Yorker, has a review called Get Smart in the most recent issue (4 April 2011). Ostensibly, the piece is a review of new books, one by Joshua Foer, Moonwalking with Einstein: The Art and Science of Remembering Everything, and one by Stephen Baker Final Jeopardy: Man vs. Machine and the Quest to Know Everything (which would explain Baker's spate of Jeopardy!-related blog posts). But like many such pieces in highbrow magazines, the book reviews are just a cover for staking out a philosophical position. Gopnik does a typically New Yorker job in explaining the title of this blog post.

RStudio - new cross-platform IDE for R


The new R environment RStudio looks really great, especially for users new to R. In teaching, these are often people new to programming anything, much less statistical models. The R GUIs were different on each platform, with (sometimes modal) windows appearing and disappearing and no unified design. RStudio fixes that and has already found a happy home on my desktop.

Initial impressions

I've been using it for the past couple of days. For me, it replaces the niche that held: looking at help, quickly doing something I don't want to pollute a project workspace with; sometimes data munging, merging, and transforming; and prototyping plots. RStudio is better than at all of these things. For actual development and papers, though, I remain wedded to emacs+ess (good old C-x M-c M-Butterfly).

Favorite features in no particular order

  • plots seamlessly made in new graphics devices. This is huge— instead of one active plot window named something like quartz(1) the RStudio plot window holds a whole stack of them, and you can click through to previous ones that would be overwritten and ‘lost’ in
  • help viewer. Honestly I use this more than anything else in and the RStudio one is prettier (mostly by being not set in Times), and you can easily get contextual help from the source doc or console pane (hit tab for completions, then F1 on what you want).
  • workspace viewer with types and dimensions of objects. Another reason I sometimes used instead of emacs. This one doesn’t seem much different from the one, but its integration into the environment is better than floaty thing that does.
  • ‘Import Dataset’ menu item and button in the workspace pane. For new R users, the answer to “How do I get data into this thing?” has always been “Use one of the textbook package’s included datasets until you learn to read.csv()”. This is a much better answer.
  • obviously, the cross-platform nature of RStudio took the greatest engineering effort. The coolest platform is actually that it will run on a server and you access it using a modern browser (i.e., no IE). (“While RStudio is compatible with Internet Explorer, other browsers provide an improved user experience through faster JavaScript performance and more consistent handling of browser events.” more).

It would be nice if…

  • indents worked like emacs. I think my code looks nice largely because of emacs+ess. The default indent of two spaces is nice (see the Google style guide) but where newlines line up by default is pretty helpful in avoiding silly typing errors (omitted commas, unclosed parentheses
  • you could edit data.frames, which I’ll guess they are working on. It must be hard, since the one and the X one that comes up in emacs are so abysmal (the one is the least bad). RStudio currently says “ Editing of matrix and data.frame objects is not currently supported in RStudio.” :-(

Overall, really great stuff!

Geen Tomko asks:

Can you recommend a good introductory book for statistical computation? Mostly, something that would help make it easier in collecting and analyzing data from student test scores.

I don't know. Usually, when people ask for a starter statistics book, my recommendation (beyond my own books) is The Statistical Sleuth. But that's not really a computation book. ARM isn't really a statistical computation book either. But the statistical computation books that I've seen don't seems so relevant for the analyses that Tomko is looking for. For example, the R book of Venables and Ripley focuses on nonparametric statistics, which is fine but seems a bit esoteric for these purposes.

Does anyone have any suggestions?

John Salvatier writes:

What do you and your readers think are the trickiest models to fit? If I had an algorithm that I claimed could fit many models with little fuss, what kinds of models would really impress you? I am interested in testing different MCMC sampling methods to evaluate their performance and I want to stretch the bounds of their abilities.

I don't know what's the trickiest, but just about anything I work on in a serious way gives me some troubles. This reminds me that we should finish our Bayesian Benchmarks paper already.

Software request


I was posting a couple of comments on Kaiser's blog, and I'm finding the CAPTCHA's there increasingly hard to read. I was thinking that some good software would be helpful.

Is there a browser attachment I could download that would automatically read CAPTCHA's and translate them into letters and numbers? This would be very helpful. I'd appreciate any tips in this direction.

This post is an (unpaid) advertisement for the following extremely useful resource:

  • Petersen, K. B. and M. S. Pedersen. 2008. The Matrix Cookbook. Tehcnical Report, Technical University of Denmark.

It contains 70+ pages of useful relations and derivations involving matrices. What grabbed my eye was the computation of gradients for matrix operations ranging from eigenvalues and determinants to multivariate normal density functions. I had no idea the multivariate normal had such a clean gradient (see section 8).

Andrew Gelman (Columbia University) and Eric Johnson (Columbia University) seek to hire a post-doctoral fellow to work on the application of the latest methods of multilevel data analysis, visualization and regression modeling to an important commercial problem: forecasting retail sales at the individual item level. These forecasts are used to make ordering, pricing and promotions decisions which can have significant economic impact to the retail chain such that even modest improvements in the accuracy of predictions, across a large retailer's product line, can yield substantial margin improvements.

Activities focus on the development of iterative imputation algorithms and diagnostics for missing-data imputation. Activities would include model-development, programming, and data analysis. This project is to be undertaken with, and largely funded by, a firm which provides forecasting technology and services to large retail chains, and which will provide access to a unique and rich set of proprietary data. The postdoc will be expected to spend some time working directly with this firm, but this is fundamentally a research position.

The ideal candidate will have a background in statistics, psychometrics, or economics and be interested in marketing or related topics. He or she should be able to work fluently in R and should already know about hierarchical models and Bayesian inference and computation.

The successful candidate will become part of the lively Applied Statistics Center community, which includes several postdocs (with varied backgrounds in statistics, computer science, and social science), Ph.D., M.A., and undergraduate students, and faculty at Columbia and elsewhere. We want people who love collaboration and have the imagination, drive, and technical skills to make a difference in our projects.

If you are interested in this position, please send a letter of application, a CV, some of your articles, and three letters of recommendation to the Applied Statistics Center coordinator, Caroline Peters, Review of applications will begin immediately.

Andrew Gelman (Columbia University) and Jennifer Hill (New York University) seek to hire a post-doctoral fellow to work on development of iterative imputation algorithms and diagnostics for missing-data imputation. Activities would include model-development, programming, and data analysis. This project is funded by a grant from the Institute of Education Sciences. Other collaborators on the project include Jingchen Liu and Ben Goodrich. This is a two-year position that would start this summer (2011) or earlier if possible.

The ideal candidate will have a statistics or computer science background, will be interested in statistical modeling, serious programming, and applications. He or she should be able to work fluently in R and should already know about hierarchical models and Bayesian inference and computation. Experience or interest in designing GUIs would be a bonus, since we are attempting to improve our GUI for missing data imputation.

The successful candidate will become part of the lively Applied Statistics Center community, which includes several postdocs (with varied backgrounds in statistics, computer science, and social science), Ph.D., M.A., and undergraduate students, and faculty at Columbia and elsewhere. We want people who love collaboration and have the imagination, drive, and technical skills to make a difference in our projects.

If you are interested in this position, please send a letter of application, a CV, some of your articles, and three letters of recommendation to the Applied Statistics Center coordinator, Caroline Peters, Review of applications will begin immediately.

We need help picking out an automatic differentiation package for Hamiltonian Monte Carlo sampling from the posterior of a generalized linear model with deep interactions. Specifically, we need to compute gradients for log probability functions with thousands of parameters that involve matrix (determinants, eigenvalues, inverses), stats (distributions), and math (log gamma) functions. Any suggestions?



I received the following email:

Did you know that it looks like Microsoft is entering the modeling game? I mean, outside of Excel. I recently received an email at work from a MS research contractor looking for ppl that program in R, SAS, Matlab, Excel, and Mathematica. . . . So far I [the person who sent me this email] haven't seen anything about applying any actual models. Only stuff about assigning variables, deleting rows, merging tables, etc. I don't know how common knowledge this all is within the statistical community. I did a quick google search for the name of the programming language and didn't come up with anything.

That sounds cool. Working with anything from Microsoft sounds pretty horrible, but it would be useful to have another modeling language out there, just for checking our answers if nothing else.

Joscha Legewie points to this article by Lars Ronnegard, Xia Shen, and Moudud Alam, "hglm: A Package for Fitting Hierarchical Generalized Linear Models," which just appeared in the R journal. This new package has the advantage, compared to lmer(), of allowing non-normal distributions for the varying coefficients. On the downside, they seem to have reverted to the ugly lme-style syntax (for example, "fixed = y ~ week, random = ~ 1|ID" rather than "y ~ week + (1|D)"). The old-style syntax has difficulties handling non-nested grouping factors. They also say they can estimated models with correlated random effects, but isn't that just the same as varying-intercept, varying-slope models, which lmer (or Stata alternatives such as gllam) can already do? There's also a bunch of stuff on H-likelihood theory, which seems pretty pointless to me (although probably it won't do much harm either).

In any case, this package might be useful to some of you, hence this note.

Automating my graphics advice


After seeing this graph:


I have the following message for Sharad:

Rotate the graph 90 degrees so you can see the words. Also you can ditch the lines. Then what you have is a dotplot, following the principles of Cleveland (1985). You can lay out a few on one page to see some interactions with demographics.

The real challenge here . . .

. . . is to automate this sort of advice. Or maybe we just need a really nice dotplot() function and enough examples, and people will start doing it?


Often a lineplot is better. See here for a discussion of another Sharad example.

Data cleaning tool!


Hal Varian writes:

You might find this a useful tool for cleaning data.

I haven't tried it out yet, but data cleaning is a hugely important topic and so this could be a big deal.

John Salvatier pointed me to this blog on derivative based MCMC algorithms (also sometimes called "hybrid" or "Hamiltonian" Monte Carlo) and automatic differentiation as the future of MCMC.

This all makes sense to me and is consistent both with my mathematical intuition from studying Metropolis algorithms and my experience with Matt using hybrid MCMC when fitting hierarchical spline models. In particular, I agree with Salvatier's point about the potential for computation of analytic derivatives of the log-density function. As long as we're mostly snapping together our models using analytically-simple pieces, the same part of the program that handles the computation of log-posterior densities should also be able to compute derivatives analytically.

I've been a big fan of automatic derivative-based MCMC methods since I started hearing about them a couple years ago (I'm thinking of the DREAM project and of Mark Girolami's paper), and I too wonder why they haven't been used more. I guess we can try implementing them in our current project in which we're trying to fit models with deep interactions. I also suspect there are some underlying connections between derivative-based jumping rules and redundant parameterizations for hierarchical models.

It's funny. Salvatier is saying what I've been saying (not very convincingly) for a couple years. But, somehow, seeing it in somebody else's words makes it much more persuasive, and again I'm all excited about this stuff.

My only amendment to Salvatier's blog is that I wouldn't refer to these as "new" algorithms; they've been around for something like 25 years, I think.

Chris Wiggins writes of an interesting-looking summer program that undergraduate or graduate students can apply to:

The hackNY Fellows program is an initiative to mentor the next generation of technology innovators in New York, focusing on tech startups. Last summer's class of fellows was paired with NYC startups which demonstrated they could provide a mentoring environment (a clear project, a person who could work with the Fellow, and sufficient stability to commit to 10 weeks of compensation for the Fellow). hackNY, with the support of the Kauffman foundation and the Internet Society of New York, provided shared housing in NYU dorms in Union Square, and organized a series of pedagogical lectures.

hackNY was founded by Hilary Mason, chief scientist at, Evan Korth, professor of CS at NYU, and Chris Wiggins, professor of applied mathematics at Columbia. Each of us has spent thousands of student-hours teaching and mentoring, and is committed to help build a strong community of technological innovators here in New York.

Applicants should be able to demonstrate proficiency in coding, in any language of their choosing, and a desire to find out more about NYC's rapidly-growing tech startup ecosystem. We strive to find excellent placements in a variety of fields, including front end (e.g., web development), back end (e.g., data science), and UIUX (user interface and user experience). The deadline to apply is 11:59 pm on Sunday December 5 (NYC time).

Students can apply here.

For information about last year's program, we encourage you to skim the list of activities from last year, which includes links to many of the blog entries, or to consult the press page.

for coverage in the media about the summer Fellows program as well as our once-per-semester hackathons.

Please don't hesitate to email info at with any questions, or to contact us individually at {hilary,evan,chris}

Multilevel quantile regression


Ryan Seals writes:

I'm an epidemiologist at Emory University, and I'm working on a project of release patterns in jails (basically trying to model how long individuals are in jail before they're release, for purposes of designing short-term health interventions, i.e. HIV testing, drug counseling, etc...). The question lends itself to quantile regression; we're interested in the # of days it takes for 50% and 75% of inmates to be released. But being a clustered/nested data structure, it also obviously lends itself to multilevel modeling, with the group-level being individual jails.

So: do you know of any work on multilevel quantile regression? My quick lit search didn't yield much, and I don't see any preprogrammed way to do it in SAS.

My reply:

To start with, I'm putting in the R keyword here, on the hope that some readers might be able to refer you to an R function that does what you want. Beyond this, I think it should be possible to program something in Bugs. In ARM we have an example of a multilevel ordered logit, which doesn't sound so different from what you're doing. I've never done a full quantile regression, but I imagine that you have to take some care in setting up the distributional form.

To start, you could fit some multilevel logistic regressions using different quantiles as cut-off points and plot your inferences to see generally what's going on.

par (mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)

Thank you.

Hadley Wickham sent me this, by Keith Baggerly and Kevin Coombes:

In this report we [Baggerly and Coombes] examine several related papers purporting to use microarray-based signatures of drug sensitivity derived from cell lines to predict patient response. Patients in clinical trials are currently being allocated to treatment arms on the basis of these results. However, we show in five case studies that the results incorporate several simple errors that may be putting patients at risk. One theme that emerges is that the most common errors are simple (e.g., row or column offsets); conversely, it is our experience that the most simple errors are common.

This is horrible! But, in a way, it's not surprising. I make big mistakes in my applied work all the time. I mean, all the time. Sometimes I scramble the order of the 50 states, or I'm plotting a pure noise variable, or whatever. But usually I don't drift too far from reality because I have a lot of cross-checks and I (or my close collaborators) are extremely familiar with the data and the problems we are studying.

Genetics, though, seems like more of a black box. And, as Baggerly and Coombes demonstrate in their fascinating paper, once you have a hypothesis, it doesn't seem so difficult to keep coming up with what seems like confirming evidence of one sort or another.

To continue the analogy, operating some of these methods seems like knitting a sweater inside a black box: it's a lot harder to notice your mistakes if you can't see what you're doing, and it can be difficult to tell by feel if you even have a functioning sweater when you're done with it all.

Sas and R


Xian sends along this link that might be of interest to some of you.

Our "arm" package in R requires Doug Bates's "lme4" which fits multilevel models.

lme4 is currently having some problems on the Mac. But installation on the Mac can be done; it just takes a bit of work.

I have two sets of instructions below.

After I spoke tonight at the NYC R meetup, John Myles White and Drew Conway told me about this competition they're administering for developing a recommendation system for R packages. They seem to have already done some work laying out the network of R packages--which packages refer to which others, and so forth.

I just hope they set up their system so that my own packages ("R2WinBUGS", "r2jags", "arm", and "mi") get recommended automatically. I really hate to think that there are people out there running regressions in R and not using display() and coefplot() to look at the output.

P.S. Ajay Shah asks what I mean by that last sentence. My quick answer is that it's good to be able to visualize the coefficients and the uncertainty about them. The default options of print(), summary(), and plot() in R don't do that:

- print() doesn't give enough information
- summary() gives everything to a zillion decimal places and gives useless things like p-values
- plot() gives a bunch of residual and diagnostic plots but no graphs of the fitted model and data.

I like display() because it gives the useful information that's in summary() but without the crap. I like coefplot() too, but it still needs a bit of work to be generally useful. And I'd also like to have a new function that automatically plots the data and fitted lines.

Here's my discussion of this article for the Journal of the Royal Statistical Society:

I will comment on this paper in my role as applied statistician and consumer of Bayesian computation. In the last few years, my colleagues and I have felt the need to fit predictive survey responses given multiple discrete predictors, for example estimating voting given ethnicity and income within each of the fifty states, or estimating public opinion about gay marriage given age, sex, ethnicity, education, and state. We would like to be able to fit such models with ten or more predictors--for example, religion, religious attendance, marital status, and urban/rural/suburban residence in addition to the factors mentioned above.

There are (at least) three reasons for fitting a model with many predictive factors and potentially a huge number of interactions among them:

1. Deep interactions can be of substantive interest. For example, Gelman et al. (2009) discuss the importance of interactions between income, religion, religious attendance, and state in understanding how people vote.

2. Deep interactions can increase predictive power. For example Gelman and Ghitza (2010) show how the relation between voter turnout and the combination of sex, ethnicity, education, and state has systematic patterns that would be not be captured by main effects or even two-way interactions.

3. Deep interactions can help correct for sampling problems. Nonresponse rates in opinion polls continue to rise, and this puts a premium on post-sampling adjustments. We can adjust for known differences between sampling and population using poststratification, but to do so we need reasonable estimates of the average survey response within narrow slices of the population (Gelman, 2007).

Our key difficulty--familiar in applied statistics but not always so clear in discussions of statistical computation--is that, while we have an idea of the sort of model we would like to fit, we are unclear on the details. Thus, our computational task is not merely to fit a single model but to try out many different possibilities. My colleagues and I need computational tools that are:

(a) able to work with moderately large datasets (aggregations of surveys with total sample size in the hundreds of thousands);
(b) able to handle complicated models with tens of thousands of latent parameters;
(c) flexible enough to fit models that we haven't yet thought of;
(d) fast enough that we can fit model after model.

We all know by now that hierarchical Bayesian methods are a good way of estimating large numbers of parameters. I am excited about the article under discussion, and others like it, because the tools therein promise to satisfy conditions (a), (b), (c), (d) above.

Ross Ihaka to R: Drop Dead


Christian Robert posts these thoughts:

I [Ross Ihaka] have been worried for some time that R isn't going to provide the base that we're going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

One of the worst problems is scoping. Consider the following little gem.

f =function() {
if (runif(1) > .5)
x = 10

The x being returned by this function is randomly local or global. There are other examples where variables alternate between local and non-local throughout the body of a function. No sensible language would allow this. It's ugly and it makes optimisation really difficult. This isn't the only problem, even weirder things happen because of interactions between scoping and lazy evaluation.

In light of this, I [Ihaka] have come to the conclusion that rather than "fixing" R, it would be much more productive to simply start over and build something better. I think the best you could hope for by fixing the efficiency problems in R would be to boost performance by a small multiple, or perhaps as much as an order of magnitude. This probably isn't enough to justify the effort (Luke Tierney has been working on R compilation for over a decade now). . . .

If we're smart about building the new system, it should be possible to make use of multi-cores and parallelism. Adding this to the mix might just make it possible to get a three order-of-magnitude performance boost with just a fraction of the memory that R uses.

I don't know what to think about this. Some of my own recent thoughts on R are here. Although I am developing some R packages, overall I think of myself as more of a user than a developer. I find those S4 data types to be an annoyance, and I'm not happy at the "bureaucratic" look of so many R functions. If R could be made 100 times faster, that would be cool. When writing ARM, I was careful to write code in what I considered a readable way, which in many instances involved looping rather than vectorization and the much-hated apply() function. (A particular difficulty arises when dealing with posterior simulations, where scalars become matrices, matrices become two-way arrays, and so forth.) In my programming, I've found myself using notational conventions where the structure in the program should be, and I think this is a common problem in R. (Consider the various objects such as rownames, rows, row.names, etc etc.) And anyone who's worked with R for awhile has had the frustration of having to take a dataset and shake it to wring out all the layers of structure that are put there by default. I'll read in some ascii data and then be going through different permutations of functions such as as.numeric(), as.vector(), as.character() to convert data from "levels" into numbers or strings.

Don't get me wrong. R is great. I love R. And I recognize that many of its problems arise from its generality. I think it's great that Ross Ihaka and others are working to make things even better.

Cyrus writes:

I [Cyrus] was teaching a class on multilevel modeling, and we were playing around with different method to fit a random effects logit model with 2 random intercepts---one corresponding to "family" and another corresponding to "community" (labeled "mom" and "cluster" in the data, respectively). There are also a few regressors at the individual, family, and community level. We were replicating in part some of the results from the following paper: Improved estimation procedures for multilevel models with binary response: a case-study, by G Rodriguez, N Goldman.

(I say "replicating in part" because we didn't include all the regressors that they use, only a subset.) We were looking at the performance of estimation via glmer in R's lme4 package, glmmPQL in R's MASS package, and Stata's xtmelogit. We wanted to study the performance of various estimation methods, including adaptive quadrature methods and penalized quasi-likelihood.

I was shocked to discover that glmer's default setting is adaptive Gaussian quadrature with 1 integration point (that is, a Laplace approximation). In addition, and even more shocking, glmer **does not allow** more than one integration point to be used for models with more than one random effect. With only one random effect, you can increase the number of integration points. But with multiple random effects (like what we had, which was 2), when you try to specify multiple integration points (with the nAGQ setting), you get an error saying that it can't do it. At first, I kinda shrugged, but then when I saw the difference that it made, alarm bells started going off.

I made a table showing results from estimates using glmer's Laplace approximation, and then also results from the glmmPQL which uses penalized quasi-liklihood, and then results from Stata's xtmelogit using first its own Laplace approximation and then a 7-pt Gaussian adaptive quadrature. Note that in Stata, the 7-pt quadrature fit is the default. This is a high dimsenional "brute force" method for fitting a complex likelihood.

Now, when we look at the results for the random effects standard deviation estimates, we see that the Laplace approximations in R (via glmer) and Stata (model 3 in the Table) are identical. In addition, taking the 7-pt quadrature estimate from xtmelogit to be the best guess of the bunch, the Laplace approximation-based estimates are HORRIBLY biased downward. That is, the estimated random effect variance is way too small. The PQL estimates are also biased downward, but not so badly. This is in line with what Rodriguez and Goldman found, using a 20-point quadrature and an MCMC fit as benchmarks (they produced nearly identical results).

The upshot is that glmer's default is REALLY BAD; and this default is the only option when you have more than one random effect. My suggestion then is to either use Stata, which has a very sensible 7-point quadrature default (although as a brute force method, it is slow) or fit the model with MCMC.

If I understand, you are using glmer? If so, have you (1) looked into this or (2) checked what glmer is giving you against xtmelogit or BUGS? I would trust the xtmelogit or BUGS answers more.

My reply:

1. I am working with Sophia Rabe-Hesketh, who's one of the people who programmed gllam in Stata. Sophia has been telling me for awhile that the method used by gllam is better than the Laplace method that is currently used in glmer.

2. Just some terminology: I'd prefer to talk about different "approximations" rather than different "estimation methods." To me, there's just one estimation method here--hierarchical Bayes (or, as Sophia would say, marginal likelihood inference)--and these various approaches to point estimates of variance parameters are approximations to get around the difficulties of doing the integrals and computing the marginal likelihood.

3. I also prefer to use the term "varying intercepts" (and slopes) rather than "random effects," because the term "random effects" is not so clearly defined in the literature (as Jennifer and I discuss in our book). This doesn't change the substance of your comment but it makes it easier for me to follow.

4. In my project with Sophia (and with Jingchen Liu, also cc-ed here), we're working on several parallel tracks:
(a) Adding a prior distribution on the variance parameters to stabilize the point estimates and keep them away from the boundary of parameter space (which, in the context of scalar variance parameters, implies that we keep the estimates away from zero).
(b) Adding a few steps of the Metropolis algorithm to capture some of the uncertainty in the variance parameters and also fix some of that bias you're talking about. It's my intuition that a low number (e.g., 10) Metropolis steps will clean up a lot of problems. Although, right now, that's just an intuition, we haven't tried it yet.
(c) Working in R (using glmer) and also separately in Stata (using gllam).

5. Last but not least, it sounds like you've encountered an important principle of research: The way to really learn a subject is to teach it! That's why college professors(at least at a place like Columbia) know so much: we're always teaching new things.

The future of R


Some thoughts from Christian, including this bit:

We need to consider separately

1. R's brilliant library

2. R's not-so-brilliant language and/or interpreter.

I don't know that R's library is so brilliant as all that--if necessary, I don't think it would be hard to reprogram the important packages in a new language.

I would say, though, that the problems with R are not just in the technical details of the language. I think the culture of R has some problems too. As I've written before, R functions used to be lean and mean, and now they're full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like "GOTO 120."

I learned about these problems a couple years ago when writing bayesglm(), which is a simple adaptation of glm(). But glm(), and its workhorse,, are a mess: They're about 10 lines of functioning code, plus about 20 lines of necessary front-end, plus a couple hundred lines of naming, exception-handling, repetitions of chunks of code, pseudo-structured-programming-through-naming-of-variables, and general buck-passing. I still don't know if my modifications are quite right--I did what was needed to the meat of the function but no way can I keep track of all the if-else possibilities.

If R is redone, I hope its functions return to the lean-and-mean aesthetic of the original S (but with better graphics defaults).

Frank Wood and Nick Bartlett write:

Deplump works the same as all probabilistic lossless compressors. A datastream is fed one observation at a time into a predictor which emits both the data stream and predictions about what the next observation in the stream should be for every observation. An encoder takes this output and produces a compressed stream which can be piped over a network or to a file. A receiver then takes this stream and decompresses it by doing everything in reverse. In order to ensure that the decoder has the same information available to it that the encoder had when compressing the stream, the decoded datastream is both emitted and directed to another predictor. This second predictor's job is to produce exactly the same predictions as the initial predictor so that the decoder has the same information at every step of the process as the encoder did.

The difference between probabilistic lossless compressors is in the prediction engine, encoding and decoding being "solved" optimally already.

Deplump compression technology is built on a probabilistic discrete sequence predictor called the sequence memoizer. The sequence memoizer has been demonstrated to be a very good predictor for discrete sequences. The advantage deplump demonstrates in comparison to other general purpose lossless compressors is largely attributable to the better guesses made by the sequence memoizer.

Deplump is algorithmically quite closely related to infinite context prediction by partial matching (PPM) although it has differences and advantages which derive from being grounded on a coherent probabilistic predictive model.

I don't know anything about this, but it looks interesting to me. They also have a cool website where you can compress your own files. According to Frank, it gives you an compressed file that's quite a bit smaller than what you get from gzip.

Deducer update

| 1 Comment

A year ago we blogged about Ian Fellows's R Gui called Deducer (oops, my bad, I meant to link to this).

Fellows sends in this update:

Probability-processing hardware


Lyric Semiconductor posted:

For over 60 years, computers have been based on digital computing principles. Data is represented as bits (0s and 1s). Boolean logic gates perform operations on these bits. A processor steps through many of these operations serially in order to perform a function. However, today's most interesting problems are not at all suited to this approach.

Here at Lyric Semiconductor, we are redesigning information processing circuits from the ground up to natively process probabilities: from the gate circuits to the processor architecture to the programming language. As a result, many applications that today require a thousand conventional processors will soon run in just one Lyric processor, providing 1,000x efficiencies in cost, power, and size.

Om Malik has some more information, also relating to the team and the business.

The fundamental idea is that computing architectures work deterministically, even though the world is fundamentally stochastic. In a lot of statistical processing, especially in Bayesian statistics, we take stochastic world, force it into determinism, simulate stochastic world by computationally generating deterministic pseudo-random numbers, and simulate stochastic matching by deterministic likelihood computations. What Lyric could do is to bypass this highly inefficient intermediate deterministic step. This way, we'll be able to fit bigger and better models much faster.

They're also working on a programming language: PSBL (Probability Synthesis for Bayesian Logic), but there are no details. Here is their patent for Analog Logic Automata, indicating applications for images (filtering, recognition, etc.).

[D+1: J.J. Hayes points to a US Patent indicating that one of the circuits optimizes the sum-product belief propagation algorithm. This type of algorithms is popular in machine learning for various recognition and denoising problems. One way to explain it to a statistician is that it does imputation with an underlying loglinear model.]

Arthur Breitman writes:

I had to forward this to you when I read about it...

My reply: Interesting; thanks. Things like this make me feel so computer-incompetent! The younger generation is passing me by...

Peter Goff wrote:

That half-Cauchy prior

| No Comments

Xiaoyu Qian writes:

MCMC in Python


John Salvatier forwards a note from Anand Patil that a paper on PyMC has appeared in the Journal of Statistical Software, We'll have to check this out.

Unhappy with improvement by a factor of 10^29


I have an optimization problem: I have a complicated physical model that predicts energy and thermal behavior of a building, given the values of a slew of parameters, such as insulation effectiveness, window transmissivity, etc. I'm trying to find the parameter set that best fits several weeks of thermal and energy use data from the real building that we modeled. (Of course I would rather explore parameter space and come up with probability distributions for the parameters, and maybe that will come later, but for now I'm just optimizing). To do the optimization, colleagues and I implemented a "particle swarm optimization" algorithm on a massively parallel machine. This involves giving each of about 120 "particles" an initial position in parameter space, then letting them move around, trying to move to better positions according to a specific algorithm. We gave each particle an initial position sampled from our prior distribution for each parameter. So far we've run about 140 iterations, and I just took a look at where the particles are now. They are indeed converging -- that is, they're coming to some agreement on what the best region of parameter space is. But the standard deviation for each parameter is still about 0.4 times what it was at the start. (For instance, we put in a very wide prior distribution for the thermal mass of the furnishings in the building's offices, and after running the optimization the distribution is about 0.4 times as wide as it was at the start).

I was, and still am, a bit disappointed by this, but: we have 74 parameters. Our particles were spread through a huge volume of parameter space, and now they're spread through a space that is about 0.4 times as big for each parameter. That means they've agreed on a volume of parameter space that is about 0.4^74 times smaller than it was before, or about a factor of 10^29 smaller. Maybe it's not so bad.

Daniel Corsi writes:

I was wondering if you could help me with some code to set up a posterior predictive check for an unordered multinomial multilevel model. In this case the outcome is categories of bmi (underweight, nomral weight, and overweight) based on individuals from 360 different areas. What I would like to do is set up a replicated dataset to see how the number of overweight/underweight/normal weight individuals based on the model compares to the actual data and some kind of a graphical summary. I am following along with chapter 24 of the arm book but I want to verify that the replicated data accounts for the multilevel structure of the data of people within areas. I am attaching the code I used to run a simple model with only 2 predictors (area wealth and urban/rural designation).

My reply: The Bugs code is a bit much for me to look at--but I do recommend that you run it from R, which will give you more flexibility in preprocessing and postprocessing the data. Beyond this, there are different ways of doing replications. You can replicate new people in existing areas or new people in new areas. It depends on your application. In a cluster sample, you'll probably want new areas. In an exhaustive sample, there might not be any new areas and you'll want to keep your existing group-level parameters.

The recent discussion of pollsters reminded me of a story from a couple years ago that perhaps is still relevant . . .

I was looking up the governors' popularity numbers on the web, and came across this page from Rasmussen Reports which shows Sarah Palin as the 3rd-most-popular governor. But then I looked more carefully. Janet Napolitano of Arizona was viewed as Excellent by 28% of respondents, Good by 27%, Fair by 26%, and Poor by 27%. That adds up to 108%! What's going on?

I'd think they would have a computer program to pipe the survey results directly into the spreadsheet. But I guess not, someone must be typing in these numbers one at a time. Another possibility is that they are altering their numbers by hand, and someone made a mistake with the Napolitano numbers, adding a few percent in one place and forgetting to subtract elsewhere. Or maybe there's another explanation?

MCMC machine


David Shor writes:

My lab recently got some money to get a high-end machine. We're mostly going to do MCMC stuff, is there anything specialized that I should keep in mind, or would any computing platform do the job?

I dunno, any thoughts out there? I've heard that "the cloud" is becoming more popular.

Derek Sonderegger writes:

I have just finished my Ph.D. in statistics and am currently working in applied statistics (plant ecology) using Bayesian statistics. As the statistician in the group I only ever get the 'hard analysis' problems that don't readily fit into standard models. As I delve into the computational aspects of Bayesian analysis, I find myself increasingly frustrated with the current set of tools. I was delighted to see JAGS 2.0 just came out and spent yesterday happily playing with it.

My question is, where do you see the short-term future of Bayesian computing going and what can we do to steer it in a particular direction?

Both R and Stata


A student I'm working with writes:

I was planning on getting a applied stat text as a desk reference, and for that I'm assuming you'd recommend your own book. Also, being an economics student, I was initially planning on doing my analysis in STATA, but I noticed on your blog that you use R, and apparently so does the rest of the statistics profession. Would you rather I do my programming in R this summer, or does it not matter? It doesn't look too hard to learn, so just let me know what's most convenient for you.

My reply: Yes, I recommend my book with Jennifer Hill. Also the book by John Fox, An R and S-plus Companion to Applied Regression, is a good way to get into R. I recommend you use both Stata and R. If you're already familiar with Stata, then stick with it--it's a great system for working with big datasets. You can grab your data in Stata, do some basic manipulations, then save a smaller dataset to read into R (using R's read.dta() function). Once you want to make fun graphs, R is the way to go. It's good to have both systems at your disposal.

How best to learn R?


Alban Zeber writes:

I am wondering whether there is a reference (online or book) that you would recommend to someone who is interested in learning how to program in R.

Any thoughts?

P.S. If I had a name like that, my books would be named, "Bayesian Statistics from A to Z," "Teaching Statistics from A to Z," "Regression and Multilevel Modeling from A to Z," and so forth.

Postdoc #1. Hierarchical Modeling and Computation: We are fitting hierarchical regression models with deep interactions. We're working on new models with structured prior distributions, and this also requires advances in Bayesian computation. Applications include public opinion, climate reconstruction, and education research.

Postdoc #1 is funded by grants from the Department of Energy, Institute of Education Sciences, and National Science Foundation.

Postdoc #2. Hierarchical Modeling and Statistical Graphics: The goal of this research program is to investigate the application of the latest methods of multi-level data analysis, visualization and regression modeling to an important commercial problem: forecasting retail sales at the individual item level. These forecasts are used to make ordering, pricing and promotions decisions which can have significant economic impact to the retail chain such that even modest improvements in the accuracy of predictions, across a large retailer's product line, can yield substantial margin improvements.

Project #2 is to be undertaken with, and largely funded by, a firm which provides forecasting technology and services to large retail chains, and which will provide access to a unique and rich set of proprietary data. The postdoc will be expected to spend some time working directly with this firm, but it is fundamentally a research position.

Ideally, postdoc #1 will have a statistics or computer science background, will be interested in statistical modeling, serious programming, and applications. Ideally, postdoc #2 will have a background in statistics, psychometrics, or economics and be interested in marketing or related topics. Both postdocs should be able to work fluently in R and should already know about hierarchical models and Bayesian inference and computation.

Both projects will be at the Applied Statistics Center at Columbia University, also with connections outside Columbia. Collaborators on Postdoc #1 include Jinchen Liu, Jennifer Hill, Matt Schofield, Upmanu Lall, Chad Scherrer, Alan Edelman, and Sophia Rabe-Hesketh. Collaborators on Postdoc #2 include Eric Johnson.

If you're interested in either, please send a letter of application, a c.v., some of your articles, and three letters of recommendation to the Applied Statistics Center coordinator, Caroline Peters,

Rodney Sparapani writes:

My Windows buddies have been bugging me about BRUGS and how great it is. Now, running BRUGS on OS X may be possible. Check out this new amazing software by Amit Singh.

Personally, I'd go with R2jags, but I thought I'd pass this on in case others are interested.

Douglas Anderton informed us that, in a Linux system, you can't call OpenBugs from R using bugs() from the R2Winbugs package. Instead, you should call Jags using jags() from the R2jags package.

P.S. Not the Rotter's Club guy.

Marty McKee at Wolfram Research appears to have a very very stupid colleague. McKee wrote to Christian Robert:

Your article, "Evidence and Evolution: A review", caught the attention of one of my colleagues, who thought that it could be developed into an interesting Demonstration to add to the Wolfram Demonstrations Project.

As Christian points out, adapting his book review into a computer demonstration would be quite a feat! I wonder what McKee's colleague could be thinking? I recommend that Wolfram fire McKee's colleague immediately: what an idiot!

P.S. I'm not actually sure that McKee was the author of this email; I'm guessing this was the case because this other very similar email was written under his name.

P.P.S. To head off the inevitable comments: Yes, yes, I know this is no big deal and I shouldn't get bent out of shape about it. But . . . Wolfram Research has contributed such great things to the world, that I hate to think of them wasting any money paying the salary of Marty McKee's colleague, somebody who's so stupid that he or she thinks that a book review can be developed into an interactive computer demonstration. I'd like to feel that I'm doing my part by alerting them so they can fire this incompetent colleague person.

P.P.P.S. Hey, this new Zombies category is really coming in handy!

Mark Girolami sends along this article, "Riemann Manifold Langevin and Hamiltonian Monte Carlo," by Ben Calderhead, Siu Chin, and himself:

This paper proposes Metropolis adjusted Langevin and Hamiltonian Monte Carlo sampling methods defined on the Riemann manifold to resolve the shortcomings of existing Monte Carlo algorithms when sampling from target densities that may be high dimensional and exhibit strong correlations. The methods provide fully automated adaptation mechanisms that circumvent the costly pilot runs required to tune proposal densities for Metropolis-Hastings or indeed Hamiltonian Monte Carlo and Metropolis Adjusted Langevin Algorithms. This allows for highly efficient sampling even in very high dimensions where different scalings may be required for the transient and stationary phases of the Markov chain. The proposed methodology exploits the Riemannian geometry of the parameter space of statistical models and thus automatically adapts to the local structure when simulating paths across this manifold providing highly efficient convergence and exploration of the target density. The performance of these Riemannian Manifold Monte Carlo methods is rigorously assessed by performing inference on logistic regression models, log-Gaussian Cox point processes, stochastic volatility models, and Bayesian estimation of dynamical systems described by nonlinear differential equations. Substantial improvements in the time normalised Effective Sample Size are reported when compared to alternative sampling approaches.

Cool! And they have Matlab code so you can go try it out yourself. If anybody out there knows more about this (I'm looking at you, Radford and Christian), please let us know. I care a lot about this right now because we're starting a big project on Bayesian computation for hierarchical regression models with deep interactions.

P.S. The tables are ugly but I forgive the authors because their graphs are so pretty; for example:


Some new articles


A few papers of mine have been recently accepted for publication. I plan to blog each of these individually but in the meantime here are some links:

Review of The Search for Certainty, by Krzysztof Burdzy. Bayesian Analysis. (Andrew Gelman)

Inference from simulations and monitoring convergence. In Handbook of Markov Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. Jones, and X. L. Meng. CRC Press. (Andrew Gelman and Kenneth Shirley)

Public opinion on health care reform. The Forum. (Andrew Gelman, Daniel Lee, and Yair Ghitza)

A snapshot of the 2008 election. Statistics, Politics and Policy. (Andrew Gelman, Daniel Lee, and Yair Ghitza)

Bayesian combination of state polls and election forecasts. Political Analysis. (Kari Lock and Andrew Gelman)

Causality and statistical learning. American Journal of Sociology. (Andrew Gelman)

Can fractals be used to predict human history? Review of Bursts, by Albert-Laszlo Barabasi. Physics Today. (Andrew Gelman)

Segregation in social networks based on acquaintanceship and trust. American Journal of Sociology. (Thomas A. DiPrete, Andrew Gelman, Tyler McCormick, Julien Teitler, and Tian Zheng)

Here's the full list.

A centipede many times over


I recently came across this presentation by Charlie Geyer on diagnosing problems with Markov chain simulations. Charlie gives some good advice, and some advice that's not so relevant to my own experience, and misses some other ideas that I and others have come across.

To a repeat a frequent theme of mine (see, for example, meta-principle #3 in this discussion), I suspect that many of the differences between Charlie's approaches and those more familiar with me stem from the different sorts of problems we work on. Charlie works on big, tangled, discrete models in genetics--the sorts of problems with multiple modes and where, not only can computations be slow to converge, then can easily get stuck in completely unreasonable places. To solve these problems, you have to write complicated computer programs and solve problems specific to your model and your application. In contrast, I work on smaller, continuous problems in social science where, if you find yourself really far from where you want to be in parameter space, you'll probably realize it. I develop relatively simple but general algorithms that are intended to work for a wide range of examples and applications.

Neither my approach nor Charlie's is "better"; we just do different things.

Here are three examples.

First, on page 3 of his slides, Charlie writes:

Ted Dunning points me to this article by J. Andres Christen and Colin Fox:

We [Christen and Fox] develop a new general purpose MCMC sampler for arbitrary continuous distributions that requires no tuning. . . .The t-walk maintains two independent points in the sample space, and all moves are based on proposals that are then accepted with a standard Metropolis-Hastings acceptance probability on the product space. . . . In a series of test problems across dimensions we find that the t-walk is only a small factor less efficient than optimally tuned algorithms, but significantly outperforms general random-walk M-H samplers that are not tuned for specific problems. . . . Several examples are presented showing good mixing and convergence characteristics, varying in dimensions from 1 to 200 and with radically different scale and correlation structure, using exactly the same sampler. The t-walk is available for R, Python, MatLab and C++ at

This looks pretty cool to me! I asked Christian Robert what he thought, and he referred me to this blog entry where he presents a bit of a skeptical, wait-and-see attitude:

The proposal of the authors is certainly interesting and widely applicable but to cover arbitrary distributions in arbitrary dimensions with no tuning and great performances sounds too much like marketing . . . there is no particular result in the paper showing an improvement in convergence time over more traditional samplers. . . . Since the authors developed a complete set of computer packages, including one in R, I figure people will start to test the method to check for possible improvement over the existing solutions. If the t-walk is indeed superior sui generis, we should hear more about it in the near future...

I have a lot of big models to fit, so I expect we'll be trying out many of these different methods during the next year or so.

P.S. Update here.

Aleks (I think) sent me this link to a paper by Jianiang Shi, Wotao Yin, Stanley Osher, and Paul Sajda. They report impressive computational improvements based on "a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate." This looks awesome. And they even test their method on the UCI database, which is what Aleks used in the cross-validation studies for bayesglm.

Here are my questions:

1. Could their algorithm work with other regularizations? L1 corresponds to a folded-exponential (Laplace) family of prior distributions? I prefer the t model. We implemented the t in bayesglm using an adaptation of the standard weighted least squares algorithm. That's fine for our little examples, but for big problems, where speed is a concern, maybe this new method is much better I imagine it could be adapted to our family of prior distributions, but I don't know enough about the method to be sure.

2. What about hierarchical models? That's the next thing to do. lmer/glmer are running pretty slow for us.

P.S. The funny thing is that Shi and Sajda are at Columbia (at the biomedical engineering department). But, to the best of my knowledge, we've never met. I don't even know where their office is located. According to their addresses in the article, they have a 10025 zipcode. But the engineering departments I've been to are all in 10027.

Tom Clark writes:

Aleks points me to this new article by the Stanford triplets:

We [Friedman, Hastie, and Tibshirani] develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include L1 (the lasso), L2 (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.

I wonder if these methods could be applied to our problems, things like this. Here we're fitting a hierarchical model--yes, we have normal priors but I don't think it's quite right to call it "ridge regression." The algorithm of Friedman, Hastie, and Tibshirani might still be adapted to these more complex models. (I'm assuming that their algorithm is already way faster than lmer or bayesglm.)

P.S. Oooh, they have ugly tables. I'll know that Stanford Statistics has finally taken over the world when they fully switch from tables to graphs.

P.P.S. I'm glad to see these top guys publishing in the Journal of Statistical Software. We've been publishing there lately--sometimes it seems like just the right place--but we worried that nobody was reading it. It's good to see this journal as a place for new research.

I just graded the final exams for my first-semester graduate statistics course that I taught in the economics department at Sciences Po.

I posted the exam itself here last week; you might want to take a look at it and try some of it yourself before coming back here for the solutions.

And see here for my thoughts about this particular exam, this course, and final exams in general.

Now on to the exam solutions, which I will intersperse with the exam questions themselves:

What can search predict?


You've all heard about how you can predict all sorts of things, from movie grosses to flu trends, using search results. I earlier blogged about the research of Yahoo's Sharad Goel, Jake Hofman, Sebastien Lahaie, David Pennock, and Duncan Watts in this area. Since then, they've written a research article.

Here's a picture:


And here's their story:

We [Goel et al.] investigate the degree to which search behavior predicts the commercial success of cultural products, namely movies, video games, and songs. In contrast with previous work that has focused on realtime reporting of current trends, we emphasize that here our objective is to predict future activity, typically days to weeks in advance. Specifically, we use query volume to forecast opening weekend box-office revenue for feature films, first month sales of video games, and the rank of songs on the Billboard Hot 100 chart. In all cases that we consider, we find that search counts are indicative of future outcomes, but when compared with baseline models trained on publicly available data, the performance boost associated with search counts is generally modest--a pattern that, as we show, also applies to previous work on tracking flu trends.

The punchline:

We [Goel et al.] conclude that in the absence of other data sources, or where small improvements in predictive performance are material, search queries may provide a useful guide to the near future.

I like how they put this. My first reaction upon seeing the paper (having flipped through the graphs and not read the abstract in detail) was that it was somewhat of a debunking exercise: Search volume has been hyped as the greatest thing since sliced bread, but really it's no big whoop, it adds almost no information beyond a simple forecast. But then my thought was that, no, this is a big whoop, because, in an automatic computing environment, it could be a lot easier to gather/analyze search volume than to build those baseline models.

Sharad's paper is cool. My only suggestion is that, in addition to fitting the separate models and comparing, they do the comparison on a case-by-case basis. That is, what percentage of the individual cases are predicted better by model 1, model 2, or model 3, and what is the distribution of the difference in performance. I think they're losing something by only doing the comparisons in aggregate.

It also might be good if they could set up some sort of dynamic tracker that could perform the analysis in this paper automatically, for thousands of outcomes. Then in a year or so they'd have tons and tons of data. That would take this from an interesting project to something really cool.

I remember many years ago being told that political ideologies fall not along a line but on a circle: if you go far enough to the extremes, left-wing communists and right-wing fascists end up looking pretty similar.

I was reminded of this idea when reading Christian Robert and George Casella's fun new book, "Introducing Monte Carlo Methods with R."

I do most of my work in statistical methodology and applied statistics, but sometimes I back up my methodology with theory or I have to develop computational tools for my applications. I tend to think of this sort of ordering:

Probability theory - Theoretical statistics - Statistical methodology - Applications - Computation

Seeing this book, in which two mathematical theorists write all about computation, makes me want to loop this line in a circle. I knew this already--my own single true published theorem is about computation, after all--but I tend to forget. In some way, I think that computation--more generally, numerical analysis--has taken some of the place in academic statistics that was formerly occupied by theorem-proving. I think it's great that many of our more mathematical-minded probabilists and statisticians can follow their theoretical physicist colleagues and work on computational methods. I suspect that applied researchers such as myself will get much more use out of theory as applied to computation, as compared to traditionally more prestigious work on asymptotic inference, uniform convergence, mapping the rejection regions of hypothesis tests, M-estimation, three-armed bandits, and the like.

Don't get me wrong--I'm not saying that computation is the only useful domain for statistical theory, or anything close to that. There are lots of new models to be built and lots of limits to be understood. Just, for example, consider the challenges of using sample data to estimate properties of a network. Lots of good stuff to do all around.

Anyway, back to the book by Robert and Casella. It's a fun book, partly because they resist the impulse to explain everything or to try to be comprehensive. As a result, reading the book requires the continual solution of little puzzles (as befits a book that introduces its chapters with quotations from detective novels). I'm not sure if this was intended, but it makes it a much more participatory experience, and I think for that reason it would also be an excellent book for a course on statistical computing.

Andrew Gelman, Jingchen Liu, and Sophia Rabe-Hesketh are looking for two fulltime postdocs, one to be based at the Department of Statistics, Columbia University, and the other at the Graduate School of Education, University of California, Berkeley.

The positions are funded by a grant entitled "Practical Tools for Multilevel/Hierarchical Modeling in Education" (Institute of Education Sciences, Department of Education). The project addresses modeling and computational issues that stand in the way of fitting larger, more realistic models of variation in social science data, in particular the problem that (restricted) maximum likelihood estimation often yields estimates on the boundary, such as zero variances or perfect correlations. The proposed solution is to specify weakly informative prior distributions - that is, prior distributions that will affect inferences only when the data provide little information about the parameters. Existing software for maximum likelihood estimation can then be modified to perform Bayes posterior modal estimation, and these ideas can also be used in fully Bayesian computation. In either case, the goal is to be able to fit and understand more complex, nuanced models. The postdocs will contribute to all aspects of this research and will implement the methods in C and R (Columbia postdoc) and Stata (Berkeley postdoc).

Both locations are exciting places to work with research groups comprising several faculty, postdocs, and students working on a wide range of interesting applied problems. The Columbia and Berkeley groups will communicate with each other on a regular basis, and there will be annual workshops with outside panels of experts.

Applicants should have a Ph.D. in Statistics or a related area and have strong programming skills, preferably with some experience in C/R/Stata. Experience with Bayesian methods and good knowledge of hierarchical Bayesian and/or "classical" multilevel modeling would be a great advantage.

The expected duration of the positions is 2 years with an approximate start date of September 1, 2010.

Please submit a statement of interest, not exceeding 5 pages, and your CV to either Andrew Gelman and Jingchen Liu ( or Sophia Rabe-Hesketh (Graduate School of Education, 3659 Tolman Hall, Berkeley, CA 94720-1670,, stating whether you would also be interested in the other position. Applications will be considered as they arrive but should be submitted no later than April 1st.

This is in addition to, but related to, our other postdoctoral positions. A single application will work for all of them.

MCMC model selection question


Robert Feyerharm writes in with a question, then I give my response. Youall can play along at home by reading the question first and then guessing what I'll say. . . .

I have a question regarding model selection via MCMC I'd like to run by you if you don't mind.

One of the problems I face in my work involves finding best-fitting logistic regression models for public health data sets typically containing 10-20 variables (= 2^10 to 2^20 possible models). I've discovered several techniques for selecting variables and estimating beta parameters in the literature, for example the reversible jump MCMC.

However, RJMCMC works by selecting a subset of candidate variables at each point. I'm curious, as an alternative to trans-dimensional jumping would it be feasible to use MCMC to simultaneously select variables and beta values from among all of the variables in the parameter space (not just a subset) using the regression model's AIC to determine whether to accept or reject the betas at each candidate point?

Using this approach, a variable would be dropped from the model if its beta parameter value settles sufficiently close to zero after N iterations (say, -.05 < βk < .05). There are a few issues with this approach: Since the AIC isn't a probability density, the Metropolis-Hastings algorithm could not be used here as far as I know. Also, AIC isn't a continuous function (it "jumps" to a lower/higher value when the number of model variables decreases/increases), and therefore a smoothing function is required in the vicinity of βk=0 to ensure that the MCMC algorithm properly converges. I've run a few simulations and this "backwards elimination" MCMC seems to work, albeit it converges to a solution very slowly.

Anyways, if you have time I would greatly appreciate any input you may have. Am I rehashing an idea that has already been considered and rejected by MCMC experts?

How to make colored circles in R?


In R, I can plot circles at specified sizes using the symbols() function, but for some reason it won't allow me to do it in color. For example, try this:

symbols (0, 0, circles=1, col="red")

It makes a black circle, just as if the "col" argument had never been specified. What's wrong?

P.S. I could just write my own function to draw circles, but that would be cheating. . . .

While putting together a chapter on inference from simulations and monitoring convergence (for a forthcoming Handbook of Markov Chain Monte Carlo; more on that another day), I came across this cool article from 2003 by Jarkko Venna, Samuel Kaski, and Jaakko Peltonen, who show how tools from multivariate discriminant analysis can be used to make displays of MCMC convergence that are much more informative than what we're used to. There's also an updated article from 2009 by Venna with Jaakko Peltonen and Samuel Kaski.

After a brief introduction, Venna et al. set up the problem:

It is common practice to complement the convergence measures by visualizations of the MCMC chains. Visualizations are useful especially when analyzing reasons of convergence problems. Convergence measures can only tell that the simulations did not convergence, not why they did not. MCMC chains have traditionally been visualized in three ways. Each variable in the chain can be plotted as a separate time series, or alternatively the marginal distributions can be visualized as histograms. The third option is a scatter or contour plot of two parameters at a time, possibly showing the trajectory of the chain on the projection. The obvious problem with these visualizations is that they do not scale up to large models with lots of parameters. The number of displays would be large, and it would be hard to grasp the underlying high-dimensional relationships of the chains based on the component-wise displays.

Some new methods have been suggested. For three dimensional distributions advanced computer graphics methods can be used to visualize the shape of the distribution. Alternatively, if the outputs of the models can be visualized in an intuitive way, the chain can be visualized by animating the outputs of models corresponding to successive MCMC samples. These visualizations are, however, applicable only to special models.

This seems like an accurate summary to me. If visualizations for MCMC have changed much in 2003, the news certainly hadn't reached me. I'd only add a slight modification to point out that with high resolution and small multiples, we can plot dozens of trace plots on the screen at once, rather than the three or four which has become standard (because that's what Bugs does).

In any case, it's a problem trying to see everything at once in a high-dimensional model. Venna et al. propose to use discriminant analysis on the multiple chains to identify directions in which there is poor mixing, and then display the simulations on this transformed scale.

Here's an example, a two-dimensional linear discriminant analysis projection of 10 chains simulated from a hierarchical mixture model:


And here's another plot, this time showing the behavior of the chains near convergence, using discriminative component analysis:


The next step, once these patterns are identified, would be to go back to the original parameters in the models and try to understand what's happening inside the chains.

Venna et al. have with what seems like a great idea, and it looks like it could be implemented automatically in Bugs etc. The method is simple and natural enough that probably other people have done it too, but I've never seen it before.

P.S. I wish these people had sent me a copy of their paper years ago so I didn't have to wait so long to discover it.

Learn to program!


I often run into people who'd like to learn how to program, but don't know where to start. Over the past few years, there has been an emergence of interactive tutorial systems, where a student is walked through the basic examples and syntax.

  • Try Ruby! will teach you Ruby, a Python-like language that's extremely powerful when it comes to preprocessing data.
  • WeScheme will teach you Scheme, a Lisp-like language that makes writing interpreters for variations of Scheme very easy.
  • Lists by Andrew Plotkin is a computer game that requires you to be able to program in Lisp. Lisp is the second-oldest programming language (after Fortran), but Ruby and Python do most of what Lisp has traditionally been useful for.

Maybe there will be a similar tool for R someday!

Thanks to Edward for the pointers!

I recently came across some links showing readers how to make their own data analysis and graphics from scratch. This is great stuff--spreading power tools to the masses and all that.

From Nathan Yau: How to Make a US County Thematic Map Using Free Tools and How to Make an Interactive Area Graph with Flare. I don't actually think the interactive area graphs are so great--they work with the Baby Name Wizard but to me they don't do much in the example that Nathan shows--but, that doesn't really matter, what's cool here is that he's showing us all exactly how to do it. This stuff is gonna put us statistical graphics experts out of business^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^Ha great service.

And Chris Masse points me to these instructions from blogger Iowahawk on downloading and analyzing a historical climate dataset. Good stuff.

Hey, I don't think I ever posted a link to this. It's a discussion in the journal Statistics in Medicine of an article by David Lunn, David Spielhalter, Andrew Thomas ,and Nicky Best. (Sorry but I can't find the Lunn et al. article online, or I'd link to it.) Anyway, here's my discussion. Once upon a time . . .

I first saw BUGS in a demonstration version at a conference in 1991, but I didn't take it seriously until over a decade later, when I found that some of my Ph.D. students in political science were using Bugs to fit their models. It turned out that Bugs's modeling language was ideal for students who wanted to fit complex models but didn't have a full grasp of the mathematics of likelihood functions, let alone Bayesian inference and integration. I also learned that the modular structure of BUGS was a great way for students, and researchers in general, to think more about modeling and less about deciding which conventional structure should be fit to data.

Since then, my enthusiasm for BUGS has waxed and waned, depending on what sorts of problems I was working on. For example, in our study of income and voting in U.S. states [1], my colleagues fit all our models in BUGS. Meanwhile we kept running into difficulty when we tried to expand our model in different ways, most notably when going from varying-intercept multilevel regressions, to varying-intercept, varying-slope regressions, to models with more than two varying coefficients per group. Around this time I discovered lmer [2], a function in R which fits multilevel linear and generalized linear models allowing for varying intercepts and slopes. The lmer function can have convergence problems and does not account for uncertainty in the variance parameters, but it is faster than Bugs and in many cases more reliable-so much so that Jennifer Hill and I retooled our book on multilevel models to foreground lmer and de-emphasize Bugs, using the latter more as a way of illustrating models than as a practical tool.

What does BUGS do best and what does it do worst?

Differential Evolution MCMC

| 1 Comment

John Salvatier writes:

I remember that you once mentioned an MCMC algorithm based on Differential Evolution, so I thought you might be interested in this paper, which introduces an algorithm based on Differential Evolution and claims to be useful even in high dimensional and multimodal problems.

Cool! Could this be implemented in Bugs, Jags, HBC, etc?

Equation search, part 2


Some further thoughts on the Eureqa program which implements the curve-fitting method of Michael Schmidt and Hod Lipson:

The program kept running indefinitely, so I stopped it in the morning, at which point I noticed that the output didn't quite make sense, and I went back and realized that I'd messed up when trying to delete some extra data in the file. So I re-ran with the actual data. The program functioned as before but moved much quicker to a set of nearly-perfect fits (R-squared = 99.9997%, and no, that's not a typo). Here's what the program came up with:


The model at the very bottom of the list is pretty pointless, but in general I like the idea of including "scaffolding" (those simple models that we construct on the way toward building something that fits better) so I can't really complain.

It's hard to fault the program for not finding y^2 = x1^2 + x2^2, given that it already had such a success with the models that it did find.

Equation search, part 1


For some reason, Aleks doesn't blog anymore; he just sends me things to blog. I guess he's decided his time is more usefully spent doing other things. I, however, continue to enjoy blogging as an alternative to real work and am a sucker for all the links Aleks sends me (except for the videos, which I never watch).

Aleks's latest find is a program called Eureqa that implements a high-tech curve-fitting method of Michael Schmidt and Hod Lipson. (And, unlike Clarence Thomas, I mean "high-tech" in a good way.) Schmidt and Lipson describe their algorithm as "distilling free-form natural laws from experimental data," which seems a bit over the top, but the basic idea seems sound: Instead of simply running a linear regression, the program searches through a larger space of functional forms, building models like Tinker Toys by continually adding components until the fit stops improving:


I have some thoughts on the limitations of this approach (see below), but to get things started I wanted to try out an example where I suspected this new approach would work well where more traditional statistical methods would fail.

The example I chose was a homework assignment that I included in a couple of my books. Here's the description:

I give students the following twenty data points and ask them to fit y as a function of x1 and x2.

y x1 x2
15.68 6.87 14.09
6.18 4.4 4.35
18.1 0.43 18.09
9.07 2.73 8.65
17.97 3.25 17.68
10.04 5.3 8.53
20.74 7.08 19.5
9.76 9.73 0.72
8.23 4.51 6.88
6.52 6.4 1.26
15.69 5.72 14.62
15.51 6.28 14.18
20.61 6.14 19.68
19.58 8.26 17.75
9.72 9.41 2.44
16.36 2.88 16.1
18.3 5.74 17.37
13.26 0.45 13.25
12.1 3.74 11.51
18.15 5.03 17.44
16.8 9.67 13.74
16.55 3.62 16.15
18.79 2.54 18.62
15.68 9.15 12.74
4.08 0.69 4.02
15.45 7.97 13.24
13.44 2.49 13.21
20.86 9.81 18.41
16.05 7.56 14.16
6 0.98 5.92
3.29 0.65 3.22
9.41 9 2.74
10.76 7.83 7.39
5.98 0.26 5.97
19.23 3.64 18.89
15.67 9.28 12.63
7.04 5.66 4.18
21.63 9.71 19.32
17.84 9.36 15.19
7.49 0.88 7.43

[If you want to play along, try to fit the data before going on.]

In the spirit of Christian Robert, I'd like to link to my own adaptive Metropolis paper (with Cristian Pasarica):

A good choice of the proposal distribution is crucial for the rapid convergence of the Metropolis algorithm. In this paper, given a family of parametric Markovian kernels, we develop an adaptive algorithm for selecting the best kernel that maximizes the expected squared jumped distance, an objective function that characterizes the Markov chain. We demonstrate the effectiveness of our method in several examples.

The key idea is to use an importance-weighted calculation to home in on a jumping kernel that maximizes expected squared jumped distance (and thus minimizes first-order correlations). We have a bunch of examples to show how it works and to show how it outperforms the more traditional approach of tuning the acceptance rate:


Regarding the adaptivity issue, our tack is to recognize that the adaptation will be done in stages, along with convergence monitoring. We stop adapting once approximate convergence has been reached and consider the earlier iterations as burn-in. Given what is standard practice here anyway, I don't think we're really losing anything in efficiency by doing things this way.

Completely adaptive algorithms are cool too, but you can do a lot of useful adaptation in this semi-static way, adapting every 100 iterations or so and then stopping the adaptation when you've reached a stable point.

The article will appear in Statistica Sinica.

Asa writes:

I took your class on multilevel models last year and have since found myself applying them in several different contexts. I am about to start a new project with a dataset in the tens of millions of observations. In my experience, multilevel modeling has been most important when the number of observations in at least one subgroup of interest is small. Getting started on this project, I have two questions:

1) Do multilevel models still have the potential to add much accuracy to predictions when n is very large in all subgroups of interest?

2) Do you find SAS, STATA, or R to be more efficient at handling multilevel/"mixed effects" models with such a large dataset (wont be needing any logit/poisson/glm models)?

My reply:

Regarding software, I'm not sure, but my guess is that Stata might be best with large datasets. Stata also has an active user community that can help with such questions.

For your second question, if n is large in all subgroups, then multilevel modeling is typically not needed. But if n is large in all subgroups, you can simply fit a separate model in each group. That is equivalent to a full-interaction model. At that point you might be interested in details within subgroups, and then you might want a multilevel model.

Asa then wrote:

Yes, a "full interaction" model was the alternative I was thinking of. And yes, I can imagine the results from that model raising further questions about whats going on within groups as well.

My previous guess was that SAS would be the most efficient for multilevel modeling with big data. But I just completely wrecked my (albeit early 2000's era) laptop looping proc mixed a bunch of times with a much smaller dataset.

I don't really know on the SAS vs. Stata issue. In general, I have warmer feelings toward Stata than SAS, but, on any particular problem, who knows? I'm pretty sure that R would choke on any of these problems.

On the other hand, if you end up breaking the problem into smaller pieces anyway, maybe the slowness of R wouldn't be so much of a problem. R does have the advantage of flexibility.

Masanao sends in this.

(The oven that Masanao was referring to is described here.)

Recent Comments

  • Sumio Watanabe: Dear Dr. Gelman, I agree with your opinion that, even read more
  • Jim: Just curious what would be the next step if the read more
  • Winston Lin: Andrew, the July 4 findings might not be quite so read more
  • Megan Pledger: This is based on my softball knowledge from a long read more
  • Andrew Gelman: Yup. This'll be fixed in a few days when we read more
  • Millsy: Here's some shorter term, very preliminary, very basic, very ugly read more
  • Matt: Ben Fry’s Baseball Chart looks more like an art-museum-security-laser plot. read more
  • Rodney Sparapani: I guess that means that we can't post comments on read more
  • Pablo Verde: Excellent article! Where I can get the R script for read more
  • Millsy: I'll race Dan to the challenge! read more
  • Dan Turkenkopf: I think I already have the code to do most read more
  • Willem: I feel the hunger in that last paragraph. Remember: if read more
  • Andrew Gelman: James: I thought of that but then doesn't the scaling read more
  • Jeffrey Showman: I agree. I've appreciated them (lineplots, slope graphs, or "bumps read more
  • Dean Eckles: I think what distinguishes a true slopegraph from a parallel read more

About this Archive

This page is an archive of recent entries in the Statistical computing category.

Sports is the previous category.

Statistical graphics is the next category.

Find recent content on the main index or look in the archives to find all content.