Burn-in Man

Mike McLaughlin writes:

I was wondering about MCMC burn-in and whether the oft-cited emphasis on this in the literature might not be a bit overstated.

My thought was that the chain is Markovian. In a Metropolis (or Metropolis-Hastings) context, once you establish the scale of the proposal distribution(s), successful burn-in gets you only a starting location inside the posterior — nothing else is remembered, by definition! However, there is nothing really special about this particular starting point; it would have been just as valid had it been your initial guess and the burn-in would then have been superfluous. Moreover, the sampling phase will eventually reach the far outskirts of the posterior, often a lot more extreme than the sampling starting location, yet it will still (collectively) describe the posterior correctly. This implies that *any* valid starting point is just as good as any other, burn-in or no burn-in.

The only circumstance that I can think of in which a burn-in would be essential is in the case in which prior support regions for the parameters are not all jointly valid (inside the joint posterior), if that is even possible given the min/max limits set for the priors. Am I missing something?

My response: What you’re missing is that any inference from a finite number of simulations is an approximation.

Consider an extreme example in which your sample takes independent draws from a N(mu,sigma^2) distribution, but you pick a starting value of X. The average of n simulations will then have the value, in expectation, of (1/n)X+ ((n-1)/n)mu (instead of the correct value of mu). If, for example, X=100 and n=100, you’re in trouble! But a burn-in of 1 will solve all your problems in this example. (And in this example, n=100 would work just fine for most purposes.) True, if you draw a few gazillion simulations, the initial values will be forgotten, but why run a few zillion simulations if you don’t have to? That will just slow down your important work.

More generally, your starting values will persist for awhile, basically as long as it takes for your chains to mix. If your starting values persist for a time T, then these will pollute your inferences for some time of order T, by which time you can already have stopped the simulations if you’d discarded some early steps.

P.S. See here for a different perspective, from Charlie Geyer. For the reasons stated above, I don’t agree with what he writes, but you can read for yourself.

P.P.S. In my example above, you might say that it would be ok if you were to just start at the center of the distribution. One difficulty, though, is that you don’t know where the center of the distribution is before you’ve done your simulations. More realistically, we start from estimates +/- uncertainty as estimated from some simpler model that was easier to fit.

3 thoughts on “Burn-in Man

  1. I think the original question is valid, and the way Mike is thinking about the problem is good.

    It seems to me that the real issue is that the stationary distribution of the MCMC chains is in fact the posterior, but convergence to this stationary distribution takes a certain amount of time. The MCMC will over-sample very unlikely regions if it starts in those unlikely regions (as it climbs up hill). Furthermore it takes some time to achieve an appropriate scale for the proposal if you have a self-tuning M-H algorithm.

    On the other hand, if you could first calculate the Maximum Likelihood estimates, then perturb them by an appropriate scaled proposal, essentially you could always use a 1 sample or very small number of samples burn-in and get a good approximation.

  2. Yes, in many examples something simpler will work fine. But the maximum likelihood estimate won't be such a good idea for hierarchical models, especially in settings such as logistic regression where some regularization is needed even for good point estimates. At the very least, I'd replace "maximum likelihood" by "easily computable estimate." Regularized mle can be more computationally stable as well as more statistically efficient than pure mle.

  3. You should always use some amount of burn-in when you can't assess the quality of your initialization. Unfortunately, to truly assess the quality of an initialization, you need to know more about the target than is generally available. In fact, if that much knowledge about the distribution is available, your probably better off using it to build better transitions than worrying about keeping the first N samples.

Comments are closed.