More on burn-in for iterative simulation

Following my discussion with Radford (see the comments of this recent entry), I had this brief back-and-forth with Bob O’Hara regarding adaptive updating and burn-in in Bugs.

Me: I want to automatically set all adaptive phases to the burnin period.

Bob: Wouldn’t it make more sense to do this the other way round, and set the burn-in to the largest adaptive phase? I don’t know how Andrew decided what they should be, but at least some thought went into setting them.

The longest is 4000, which I haven’t found to be too large: my experience is that the models that are slow enough for this to be a problem tend to be ones which take some time to burn in anyway.

Me: 4 reasons:

1. In debugging, it’s very helpful to run for 10 iterations to check that everything works ok and is saved as should be.

2. Sometimes it doesn’t take 4000 iterations to converge.

3. With huge datasets, I sometimes want to just run a few hundred and get close–I don’t have time to wait until 10000 or whatever.

4. If I actually need to run 50,000, why not adapt for 25,000? That will presumaby be more efficient, no?

6 thoughts on “More on burn-in for iterative simulation

  1. Hi Andrew,

    I have a question about burn-in that's slightly off-topic, as it isn't about adaptive updating. What's your take on Charles Geyer's position on burn-in (http://www.stat.umn.edu/~charlie/mcmc/burn.html)? I find it compelling, because of the way I learned about the Bayesian approach (more below). But I'd like to know what someone with more experience thinks about Geyer's contention.

    I was introduced to the Bayesian approach when I read Jaynes's book in 2002. His introductory example of Bayesian inference was developed to the point where he was discussing forward and backward inference for a two-state Markov chain, conditional on its state at one particular iteration. (This was in the context of sampling with replacement, where the replacement is not quite random, therefore influencing the next draw.) Because of this example, I began my life as a Bayesian with the idea that convergence to the stationary distribution isn't a property of a Markov chain at any iteration in which its state is known, but rather is something that happens to our posterior distribution about distant states of the Markov chain. Thus Geyer's viewpoint feels very natural to me.

  2. Corey,

    Burn-in is less of an issue in a two-state Markov chain. In the sorts of problems I work on, the parameter space is continuous or, if discrete, is large. It can be much easier to get mixing when I discard the first part of the simulated sequences, especially when using overdispersed starting points. What's the best fraction to discard, I'm not sure. My intuition is that half is reasonable but I've struggled to formalize it. I don't find the asymptotic arguments on Geyer's webpage very helpful, because he's focused on getting a point summary (e.g., posterior mean), and I'm interested in getting summaries of the whole posterior distribution.

  3. Hi Andrew,

    I may have given you the misimpression that Jaynes is discussing MCMC, which I didn't mean to do. He's just discussing Bayesian inference in the case of not-quite-random sequential Bernoulli trials that are modelled collectively as a Markov chain.

    All I mean to say is that Geyer's maxim, "any point you don't mind having in a sample is a good starting point", makes sense to me, because convergence isn't a property of any given known state of the Markov chain. Whether you sample from an overdispersed starting distribution or just start at the MAP point, convergence just isn't a concept that applies; after you pick a starting point, the starting distribution is just a delta function at that point.

    Checking mixing is something else again. I am mindful of the example in the WinBUGS manual that shows that slow mixing chains can look very much like they've converged even though they're just stuck at the starting point; I've used multiple chains with overdispersed starting points to investigate mixing in my models.

  4. Corey,

    I suppose it depends on the starting points. In the example on page 286 of our book (second edition), you'd certainly want to discard the first part of the simulations. In other cases, maybe not necessary. The point is that if we're going to have 100 or 1000 simulations or whatever, it doesn't make sense to have, for example, a point that's extremely unlikely, of the sort you'd only see once in a million tries. You wouldn't want this using any other simulation method–it's not just an issue with Markov chain simulation.

  5. It seems like you should be able to determine the amount of burn-in by looking at the results…at least for a parameter space that doesn't have too many dimensions. Suppose I do 2000 iterations, then look at the results (as a walk in parameter space) and see that ever since iteration 200, they've been wandering around in the same region. In this case, I'll just discard 200…or maybe 300. But if, after 2000 iterations, things still don't seem to have "settled down", then I'll run more…and maybe more after that.

    I guess my general point is, I don't see why (or whether) the decision of what to discard as a "burn-in phase" needs to be made a priori.

  6. Phil,

    One reason for setting burn-in ahead of time is that we can do adaptive tuning of Metropolis jumps during the burn-in phase. The other reason is to have a default setting. When I run Gibbs or Metropolis, I don't like to look at lots of plots, I just like to run it until it mixes and then stop. Burn-in of 50% seemed reasonable to me for this, but as I've said, I don't have any great argument for it.

Comments are closed.