Possible confusion over convergence of iterative simulation

This should be of interest to people who run Gibbs samplers, Metropolis algorithms, whatever. (Including if you just run them in Bugs, even if you don’t program them yourself.)

Here’s the output from running a simple Gibbs sampler for the 8-schools model (3 chains, 1000 iterations per chain). Convergence looks good (R.hat is no more than 1.1 for any parameter).

         mean  sd  2.5% 25% 50%  75% 97.5% Rhat n.eff
theta[1] 10.9 8.3  -2.3 5.6 9.8 15.0    31  1.0   300
theta[2]  7.8 6.1  -4.8 4.0 8.0 11.7    20  1.0   320
theta[3]  6.3 7.4 -10.5 2.5 6.7 10.5    19  1.0   510
theta[4]  7.4 6.3  -4.9 3.6 7.5 11.2    20  1.0   760
theta[5]  5.3 6.3  -8.3 1.4 5.8  9.4    17  1.0   320
theta[6]  6.2 6.5  -7.7 2.4 6.6 10.3    18  1.0   450
theta[7] 10.4 6.5  -0.5 5.9 9.7 14.1    25  1.0   170
theta[8]  8.2 7.5  -5.9 3.9 7.9 12.1    25  1.0   350
mu        7.8 5.1  -2.3 4.3 7.7 11.0    18  1.0   180
tau       6.2 5.3   0.3 2.1 4.8  8.6    20  1.1    76

Just for laffs, I ran it again, identically except for different randomly-drawn initial conditions:

         mean  sd  2.5% 25%  50%  75% 97.5% Rhat n.eff
theta[1] 12.2 8.3  -1.2 6.9 11.1 15.9    33  1.0   260
theta[2]  8.3 6.2  -4.0 4.4  8.4 11.8    21  1.0   370
theta[3]  6.5 7.8 -11.9 2.3  7.5 11.2    22  1.0   490
theta[4]  8.1 6.7  -6.7 4.1  8.5 12.1    21  1.0   290
theta[5]  5.1 6.6  -9.8 0.9  5.9  9.9    16  1.0   240
theta[6]  6.4 6.8  -8.3 2.2  7.0 10.8    19  1.0   390
theta[7] 11.2 6.8  -0.5 7.0 10.7 15.0    27  1.0   220
theta[8]  8.9 8.2  -7.0 4.3  9.0 13.2    26  1.0   650
mu        8.4 5.2  -1.3 5.1  8.6 11.3    19  1.0   220
tau       7.1 5.9   0.6 2.9  5.7  9.6    22  1.1    60

Hey! What’s going on here? Consider theta[1]. R.hat is at 1.0, indicating that the 3 chains have mixed, but the two runs give different values for the posterior mean (10.9 or 12.2). (And it’s not just a problem with the mean; there’s a similar difference in the medians.) Does this mean that 3000 simulations are not enough? I answer no, for two reasons:

1. The estimates 10.9 and 12.2 look much difference, but really they’re not so different at all, considering the sd of 8.3. Just for example:

> pnorm (12.2+8.3, 10.9, 8.3) - pnorm (12.2-8.3, 10.9, 8.3)
[1] 0.677
> pnorm (12.2+2*8.3, 10.9, 8.3) - pnorm (12.2-2*8.3, 10.9, 8.3)
[1] 0.952

These probabilities are supposed to be .68 and .95 so the shift doesn’t really do much. (Yes, I know the actual posterior distribution is not really normal; I’m just giving a sense of the overlap here.)

2. If you look at the output above, you’ll see the effective sample size is estimated to be about 300. Let’s do some quick simulations, again using the normal distribution:

> mean (rnorm (300, 10.9, 8.3))
[1] 10.1
> mean (rnorm (300, 10.9, 8.3))
[1] 12.4
> mean (rnorm (300, 10.9, 8.3))
[1] 11.3

As you can see, even completely random draws don’t get you the posterior mean so precisely when you only have 300. This has nothing to do with the use of Gibbs (except to the extent that the number of effective draws has decreased). It’s not a problem with convergence.

Do more simulations if you want. For me, I’ll remember that my goal is this sort of problem, is inference about theta, not inference about the posterior mean (or median, or whatever) of theta.