This should be of interest to people who run Gibbs samplers, Metropolis algorithms, whatever. (Including if you just run them in Bugs, even if you don’t program them yourself.)
Here’s the output from running a simple Gibbs sampler for the 8-schools model (3 chains, 1000 iterations per chain). Convergence looks good (R.hat is no more than 1.1 for any parameter).
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff theta[1] 10.9 8.3 -2.3 5.6 9.8 15.0 31 1.0 300 theta[2] 7.8 6.1 -4.8 4.0 8.0 11.7 20 1.0 320 theta[3] 6.3 7.4 -10.5 2.5 6.7 10.5 19 1.0 510 theta[4] 7.4 6.3 -4.9 3.6 7.5 11.2 20 1.0 760 theta[5] 5.3 6.3 -8.3 1.4 5.8 9.4 17 1.0 320 theta[6] 6.2 6.5 -7.7 2.4 6.6 10.3 18 1.0 450 theta[7] 10.4 6.5 -0.5 5.9 9.7 14.1 25 1.0 170 theta[8] 8.2 7.5 -5.9 3.9 7.9 12.1 25 1.0 350 mu 7.8 5.1 -2.3 4.3 7.7 11.0 18 1.0 180 tau 6.2 5.3 0.3 2.1 4.8 8.6 20 1.1 76
Just for laffs, I ran it again, identically except for different randomly-drawn initial conditions:
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff theta[1] 12.2 8.3 -1.2 6.9 11.1 15.9 33 1.0 260 theta[2] 8.3 6.2 -4.0 4.4 8.4 11.8 21 1.0 370 theta[3] 6.5 7.8 -11.9 2.3 7.5 11.2 22 1.0 490 theta[4] 8.1 6.7 -6.7 4.1 8.5 12.1 21 1.0 290 theta[5] 5.1 6.6 -9.8 0.9 5.9 9.9 16 1.0 240 theta[6] 6.4 6.8 -8.3 2.2 7.0 10.8 19 1.0 390 theta[7] 11.2 6.8 -0.5 7.0 10.7 15.0 27 1.0 220 theta[8] 8.9 8.2 -7.0 4.3 9.0 13.2 26 1.0 650 mu 8.4 5.2 -1.3 5.1 8.6 11.3 19 1.0 220 tau 7.1 5.9 0.6 2.9 5.7 9.6 22 1.1 60
Hey! What’s going on here? Consider theta[1]. R.hat is at 1.0, indicating that the 3 chains have mixed, but the two runs give different values for the posterior mean (10.9 or 12.2). (And it’s not just a problem with the mean; there’s a similar difference in the medians.) Does this mean that 3000 simulations are not enough? I answer no, for two reasons:
1. The estimates 10.9 and 12.2 look much difference, but really they’re not so different at all, considering the sd of 8.3. Just for example:
> pnorm (12.2+8.3, 10.9, 8.3) - pnorm (12.2-8.3, 10.9, 8.3) [1] 0.677 > pnorm (12.2+2*8.3, 10.9, 8.3) - pnorm (12.2-2*8.3, 10.9, 8.3) [1] 0.952
These probabilities are supposed to be .68 and .95 so the shift doesn’t really do much. (Yes, I know the actual posterior distribution is not really normal; I’m just giving a sense of the overlap here.)
2. If you look at the output above, you’ll see the effective sample size is estimated to be about 300. Let’s do some quick simulations, again using the normal distribution:
> mean (rnorm (300, 10.9, 8.3)) [1] 10.1 > mean (rnorm (300, 10.9, 8.3)) [1] 12.4 > mean (rnorm (300, 10.9, 8.3)) [1] 11.3
As you can see, even completely random draws don’t get you the posterior mean so precisely when you only have 300. This has nothing to do with the use of Gibbs (except to the extent that the number of effective draws has decreased). It’s not a problem with convergence.
Do more simulations if you want. For me, I’ll remember that my goal is this sort of problem, is inference about theta, not inference about the posterior mean (or median, or whatever) of theta.