New issue of R News on Bayesian inference

Martyn Plummer edited the most recent issue of R News, and it’s focused on Bayesian computation. Jouni and Sam and I have 2 articles there (one on Bayesian software validation and one on Bayesian data analysis (as opposed to simple Bayesian inference) in R). But here I’ll give some quick comments on the other articles in the issue.

I like all the articles. The below comments might be picky, but I hold articles in a computational forum such as R News to a higher standard.

1. I’m a little disappointed in some of the graphical displays. For example, on page 4, there’s an unwieldy table with numbers presented to 4 or 5 decimal places (e.g., 67.0208 with an sd of 11.08133)! FIgure 3 on page 6 has the same problems.

2. I’d also like to see some clearer guidance on practical issues. For example, for Bayesian inference it doesn’t make much sense to simulate 11,000 iterations, keep the last 10,000, and thin by 1 (also in the example on page 4). If you really needed 11,000 to converge (which I doubt you did, but it’s hard to tell based on simulating just one chain), then I think you’d better discard more than the first 1000. Conversely, if you have approximate convergence after 1000 iterations (i.e., that’s all you need to discard from the start of the chain), then I’d think 2000 would be enough, not 11,000. Obviously in this particular example, maybe 11,000 won’t bust your computational budget, but I’ve seen people running things overnight–and really reducing their ability to do flexible data analysis–so if 100 iterations are enough, why do so many more?

3. On page 5, there’s a pretty picture, but it’s hard to identify which case is which. Given tha tthe distributions are all basically Gaussian, I think they’d be more clearly displayed as intervals, which would free up one of the axes to display another variable (e.g., year when the justice was appointed) and would also allow each justice to be labeled directly.

I have a similar comment about Figure 1 on page 16: once again, with these densities practically Gaussian, it would be much more informative, I believe, to make a plot whose x-axis goes from 1 to 6, showing the posterior medians and intervals for each of the components of alpha. This would take up much less space (basically 1/6 the space of the current graph) and allow the relevant direct comparison between the alpha’s. In terms of data analysis, I just don’t see what’s gained by having all these curves.

4. The trace plot on page 9 is not maximally informative. It would help to display 2 or 3 parallel chains to get a better sense of mixing.

5. I’m disappointed that the example on page 18 uses the discredited dgamma(0.001,0.001) prior density.

OK, that’s all. Sorry for being such a curmudgeon. I really did like all the papers and I hope these comments will be useful.

7 thoughts on “New issue of R News on Bayesian inference

  1. I had a problem with Figure 2 on page 5. Maybe it's my printer, but the blues of Stevens, Breyer and Ginsburg all look the same to me, as do the reds of Scalia, Thomas and Rehnquist. The ghostly blue used for the prior is surely a bad choice — it looks like bleed-through from the other side of the page.

    I think we should ask Kaiser of "Junk Charts" about this one. Was it necessary to use color in this figure? Is there a better way of displaying the information?

  2. it doesn't make much sense to simulate 11,000 iterations, keep the last 10,000, and thin by 1 (also in the example on page 4). If you really needed 11,000 to converge (which I doubt you did, but it's hard to tell based on simulating just one chain), then I think you'd better discard more than the first 1000. Conversely, if you have approximate convergence after 1000 iterations (i.e., that's all you need to discard from the start of the chain), then I'd think 2000 would be enough, not 11,000.

    I think you have things backwards here. Indeed, if you are discarding more than 10% of your iterations, you are either inefficiently discarding too many, or you haven't run the chain for long enough to have any great confidence that it's converge and that the estimates you have are accurate.

    One thing to keep in mind is that the time to reach approximate convergence is often about the same as the autocorrelation time (the factor by which non-independence reduces your effective sample size). This isn't guaranteed (for one thing, convergence time depends on the initial state), but is often at least a rough guide. So if you have to discard 1000 iterations, your autocorrelation time may well be around 1000. If you run for only 2000 total iterations (keeping only 1000), your effective sample size for MCMC estimates is about 2. This isn't enough even for Bayesian inference (where you might say that standard errors in estimating a mean need only be, say, 10 times less than the standard deviation).

    Furthermore, if you think you need to discard 1000 iterations, but have only simulated 2000 total, you can't possibly have any strong reason to believe that the chain has actually converged in 2000 iterations. You might be getting an answer that is very far from correct. Suppose, for example, that the value of some function of state for the first 1000 iterations is around 10 plus or minus 1, but at iteration 1001, this this changes to a value of 30, with the value staying at 30 plus or minus 1 for the next 1000 iterations. If you stop here, you have no evidence that the chain doesn't just move back and forth between a region where the function is around 10 and a region where it is about 30, with time between transitions exponentially distributed with a mean of about 1000. You'd need to simulate for maybe 10000 iterations before you'd have any real confidence that the chain will spend almost all its time in the region where the value is around 30.

  3. Andrew,
    I think your criticism of the table (Figure 1) with all of the significant figures is not valid. This is an article about a specific set of software functions, and the table is a verbatim example of the output of one of those functions. This is clearly the right thing to show. You might argue that the software shouldn't generate the table in that particular way, but that's a totally different issue.

  4. 5. I'm disappointed that the example on page 18 uses the discredited dgamma(0.001,0.001) prior density.

    Hey! That's Rats! You can't do anything to the model: poor Andrew would have a breakdown.

    More seriously, has anyone checked if it makes a difference what prior you use on Rats?

    Bob

  5. A quick run-down of two of the charts:
    – Andrew's Point 3 (Fig 1, p.16): if the axis and title labels are removed, it is next to impossible to see the difference between these curves. If indeed they are normal curves, a simple scatter plot of the means and sds would suffice. And something similar to Andrew's suggestion of side-by-side intervals/boxplots would work well too.
    – Andrew's Point 3 (Fig 2, p.5): I second the idea of using intervals. Perhaps even have one complete line for each Justice, and mark out the mean, +/- 1sd, +/- 2sd points on each line. These Gaussian curves are very hard to understnad because heights and widths must be interpreted without forgetting that the total area has to be 1. The colors are pointless and potentially misleading: one kind of expects the progression of color to mimic the progression of "liberal-ness" at some "speed" proportional to distance but of course this is not possible. A simple annotation of –> liberal

  6. Phil,

    Yeah, that's exactly my point. R News is for exemplary software. The display with all the digits is a bad thing to show. If it's a verbatim output of the software, then the software should be changed.

    Radford,

    I simulate multiple sequences, so your example, where a sequence jumps from 10 to 30 and then stays there, wouldn't generally present a problem, since there's no reason that 3 independent sequences would all do that.

    Regarding your first point, if my effective sample size is 2, I'm not going to see good mixing. I actually calculate an effective sample size (it's in the 2nd edition of our book and appears in the output from R2WinBUGS when you print a fitted model onto the console). I'd typically want the effective sample size to be at least 100, but I typically don't see that until R-hat is less than 1.1 anyway. That is, the time it takes to verify "convergence" (really, mixing) also leaves you with a reasonable effective sample size.

    John,

    My real problem with Figure 2 on page 5 is that it displays the inferences as density curves. They are basically Gaussian and as such can each be summarzed by mean and s.e. (i.e., point and segments); these could then be plotted on the y-axis vs. year of appointment on the x-axis.

  7. I tend to prefer "caterpillar" plots for this type of thing: rank order by posterior mean (or whatever Bayes estimate you're using), and then show the width of a 95% highest posterior density region. Sometimes I'll show the marginal densities, just to be pedantic, or when I am in "expository" mode. Sometimes you're also interested in bivariate slices through the 9-dimensional joint posterior density of the Justices' ideal points. Here are some examples of how I/we do it here at PSCL… http://jackman.stanford.edu/blog/?p=37

Comments are closed.