Defining what is random in y^rep

David from Alicia Carriquiry’s class at Iowa State asks,

We have a question on Gelman’s A Bayesian formulation of exploratory data analysis and goodness-of-fit testing”. He writes, “Gelman, Meng, and Stern (1996) made a case that model checking warrants a further generalization of the Bayesian paradigm, and we continue that argument here. The basic idea is to expand from p(y|theta)p(theta) to p(y|theta)p(theta)p(y^rep), where y^rep is a replicated data set of the same size and shape as the observed data.” (p. 5 of the PDF)

Here and elsewhere he makes it sound like the distribution of y^rep is something that must be specified in addition to the prior and the likelihood… But don’t the prior and the likelihood (plus the data at hand) force a unique choice for the distribution of y^rep? I thought this was how we defined the replication distribution in 544 (and it seems like they do the same in the 1996 paper. I’ve only skimmed that so far). y^rep follows the posterior predictive distribution.

Why does y^rep provide an extra “degree of freedom”/force modelers to do more thinking?

My reply: the choice in y^rep is what to condition on. For example, in a study with n data points, should y^rep be of length n, or should n itself be modeled (e.g., by a Poisson dist), so that you first simulate n^rep and then y^rep? Similar questions arise with hierarchical models: do you simulate new data from the same groups or from new groups?

This topic interests me because the directed graph of the model determines the options of what can be conditioned on. For example, we could replicate y, or replicate both theta and y, but it doesn’t make sense to replicate theta conditional on the observed value of y. In Bayesian inference (or, as the CS people call it, “learning”), all that matters is the undirected graph. But in Bayesian data analysis, including model checking, the directed graph is relevant also. This comes up in Jouni’s thesis.