Optimal Metropolis scaling in a Dirichlet process/Label Switching world

Posted on February 1, 2008 2:38 AM by Andrew

Michael Braun writes,

My particular problem involves a Dirichlet process mixture, but I think that the issue I describe also applies to finite mixtures where label switching can be a problem.

I am trying to construct a reasonably “close to optimal” adaptive scaling algorithm for a random walk Metropolis update, within a Gibbs sampler, for a model that includes a Dirichlet process mixture.

My likelihood is not conjugate to the base measure of the Dirichlet process. I am able to update the “class” probabilities without much problem (I use Neal’s Algorithm 8 (JCGS, 2002). The problem occurs when I want to do a subsequent posterior update of the support points of the underlying Dirichlet process. Since the likelihood and DP base measure are not conjugate, Metropolis is the way to go.

Most of the updating algorithms that I have seen use a history of draws to construct an “empirical” covariance matrix of draws to get a shape of a proposal distribution, and then some measure of historical acceptance probabilities or numerical efficiencies to scale that proposal. But in the Dirichlet process or label switching cases, the history of these draws is meaningless, since the assignments bounce around among indices from iteration to iteration. How could I update the proposal for class 1 in this way, when *maybe* it was class 4 before, but maybe not (and we wouldn’t really know until the end of the MCMC algorithm by post-processing the draws). The same issue arises in assessing convergence of independent chains in real-time.

I am running this model in Umacs (which I highly recommend), but I am getting some crazy proposal draws, probably for the reasons I describe here.

One crude method I am considering is just using a spherical normal proposal, with scale adjusted using acceptance probabilities across all parameters, but this seems like it would be highly inefficient. A good “history-independent” proposal density would be ideal.

So I am hoping someone out there has experienced this problem and has a solution already (or might even want to collaborate on finding one!).

My reply:

1. I’d suggest using t_4 jumps (or some other such longtailed distribution) rather than Gaussian. This isn’t implemented in Umacs but I think it would be nearly trivial to alter Umacs to do this.

2. Regarding the label-switching problem, perhaps you can reorder the parameters after each draw to get a consistent labeling.

2 thoughts on “Optimal Metropolis scaling in a Dirichlet process/Label Switching world”

Corey on February 4, 2008 6:29 PM at 6:29 pm said:

There's an article at the Bayesian Analysis website (on the "forthcoming" page) that discusses related issues, although not the issue of optimal Metropolis scaling. It focuses on a partition approach instead of a labelled cluster approach; the authors assert that this avoids problems due to label-switching and changes in the dimensionality of the parameter space. This paper won't answer the query, but it may provide useful ideas and pointers into the literature.
Bob Carpenter on February 5, 2008 10:30 AM at 10:30 am said:

Theoretically, you can't, because labels are exchangeable across samples for DP. You're restricted to posterior estimates that marginalize over label assignments, such as the likelihood of two outcomes being assigned the same label, or the likelihood of one outcome given a multiset of other outcomes.

In practice, re-ordering of the type Andrew suggests can be surprisingly effective. Griffiths and Steyvers' 2007 paper "Probabilistic topic models" (in a book, but findable online) does something like Andrew suggests in the latent Dirichlet setting (Dirichlet prior over mixtures of multinomials) using Gibbs sampling over label assignments (with the priors marginalized out). They evaluate greedy alignment over the symmetrized KL divergence between the topic multinomials. You could also measure divergence between the posterior Dirichlets as Blei et al. suggest (2006 BMC Bioinformatics paper).

There've also been some papers on concurrent/parallel sampling, which should fall prey to problems arising from exchangeability in theory, but in practice do not (Newman et al. 2007, NIPS, "Distributed inference for latent Dirichlet allocation").

Comments are closed.