Robust t-distribution priors for logistic regression coefficients

Bill DuMouchel wrote:

I recently came across your paper, “A default prior distribution for logistic and other regression models,” where you suggest the student-t as a prior for the coefficients. My application involves drug safety data and very many predictors (hundreds or thousands of drugs might be associated with an adverse event in a database). Rather than a very weakly informative prior, I would prefer to select the t-distribution scale parameter (call it tau) to shrink the coefficients toward 0 (or toward some other value in a fancier model) as much as can be justified by the data. So I want to fit a simple hierarchical model where tau is estimated. Is there an easy modification of your algorithm to adjust tau at every iteration and to ensure convergence to the MLE of tau (or maximum posterior estimate if we add a prior for tau)? And do you know of any arguments for why regularization by cross-validation would really be any better than fitting tau by a hierarchical model, especially if the goal is parameter interpretation rather than pure prediction?

I replied:

We also have a hierarchical version that does what you want, except that the distribution for the coeffs is normal rather than t. (I couldn’t figure out how to get the EM working for a hierarchical t model. The point is that the EM for the t model uses the formulation of a t as a mixture of normals, i.e., it’s essentially already a hierarchical normal.)

We’re still debugging the hierarchical version, hope to have something publicly available (as an R package) soon.

Regarding your qu about cross-validation, yes, I think a hierarchical model would be better The point of the cross-validation in our paper was to evaluate priors for unvarying parameters which would not be modeled hierarchically.

Bill then wrote:

I did have my heart set on a hierarchical model for t rather than normal, because I wanted to avoid over shrinking very large coefficients while still “tuning” the prior scale parameter to the data empirically. (Although my worry about over shrinking might be less urgent if I use prior information to create “batches” that can have their own centers of shrinkage, as in your in-progress hierarchical bayesglm program.)

Lee Edlefsen and I [Bill D.] are working on a drug adverse events dataset with about 3 million rows and three thousand predictors, using logistic regression and some extensions of LR, and with thousands of different response events to fit. Plus the potential non repeatability of MCMC results would be a real turnoff for the FDA regulators and pharma industry researchers.

An EM question

I have a question for Chuanhai or Xiao-Li or someone like that: is it possible to do EM with two levels of latent variables in the model? In the usual formulation, there are data y, latent parameters z, and hyperparameters theta, and EM gives you the maximum likelihood (or posterior mode) estimate of theta, conditional on y and averaging over z. This can commonly be done fairly easily because z commonly has (or can be approximated with) a simple distribution given y and theta. This scenario describes regression with fixed Student-t priors, or regression with normal priors with unknown mean and variance.

But what about regression with t priors with unknown center and scale? There are now two levels of latent variables. Can an EM, or approximate EM, be constructed here? As Bill and I discussed in our emails, Gibbs is great, and it’s much easier to set up and program than EM, but it’s harder to debug. There’s something nice about a deterministic algorithm, especially if it’s built with bells and whistles that go off when something goes wrong.

6 thoughts on “Robust t-distribution priors for logistic regression coefficients

  1. Regarding you EM question. I suspect yes although I haven't worked through the precise details. While there are some great probabilistic interpretations of EM, there is a very simple numerical one which makes your question easily answered. Specifically, EM can be viewed as a form of coordinate descent on your posterior. That is, you fix one half of your coordinate space and optimize the other and then do it the other way around. It's greatly advantageous in cases where the exact global minima can be found when half the coordinates are fixed.

    At any rate, so long as you are always decreasing the posterior at every step, you will generally converge to something. There are some pathological cases for coordinate descent where you can progress infinitely slowly towards an optima but those are, as far as I'm aware, rare in practice.

  2. Aren't variational frameworks for Bayesian inference what is needed here?

    I think that Bishop has a good chapter on these especially as they relate to EM. Michael Jordan does gobs of work on this as well.

  3. Ted,

    That sounds like (potentially) a great idea. I don't know anything about these methods. But if anyone does, and wants to implement this model in that way, just let me know!

  4. There is no inherent difficulty with using EM in a multi-level model with two or more levels of latent variables. The problem is that the E-step may become very complicated. So MCEM is sometimes used. But that can be hard to debug too! (Since it is Monte Carlo, it may not exactly increase the likelihood/posterior.) I also used the Nested EM algorithm for some models like this years ago (van Dyk, D. A. (2000). The nested EM algorithm. Statistica Sinica, 10,, 203-225)

  5. David,

    Can you do something in this example (a hierarchical t model for batches of regression coefficients)? I'm happy to assume the degrees of freedom parameter is known.

  6. It obviously depends on the two levels of latent variables, but you can for instance estimate a mixture of t distributions using (a) the latent variables acting as component indicators and (b) the latent variables corresponding to the chi-square decomposition of the Student's t. This is done for instance in Peel and McLachlan, Statistics and Computing, 2000, 10, 339-348.

Comments are closed.