Scaled inverse-Wishart model for covariance matrices of group-level regression coefficients, as arising in a problem in marketing

Michael Braun writes:

For the last couple of months, I’ve been reflecting on your recent Bayesian Analysis article on prior distributions for variance parameters in hierarchical models. As a marketing researcher who uses Bayesian methods extensively (and a recent student of Eric Bradlow), I am interested in how your findings might extend to the multivariate case. I’m hoping you can help me understand some issues related to the following problem.

Suppose we have a standard linear hierarchical model, where the observed data y[i,j]~N(b[i]*x[i,j],1), where i indexes households, j indexes time, b_i is a k-dimensional household-specific vector of coefficients, and x_ij is a vector of covariates for household i at time j. Think of y_ij as being sales for a household in month j, with the x’s being something like price, advertising, promotions, etc.

We put a prior on b[i]~MVN(mu, sigma), and then put hyperpriors on mu and sigma. Mu is, obviously, the mean of this distribution of coefficients across households and sigma, the covariance matrix, contains information about the degree of heterogeneity in responsiveness to the covariates across the population (e.g., sensitivity to price), and the correlations (e.g., households who are price sensitive are also susceptible to advertising). In marketing, we are not only interested in posterior distributions on mu, but on sigma as well.

OK, nothing interesting so far. But knowing what I know about putting gamma(0.001,0.001) priors on precision parameters (and the amplified shrinkage that these priors can yield), I was wondering if the same thing might happen in the multivariate case. In particular, I thought that one might get better estimates by, instead of putting a Wishart prior on a precision matrix inv(sigma), putting flat priors on elements of a decomposed covariance matrix, a la Barnard McCulloch Meng.

I started simulating data, but could not generate a data set/prior combination in which the posterior distributions of the b’s and mu’s are shrunk towards zero. It then occurred to me that in order for the Wishart prior to be proper, the number of degrees of freedom (v) must be at least large as the dimensionality of the precision matrix (k). So, the marginal distributions on the precisions for each b[i] are scaled chi-squared with v/2 degrees of freedom. If k>2, (v/2)>1, and the prior has an interior mode. The marginal density is finite, without the huge probability mass at very small values.

My conclusion from this (and this is where I could use your insight), is that a Wishart prior on a precision matrix should work just fine for the multivariate case, where the gamma prior fails in the univariate case. As long as the scale matrix in the Wishart hyperprior on sigma is sufficiently “large,” one could still do quite well staying in this conjugate setting.

My response:

First, I think when you discuss a Wishart prior on a covariance matrix, you’re actually referring to the inverse-Wishart (i.e., the Wishart model for the inverse of the covariance matrix). Getting to the specific question,
I haven’t looked into the details of the model to this extent, so I’m not sure on all the technical issues. But my impression is that, no, the inverse-Wishart is not such a good idea. The trouble is the inv-Wishart constrains the variance parameters.

Instead, I recommend a scaled-inverse-Wishart model, as we discuss in Section 13.3 of our forthcoming book. The idea is to break up the covariance matrix into a diagonal matrix of scale parameters and an unscaled covariance matrix which is given the inverse-Wishart distribution. This larger miodel is still conditionally conjugate on the larger space. I took this model from this paper by James O’Malley and Alan Zaslavsky

3 thoughts on “Scaled inverse-Wishart model for covariance matrices of group-level regression coefficients, as arising in a problem in marketing

  1. I heard somewhere that the Wishart does bad things to the eigenvectors of the precision matrix, but I can't remember the details.

    Perhaps not so useful here, but for longitudinal data, this paper has a neat idea:

    Smith M and Kohn R (2002) Parsimonious Covariance Matrix Estimation for Longitudinal Data, Journal of the American Statistical Association, 97: 1141-1153.

    Put Gaussian priors on the Cholesky decomposition of the covariance matrix. I've tried it on some singing crickets, and it seemed to work.

    Bob

Comments are closed.