So many predictors, so little time

Ilya Eliashberg sent in a question regarding the blessing of dimensionality. I’ll give his question and my response.

Eliashberg:

I’ve been poring over your book (Bayesian Data Analysis),
and there is one aspect that either I didn’t
understand or it wasn’t fully addressed. In
particular, do you have any thoughts on how to apply
Bayesian logistic regression in the presence of very
high levels of multi-collinearity among almost all the
variables.

My particular problem entails several hundred of
binary inputs, most of which are weakly correlated to
a binary output, and strongly correlated to each
other. I’m finding that when I apply Bayesian logistic
regression (w/ N~(0, m) priors on the regression
coefficients) out-of sample performance initially
improves with the first few inputs, but then quickly
drops off as I add more inputs into the regression
(despite a positive correlation w/ the target
variable).

Would you have any suggestions on how this could be
addressed in a Bayesian model (other than transform
the data)?

One additional piece of information I can add about
the problem, is that when performing the regression on
only a small hand-selected subset of the inputs, the
best performance comes from the Bayesian logistic
approach. If I want to use all the data available the
best classification was actually achieved using
Partial Least Squares regression (using the top few
factors). However, that seems like a suboptimal
approach since the binary inputs represent physical
observations (is a particular characteristic observed
or not), so a factor transformation like PLS or PCR
seems like an unnatural approach to take.

My response:

This is certainly a real concern, and it’s not something we said much about in our book. My first thought is to combine the predictors to make “scores,” so that instead of a few hundred predictors, you could just start with a few of these composite scores. You could then throw in the individual predictors in the model also, but hierarchically, in batches with prior distributions centered at 0 and with variances estimated from the data. The idea would be, first, to include the big things that should go into the prediction and then, to include the individual predictors to the extent that they are needed to fit the data.

I talk a little about such models in Section 5.2 of this paper but I can’t say I’ve ever actually put in the effort to do it in a real example. It’s really a research problem, but one that could see some progress, I think, if focused on a particular problem of interest.

1 thought on “So many predictors, so little time

  1. Wouldn't a good solution here be to put a very vague hyperprior on the variance of the weights of the regression coefficient? That way highly correlated inputs dont have excessive variance assigned to them in the beginning.

Comments are closed.