The biggest problem in model selection?

A student writes:

One of my most favorite subjects is model-selection. I have read some papers in this field and know that it is so widely used in almost every sub-field in statistics. I have studied some basic and traditional criterion such as AIC, BIC and CP. The idea is to set a consistent optimal criterion, usually it’s not easy when the dimensionality is high, but my question is, what is the biggest problem and why it is so hard?

Also I heard that this field has some relations to non-parameter statistics and linear model theories, but as an undergraduate student, I do not know any specific connections between them. I am working in a laboratory in biostatistics; are there any related problems in this field?

My reply: In my opinion, the biggest difficulty is that AIC etc. are all approximations, not actual out-of-sample errors. The attempt to calculate out-of-sample errors leads to cross-validation which has its own problems. Some sort of general theory/methods for cross-validation would be good. I’m sure people are working on this but I don’t think we’re there yet. Regarding your final question: sure, just about every statistical method has biological applications. In this case, you’re comparing different models you might want to fit.

14 thoughts on “The biggest problem in model selection?

  1. In my grad school days a professor once described model fitting as being just as much an art as it is a science, and in my experience that is a pretty apt description. The likelihood based criteria tell you when you have a better fitting model, but they do not tell you when you have the right model. Experience and background knowledge of the area of inquiry is essential.

  2. AIC can be interpreted as a pseudo-Bayesian prior probability of a hypothesis, where simpler hypotheses are a priori more likely than more complex ones. The employment of AIC is alike maximum a posteriori (MAP), where you pick the MAP hypothesis complexity.

    BIC is more convoluted (or I never really properly understood the derivation), but the underlying idea is to fix the "problem" with AIC where large datasets would almost always favor relatively complex hypotheses. For that reason, the number of cases is employed to discount complexity.

  3. "but I don't think we're there yet [cross-validation theory]"
    On the other hand, model selection is something that a statistical consultant can do in an hour or two and bill for or an instructor can teach "easily" ;-)
    But more seriously, model selection is still taught and widely used without the proper disclosure of the limitations.
    The usual habit in statistics of positing "a" model and then producing the "correct" method perhaps detracts from being vague when its called for…

    Keith
    p.s. we are all assuming the purpose here is prediction?

  4. When I attended some Stat classes at a SAS Institute about 5 years ago, they spent a significant amount of time on forward selection and backward elimination. These were mixed ith some consideration of AIC and BIC for model selection. Not until the last year did I finally discover the limitations of these methods.

    These methods were not typically taught as "Best" methods, but instructors seemd to imply that they were pretty good. Nothing was mentioned about the limitations. Then the many people who took these courses (business analysts, consultants, researchers) went forth without that necessary piece of knowledge.

  5. The "Akaike prior" that is necessary to give AIC a Bayesian interpretation is really weird. It has the same precision as the likelihood. So to establish it, you need to look at your data first. Also, why should a prior have the same precision as the likelihood? Usually we would like the likelihood to overwhelm the prior, which can't happen if they must both have the same precision. I think the attempt to give AIC a Bayesian justification is very forced.

    One of the major problems with model selection is the error in model selection is not usually incorporated into the estimation process. So we end up with a selected model, and use it for inference AS IF we knew this was the correct model a priori. Overcoming this problem leads to model averaging rather than selection.

  6. Simon,

    I disagree. AIC has a Bayesian interpretation as an estimate of the out-of-sample predictive error. "Bayesian interpretation" does not have to be "posterior probability that a model is true." It can be the summary of a posterior inference conditional on a model–in this case, it's inference about predictive error.

    Regarding your last point, I prefer continuous model expansion to discrete model averaging. See chapter 6 of Bayesian Data Analysis for more discussion.

  7. I agree that AIC is an estimate of the out-of-sample predictive error. But this is not intrinsically Bayesian. Besides, proponents of AIC use it for model selection, therefore it is reasonable to ask how AIC fits into a Bayesian model selection/averaging perspective. For reasons in my first comment, I don't think this is possible without some contortions. (Don't get me started on using "Akaike weights" for model averaging!)

    Your continuous model expansion approach looks good. But you still have to make decisions as to which way to expand a model if that model doesn't fit. Does your model expansion incorporate the error associated with the choice of the expanded model family? In any case, a common situation (at least in biology) is variable selection for regression models. In that case, discrete model averaging or model selection makes sense (at least to me!). We can add or subtract variables from a model, but it makes no sense to add "half a variable", for example. But I guess you could use a weighting scheme. But then you would have to estimate the weights too. Shudder!

  8. Simon, the proponents of AIC have actually done averaging and found out that model averaging helps. See <a href="http://books.google.si/books?id=BQYR6js0CC8C&dq=%22Model+Selection+and+Multimodel+Inference%22&pg=PP1&ots=i84UpegfYD&sig=XUnyMqFqu0RbnL4QxeeKhvcBMSQ&hl=sl&sa=X&oi=book_result&resnum=4&ct=result&quot; rel="nofollow"&gt <a href="http://;http://books.google.si/books?id=BQYR6js0CC8C&dq=%22Model+Selection+and+Multimodel+Inference%22&pg=PP1&ots=i84UpegfYD&sig=XUnyMqFqu0RbnL4QxeeKhvcBMSQ&hl=sl&sa=X&oi=book_result&resnum=4&ct=result” target=”_blank”>;http://books.google.si/books?id=BQYR6js0CC8C&dq=%22Model+Selection+and+Multimodel+Inference%22&pg=PP1&ots=i84UpegfYD&sig=XUnyMqFqu0RbnL4QxeeKhvcBMSQ&hl=sl&sa=X&oi=book_result&resnum=4&ct=result They refer to it as "Multi-Model Inference (MMI)".

    AIC taken as a prior isn't normalized, but in any kind of practice, you renormalize it. If it's taken as a prior, the grounds for the prior is out-of-sample predictive error minimization. In my own experiments I've noticed that AIC-inspired priors have very good error performance.

  9. Aleks, Yes of course. I have that book on my shelf. If you use AIC as a prior for model averaging (with all due renormalization), you are implicitly accepting the "Akaike prior" (derived by Akaike in the 1970's, but for a recent proselytising reference: SOCIOLOGICAL METHODS & RESEARCH, Vol. 33, No. 2, November 2004 261-304.) The question is, why should I choose this prior as a reference prior, especially given that it has the weird properties I mentioned above? Burnham and Anderson are happy with this, and even give it a snappy name: "savvy". (In my opinion, clever language should not be used to disguise problematic analysis methods.)

    If we are considering how we might view AIC-based model averaging/selection through a Bayesian paradigm, we have to remember that the prior on each model reflects our prior belief that that model is the "true" model (Yes, all models are wrong, etc. etc.). It is not clear to me why a prior which represents an estimate of the out-of-sample prediction error should also represent our prior belief that that model is true. Now it may turn out that AIC-based methods work pretty well, in the sense that the results are useful, but that should not be the only criterion: philosophical justification and consistency with current Bayesian theory should also be sought, particularly if we are to call ourselves Bayesian statisticians.

  10. Simon,

    Bayesian inference is conditional on a model. I would like to make my models better. I agree there is no formula for continuous model expansion; decisions of how to expand the model can be made in light of one's goals and understanding of the problem at hand.

    Regarding AIC etc: I think that out-of-sample predictive error is a reasonable way to compare models. AIC and DIC can be given Bayesian interpretations as data-based estimates (approximate posterior means) of out-of-sample predictive error. I would not use AIC to compute the posterior probability that a model was true, for the simple reason that in almost no circumstances to I ever compute the posterior probability that a model is true. I rarely think this is a meaningful or useful question to ask.

    Regarding "philosophical justification and consistency with current Bayesian theory": Please take a look at the DIC literature (Spiegelhalter et al., etc) and at chapter 6 of Bayesian Data Analysis. It is perfectly consistent with current Bayesian theory to use Bayesian methods to estimate quantities of interest such as predictive error.

  11. Hi Andrew,

    I agree with everything you have said! I haven't read much of the DIC literature, but I'm interested in missing data problems, and there is a paper by Celeux et al. on DIC with missing data (2006, Bayesian Analysis 1(4)) with several commentaries. Some of the commentaries criticise the whole DIC approach as lacking a strong theoretical foundation. I guess the jury is still out on that. I like Brad Carlin's comment, "Model choice is to Bayesians what multiple comparisons is to frequentists: a really hard problem for which there exist several potential solutions, but no consensus choice."

    I agree that AIC should not be used to calculate the posterior probablility that a model is true. But the proponents of AIC for model averaging are advocating exactly that!

    I have no difficulty accepting that Bayesian methods can be used to estimate useful quantities such as predictive error. And I think AIC can be used to compare models. But it should not be the only tool in the toolbox. And it may not be the best tool, either.

    Cheers, Simon.

  12. Simon, I was one of the discussants of Celeux et al (2008), and I did criticise DIC for lacking a theoretical foundation. I came back to this topic in this paper, which I think does provide a solid foundation for penalized deviance statistics.

    There is good news and bad news in the paper. The good news is that DIC can be justified as an approximation to a penalized loss function. The penalty represents a rational price that must be paid for using the data twice (once for parameter estimation and once for assessing goodness of fit of the model). The bad news is that the DIC approximation only holds under asymptotic conditions that are easily broken in hierarchical models.

    The paper also explores a related criterion – the penalized expected deviance – which is more universally applicable, in the sense that it does not rely on the existence of a good point estimate of the parameters. I am still trying to implement this properly in JAGS.

  13. Hi, Martyn. Since we seem to be communicating here . . . let's find a time to talk about redundant parameterization for hierarchical models in Jags!

Comments are closed.