Model selection criteria and information theory

Carlos Rodriguez has a paper on a new model selection criterion he calls CIC, which he justifies using information theory. I confess to being confused by this sort of reasoning–it’s easier for me to think of models than of bit streams–but it looks potentially interesting.

I’m skeptical of some of the claims made for BIC (the so-called Bayesian information criterion). I’m more a fan of the DIC (deviance information criterion) of Spiegelhalter et al. but in practice it can be unstable to compute.

3 thoughts on “Model selection criteria and information theory

  1. I worry about all these *ICs that they miss the polint of much modelling, which is to get a reasonable summary of the data, that can be used for inferential purposes rather than for predictive purposes (as *ICs seem to be intended for). I care less about optimising the closeness to a non-existant true model than I do about getting a model that summarises a lot of the variation in the data with relatively few parameters. The trade-off between the two will depend in part on the purpose of the study, and things like whether it is a designed experiment, or observational data: things that cannot be measured by an *IC

    It would be useful to have an *IC devised with a sliding penalty term, so that we can be exlicit about the amount of complexity we want (rather than "cheating" by choosing which *IC gives the "best" result).

    No, I don't know how to devise one either, but someone might come up with a bright idea.

    Bob

  2. The best way of understanding AIC/BIC/CIC is to exponentiate it and see it composed of the likelihood and the "prior". By minimizing CIC, one is essentially seeking for MAP model. However one could also use the same "prior" to integrate out the model in predictive modelling.

    I refer to it as "prior" because it depends on the amount of data. But the actual motivation is based on a frequentist analysis: attempting to maximize the likelihood with a model that is obtained from only a part of the data.

    If you "evaluate" these AIC/BIC/CIC methods, clearly the winning "prior" will be the one closest to the given data. Not sure how much we've learned from that.

    Nonetheless, I appreciate the fact that Rodriguez trying to provide a prior based on the model's geometric complexity. A similar goal is pursued with universal priors.

    As for DIC: it attempts to characterize the variance in the log-posterior probability and penalize the likelihood with it. Some kind of a "derived prior". One wonders if there is a way of assessing the posterior variance that won't require as much sampling.

Comments are closed.