Wacky computer scientists

Aleks pointed us to an interesting article on the foundations of statistical inference by Walter Kirchherr, Ming Li, and Paul Vitanyi from 1997. It’s an entertaining article in which they discuss the strategy of putting a prior distribution on all possible models, with higher prior probabilities for models that can be described more concisely. Thus linking Bayesian inference with Occam’s razor.

I’m not convinced, though. They’ve convinced me that their model has nice mathematical properties but I don’t see why it should work for problems I’ve worked on such as estimating radon levels or incumbency advantage or the probability of having a death sentence overturned or whatever.

Mark Hansen and Bin Yu have worked on applying this “minimum description length” idea to regression modeling, and I think it’s fair to say that these ideas are potentially very useful without being automatically correct or optimal in the sense that seems to be implied by Kirchherr et al. in the paper linked to above.

1 thought on “Wacky computer scientists

  1. MDL can usually be understood as MAP in a Bayesian context. Let me try to write a quick non-rigorous overview.

    Consider the structure of the log-posterior through the Bayes rule:

    log P(H|D) = ( log P(D|H) + log P(H) ) – log Z

    Here Z refers to the evidence, or the probability of the data. This is usually ignored so Z is the normalization coefficient.

    So, model selection with AIC or MDL can be seen as a special case of Bayesian maximum a posteriori (MAP) inference, with a particular choice of the prior, which corresponds to the "model description length". For the specific case of AIC:

    AIC = log P(D|H) – k

    k ~= -log P(H) — indeed, the more complex the parameter space, the larger the k. AIC corresponds to a uniform prior. Uniform priors are potentially very useful indeed!

Comments are closed.