Can you do Bayesian inference without strong assumptions about the functional form of the unknown distribution?

Adam Taylor writes:

One of the criticisms of Bayesian statistics seems to be that, as generally practiced, it relies on strong distributional assumptions. I’m wondering if it’s possible to come up with a posterior distribution on the mean of a bunch of IID samples from an unknown distribution [I think he means “a posterior distribution on an unknown distribution given the mean of an independent sample from that distribution” — AG], and to do it such that you don’t have to make strong assumptions about what the unknown distribution is. I.e. I’m looking for some kind of nonparametric or semi-parametric Bayesian approach to this problem. Does something like this exist?

My reply: Yes, you can do this. See, for example, the articles, “On Bayesian analysis of mixtures with an unknown number of components,” by Sylvia Richardson and Peter Green, Journal of the Royal Statistical Society B, 59, 731-792 (1997), and Bayesian density regression, by David Dunson, N. S. Pillai, and J-H. Park, Journal of the Royal Statistical Society B, 69, 163-183.

The short answer it’s not trivial to solve the problem in reasonable generality. There are classical methods such as kernel density estimation that are much simpler, but they have problems when sample size is small.

Beyond this, my intuition is that the way to proceed is to think hierarchically. You’re almost never really just analyzing a sample from one distribution; realistically, you’ll be applying your method repeatedly on related problems. This returns us to the connections between hierarchical modeling and the frequency evaluation of statistical procedures.

9 thoughts on “Can you do Bayesian inference without strong assumptions about the functional form of the unknown distribution?

  1. I'd recommend doing a Google Scholar search on "Dirichlet process mixture model". It's a probability measure on continuous distributions — it can be used as a prior for the unknown distribution. With Bayesian nonparametrics, you have to be careful about posterior consistency, but posterior consistency has been proved for vanilla density estimation with a DPMM.

  2. Another way to do this is to use a nonparametric prior on functions, like a Gaussian process. It's quite a bit more work to turn a GP into a prior on probability densities, but it's something that people have been working on for a while. The result can be interpreted as a Bayesian kernel density model. Some relevant papers are:

    Peter Lenk, The Logistic Normal Distribution for Bayesian Nonparametric Predictive Densities, JASA 83:402, 1988.
    Peter Lenk, Towards a Practicable Bayesian Nonparametric Density Estimator, Biometrika 78:3, 1991
    Surya Tokdar, Towards a Faster Implementation of Density Estimation with Logistic Gaussian Process Priors, JCGS 16:3, 2007
    Plugging my recent conference paper with Iain Murray and David MacKay: The Gaussian Process Density Sampler, NIPS, 2009

  3. DP and GP as mentioned above. Quite a lot of references esp in the Machine Learning literature on Nonparametric Bayesian inference. Zoubin Ghahramani at UCL/Cambridge is one of the names there. You can also check out http://videolectures.net for many online video lectures on the topic (and many others).

  4. I'm wondering if what he wants is to make inference on the mean of an unknown population, rather than make inference on the distribution of the unknown population. If the former, then the machinery of nonparametric Bayes is a bit unecessary. Instead, you can use a normal model for your data, which really just amounts to assuming the sample mean is approximately normal (and that the sample mean and sample variance are independent). Even if your data are not normal, you get asymptotic consistency for your estimate of the population mean and variance (see White 1982?), and you don't have to specify prior distributions over spaces of prior distributions.

    In such an approach, we are relying more on the central limit theorem than the assumption that the unknown distribution is actually normal. If I recall correctly, this is used quite a bit in GCSR.

  5. Peter:

    Yes, but in that case much depends on the tails of the distribution. If the tails of the underlying distribution are short, normal model is fine for inference about the population mean (and other models that fit data better can perform much worse; see section 9.3 of BDA2); if tails are long, not so much. Also, modeling the distribution better (even with crude bounds) can make a big difference. But I agree that nonparametric modeling would be overkill for this purpose.

  6. I would say it depends on a combination of the sample size and the tails of the distribution. Of course, the t-likelihood or interval will fail if you have 3 observations with which to estimate the mean of a very heavy tailed distribution, but then again so would any other procedure.

    What is sometimes surprising is how quickly the CLT kicks in, making the t-likelihood a reasonable semiparametric procedure that works fine for long-tailed and nonnormal distributions, as long as the sample size is reasonable. For example, what do you think the coverage probabilities of the t-interval are based on 50 observations from the following three distributions?

    a) beta(1/2,1/2) (bathtub shaped)
    b) exponential
    c) lognormal(1,1)

    Based on some quick simulations, the coverage probabilities are 95%, 93% and 90%. If the sample size is 100 the numbers are 95%, 94% and 92%. Not bad, considering the "model" for the data is very wrong (but the model for the sample mean is approximately right).

  7. Sorry for the confusion I've created with my sloppy wording, but Peter Hoff is correct: I was interested in making inference on the mean of an unknown population. (I do appreciate all the pointers on Bayesian density estimation. Thanks to all.)

    For making inference on the mean, what I'm hearing is that you can often get away with assuming the underlying distribution is normal, as long as it isn't too heavy-tailed. But is there some way to automatically guard against the possibility that it is heavy-tailed?

  8. Adam:

    It depends on the context. Typically when people are interested in the sum, they are working with all-positive data, in which case it can make sense to take the log first. Peter's point is that the inference based on the normal distribution can be ok even if it's a bad model.

    Peter:

    Please take a look at Section 9.3 of Bayesian Data Analysis (second edition).

Comments are closed.