Measuring model fit

Ahmed Shihab writes,

I have a quick question on clustering validation.

I am interested in the problem of measuring how well a given data set fits a proposed GMM [Gaussian mixture model]. As opposed to the notion of comparing models, this “validation” idea asserts that a GMM already represents a specific mixture of distributions, it already represents an absolute, so we can find out directly if the data fits that representation or not.

In fuzzy clustering, such validity measures abound. But it struck me that in the probabilistic world of GMMs our only measure is the actual sum of probabilities given by the GMM. The closer it is to one, the better. However, if the sum is say 0.69 it can be misleading; when the clusters do not match in populations the bigger cluster, even though it fits badly, adds substantially to the overall probability score and so the overall impression is that there is a good fit.

My reponse: I don’t have much experience with these models, but I recommend simulating replicated datasets from the fitted model and comparing them (visually, and using numerical summaries) to the observed data, as discussed in Chapter 6 of Bayesian Data Analysis.

My other comment is that clusters typically represent chioces rather than underlying truth. For an extreme example, consider a simulation of 10,000 points from a unit bivariate normal distribution. This can certainly be considered as a single cluster, but it can also be divided into 50 or 100 or 200 little clusters (e.g., via k-means or any other clustering algorithm). Depending on the purpose, any of these choices can be useful. But if you have a generative model, then you can check it by comparing replications to the data.