Sample size and self-efficiency

Xiao-Li Meng is speaking this Friday 2pm in the biostatistics seminar (14th Floor, Room 240, Presbyterian Hospital Bldg, 622 West 168th Street). Here’s the abstract:

One of the most frequently asked questions in statistical practice, and indeed in general quantitative investigations, is “What is the size of the data?” A common wisdom underlying this question is that the larger the size, the more trustworthy are the results. Although this common wisdom serves well in many practical situations, sometimes it can be devastatingly deceptive. This talk will report two of such situations: a historical epidemic study (McKendrick, 1926) and the most recent debate over the validity of multiple imputation inference for handling incomplete data (Meng and Romero, 2003). McKendrick’s mysterious and ingenious analysis of an epidemic of cholera in an Indian village provides an excellent example of how an apparently large sample study (e.g., n=223), under a naive but common approach, turned out to be a much smaller one (e.g., n<40) because of hidden data contamination. The debate on multiple imputations reveals the importance of the self-efficiency assumption (Meng, 1994) in the context of incomplete-data analysis. This assumption excludes estimation procedures that can produce more efficient results with less data than with more data. Such procedures may sound paradoxical, but they indeed exist even in common practice. For example, the least-squared regression estimator may not be self-efficient when the variances of the observations are not constant. The morale of this talk is that in order for the common wisdom "the larger the better" be trusted, we not only need to assume that data analyst knows what s/he is doing (i.e., an approximately correct analysis), but more importantly that s/he is performing an efficient, or at least self-efficient, analysis.

This reminds me of the blessing of dimensionality, in particular Scott de Marchi’s comments and my reply here. I’m also reminded of the time at Berkeley when I was teaching statistical consulting, and someone came in with an example with 21 cases and 16 predictors. The students in the class all thought this was a big joke, but I pointed out that if they had only 1 predictor, it wouldn’t seem so bad. And having more information should be better. But, as Xiao-Li points out (and I’m interested to hear more in his talk), it depends what model you’re using.

I’m also reminded of some discussions about model choice. When considering the simpler or the more complicated model, I’m with Radford that the complicated model is better. But sometimes, in reality, the simple model actually fits better. Then the problem, I think, is with the prior distribution (or, equivalently, with estimation methods such as least squares that correspond to unrealistic and unbelievable prior distributions that do insufficiant shrinkage).

1 thought on “Sample size and self-efficiency

  1. But we are human (some of us more so than others) and less likely to make mistakes with simpler (more false) models and that may make their use superior to complex (less false)models.

    An early reference would be – Human information processing; individuals and groups functioning in complex social situations Author: Schroder, Harold M.; Driver, Michael J.,; Streufert, Siegfried, Publication: New York, Holt, Rinehart and Winston 1967

    Also vaguely drawing on arguments from CS Peirce on the continuity of models – that there is a model between every less and more complex model and that the true model should be taken as the model an infinite community of enquirers would settle on – eventually the (surviving) most complex model would be best.

    Yes vague, complexity undefined and "in the mind of the beholder" but I think Radford is thinking of less wrong models as being more involved than more wrong models

    (but you need some extra data set based conviction that they are less wrong)

    Keith

Comments are closed.