Death by survey

Emmanuela Gakidou and Gary King (Institue for Quantitative Social Science, Harvard) wrote a cool paper, “Death by survey: estimating adult mortality without selection bias,” in which they consider estimates of mortality based on “survey responses about the survival of siblings, parents, spouses, and others.” By explicitly modeling the missing-data process, they correct for selection biases such as, dead persons with more siblings are more likely be counted in a survey asking about the deaths of siblings. (And persons with no siblings won’t be counted at all.)

Comments on the Gakidou and King paper

This is a fun, interesting, and potentially important paper. I just had a few questions/comments. Mostly picky, but hey, anything to be helpful . . .

– What is the “DHS program”? They refer to it several times but I don’t know what it is.

– Figure 1 would be better, I think, as a 3×3 grid of small plots. Instead of trying to use symbols and colors to convey so many details on a single plot, use a grid so that the individual plots are less overloaded. Also, connect the points in each plot with lines (and get rid of the points). Then you can label the lines directly on the plot and avoid the need for a legend.

– Same for Figure 2. Also, for Figure 2, make the bottom boundary at 0 a “hard” boundary (no gap between 0 and the axis) since zero is a meaningful comparison. Also, I applaud the authors for using RMSE instead of MSE.

– Table 1 should be a graph. No doubt about it.

– Figure 3 is fascinating. I have a minor comment which is that I can’t figure out how the subplots are ordered. I’d like to see something like an ordering by average death rate, or GDP, or some interpretable quantity.

More importantly, I’m interested in the consistent pattern of these curves (of death rate vs. #siblings), which go up from as the number of siblings increases from 1 to 4 or 5, then generally decrease as the number increases further. What’s going on here? Is it a “real” phenomenon, or is it some statistical artifact having to do with the sampling? I just didn’t quite know how to think about it.

– In the discussion of sampling weights in the conclusion, the authors should be aware that, in many settings, the more appropriate survey weights come from poststratification. To the extent that family size information is directly available in some of these countries, I suspect that poststratification could improve the Gadikou and King method even more.

A related classroom demo

The Gakidou and King paper is really cool and reminds me of a (much simpler) classroom demo for teaching sampling methods. We ask each student to tell how many siblings are in his or her family (“How many brothers and sisters are in your family, including yourself?”). We write the resutls on the blackboard as a frequency table and a histogram, and then compute the mean, which is typically around 3.

But families, on average, typically have less than 3 kids (and this was also true twenty or so years ago when the current college students were being born). Why i sthe number for the class so high? Students give various suggestions such as, perhaps larger families are more likely to sent children to college. But the real reason is that the probability a family is included in the sample is proportional to the number of kids in the family. Families with 0 kids won’t be included at all, and families with many kids are more likely to be sampled than families with just 1 kid.

This example is discussed in Section 5.1.6 of Teaching Statistics: A Bag of Tricks. Related examples in surveys include sampling by household or by individual.