Poststratification on variables that are not fully observed

Posted on September 11, 2007 7:36 AM by Andrew

Seth Wayland writes,

In Chapter 14.1 of your new book, the example uses only predictors for which you have census data at the state level. In the postratification step, you just plug the values of those covariates into the model, and viola, you have an estimate for that poststratification cell! What about including further individual level predictors in the model to account for probability of selection such as household size and number of phones in the household, or even an individual-level predictor that might improve the model? How do you then calculate the estimate for each poststratification cell?

My response: yes, this is something we are struggling with.

The long answer is that we would treat the population distribution of all the predictors, Census and non-Census variables (those desirable individual-level predictors which are only observed in the sample and not in the population), as unknown. We’d give it all a big fat prior distribution and do Bayesian inference. This sounds like a lot but I think it’s doable using regression models with interactions. We’re working on this now, starting with simple models with just one non-Census variable. The closest we’ve come so far with is this paper with Cavan and Jonathan on poststratification without population inference (see blog entry here).

The short answer is that it should be possible to do a quick-and-dirty version of the above plan, estimating the joint distribution of Census and non-Census variables using point estimates for the distribution of non-Census variables given the Census variables, based on weighting using the survey data within each Census post-stratification cell. This is only an approximation because it ignores uncertainty (for example, if a particular cell includes 4 people in single-phone households and 3 people in multiple-phone households, the weighted totals become 4 and 1.5, so the quick-and-dirty approach would use the point estimate of 1/(4+1.5) as the proportion in single-phone households in that cell, ignoring the uncertainty arising from sampling variability).

I do think that this (the quick version, then the full version) is ultimately the way to go, since the poststratification strategy allows us to model the data and get small-area estimates, such as state-level opinions from national polls.

As is often the case, the challenge in statistics is to include all relevant information (from the Census as well as the survey, and maybe also from other surveys), and to do this while setting up a model that is structured enough to take advantage of all these data but not so structured that it overwhelms this information.