Analysis of survey data: Design based models vs. hierarchical modeling?

Alban Zeber writes:

Suppose I have survey data from say 10 countries where by each country collected the data based on different sampling routines – the results of this being that each country has its own weights for the data that can be used in the analyses. If I analyse the data of each country separately then I can incorporate the survey design in the analyses e.g in Stata once can use svyset …..

But what happens when I want to do a pooled analysis of the all the data from the 10 countries:

Presumably either

1. I analyse the data from each country separately (using multiple or logistic regression, …) accounting for the survey design and then combine the estimates using a meta analysis (fixed or random)

OR

2. Assume that the data from each country is a simple random sample from the population, combine the data from the 10 countries and then use multilevel or hierarchical models

My question is which of the methods is likely to give better estimates? Or is there a better way to do the analyses?

Is there a method one can use to get the best of both worlds i.e accounting for the survey design within each country’s data as well as account for the fact that the data from each country is correlated and therefore accounting for the correlation via multilevel/hierarchical models. If such a method exists, is it implemented in any of the ‘common’ statistical programs?

My reply:

Of your two choices above, I recommend your first option, the meta-analysis. You can take the estimate and standard error from each of your separate surveys and then put them together, including survey and country-level predictors as well.

Method 2 doesn’t make sense. Your data are not simple random samples, or even close (it they were close, nobody would be doing weighting), and so you shouldn’t go around assuming they are.

But there is a way to analyze the raw data. What you need to do is find out where the survey weights came from. Then, do your regression analysis conditional on all the variables used in the weighting, and all will be fine. Use a multilevel model with appropriate interactions to model what you need to model.

Including the variables used in the weighting is as “design-based” as you need to be, and it’s the essence of Mister P.

See my Struggles paper for more discussion of these issues.

5 thoughts on “Analysis of survey data: Design based models vs. hierarchical modeling?

  1. The question also points out a perpetual terminology problem–that "fixed" effects means different things to different people. Two different fixed approaches that might be appropriate here:
    1) The usual inverse variance weighting, which assumes that the quantity being estimated in each survey is constant across countries. This is what a meta-analysis book would call the fixed effects analysis.
    2) Weighting each country by the size of the surveyed population, which would give you an estimate for the quantity of interest across the population of ten countries (e.g. mean obesity rate in OECD countries or total number of of chicken houses in Western Europe). This approach amounts to treating each country as a stratum.

    Andrew do you have a thought on #1 vs. #2?

  2. You might want to spend sometime trying to discern, at the country level or lower, which parameters in the likelihood are common (fixed) or common in distribution (random i.e. anticipated estimatemands are exchangeable and can be represented as unobserved random parameters drawn from a common distribution) verus should be allowed to arbitrarily vary (with-in country variances?).

    Even the assumption that the countries are excahngeable is a tough one if they can be grouped into Asian, European, etc.

    Inverse variance is often a good approximation for fixed as well as random effects – even if estimates are correlated given the unknown correlations can be properly allowed for (AKA Generalized Gauss-Markov).

    Perhaps one good piece of advice would be to "scaffold up" starting with inverse variance fixed effects and then get random/multi-level somehow.

    There likely will be literature, Jon Rao (survey, small survey) and Robert Platt (correlations) come to mind – but simple analyses might give you a better sense if thats worth it.

    K?

  3. Andrew, with regard to doing the analyses on raw data you suggest doing regression analyses conditional on the variables used in the weighting. Suppose the weights in each country j are based on the probability of selection for unit i in country j and the variables age and gender. How would one proceed with doing regression analyses conditional on these variables?

  4. Alban:

    Include age and sex (and, if appropriate, their interaction) among the individual-level predictors in your model and include any relevant country-level predictors in the country-level model. The point is that the weight for unit i in country j will be a function of measured variables on unit i and country j. You can include these measured variables as regression inputs.

  5. Thanks Andrew.
    Effectively what you are suggesting is that we are 'weighting' by conditioning on the variables used in the weighting? Incidentally is this the approach you used in the example on pages 301-310 of your book "Data Analysis Using Regression and Multilevel/Hierarchical Models"?

    Another issue I would like to bring up is that the survey design differs from country to country. For example some countries used stratified random sampling while others used simple random sampling and others used cluster sampling. So potentially I would need a three level model with country at the top level (country-level) followed by the "primary sampling unit" (e.g districts) for the countries that used cluster sampling as the next level (district-level) and finally the individual level. For countries where no cluster sampling was used – I would then use individuals as the clusters at district level (i.e., 1-person clusters). Or is it possible to do the analyses using a two level model? What are your thoughts about the above?

Comments are closed.