Working with cluster data when you don’t have the cluster information

Amanda Geller writes,

I [Amanda] am using the NYC Housing and Vacancy Survey to look at the associations between disorder and crime. The city is divided into 55 neighborhoods, and in every wave, they survey about 18,000 households – about 250-300 per neighborhood. I’ve aggregated the microdata to the neighborhood level, so I have a panel of the 55 neighborhoods, over 5 waves, and my predictors are basically all rates – rates of broken windows, public assistance receipt, etc – predicting crime rates.

My problem is that the HVS, while stratified by neighborhood, is not random within neighborhood. Housing units are surveyed in clusters of 4, and unfortunately I don’t have cluster ID’s and can’t get them from the census bureau. I’ve discussed this and it sounds like because the problem boils down to measurement error in my predictors, then I don’t need to worry about bias. But what I do need to worry about are the standard errors; I need to inflate them to account for the design effect.

So the question remains on how to do this – whether I need to look at a sample of households to determine how similar the clusters are, how to measure the design effect, etc.

My reply: I think the best approach, if you can, is to gather some supplementary data to estimate the within-cluster correlations.