Can you trust a dataset where more than half the values are missing?

Rick Romell of the Milwaukee Journal Sentinel pointed me to the National Highway Traffic Safety Administration’s data on fatal crashes. Rick writes,

In 2006, for example, NHTSA classified 17,602 fatal crashes as being alcohol-related and 25,040 as not alcohol-related. In most of the crashes classified as alcohol-related, no actual blood-alcohol-concentration test of the driver was conducted. Instead, the crashes were determined to be alcohol-related based on multiple imputation. If I read NHTSA’s reports correctly, multiple imputation is used to determine BAC in about 60% of drivers in fatal crashes.

He goes on to ask, “Can actual numbers be accurately estimated when data are missing in 60% of the cases?” and provides this link to the imputation technique the agency now uses and this link to an NHTSA technical report on the transition to the currently-used technique.

My quick thought is that the imputation model isn’t specifically tailored to this problem and I’m sure it’s making some systematic mistakes, but I figure that the NHTSA people know what they’re doing, and if the imputed values made no sense, they would’ve done something about it. That said, it would be interesting to see some confidence-building exercises to give a sense that the imputations make sense. (Or maybe they did this already; I didn’t look at the report in detail.)

6 thoughts on “Can you trust a dataset where more than half the values are missing?

  1. Isn't any imputation strategy based on very strong assumptions, namely that the relationship between the variable of interest and the other variables is the same whether or not the variable is missing? Doesn't this make any dataset with lots of imputed variables highly questionable?

  2. The NHTSA imputation model is very well tailored to make sense of the data reported in the Fatal Accident Reporting system. The estimate that 40-45 percent of all fatal crashes are alcohol-related is about a close to certainty as one can get with statistics. Of course, alcohol-related covers a multitude of sins, including drinking and walking, cycling, motorcycling, and vehicular suicide.

    Steven D. Levitt and Jack Porter estimate that drunk drivers kill about 3-5,000 more or less innocent bystanders each year, about 1/2 as many as sober folks with elevated testosterone levels (all males) kill.

    How Dangerous Are Drinking Drivers? by Steven D. Levitt and Jack Porter in The Journal of Political Economy, Vol. 109, No. 6 (Dec., 2001), pp. 1198-1237. An ungated version is available at http://pricetheory.uchicago.edu/levitt/Papers/Lev

  3. This isn't all that extreme.

    Most fraud models are built on data where 99% of the target tags are missing and they still provide good benefit.

    The issue is that nobody examines any account that is not marked as delinquent (and many frauds are not marked that way until they fall out of the window of interest), and most charge-offs are attributed to bad debt without any serious examination.

    For relatively new areas of fraud modeling such as identity fraud or payment fraud, as much as 90% of the positive cases are not labeled as frauds.

    So I would say that the highway guys have it easy.

  4. Theoretically this is easy enough to check.

    Introduce random alcohol checks for accidents and compute the average. See how it compares to the proportion in the imputed dataset. It should not be too expensive.

    I suppose there are legal reasons why this is not done.

    Although I like imputation it engenders a false sense of security for management – Why invest in data collection when we can impute it?. Unfortunately, over time, the data quality suffers, data collectors die, and the data department is left with three cadavers overlooking a Commodore 64 spit out "data".

  5. The NHTSA imputation model works well because all states collect some information on all highway fatalities (FARS); some collect incomplete, not so accurate information; some incomplete, precise information; others complete, not so accurate information, and a few complete, precise information. For example, ten states have blood alcohol readings from all active participants (divers, pedestrians, cyclists, motorcyclists, etc.) killed in motor vehicle accidents. Some collect this date only for certain classes of accidents. Others rely primarily on the investigating officer's report. And so forth.

Comments are closed.