Doug McNamara writes,
I am preparing for my first year as a graduate student at the University of Maryland in their Department of Measurement, Evaluation and Statistics. I’ve been reading your blog for a few months, and thought I would finally ask a question. So, here it is:
I have some data on number of terrorist/insurgent troops in a country. For some of the cases, the data could not be directly measured; instead, experts on the country in question were surveyed. For these survey responds, the dataset provides a range of possible values for number of troops, with the range usually representing the high and low estimates (rounded to the nearest thousand). For instance, experts have assigned a range of 10,000-15,000 for number of UNITA troops in Angola in 1989.
So, the question is, how do I go about assigning an actual value to those situations where there is a range? Initially, I was thinking about simply using the mean between the high and low values, but I know nothing about the distribution of expert opinions. Alternatively, I could simply assign a random value within the range. A third option would be to run three tests—one where I only use the low values, one where I use the high values and a third where I use the median/random value approach.
I should mention I would like to assign a single value for the simple purpose of running a t-test to see if there is a difference in average number of troops when the group is foreign funded or not.
My reply: Considering this as a statistical problem, you could treat the actual number as missing data and then use a rounded-data likelihood (as in Exercise 3.5 of Bayesian Data Analysis). In your case, however, I’d probably just use the average (or the geometric mean) of the range. I wouldn’t take these ranges very seriously: in general, experts are notorious for giving estimates where the truth falls outside the range of their guesses. So I don’t see you getting anything special from looking at the high and low values as if they were actually upper and lower bounds.
This book seems fairly relevant here.
Why not use the range as a confidence interval generated from a normal distribution, but I would say based on skepticism to use a fairly low confidence level, perhaps 50%
You could then combine the resulting normal distributions in the two groups, and generate probability distributions for fraction of population involved in insurgency as a gaussian mixture model.
The result wouldn't test whether there actually were more insurgents in foreign funded countries, but rather whether the experts thought there were…