Handling multiple versions of an outcome variable

Jay Ulfelder asks:

I have a question for you about what to do in a situation where you have two measures of your dependent variable and no prior reasons to strongly favor one over the other.

Here’s what brings this up: I’m working on a project with Michael Ross where we’re modeling transitions to and from democracy in countries worldwide since 1960 to estimate the effects of oil income on the likelihood of those events’ occurrence. We’ve got a TSCS data set, and we’re using a discrete-time event history design, splitting the sample by regime type at the start of each year and then using multilevel logistic regression models with parametric measures of time at risk and random intercepts at the country and region levels. (We’re also checking for the usefulness of random slopes for oil wealth at one or the other level and then including them if they improve a model’s goodness of fit.) All of this is being done in Stata with the gllamm module.

Our problem is that we have two plausible measures of those transition events. Unsurprisingly, the results we get from the two DVs differ, sometimes not by much but in a few cases to a non-trivial degree. The conventional solution to this problem seems to be to pick one version as the “preferred” measure and then report results from the other version in footnotes as a sensitivity analysis (invariably confirming the other results, of course; when’s the last time you saw a sensitivity analysis in a published paper that didn’t back up the “main” findings?). I just don’t like that solution, though, because it sweeps under the rug some uncertainty that’s arguably as informative as the results from either version alone. At the same time, it seems a little goofy just to toss both sets of results on the table and then shrug in cases where they diverge non-trivially.

Do you know of any elegant solutions to this problem? I recall seeing a paper last year that used Bayesian methods to average across estimates from different versions of a dependent variable, but I don’t think that paper used multilevel models and am assuming the math required is much more complicated (i.e., there isn’t a package that does this now).

My reply:

My quick suggestion would be to add the two measures and then use the sum as the outcome. If it’s a continuous measure there’s no problem (although you’d want to prescale the measures so that they’re roughly on a common scale before you add them). If they are binary outcomes you can just fit an ordered logit.

Jay liked my suggestion but added:

One hitch for our particular problem, though: because we’re estimating event history models, the alternate versions of the DV (which is binary) also come with alternate versions of a couple of the IVs: time at risk and counts of prior events. I can’t see how we could accommodate those differences in the framework you propose. Basically, we’ve got two alternate universes (or two alternate interpretations of the same universe), and the differences permeate both sides of the equation. Sometimes I really wish I worked in the natural sciences…

My suggestion would be to combine the predictors in some way as well.

10 thoughts on “Handling multiple versions of an outcome variable

  1. adding is equivalent to treating the variables in question as indicators of a latent variable. It will work only if the inter-item correlation (or Cronbach's alpha) is reasonably high; if you aggregate variables that aren't correlated with each other (aren't behaving like a reliable composite measure of the unobserved variable), you'll get results biased toward 0. In this case, if the "alternative outcome variables" aren't correlated pretty highly with each other, the implication would be that the researchers don't have a really good grasp of how to identify the outcome they are trying to model.

  2. Just to play devil's advocate, have you seen this approach stand up to peer-review in the past? I can easily see a reviewer taking a dislike to the adding of variables approach as arbitrary.

  3. Gary King and co-authors might have a solution to your statistical question, including software:

    http://gking.harvard.edu/publications/multiple-ov

    However as a referee I would be somewhat unhappy with this. If too different but "equally theoretically plausible" proxys produce different answers to the same question, I would like a read a vivid discussion about that rather than merging them into one latent variable. You can of course merge them as well but I would like to see the results for each of them separately. I would like to know where (i.e for each country- years) the measures are disagreeing and how much uncertainty this is introducing into your conclusions.

  4. It stands up–is completely unremarkable–in settings which everyone understands that the outcome variable is properly conceptualized as a latent variable & the aggregated variables as indicators or measures of it. You'll see this all the time, e.g., in social psychology studies in which the aggregated variables are likert items relating to some attitude or perception & are combined into a composite scale. (Simply adding or averaging, moreover, is only way to scale likert items, and should be done after one normalizes the individual variables being added. Alternatively, one can use factor analysis.) But the principles behind it apply generally, and are fine for behavioral measures too. See Rushton, J.P., Brainerd, C.J. & Pressley, M. Behavioral development and construct validity: The principle of aggregation. Psych. Bulletin 94, 18-38 (1983). So if this isn't a context in which a reviewer would expect to see this kind of data reduction or scaling strategy, it might help to anticipate that & explain that that's exactly what one is doing & that it is equivalent to what is done all the time in settings in which people are used to thinking this way. But even if reviewer finds all of this unremarkable, she'll want to see some conventional measures of reliability, or inter-item correlation, to confirm that the combined items really are plausibly treated as measures of some latent variable. So if a reviewer saw someone adding or averaging multiple variables (predictors or outcome variables) & not addressing reliability, I'd *expect* her to complain "hey–that's arbitrary … ad hoc" or whatever.

  5. A lot will depend on the way you explain this to reviewers. No need for jargon such as latent variables. You have two unreliable measures of the same thing. Adding (or averaging them) means that the random error in each measure is summed and because it is random some of the errors cancel out – producing a more reliable measure.

  6. I don't believe that differences between democracy measures are due to random errors. For instance, Polity and Przeworski et al have conceptual disagreements about democracy which influences their operationalization. It might be better to think about how different estimates may reflect the different notions of democracy captured by these measures.

  7. How were the original variables measured? If at the interval level (e.g. Some level of democracy) you may want to consider stepping away from the event-history models, and use something like the multilevel model for change (singer & Willard, 2003) in combination with the suggestion of using the sum of both variables. This solves the problem of both variables having a different setoff IVs, but changes your analyses quite dramatically,

  8. Thanks, Andrew and commenters, for educating me. The (binary) measures of democracy are highly correlated, so I think we could safely operate within the framework Andrew and several others of you are proposing, and now I'm really curious to see the results from that analysis. That said, I have a feeling that likely reviewers for this work will respond as Antonio has, and with good reason. Now I'm stuck trying to figure out how to cram that discussion into a publication-length paper that also describes the two data sets and associated results. Sigh. I think Michael and I will have to powwow on this issue some more…

  9. Hi Jay,

    I wouldn't expect many reviews responding like me, since I've seen many absolute careless papers on cross-national studies published in considered top journal but perhaps I'm digressing …

    The question of length is another big problem is political science….many of us managed to do what seems to be very difficult, which is a long paper with repetitive stories but not enough information in it … I think the solution is quite simple : we should follow what has been done in the natural sciences which is to present the main results in the published piece, usually with key graphs instead of long and essentially meaning less tables, and all the rest on online appendixes, which can be as long as needed.

    Journals like The Lancet are very close to our substantive interests and already follow this format. In political science we still have to be more verbose than that but the online appendix can still covey additional but relevant information with details and sensitive analysis.

Comments are closed.