Regression and matching for causal inference: questions about Guido Imbens’s article

We would like to incorporate matching methods into a Bayesian regression framework for causal inference, with the ultimate goal of being able to do more effective inference using hierarchical modeling. The founding work here are papers by Cochran and Rubin in 1973, demonstrating that matching followed by regression outperforms either method alone, and papers by Rosenbaum and Rubin in 1984 on propensity scores.

Right now, our starting points are two recent review articles, one by Guido Imbens on the theory of regression and matching adjustments, and one by Liz Stuart on practical implementations of matching. So far, I’ve read Guido’s article and have a bunch of comments/questions. Much of this involves my own work (since that’s what I’m most familiar with), so I apologize in advance for that.

My comments on the Imbens paper

In Section II.B of his paper, Guido distinguishes between “sample average treatment effects” and “population average treatment effects.” This distinction comes up in other statistical contexts as well. For example, on page 13 of my paper on Anova, I distinguish between “finite-population” and “superpopulation” variances. Page 50 of my paper (in the rejoinder to discussion) discusses finite-population and superpopulation contrasts, which are essentially the same as Guido’s sample and population averages. As Guido and I both point out, the sample and population estimands have the same natural point estimates but different standard errors.

As Guido notes, an important motivation for studying all these matching and regression procedures is the potential for interactions between treatment and background variables (i.e., the treatment being more effective for some groups than others). He also notes that these background variables can include lagged outcomes. I wonder whether some progress could be made by modeling these interactions as variance components (as in this paper in a volume that also has a paper by Guido).

On page 14, Guido makes an offhand comment that asymptotic bias of matching estimates is only an issue with continuous covariates, since “with discrete covariates the matching will be exact in large samples.” First, this is only true if there is full overlap between the treated and control populations. But more generally, my impression is that some of the biggest difficulties arise with discrete predictors. For example, if you have 20 different covariates, each with only 2 levels, that’s still a million possibilities! From a Bayesian standpoint, the challenge is to come up with reasonable models for this structure (which I don’t know how to do; see here). This issue comes up again later when Guido talks about sqrt(n) consistency.

One other comment. Why use the notation W_i for the treatment? Wouldn’t T_i be more natural?

My questions about matching and regression

I liked Guido’s paper because it laid out the reasons for using the different methods. What I’d like is a fully Bayesian approach, since that allows flexibility in modeling the potential zillions of interactions. The basic theory of Bayesian inference for causal effects (including the sample and population treatment effects) is in Chapter 7 of our book, but the actual implementation is definitely a loose end.

I’m imagining a unification of matching and regression methods, following the Cochran and Rubin approach: (1) matching, (2) keeping the treated and control units but discarding the information on who was matched with whom, (3) regression including treatment interactions. I’m still confused about exactly how the propensity score fits in.

And, of course, I’m also confused because there are so many matching methods out there.