Recently in Causal Inference Category

In suggesting "a socially responsible method of announcing associations," AT points out that, as much as we try to be rigorous about causal inference, assumptions slip in through our language:

The trouble is, causal claims have an order to them (like "aliens cause cancer"), and so do most if not all human sentences ("I like ice cream"). It's all too tempting to read a non-directional association claim as if it were so -- my (least) favourite was a radio blowhard who said that in teens, cellphone use was linked with sexual activity, and without skipping a beat angrily proclaimed that giving kids a cell phone was tantamount to exposing them to STDs. . . . So here's a modest proposal: when possible, beat back the causal assumption by presenting an associational idea in the order least likely to be given a causal interpretation by a layperson or radio host.

Here's AT's example:

A random Google News headline reads: "Prolonged Use of Pacifier Linked to Speech Problems" and strongly implies a cause and effect relationship, despite the (weak) disclaimer from the quoted authors. Reverse that and you've got "Speech Problems linked to Prolonged Use of Pacifier" which is less insinuating.

It's an interesting idea, and it reminds me of something that really bugs me.

Placebos Have Side Effects Too

| 4 Comments

placebo.jpg

Aleks points me to this blog by Neuroskeptic, who reports on some recent research studying the placebo effect:

Following my comments on their article on U.S. military funding and conflict in Colombia, Oeindrila Dube and Suresh Naidu wrote:

Thanks for the comments on our paper. It seemed that you viewed the correlations in the anaysis as an interesting descriptive exercise, but not interpretable as causal. We agree with you that the most interesting social science is often causal, and in this case in particular the causal claims are the main results. The paper's punchline is that military aid needs to be reconsidered when there is collusion between the army and non-state armed groups, and we couldn't make this claim if we thought the results were purely descriptive.

In the paper, we do a lot of sample splitting and parametric time controls to rule out the possibility that this is a spurious effect. For example, our results are robust to including a base-specific time trend, along with a base-specific post-2001 dummy.

Possibly the best evidence against a strict "conflict" time-series interpretation is that there is no effect (positive or negative) of US military aid on guerrilla attacks near Colombian military bases. In other words, its not just an increase in conflict on all sides, but an increase in paramilitary attacks in particular.

The "differential time trend" that could drive our effect would have to be a) steeply nonlinear b) only applicable to paramilitaries in base municipalities, and c) would have to be fairly unique to the base municipalities, given the wide variety of alternate control groups we examine. So we think this is not a likely alternative explanation that can account for the effects.

To which I replied:

First off, I still would prefer associational language followed by causal speculation. But I can respect your different choice of emphasis. Now to get to details: my basic alternative model goes as follows: - Conflict in Colombia increased during the early 2000's. - U.S. military aid, in the U.S. and elsewhere, increased during that period also. - Most of the paramilitary attacks (and, thus, most of the increase in paramilitary attacks) occurred near military bases. Thus, I'm not so impressed by the "differential time trend" argument. It's unsurprising (but nonetheless worth noting, as you do) that there are fewer guerilla attacks near military bases. But that doesn't mean that the paramilitary attacks wouldn't have increased in the absence of U.S. aid.

None of the above really contradicts your main political story, which is that the Colombian military is involved in paramilitary attacks, and that U.S. aid is an enabler for this sort of violence.

My story above is consistent with your causal story--more U.S. aid, more resources for the military, more paramilitary attacks. It's also consistent with a different causal story, which goes like this: more conflict, more paramilitary attacks, also more U.S. aid which actually serves to stop the situation from getting worse. The argument is, yes, the U.S. is giving weapons to the bad guys, but by doing so, it co-opts them and restrains their behavior.

OK, I'm not saying this latter argument is true, but I think your strongest argument against it is to say something like: "Sure, it's possible that things would be getting even worse in the absence of U.S. military aid. But given that, during the time that aid was higher, violence was also higher--and we're talking here about violence being done by the allies of the recipients of the aid--well, maybe aid isn't such a good idea." That is, you can put the burden of proof on the advocates of aid. Hey, it costs money and it's going to some unsavory characters. You shouldn't have to prove that aid is hurting; I think it would be more defensible, from a statistical/econometric point of view, to show the association and put the ball in their court.

P.S. Just to be clear: I don't have any strong feeling that you're wrong or any goal of "debunking" your paper. It's interesting and important work and I'm trying to understand it better.

And then they shot back with:

Regarding the stylistic point about associations and causal claims, we think this is perhaps discipline-specific, as the style in economics seems to be to make a causal claim and then rule out all the alternative causal stories as much as possible. I'm sure this is probably one of many idiosyncrasies that irks non-economists.

The substantive question is why paramilitary attacks (and paramilitary attacks specifically, rather than other measures of conflict), increase more in places near bases. The account we put forward is that this occurs because the Colombian military funnels a share of its resources to paramilitary groups. Thus, if US military aid translates into more resources for the military which are shared with paramilitary groups, the implication is that in the absence of increases in US military aid, paramilitary attacks would not have increased by as much as they did.

Now the alternative account you put forward is "more conflict, more paramilitary attacks, also more U.S. aid which actually serves to stop the situation from getting worse. The argument is, yes, the U.S. is giving weapons to the bad guys, but by doing so, it co-opts them and restrains their behavior."

It seems like you have two distinct things in mind, that overall conflict is a source of bias, and an associated conjecture that this omitted variable (overall conflict) upward biases our main coefficient since it is positively correlated with paramilitary attacks and positively correlated with the aid shock. First, we explicitly address and rule out potential omitted variables using a number of empirical specifications. But, even if there is an omitted variable correlated with U.S. military aid that differentially affects paramilitary attacks in base municipalities, it is not clear whether the direction of the bias would be positive. As an example, say a change in Colombian government leads the state to become more effective in fighting the guerilla insurgency, and the US rewards the state with more military aid, while paramilitary activity declines differentially in base regions, as this activity becomes less necessary with greater military effectiveness. In this case, the omitted variable (stronger Colombian state) is negatively correlated with paramilitary attacks and positively correlated with the aid shock, and this would lead us to underestimate the true effect of U.S. aid on paramilitary activity.

Moreover, we think we do a good job ruling "conflict in general" at the national, state, or municipality level as a confounding variable. "Overall conflict" variation at the country level is absorbed by year fixed effects, and conflict at the department level is absorbed by the department x year fixed effects. At the municipal level, it is NOT the case that we observe increases in overall conflict, such as total number of clashes amongst all armed actors at the municipal level. (In out data, attacks are one-sided events carried out by a particular group. The fact that we see paramilitary attacks increase means we are specifically observing increases in events that involve only paramilitary groups - e,g, the paramilitaries attack a village or destroy some type of infrastructure. ) Also, in every specification we find no effect on the guerrilla attacks, and we think you are not taking the non-effect sufficiently seriously in terms of countering the overall conflict account. The guerilla non-effect actually provides very robust evidence that the U.S. military aid is not just correlated with any type of conflict, but rather with attacks by a particular group (which has no regional spillovers).

In addition, our base-specific linear trend and post-2001 dummy specification should convince you that our effect is not merely a post-2001 increase in conflict that manifests particularly as paramilitary attacks in base municipalities.

Your alternative account suggests that more aid to paramilitary organizations could actually result in less violence. While it is challenging to know what the counterfactual would have been in the absence of increased aid, Figure 2 shows that when aid rises sharply in 1999 there is a differential increase in aid in the base regions, and when aid decreases in 2001, there is a corresponding closing of differential decrease in the base regions. This seems inconsistent with the idea that lower aid translates into more paramilitary activity. Also, after 2002, when aid rises again, the differential increases yet another time. It is difficult to explain this pattern with the account you put forward, which would have to require additional coincidental reasons why paramilitary attacks should increase more in base regions precisely in 1999, then decline in 2001, and then rise again in 2002. This is possible, but seems unlikely.

We were thinking of some ideas that would be consistent with your alternative account, of why more aid to paramilitary organizations could actually lower violence. One story here could be deterrence - that stronger paramilitaries deter the guerillas resulting in fewer attacks by guerillas or fewer clashes between guerillas and paramilitaries. But, our results do not show a fall in guerilla attacks or clashes amongst the two groups; rather the coefficient on these other variables is close to 0 and they are statistically insignificant, which is inconsistent with the deterrence account.

Another reason could be dependence, that in the short run U.S. aid increases paramilitary violence, but it also induces paramilitary reliance on the Colombian military for supplies, which increases the sway the government has vis-à-vis this group, potentially leading to future demobilization. Thus in the long-run, U.S. military aid reduces paramilitary violence. While this process could take "long and variable lags" to manifest, it is important to note that we see a dramatic increase in paramilitary activity in 2005, despite a half-decade of huge U.S. military transfers to Colombia. Thus we do not see evidence of this dependence account in our data.

FInd out on Thurs 1 Oct at 11:15 am in Kimmel 900 at NYU: Dr. Michael Foster from UNC will present the 4th Statistics in Society lecture, entitled: "Does Special Education Actually Work?" This talk will explore the efficacy of current special education policies while highlighting the role of new methods in causal inference in to helping answer it. It is jointly sponsored by the Departments of Teaching and Learning and Applied Psychology, and by the Institute for Human Development and Social Change.

I'd definitely go to this if I were in town.

John reports on an article by Oeindrila Dube and Suresh Naidu, who ran some regressions on observational data and wrote:

This paper examines the effect of U.S. military aid on political violence and democracy in Colombia. We take advantage of the fact that U.S. military aid is channeled to Colombian army brigades operating out of military bases, and compare how changes in aid affect outcomes in municipalities with and without bases. Using detailed data on violence perpetuated by illegal armed groups, we …find that U.S. military aid leads to differential increases in attacks by paramilitaries . . .

It's an interesting analysis, but I wish they'd restrained themselves and replaced all their causal language with "is associated with" and the like.

From a statistical point of view, what Dubey and Naiduz are doing is estimating the effects of military aid in two ways: first, by comparing outcomes in years in which the U.S. spends more or less in military aid; second, by comparing outcomes in cities in Colombia with and without military bases.

Matt Fox writes:

I teach various Epidemiology courses in Boston and in South Africa and have been reading your blog for the past year or so and used several of your examples in class . . . I am curious to know why you are skeptical of structural models. Much of my training has been in how essential these models are and I rarely hear the other side of the debate.

I've never used structural models myself. They just seem to require so many assumptions that I don't know how to interpret them. (Of course the same could be said of Bayesian methods, but that doesn't seem to bother me at all.) One thing I like to say is that in observational settings I feel I can interpret at most one variable causally. The difficulty is that it's hard to control for things that happen after the variable that you're thinking of as the "treatment."

To put it another way, there's a research paradigm in which you fit a model--maybe a regression, maybe a structural equations model, maybe a multilevel model, whatever--and then you read off the coefficients, with each coefficient telling you something. You gather these together and those are your conclusions.

My paradigm is a bit different. I sometimes say that each causal inference requires its own analysis and maybe its own experiment. I find it difficult to causally interpret several different coefficients from the same model.

Things haven't changed much since the 8-schools experiment, apparently. (See also this article by Ben Hansen.) Howard Wainer once told me that SAT coaching is effective--it's about as effective as the equivalent number of hours in your math or English class at school.

Dumpin' the data in raw

| 7 Comments

Benjamin Kay writes:

I just finished the Stata Journal article you wrote. In it I found the following quote: "On the other hand, I think there is a big gap in practice when there is no discussion of how to set up the model, an implicit assumption that variables are just dumped raw into the regression."

I saw James Heckman (famous econometrician and labor economist) speak on Friday, and he mentioned that using test scores in many kinds of regressions is problematic, because the assignment of a score is somewhat arbitrary even if the order was not. He suggested that positive, monotonic transformations scores contain the same information and lead to different standard errors if in your words one just "dumped into the regression". It was somewhat of a throw away remark, but considering it longer, I imagine he mans that a difference of test scores need have no constant effect. The remedy he suggested was to recalibrate exam scores such that they have some objective meaning. For example, a mechanics exam scored between one and a hundred, one can pass (65) only if they successfully rebuild the engine in the time allotted, but better scores indicate higher quality or faster speed. In this example one might change it to a binary variable to passing or not, an objective testing of a set of competencies. However, doing that clearly throws away information.

Do you or the readers of Statistical Modeling, Causal Inference, and Social Science blog have any advice here? The transformation of the variable is problematic and the critique of transformations on using it raw seems a serious one, but the act of narrowly mapping it onto a set of objective discrete skills seems to destroy lots of information. Percentile ranks on exams might be a substitute for the raw scores in many cases, but introduces other problems like in comparisons between groups.

My reply: Heckman's suggestion sounds like it would be good in some cases but it wouldn't work for something like the SAT which is essentially a continuous measure. In other cases, such as estimated ideal point measures for congressmembers, it can make sense to break a single continuous ideal-point measure into two variables: political party (a binary variable: Dem or Rep) and the ideology score. This gives you the benefits of discretization without the loss of information.

In chapter 4 of ARM we give a bunch of examples of transformations, sometimes on single variables, sometimes combining variables, sometimes breaking up a variable into parts. A lot of information is coded in how you represent a regression function, and it's criminal to just take the data as they appear in the Stata file and just dump them in raw. But I have the horrible feeling that many people either feel that it's cheating to transform the variables, or that it doesn't really matter what you do to the variables, because regression (or matching, or difference-in-differences, or whatever) is a theorem-certified bit of magic.

Hal Varian pointed me to this article in The Economist:

Instrumental variables help to isolate causal relationships. But they can be taken too far

"Like elaborately plumed birds...we preen and strut and display our t-values." That was Edward Leamer's uncharitable description of his profession in 1983. Mr Leamer, an economist at the University of California in Los Angeles, was frustrated by empirical economists' emphasis on measures of correlation over underlying questions of cause and effect, such as whether people who spend more years in school go on to earn more in later life. Hardly anyone, he wrote gloomily, "takes anyone else's data analyses seriously". To make his point, Mr Leamer showed how different (but apparently reasonable) choices about which variables to include in an analysis of the effect of capital punishment on murder rates could lead to the conclusion that the death penalty led to more murders, fewer murders, or had no effect at all.

In the years since, economists have focused much more explicitly on improving the analysis of cause and effect, giving rise to what Guido Imbens of Harvard University calls "the causal literature". The techniques at the heart of this literature--in particular, the use of so-called "instrumental variables"--have yielded insights into everything from the link between abortion and crime to the economic return from education. But these methods are themselves now coming under attack.

You can't win for losing

| 7 Comments

Devin Pope writes:

I wanted to send you an updated version of Jonah Berger and my basketball paper that shows that teams that are losing at halftime win more often than expected.

This new version is much improved. It has 15x more data than the earlier version (thanks to blog readers) and analyzes both NBA and NCAA data.

Also, you will notice if you glance through the paper that it has benefited quite a bit from your earlier critiques. Our empirical approach is very similar to the suggestions that you made.

See here and here for my discussion of the earlier version of Berger and Pope's article.

Here's the key graph from the previous version:

Halfscore.jpg

And here's the update:

hoops.png

Much better--they got rid of that wacky fifth-degree polynomial that made the lines diverge in the graph from the previous version of the paper.

What do we see from the new graphs?

One of those funny things

| 4 Comments

I published an article in the Stata Journal even though I don't know how to use Stata.

Avi Feller and Chris Holmes sent me a new article on estimating varying treatment effects. Their article begins:

Randomized experiments have become increasingly important for political scientists and campaign professionals. With few exceptions, these experiments have addressed the overall causal effect of an intervention across the entire population, known as the average treatment effect (ATE). A much broader set of questions can often be addressed by allowing for heterogeneous treatment effects. We discuss methods for estimating such effects developed in other disciplines and introduce key concepts, especially the conditional average treatment effect (CATE), to the analysis of randomized experiments in political science. We expand on this literature by proposing an application of generalized additive models to estimate nonlinear heterogeneous treatment effects. We demonstrate the practical importance of these techniques by reanalyzing a major experimental study on voter mobilization and social pressure and a recent randomized experiment on voter registration and text messaging from the 2008 US election.

This is a cool paper--they reanalyze data from some well-known experiments and find important interactions. I just have a few comments to add:

After six entries and 91 comments on the connections between Judea Pearl and Don Rubin's frameworks for causal inference, I thought it would be good to draw the discussion to a (temporary) close. I'll first present a summary from Pearl, then briefly give my thoughts.

Pearl writes:

John Sides links to this quote from Barney Frank:

Not for the first time, as a -- a -- an elected official, I envy economists. Economists have available to them, in an analytical approach, the counterfactual. Economists can explain that a given decision was the best one that could be made, because they can show what would have happened in the counterfactual situation. They can contrast what happened to what would have happened.

No one has ever gotten reelected where the bumper sticker said, "It would have been worse without me." You probably can get tenure with that. But you can't win office.

I have two thoughts on this. First, I think Frank is a bit too confident in economists' ability to "show what would have happened in the counterfactual situation." Maybe "estimate" or "guess" or "hypothesize" would be a bit stronger than "show." Recall this notorious graph, which shows the unintentional counterfactual of some economic predictions:

stimulus-vs-unemployment-april.gif

Second, I don't know how Frank can say that about "no one has ever gotten reelected . . ." In Frank's district in Massachusetts, it would take a lot--a lot--for a Democrat to not get reelected.

What with all this discussion of causal inference, I thought I'd rerun a blog entry from a couple years ago about my personal trick for understanding instrumental variables:

A correspondent writes:

I've recently started skimming your blog (perhaps steered there by Brad deLong or Mark Thoma) but despite having waded through such enduring classics as Feller Vol II, Henri Theil's "Econometrics", James Hamilton's "Time Series Analysis", and T.W. Anderson's "Multivariate Analysis", I'm finding some of the discussions such as Pearl/Rubin a bit impenetrable. I don't have a stats degree so I am thinking there is some chunk of the core curriculum on modeling and causality that I am missing. Is there a book (likely one of yours - e.g. Bayesian Data Analysis) that you would recommend to help fill in my background?

1. I recommend the new book, "Mostly Harmless Econometrics," by Angrist and Pischke (see my review here).

2. After that, I'd read the following chapters from my book with Jennifer:

Chapter 9: Causal inference using regression on the treatment variable

Chapter 10: Causal inference using more advanced models

Here are some pretty pictures, from the low-birth-weight example:

fig10.3.png

and from the Electric Company example:

fig23.1_small.png

3. Beyond this, you could read the books by Morgan and Winship and Pearl, but both these are a bit more technical and less applied that the two books linked to above.

The commenters may have other suggestions.

Daniel Egan sent me a link to an article, "Standardized or simple effect size: What should be reported?" by Thom Baguley, that recently appeared in the British Journal of Psychology. Here's the abstract:

It is regarded as best practice for psychologists to report effect size when disseminating quantitative research findings. Reporting of effect size in the psychological literature is patchy -- though this may be changing -- and when reported it is far from clear that appropriate effect size statistics are employed. This paper considers the practice of reporting point estimates of standardized effect size and explores factors such as reliability, range restriction and differences in design that distort standardized effect size unless suitable corrections are employed. For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take. Deciding on the appropriate effect size statistic to report always requires careful thought and should be influenced by the goals of the researcher, the context of the research and the potential needs of readers.

Egan writes:

I run into the problem of reporting coefficients all the time, mostly in the context of presenting effects to non-statisticians. While my audiences are generally bright, the obvious question always asked is "which of these is the biggest effect?" The fact that a sex dummy has a large numerical point estimate relative to number-of-purchases is largely irrelevant - its because sex's range is tiny compared to other covariates. But moreover, sex is irrelevant to "policy-making" - we can't change a persons sex! So what we're interested in is the viable range over which we could influence an independent variable, and the second-order likely affect upon the dependent. So two questions: 1. For pedagogical effect, is there any way of getting around these problems? How can we communicate the effects to non-statisticians easily (and think someone who has exactly 10 minutes to understand your whole report) 2. Is there any easy way to infer the elasticity of the effect - i.e. how much can we change the dependent, by attempting to exogenously change one of the independents? While I know that I could design the experiment to do this, I work in far more observational data - and this "effect" size is really what matters the most.

My quick reply to Egan is to refer to my article with Iain Pardoe on average predictive comparisons, where we discuss some of these concerns.

I also have some thoughts on the Baguley article:

In the most recent round of our recent discussion, Judea Pearl wrote:

There is nothing in his theory of potential-outcome that forces one to "condition on all information" . . . Indiscriminate conditioning is a culturally-induced ritual that has survived, like the monarchy, only because it was erroneously supposed to do no harm.

I agree with the first part of Pearl's statement but not the second part (except to the extent that everything we do, from Bayesian data analysis to typing in English, is a "culturally induced ritual"). And I think I've spotted a key point of confusion.

To put it simply, Donald Rubin's approach to statistics has three parts:

1. The potential-outcomes model for causal inference: the so-called Neyman-Rubin model in which observed data are viewed as a sample from a hypothetical population that, in the simplest case of a binary treatment, includes y_i^1 and y_i^2 for each unit i).

2. Bayesian data analysis: the mode of statistical inference in which you set up a joint probability distribution for everything in your model, then condition on all observed information to get inferences, then evaluate the model by comparing predictive inferences to observed data and other information.

3. Questions of taste: the preference for models supplied from the outside rather than models inspired by data, a preference for models with relatively few parameters (for example, trends rather than splines), a general lack of interest in exploratory data analysis, a preference for writing models analytically rather than graphically, an interest in causal rather than descriptive estimands.

As that last list indicates, my own taste in statistical modeling differs in some ways from Rubin's. But what I want to focus on here is the distinction between item 1 (the potential outcomes notation) and item 2 (Bayesian data analysis).

The potential outcome notation and Bayesian data analysis are logically distinct concepts!

Items 1 and 2 above can occur together or separately. All four combinations (yes/yes, yes/no, no/yes, no/no) are possible:

- Rubin uses Bayesian inference to fit models in the potential outcome framework.

- Rosenbaum (and, in a different way, Greenland and Robins) use the potential outcome framework but estimate using non-Bayesian methods.

- Most of the time I use Bayesian methods but am not particularly thinking about causal questions.

- And, of course, there's lots of statistics and econometrics that's non-Bayesian and does not use potential outcomes.

Bayesian inference and conditioning

In Bayesian inference, you set up a model and then you condition on everything that's been observed. Pearl writes, "Indiscriminate conditioning is a culturally-induced ritual." Culturally-induced it may be, but it's just straight Bayes. I'm not saying that Pearl has to use Bayesian inference--lots of statisticians have done just fine without ever cracking open a prior distribution--but Bayes is certainly a well-recognized approach. As I think I wrote the other day, I use Bayesian inference not because I'm under the spell of a centuries-gone clergyman; I do it because I've seen it work, for me and for others.

Pearl's mistake here, I think, is to confuse "conditioning" with "including on the right-hand side of a regression equation." Conditioning depends on how the model is set up. For example, in their 1996 article, Angrist, Imbens, and Rubin showed how, under certain assumptions, conditioning on an intermediate outcome leads to an inference that is similar to an instrumental variables estimate. They don't suggest including an intermediate variable as a regression predictor or as a predictor in a propensity score matching routine, and they don't suggest including an instrument as a predictor in a propensity score model.

If a variable is "an intermediate outcome" or "an instrument," this is information that must be encoded in the model, perhaps using words or algebra (as in econometrics or in Rubin's notation) or perhaps using graphs (as in Pearl's notation). I agree with Steve Morgan in his comment that Rubin's notation and graphs can both be useful ways of formulating such models. To return to the discussion with Pearl: Rubin is using Bayesian inference and conditioning on all information, but "conditioning" is relative to a model and does not at all imply that all variables are put in as predictors in a regression.

Another example of Bayesian inference is the poststratification which I spoke of yesterday (see item 3 here). But, as I noted then, this really has nothing to do with causality; it's just manipulation of probability distributions in a useful way that allows us to include multiple sources of information.

P.S. We're lucky to be living now rather than 500 years ago, or we'd probably all be sitting around in a village arguing about obscure passages from the Bible.

To continue with our discussion (earlier entries 1, 2, and 3):

1. Pearl has mathematically proved the equivalence of Pearl's and Rubin's frameworks. At the same time, Pearl and Rubin recommend completely different approaches. For example, Rubin conditions on all information, whereas Pearl does not do so. In practice, the two approaches are much different. Accepting Pearl's mathematics (which I have no reason to doubt), this implies to me that Pearl's axioms do not quite apply to many of the settings that I'm interested in.

I think we've reached a stable point in this part of the discussion: we can all agree that Pearl's theorem is correct, and we can disagree as to whether its axioms and conditions apply to statistical modeling in the social and environmental sciences. I'd claim some authority on this latter point, given my extensive experience in this area--and of course, Rubin, Rosenbaum, etc., have further experience--but of course I have no problem with Pearl's methods being used on political science problems, and we can evaluate such applications one at a time.

2. Pearl and I have many interests in common, and we've each written two books that are relevant to this discussion. Unfortunately, I have not studied Pearl's books in detail and I doubt he's had the time to read my books in detail also. It takes a lot of work to understand someone else's framework, work that we don't necessarily want to do if we're already spending a lot of time and effort developing our own research programmes. It will probably be the job of future researchers to make the synthesis. (Yes, yes, I know that Pearl feels that he already has the synthesis, and that he's proved this to be the case, but Pearl's synthesis doesn't yet take me all the way to where I want to go, which is to do my applied work in social and environmental sciences.) I truly am open to the probability that everything I do can be usefully folded into Pearl's framework someday.

That said, I think Pearl is on shaky ground when he tries to say that Don Rubin or Paul Rosenbaum is making a major mistake in causal inference. If Pearl's mathematics implies that Rubin and Rosenbaum are making a mistake, then my first step would be to apply the syllogism the other way and see whether Pearl's assumptions are appropriate for the problem at hand.

3. I've discussed a poststratification example. As I discussed yesterday (see the first item here), a standard idea, both in survey sampling and causal inference, is to perform estimates conditional on background variables, and then average over the population distribution of the background variables to estimate the population average. Mathematically, p(theta) = sum_x p(theta|x)p(x). Or, if x is discrete and takes on only two values, p(theta) = (N_1 p(theta|x=1) + N_2 p(theta|x=2)) / (N_1 + N_2).

This has nothing at all to do with causal inference: it's straight Bayes.

Pearl thinks that if the separate components p(theta|x) are nonidentifiable, that you can't do this, and you should not include x in the analysis. He writes:

I [Pearl] would really like to see how a Bayesian method estimates the treatment effect in two subgroups where it is not identifiable, and then, by averaging the two results (with two huge posterior uncertainties) gets the correct average treatment effect, which is identifiable, hence has a narrow posterior uncertainly. . . . I have no doubt that it can be done by fine-tuned tweaking . . . But I am talking about doing it the honest way, as you described it: "the uncertainties in the two separate groups should cancel out when they're being combined to get the average treatment effect." If I recall my happy days as a Bayesian, the only operation allowed in combining uncertainties from two subgroups is taking a linear combination of the two, weighted by the (given) relative frequencies of the groups. But, I am willing to learn new methods.

I'm glad that Pearl is willing to learn new methods--so am I--but, no new methods are needed here! This is straightforward, simple Bayes. Rod Little has written a lot about these ideas. I wrote some papers on it in 1997 and 2004. Jeff Lax and Justin Phillips do it in their multilevel modeling and poststratification papers where, for the first, time, they get good state-by-state estimates of public opinion on gay rights issues. No "fine-tuned tweaking" required. You just set up the model and it all works out. If the likelihood provides little to no information on theta|x but it does provide good information on the marginal distribution of theta, then this will work out fine.

In practice, of course, nobody is going to control for x if we have no information on it. Bayesian poststratification really becomes useful in that it can put together different sources of partial information, such as data with small sample sizes in some cells, along with census data on population cell totals.

Please, please don't say "the correct thing to do is to ignore the subgroup identity." If you want to ignore some information, that's fine--in the context of the models you are using, it might even make sense. But Jeff and Justin and the rest of us use this additional information all the time, and we get a lot out of it. What we're doing is not incorrect at all. It's Bayesian inference. We set up a joint probability model and then work from it. If you want to criticize the probability model, that's fine. If you want to criticize the entire Bayesian edifice, then you'll have to go up against mountains of applied successes.

As I wrote earlier, you don't have to be a Bayesian (or, I could say, you don't have to be a Bayesian)--I have a great respect for the work of Hastie, Tibshirani, Robins, Rosenbaum, and many others who are developing methods outside the Bayesian framework)--but I think you're on thin ice if you want to try to claim that Bayesian analysis is "incorrect."

4. Jennifer and I and many others make the routine recommendation to exclude post-treatment variables from analysis. But, as both Pearl and Rubin have noted in different contexts, it can be a very good idea to include such variables--it's just not a good idea to include them as regression predictors.) If the only think you're allowed to do is regression (as in chapter 9 of ARM), then I think it's a good idea to exclude post-treatment predictors. If you're allowed more general models, then one can and should include them. I'm happy to have been corrected by both Pearl and Rubin on this one.

5. As I noted yesterday (see second-to-last item here), all statistical methods have holes. This is what motivates us to consider new conceptual frameworks as well as incremental improvements in the systems with which we are most familiar.

Summary . . . so far

I doubt this discussion is over yet, but I hope the above notes will settle some points. In particular:

- I accept (on authority of Pearl, Wasserman, etc.) that Pearl has proved the mathematical equivalence of his framework and Rubin's. This, along with Pearl's other claim that Rubin and Rosenbaum have made major blunders in applied causal inference (a claim that I doubt), leads me to believe that Pearl's axioms are in some way not appropriate to the sorts of problems that Rubin, Rosenbaum, and I work on: social and environmental problems that don't have clean mechanistic causation stories. Pearl believes his axioms do apply to these problems, but then again he doesn't have the extensive experience that Rosenbaum and Rubin have. So I think it's very reasonable to suppose that his axioms aren't quite appropriate here.

- Poststratification works just fine. It's straightforward Bayesian inference, nothing to do with causality at all.

- I have been sloppy when telling people not to include post-treatment variables. Both Rubin and Pearl, in their different ways, have been more precise about this.

- Much of this discussion is motivated by the fact, that, in practice, none of these methods currently solves all our applied problems in the way that we would like. I'm still struggling with various problems in descriptive/predictive modeling, and causation is even harder!

- Along with this, taste--that is, working with methods we're familiar with--matters. Any of these methods is only as good as the models we put into them, and we typically are better modelers when we use languages with which we're more familiar. (But not always. Sometimes it helps to liberate oneself, try something new, and break out of the implicit constraints we've been working on.)

To follow up on yesterday's discussion, I wanted to go through a bunch of different issues involving graphical modeling and causal inference.

Contents:
- A practical issue: poststratification
- 3 kinds of graphs
- Minimal Pearl and Minimal Rubin
- Getting the most out of Minimal Pearl and Minimal Rubin
- Conceptual differences between Pearl's and Rubin's models
- Controlling for intermediate outcomes
- Statistical models are based on assumptions
- In defense of taste
- Argument from authority?
- How could these issues be resolved?
- Holes everywhere
- What I can contribute

Philip Dawid (a longtime Bayesian researcher who's done work on graphical models, decision theory, and predictive inference) saw our discussion on causality and sends in some interesting thoughts, which I'll post here and then very briefly comment on:

Having just read through this fascinating interchange, I [Dawid] confess to finding Shrier and Pearl's examples and arguments more convincing that Rubin's. At the risk of adding to the confusion, but also in hope of helping at least some others, let me briefly describe yet another way (related to Pearl's, but with significant differences) of formulating and thinking about the problem. For those who, like me, may be concerned about the need to consider the probabilistic behaviour of counterfactual variables, on the one hand, or deterministic relationships encoded graphically, on the other, this provides an observable-focused, fully stochastic, alternative. A full presentation of the essential ideas can be found in Chapters 9 (Confounding and Sufficient Covariates) and 10 (Reduction of Sufficient Covariate) of my online document "Principles of Statistical Causality".

Like Pearl, I like to think of "causal inference" as the task of inferring what would happen under a hypothetical intervention, say F_E = e, that sets the value of the exposure E at e, when the data available are collected, not under the target "interventional regime", but under some different "observational regime". We could code this regime as F_E = idle. We can think of the non-stochastic variable F_E as a parameter, indexing the joint distribution of all the variables in the problem, under the regime indicated by its value.

Greg Mankiw links to an article that illustrates the challenges of interpreting raw numbers causally. This would really be a great example for your introductory statistics or economics classes, because the article, by Robert Book, starts off by identifying a statistical error and then goes on to make a nearly identical error of its own! Fun stuff.

This is a pretty long one. It's an attempt to explore some of the differences between Judea Pearl's and Don Rubin's approaches to causal inference, and is motivated by recent article by Pearl.

Pearl sent me a link to this piece of his, writing:

I [Pearl] would like to encourage a blog-discussion on the main points raised there. For example:

Whether graphical methods are in some way "less principled" than other methods of analysis.

Whether confounding bias can only decrease by conditioning on a new covariate.

Whether the M-bias, when it occurs, is merely a mathematical curiosity, unworthy of researchers attention.

Whether Bayesianism instructs us to condition on all available measurements.

I've never been able to understand Pearl's notation: notions such as a "collider of an M-structure" remain completely opaque to me. I'm not saying this out of pride--I expect I'd be a better statistician if I understood these concepts--but rather to give a sense of where I'm coming from. I was a student of Rubin and have used his causal ideas for awhile, starting with this article from 1990 on estimating the incumbency advantage in politics. I'm pleased to see these ideas gaining wider acceptance. In many areas (including studying incumbency, in fact), I think the most helpful feature of Rubin's potential-outcome framework is to get you, as a researcher, to think hard about what you are in fact trying to estimate. In much of the current discussion of identification strategies, regression discontinuities, differences in differences, and the like, I think there's too much focus on technique and not enough thought put into what the estimates are really telling you. That said, it makes sense that other theoretical perspectives such as Pearl's could be useful too.

To return to the article at hand: Pearl is clearly frustrated by what he views as Rubin's bobbing and weaving to avoid a direct settlement of their technical dispute. From the other direction, I think Rubin is puzzled by Pearl's approach and is not clear what the point of it all is.

I can't resolve the disagreements here, but maybe I can clarify some technical issues.

Controlling for pre-treatment and post-treatment variables

Much of Pearl's discussion turns upon notions of "bias," which in a Bayesian context is tricky to define. We certainly aren't talking about the classical-statistical "unbiasedness," in which E(theta.hat | theta) = theta for all theta, an idea that breaks down horribly in all sorts of situations (see page 248 of Bayesian Data Analysis). Statisticians are always trying to tell people, Don't do this, Don't do that, but the rules for saying this can be elusive. This is not just a problem for Pearl: my own work with Rubin suffers from similar problems. In chapter 7 of Bayesian Data Analysis (a chapter that is pretty much my translation of Rubin's ideas), we talk about how you can't do this and you can't do that. We avoid the term "bias," but then it can be a bit unclear what our principles are. For example, we recommend that your model should, if possible, include all variables that affect the treatment assignment. This is good advice, but really we could go further and just recommend that an appropriate analysis should include all variables that are potentially relevant, to avoid omitted-variable bias (or the Bayesian equivalent). Once you've considered a variable, it's hard to go back to the state of innocence in which that information was never present.

If I'm reading his article correctly, Pearl is making two statistical points, both in opposition to Rubin's principle that a Bayesian analysis (and, by implication, any statistical analysis) should condition on all available information:

1. When it comes to causal inference, Rubin says not to control for post-treatment variables (that is, intermediate outcomes), which seems to contradict Rubin's more general advice as a Bayesian to condition on everything.

2. Rubin (and his collaborators such as Paul Rosenbaum) state unequivocally that a model should control for all pre-treatment variables, even though including such variables, in Pearl's words, "may create spurious associations between
treatment and outcome and this, in turns, may increase or decrease confounding bias."

Let me discuss each of these criticisms, as best as I can understand them. Regarding the first point, a Bayesian analysis can control for intermediate outcomes--that's ok--but then the causal effect of interest won't be summarized by a single parameter--a "beta"--from the model. In our book, Jennifer and I recommend not controlling for intermediate outcomes, and a few years ago I heard Don Rubin make a similar point in a public lecture (giving an example where the great R. A. Fisher made this mistake). Strictly speaking, though, you can control for anything; you just then should suitably postprocess your inferences to get back to your causal inferences of interest.

I don't fully understand Pearl's second critique, in which he says that it's not always a good idea to control for pre-treatment variables. My best reconstruction is that Pearl's thinking about a setting where you could estimate a causal effect in a messy observational setting in which there are some important unobserved confounders, and it could well happen that controlling for a particular pre-treatment variable happens to make the confounding worse. The idea, I think, is that if you have an analysis where various problems cancel each other out, then fixing one of these problems (by controlling for one potential counfounder) could result in a net loss. I can believe this could happen in practice, but I'm wary of setting this up as a principle. I'd rather control for all the pre-treatment predictors that I can, and then make adjustments if necessary to attempt to account for remaining problems in the model. Perhaps Pearl's position and mine are not so far apart, however, if his approach of not controlling for a covariate could be seen as an approximation to a fuller model that controls for it while also adjusting for other, unobserved, confounders.

The sum of unidentifiable components can be identifiable

At other points, Pearl seems to be displaying a misunderstanding of Bayesian inference (at least, as I see it). For example, he writes:

For example, if we merely wish to predict whether a given person is a smoker, and we have data on the smoking behavior of seat-belt users and non-users, we should condition our prior probability P(smoking) on whether that person is a "seat-belt user" or not. Likewise, if we wish to predict the causal effect of smoking for a person known to use seat-belts, and we have separate data on how smoking affects seat-belt users and non-users, we should use the former in our prediction. . . . However, if our interest lies in the average causal effect over the entire population, then there is nothing in Bayesianism that compels us to do the analysis in each subpopulation separately and then average the results. The class-specific analysis may actually fail if the causal effect in each class is not identifiable.

I think this discussion misses the point in two ways.

First, at the technical level, yes you definitely can estimate the treatment effect in two separate groups and then average. Pearl is worried that the two separate estimates might bot be identifiable--in Bayesian terms, that they will individually have large posterior uncertainties. But, if the study really is being done in a setting where the average treatment effect is identifiable, then the uncertainties in the two separate groups should cancel out when they're being combined to get the average treatment effect. If the uncertainties don't cancel, it sounds to me like there must be some additional ("prior") information that you need to add.

The second way that I disagree with Pearl's example is that I don't think it makes sense to estimate the smoking behavior separately for seat-belt users and non-users. This just seems like a weird thing to be doing. I guess I'd have to see more about the example to understand why someone would do this. I have a lot of confidence in Rubin, so if he actually did this, I expect he had a good reason. But I'd have to see the example first.

Final thoughts

Hal Stern once told me the real division in statistics was not between the Bayesians and non-Bayesians, but between the modelers and the non-modelers. The distinction isn't completely clear--for example, where does the "Bell Labs school" of Cleveland, Hastie, Tibshirani, etc. fall?--but I like the idea of sharing a category as all the modelers over the years--even those who have not felt the need to use Bayesian methods.

Reading Pearl's article, however, reminded me of another distinction, this time between discrete models and continuous models. I have a taste for continuity and always like setting up my model with smooth parameters. I'm just about never interested in testing whether a parameter equals zero; instead, I'd rather infer about the parameter in a continuous space. To me, this makes particular sense in the sorts of social and environmental statistics problems where I work. For example, is there an interaction between income, religion, and state of residence in predicting one's attitude toward school vouchers? Yes. I knew this ahead of time. Nothing is zero, everything matters to some extent. As discussed in chapter 6 of Bayesian Data Analysis, I prefer continuous model expansion to discrete model averaging.

In contrast, Pearl, like many other Bayesians I've encountered, seems to prefer discrete models and procedures for finding conditional independence. In some settings, this can't matter much: if a source of variation is small, then maybe not much is lost by setting it to zero. But it changes one's focus, pointing Pearl toward goals such as "eliminating bias" and "covariate selection" rather than toward the goals of modeling the relations between variables. I think graphical models are a great idea, but given my own preferences toward continuity, I'm not a fan of the sorts of analyses that attempt to discover whether variables X and Y really have a link between them in the graph. My feeling is, if X and Y might have a link, then they do have a link. The link might be weak, and I'd be happy to use Bayesian multilevel modeling to estimate the strength of the link, partially pool it toward zero, and all the rest--but I don't get much out of statistical procedures that seek to estimate whether the link is there or not.

Finally, I'd like to steal something I wrote a couple years ago regarding disputes over statistical methodology:

Different statistical methods can be used successfully in applications--there are many roads to Rome--and so it is natural for anyone (myself included) to believe that our methods are particularly good for applications. For example, Adrian Raftery does excellent applied work using discrete model averaging, whereas I don't feel comfortable with that approach. Brad Efron has used bootstrapping to help astronomers solve their statistical problems. Etc etc. I don't think that Adrian's methods are particularly appropriate to sociology, or Brad's to astronomy--these are just powerful methods that can work in a variety of fields. Given that we each have successes, it's unsurprising that we can each feel strongly in the superiority of our own approaches. And I certainly don't feel that the approaches in Bayesian Data Analysis are the end of the story. In particular, nonparametric methods such as those of David Dunson, Ed George, and others seem to have a lot of advantages.

Similarly, Pearl has achieved a lot of success and so it would be silly for me to argue, or even to think, that he's doing everything all wrong. I think this expresses some of Pearl's frustration as well: Rubin's ideas have clearly been successful in applied work, so it would be awkward to argue that Rubin is actually doing the wrong thing in the problems he's worked on. It's more that any theoretical system has holes, and the expert practitioners in any system know how to work around these holes.

P.S. More here (and follow the links for still more).

I want to explore the distinction between self-experimentation and formal experimentation in the context of a recent discussion on Seth's blog.

The story begins with two people who found, via self-experimentation, how to make their acne go away:

A student . . . had gone on a camping trip and found that her acne went away. At first she thought it was the sunshine; but then, by self-experimentation, she discovered that the crucial change was that she had stopped using soap to wash her face.
A friend of Seth writes: "I started "washing" my face with water about a month ago, and [now] my face is acne free and soft as a pair of brand new UGG boots. [He had had acne for years.]"

In the comments section, someone writes:

While it would be nice to think that all we have to do to get rid of acne is stop using those expensive cleanser and just use water - this is just anecdotal evidence you present. It would require a large clinical trial to be conclusive.

Seth replies that informal experimentation is cheaper and faster than more formal clinical trials. Also, different things might work for different people, so whether or not a treatment has been evaluated a large study, it might make sense to test it yourself--especially for something such as acne or weight loss that is not an urgent concern.

This got me thinking . . . what are the benefits (if any) of a formal controlled trial? In statistics, we usually frame these benefits by comparing to observational studies. The big risk in an observational study is that the treatment and control groups will differ in important ways (as in the famous hormone replacement therapy story). Is this worth the cost? Maybe. Sometimes.

A related issue is bias, a word which I am using in the conversational rather than the statistical sense. For example, how would you want to evaluate the risks and effectiveness of a new drug that was developed by a pharmaceutical company at the cost of millions of dollars? I'd be suspicious of an observational study: even if conducted by professionals, there just seem to be too many ways for things to be biased.

In Seth's acne example, there is no financial source of bias. And, as Seth points out, the test is free to apply on yourself. If I had a kid with acne, I'd give it a try and do an experiment--which means trying the soap and no-soap conditions on different days (or different weeks, or months) and measuring and recording acne levels. One thing I've gathered from Seth's work is that there are big benefits to be gained by doing self-experimentation with careful measurement and record keeping, rather than simply trying different things and trying to remember what works.

On the other hand, yeah, I'm skeptical about Seth's acne claims, and I think a larger study would be more likely to convince me. But I don't think it would have to be expensive. All Seth (or somebody) needs is to set up a protocol for deciding when to wash with soap or water and a protocol for measuring acne, then he could get a bunch of volunteers to flip coins and try it. This blog has a few thousand readers, and Seth's diet forum has thousands of participants, so it shouldn't be so hard to find people to do this. I'm not so interested in acne myself, but according to Seth (and others, I assume), "acne really matters," so maybe it's worth giving this a try.

Triple-blinding

| No Comments

Fred Bookstein writes:

Your blog comment about triple-blinding was a joke, but there IS a triple-blinding procedure in which the identity of the two groups is not revealed to the statistician on the project until the very end. At all times the data analyses proceed solely in reference to a comparison of some unspecified "group A" with a similarly unspecified "group B," and the identification of who were the intervened-upon and who were not is concealed from him or her until the computations are finished. (There are some other assumptions, e.g. absence of baseline differences, required for this to make sense; it applies mainly in contexts like randomized clinical trials.) You can't really purge the Discussion section of an article of the possibility of spin, but at least you can get the right scatters and tables into the dossier that they're spinning. The possibility was called to my attention a while ago by Michael Myslobodsky, a wise old man from my schizophrenia research world, who did not remotely intend it as a joke.

Interesting. My only experience along these lines is when I was working with a student doing matching for a public health study: There were something like 100 treated units and 1000 potential controls, and we wanted to select 300 of these as matched controls. The researchers were careful to give us only the background information and no outcomes.

Statistics = Job$

| 3 Comments

I just got this unsolicited email:

Greg Mankiw has a nice little discussion of the difficulty of evaluating the effects of interventions in the n=1 setting:

stimulus-vs-unemployment-april.gif

As Mankiw points out, the bad news about the unemployment rate is bad news with or without the recovery plan and thus--although it certainly seems to knock down the predictions shown in that graph--it does not provide much information on the causal effect of the fiscal stimulus. Especially given that the graph comes from a report released in early January, before anyone knew what would end up being included in the final version of the stimulus plan.

James Heckman recently posted this article, which is based on a paper from 1980. (This sort of thing happens; for example, I just published an article based on work from 1986.) Heckman's tongue-in-cheek article begins:

This paper uses data available from the National Opinion Research Center's (NORC) survey on religious attitudes and powerful statistical methods to evaluate the effect of prayer on the attitude of God toward human beings.

He sets up a model for the intensity of prayer, given its effectiveness. The key assumption is as follows:

Accept on faith that the conditional density of x [the intensity of prayer in the population] given y [God's attitude arrayed on a scale ranging from 0 to 1] is of the form g(x|y) = a(y) exp(xy).

That is, the higher y is, the more prayer we'd see, which makes sense. (Heckman labels the function a(y) as "unknown," but, unless I'm missing something, a(y) is a normalizing constant that can be calculated in closed form by integrating exp(xy) over x. Perhaps this mistake, if it is one, can be caught before the article appears in press.)

Given the reasonable enough model above, Heckman points out that you can differentiate the density of x and learn something about the distribution of y, the effectiveness of prayer.

What does it all mean?

Of course Heckman is joking, but it appears he might be making a more serious point when he comments:

Provided conditional density (1) is assumed, we do not need to observe a variable in order to compute its conditional expectation with respect to another variable whose density can be estimated. For example, one can extend current empirical work in a variety of areas of economics to estimate the effect of income on happiness or the effect of income inequality on democracy.

I don't think this is literally an issue. True, all four of the variables Heckman mentions--income, happiness, income inequality, and democracy--can only be measured with error, but certainly they can be (and are) measured when they are studied empirically.

But I got a little worried that maybe there's something more going on here, some reason I should be giving a little less credence to studies linking economics to psychology and political science. Is Heckman implying that those cross-disciplinary studies have, at bottom, no more foundation than his argument on the effectiveness of prayer?

So I went back to Heckman's article to try to find the flaw in the reasoning. (By "flaw," I don't mean that Heckman was making a mistake; rather, I'm speaking of the hidden logical flaw that makes the reasoning flow, just as in those mathematical arguments where you "prove" 1=0 by means of a series of algebraic expressions that include a division-by-zero.)

Rereading carefully, I found the flaw. I actually think this article would be a good one for a take-home exam in a theoretical statistics class. I'll give the answer below.

The other day I mentioned this article by Lionel Page that found a momentum effect in tennis matches; more specifically: "winning the first set has a significant and strong effect on the result of the second set. A player who wins a close first set tie break will, on average, win one game more in the second set."

tennis.png

I'd display these data with a heat map rather than with overplotted points, but you get the idea.

This looked reasonable to me, but Guy Molyeneux sent in some skeptical comments, which I'll give, followed by Page's response. Molyeneux writes:

Self-experimentation

| 8 Comments

Jimmy sent this along:

Still, Mr. Perry wondered whether caffeine would help him. When he retired from rowing last July, he decided to do a randomized, blinded, placebo-controlled experiment on himself.

Atlantic causal conference

| No Comments

Dylan Small writes:

We will be holding the next edition of the Atlantic Causal Conference on May 20-21 at Penn. Hope to see you at the conference in May.

It looks great! We actually organized the very first one of these conferences here at Columbia (see also here for a brief report), and I'm pleased to see it's going stronger than ever.

Whiteboard update

| 7 Comments

Jeronimo writes:

I have been using small whiteboards in my research methods class to have the students work in pairs and it has been a huge success.

I asked, "How large are the whiteboards? And why do you use these rather than simply having them work in their notebooks?" and he responded:

The whiteboards are about 8x11. I like the boards because it changes the dynamic of the class. It introduces the sense of doing something different and also they can erase everything and start all over again. And I guess we don't waste a lot of paper.

I'll try it for the next course I teach.

P.S. As Seth might say, how come I have no problem with anecdotal evidence in education--the area in which I actually work--but when it comes to medicine and public health I focus on potential selection biases, insist on randomized trials, etc. In my defense, I'd point out that there has been some education research showing the benefits of working in pairs, peer instruction, and so forth--thus the "whiteboard for each pair of students" idea makes sense. But, then again, medical interventions typically make sense, whether or not they work (recall The Doctor's Dilemma).

Seth Roberts has had success with self-experimentation--among other things, he's written a successful diet book on how to lose weight by eating unflavored oil or sugar water--and on his blog he reports his latest self-experiments and their effects on him.  For example, recently he wrote about the beneficial effects of fermented food.

When Seth tries a new food, or a new lifestyle change, and finds positive effects, I'm always skeptical:  maybe he's hoping for such effects and then finding them.  But often they work for others.  For example, his correspondent Tucker Max writes:

I have been reading your posts about bacteria in food, so I decided to try it on my own. I HATE Roquefort and other stinky cheeses, and I am not about to eat fermented meat, so the best thing I could find in Whole Foods was Kombucha tea. It is basically normal tea, with bacteria cultures growing in it. Sounds weird I know, but it actually tastes pretty good. . . . [I'm giving all the details to give a sense of how weird this all sounds to an outsider. --AG]

Anyway, after a week of drinking two bottles a day, I have noticed these changes:

1.  My stool is...well, better. In every way. More regular, more solid . . . [ok, enough detail here]

2.  I have more energy. Aside from subjectively feeling it, I can see the difference in my workout logs, just in this past week I've gone up more weight on exercises than I normally do.

3.  I am feeling overall better. This could very well be placebo effect/confirmation bias as it is a very subjective measurement, but I just feel better. . . .


Sure, but maybe this could all be a confirmation bias.  The toilet stuff sounds objective, but who knows what else is happening when he's doing this?  And then of course there's selection bias, that Seth is hearing about the successes.

Just to be clear:  I'm not trying to criticize what Seth is doing, and I'm not trying to shoot it down.  I'm trying to strengthen it by suggesting ways of thinking about it.  As Seth says, criticism is easy, helping people is hard.

So here's my thought.  Maybe Seth could try a real placebo, as follows:  he could make up some goofy food or behavior change (something like . . . eating fermented food!  Or, I dunno, sleeping with the bed inclined at a 10 degree angle.  Or, I dunno--Seth would be better than I at coming up with something.  (Of course, it should be something he tries himself first and finds no adverse effects from.)  He could then make up some fairly vague story about how it helped him, then post it on his blog and see what happens.  Would people respond with stories about how helpful it was?

The great Linus Pauling conspiracy

I'm reminded of the idea I heard once that Linus Pauling knew all along that megadoses of Vitamin C have no effect, and that he altruistically sacrificed his reputation as a scientist to trumpet Vitamin C's virtues, on the theory that it would reduce the suffering of millions via the placebo effect.

In response to my entry on whether propensity score analysis could fix the Harvard Nurses study, Joseph Delaney wrote:

I am unsure about how propensity scores give any advantage over a thoughtfully constructed regression model. . . . I'm not saying that better statistical models shouldn't be used but I worry about overstating the benefits of propensity score analysis. It's an extremely good technique, no question about it, and I've published on one of it's variations. But I want to be very sure that we don't miss issues of study design and bias in the process.

I agree completely. But I'd focus on that "thoughtfully constructed" part of the regression model. As we've discussed, even some of the most thoughtful researchers don't talk much at all about construction of the model when they write regression textbooks.

So I think it might be too much to expect that working statisticians--those that might be employed by a long-running public health study, for example--to necessarily be using "a thoughtfully constructed regression model." Maybe all we can hope is that they use standard methods and document them well.

From this perspective, propensity scores have the advantage that in their standard implementation they allow a researcher to include dozens of background variables, which is not generally done in classical regression. As I noted in my original entry, there are other methods out there that also can handle large numbers of inputs; it doesn't necessarily have to be propensity scores.

The real issue is whether a method can allow a competent user to include the relevant information. This was the point of the famous Dehejia and Wahba paper on adjustment for observational studies.

Delaney also writes:

Issues of self-selection seriously limit all observational epidemiology. The issue is serious enough that I often wonder if we should not use observational studies to estimate medication benefits (at all). It's just too misleading.

Sure, but we do have to make decisions in life, and what do you do in those settings where no randomized trial exists, or where you don't trust a generalization of the results to the general population? Almost always we need some assumptions or another.

A favorite example

| 8 Comments

Tim Wilson writes:

For a book I'm writing, I'm looking for good examples in which regression suggested that A caused B, whereas experimental studies showed that there was no causal relationship. Even better (at least for the sake of my example) would be if social policy changes were made based on the regression. Do you have a favorite example or two?

My reply:

Here's everybody's favorite example.

When writing my book with Jennifer, I learned to be super-careful in my use of causal language. For example, when describing a regression coefficient, instead of saying "the effect of x on y," I trained myself to say, "the average difference in y, comparing people who differed by one unit in x." Or, in a multiple regression, "the average difference in y, comparing people who differed by one unit in x while being identical in all other predictors."

At first it's a struggle to speak this way, but eventually, I have found, this constraint has improved my thinking.

Application to the studies that purport to show that "real-life voters must also have based their choice of candidate on looks"

Yesterday I discussed an article that claimed (misleadingly, in my opinion) that people decide how to vote based on candidates' physical appearance.

Let's try to describe the study using Jennifer's non-causal approach. OK, here goes:

Winning politicians are judged to be more attractive, on average, than losing politicians.

Or, if there is some controlling for background variables:

Comparing two political candidates, one who won and one who didn't, but who are the same age, sex, and ..., the winner was, on average, judged to be more attractive than the loser.

At first glance, this might not seem to give us anything beyond the usual summary. But I find its precision helpful. Once the results are expressed as a difference, it's clear that there's no direct relevance to the question of how people vote; rather, it's a statement about a way in which successful and unsuccessful politicians differ. Which, among other things, perhaps makes it clearer that there are a lot of ways this could happen.

More generally

The "comparisons" way of describing regressions has helped me in other ways. Iain Pardoe and I wrote an article on average predictive comparisons, in which we focused on the question of what does it mean to compare two people who differ on one input variable while being identical on all the others. Among other things, this helped clarify for me the distinction between inputs and predictors in a regression model. (For example, in a model with age, sex, and age*sex, there are four predictors--the three items just listed, along with the constant term--but only two inputs: age and sex. It's a challenge to try to compare two people who differ in age but are identical in sex and age*sex--but people do this sort of thing all the time when they look at regression coefficients.)

The interpretation of coefficients as comparisons also helped clarify my thinking regarding the scaling of regression inputs. Now my default is to rescale continuous inputs to have standard deviation 1/2, which makes a comparison of one unit comparable to the difference between 0 and 1 for a binary variable. (Actually, I have to admit that I'm starting to wish that, for comparability with standard deviations in other examples, that I'd set the default of rescaling to have a standard deviation of 1, and rescaled binary inputs to be +/-1. I don't know if I have it in me to shift everything in this way, though.)

My recommendation

When describing comparisons and regressions, try to avoid "effect" and other causal terms (except in clearly causal scenarios) and instead write or speak in descriptive terms. It might seem awkward at first, but give it a try for a week. In the amorphous world of applied statistics, it can be oddly satisfying to speak precisely.

I feel I have to respond to this item that people keep pointing me to:

John Antonakis and Olaf Dalgas presented photos of pairs of competing candidates in the 2002 French parliamentary elections to hundreds of Swiss undergrads, who had no idea who the politicians were. The students were asked to indicate which candidate in each pair was the most competent, and for about 70 per cent of the pairs, the candidate rated as looking most competent was the candidate who had actually won the election. The startling implication is that the real-life voters must also have based their choice of candidate on looks, at least in part. [emphasis added]

Nooooooooooooooooooooooooooooooooooo!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

This came up a couple of years ago, when, in response to a similar study, I wrote:

It's a funny result: at first it seems impressive--70% accuracy!--but then again it's not so impressive given that you can predict something on the order of 90% of races just based on incumbency and the partisan preferences of the voters in the states and districts [at least in the U.S.; I don't know about France]. If 90% of the races are essentially decided a year ahead of time, what does it mean to say that voters are choosing 70% correct based on the candidates' looks.

I can't be sure what's happening here, but one possibility is that the more serious candidates (the ones we know are going to win anyway) are more attractive. Maybe you have some goofy-looking people who decide to run in districts where they don't have a chance, whereas the politicians who really have a shot at being in congress take the time to get their hair cut, etc.

Anyway, the point of this note is just that some skepticism is in order. It's fun to find some scientific finding that seems to show the shallowness of voters, but watch out! I guess it pleases the cognitive scientists to think that something as important and seemingly complicated as voting is just some simple first-impression process. Just as, at the next level, it pleases biologists to think that something as important and seemingly complicated as psychology is just some simple selfish-gene thing.

And see here for a discussion of some research by Atkinson, Enos, and HIll on this topic.

Just one more thing

From the news article:

"These findings suggest that voters are not appropriately weighting performance-based information on political candidates when undertaking one of democracy's most important civic duties," the researchers said.

No, no, no. Unless you want to take a very weak interpretation of "suggest." Or, to put it another way, sure, I have no doubt that "voters are not appropriately weighting performance-based information on political candidates"--but I don't see the personal appearance study as relevant to even close to definitive on this point.

I'm as cynical as the next guy, but this sort of thing is going a step too far, even for me.

We were discussing the Angrist and Pischke book with Paul Rosenbaum and I mentioned my struggle with instrumental variables: where do they come from, and doesn't it seem awkward when you see someone studying a causal question and looking around for an instrument?

And Paul said: No, it goes the other way. What Angrist and his colleagues do is to find the instrument first, and then they go from there. They might see something in the newspaper or hear something on the radio and think: Hey--there's a natural experiment--it could make a good instrument! And then they go from there.

This sounded fun at first, but I actually prefer this to the usual presentation of instrumental variables. The "find the IV first" approach is cleaner: in this story, all causation flows from the IV, which has various consequences. So if you have a few key researchers such Angrist keeping their ears open, hearing of IV's, then you'll learn some things. This approach also fits in with my fail-safe method of understanding IV's when I get stuck with the usual interpretation.

Sometimes the "lead with the natural experiment" approach can lead to missteps, as illustrated by Angrist and Pischke's overinterpretation of David Lee's work on incumbency in elections. (See here for my summary of Lee's research along with a discussion of why he's estimating the "incumbent party advantage" rather than the advantage of individual incumbency.) But generally it seems like the way to go, much better than the standard approach of starting with a causal goal of interest and then looking around for an IV.

In this spirit, let me again mention my own pet idea for a natural experiment:

The Flynn effect, and the related occasional re-norming of IQ scores, causes jumps in the number of people classified as mentally retarded (conventionally, an IQ of 70, which is two standard deviations below the mean if the mean is scaled at 100). When they rescale the tests, the proportion of people labeled "retarded" jumps up. Seems like a natural experiment that might be a good opportunity to study effects of classifying people in this way on the margin. If the renorming is done differently in different states or countries, this would provide more opportunity for identifying treatment effects.

I think it would be so cool if someone could take this idea and run with it.

Mostly Harmless Econometrics

| 15 Comments

I just read the new book, "Mostly Harmless Econometrics: An Empiricist's Companion," by Joshua Angrist and Jorn-Steffen Pischke. It's an excellent book and, I think, well worth your $35. I recommend that all of you buy it.

I also have a few comments.

Chris Blattman writes,

Several aspiring graduate students have written me [Blattman] about becoming an impact evaluator. . . . I think the best advice is: don't get a PhD to do evaluations. The randomized evaluation is just one tool in the knowledge toolbox. . . . Yes, the randomized evaluation remains the "gold standard" for important (albeit narrow) questions. Social science, however, has a much bigger toolbox for a much broader (and often more interesting) realm of inquiry. . . .

I pretty much agree with Chris on the substance of his remarks, but I think he's missing something when he merges "impact evaluation" and "randomize evaluation" into a single concept. Policy analysis is a big area, and it certainly includes observational studies. We care about the impacts of all sorts of policies that can't be directly studied using experimentation.

P.S. In a different direction, it's interesting to me that policy evaluation is considered part of economics (a little bit) but not really part of political science--but maybe things are changing.

Nate Silver and Greg Mankiw have an interesting exchange about the use of exogenous instruments to estimate causal effects. Unfortunately, the subject is macroeconomics, a topic on which I know next to nothing beyond what I learned in Mr. Cutlip's econ class in 11th grade. But I think it is, in Greg's phrase, "a teachable moment" on the subject of causal inference.

Greg summarizes the exchange pretty well, although I think he's missing a key point.

Nate noticed a newspaper article where Greg related research by Christina and David Romer on the effects of "exogenous" tax cuts on the economy. Nate writes:

The type of tax cut that Romer and Romer think falls into this category is what they call an "exogenous" tax cut -- one designed not to counter business cycles, but rather a "spontaneous" tax cut under relatively healthy economic circumstances.

This is very much not the type of tax cut that we are contemplating right now. Instead, what is being contemplated is a countercyclical action in an unhealthy economy designed to return the economy to normal growth. Romer and Romer are not all that keen on this type of tax cut; in fact, they argue that such "countercyclical fiscal policy is not achieving its intended purpose" . . .

Greg repiies:

Why did the Romers focus on exogenous policy changes? The reason is that these are the only changes that can be used to reliability identify the effects of tax policy. . . . The Romers focus on exogenous tax changes for the same reason doctors conduct randomized drugs trials--not because they are interested in randomization as a prescriptive tool, but because randomization solves a statistical identification problem.

And now here are my thoughts, again with full recognition that I can really only comment on the statistical issues here, not the economics.

First, Greg is right that it is generally considered desirable or even optimal to estimate treatment effects using randomized experiments or exogenous implementations (but see here for an opposite view from James Heckman), even when the ultimate goal is to understand how the intervention works in the wild, so to speak.

But there is the potential for treatment interactions--that is, a treatment might be more effective in some conditions than in others. There's lots of evidence for treatment interactions in various settings, ranging from education to job training. And this is what Nate is talking about. Again, without attempting to comment on the economics, the treatment effect could vary enough that Nate could be right about the direct relevance of the Romers' study of exogenous tax changes.

To put it another way, Greg is talking about identifiability and Nate is talking about generalizability.

Greg writes, "I usually don't respond to blogosphere commentary on my work because, after all, time is scarce." But since he's had time to respond once, perhaps he'll be able to respond again and clarify this issue. (I think my time is particularly non-scarce since I'm responding to blogosphere commentary on somebody else's work!) In any case, I like the idea of shifting the debate to a discussion of treatment interactions since then it might be more possible to resolve this on a technical level. Perhaps a teachable moment for me as well as for others.

Sergey Aksenov pointed me to this article by Deepak Hegde and David Mowery:

Bill Easterly is speaking in our seminar this Thursday--the title is "Free the Poor! Ending Global Poverty from the Bottom Up," and it will be in 711 IAB from 11-12:30 (it's open to all)--and I thought I'd prepare by reading his recent article in the New York Review of Books, a review of "The Bottom Billion: Why the Poorest Countries Are Failing and What Can Be Done About It," by Paul Collier. (I haven't read the Collier book so that puts me on an even footing with most of the others in the audience, I think.)

Easterly has some pretty strong criticism of Collier, setting him up with some quotes that set of alarms to me as a statistician. For example, Collier writes:

Aid is not very effective in inducing a turnaround in a failing state; you have to wait for a political opportunity. When it arises, pour in the technical assistance as quickly as possible to help implement reform. Then, after a few years, start pouring in the money for the government to spend.

and

Security in postconflict societies will normally require an external military presence for a long time. Both sending and recipient governments should expect this presence to last for around a decade, and must commit to it. Much less than a decade and domestic politicians are liable to play a waiting game rather than building the peace.... Much more than a decade and citizens are likely to get restive for foreign troops to leave the country.

These are the kind of precise recommendations that make me suspicious. I mean, sure, I know that real effects are nonlinear and even nonmonotonic--too much of anything is too much, and all that--but I'm just about always skeptical of the claim that this sort of "sweet spot" analysis can really be pulled out from data. (See here--search on "Shepherd"--for an example of a nonlinear finding I didn't believe, in that case on the deterrent effect of the death penalty.)

Easterly continues with some more general discussion of the role of statistics in making policy decisions. Here's Easterly:

Alas, as a social scientist using methods similar to Collier's in my research, I [Easterly] am painfully aware of the limitations of our science. When recommending an action on the basis of a statistical correlation, first of all, one must heed the well-known principle that correlation does not equal causation. . . . [Collier] fails to establish that the measures he recommends will lead to the desired outcomes. In fairness to Collier, it is very difficult to demonstrate causal effects with the kind of data we have available to us on civil wars and failing states. . . .

Of course, governments take many actions even when social scientists are unable to establish that such actions will cause certain desirable outcomes. Presumably they use some kind of political judgment that is not based on statistical analysis.

I'm not so sure about this. Bill James once said something along the lines of: The alternative to "good statistics" is not "no statistics," it's "bad statistics." (His example was something like somebody who critizicized On Base Percentage because it counted a walk the same as a hit, but then said critic ended up relying on Batting Average, which has lots more serious problems.)

Similarly, if we don't have a fully credible causal inference, I'd still like to see some serious observational analysis.

An example of this is Page Fortna's work on the effectiveness of peacekeeping. I might be garbling the details here, but my recollection is that she compared outcomes in countries with civil war, comparing countries with and without international peacekeeping. The countries with peacekeeping did better. A potential objection arose, which was that perhaps peacekeepers chose the easy cases--maybe the really bad civil wars were so dangerous that peacekeepers didn't go to those places. So she controlled for how bad off the country before the peacekeeping-or-no-peacekeeping decision was made, using some objective measures of badness--and she found that peackeeping was actually more likely to be done in the tough cases, and after controlling for how bad off the country was, peacekeeping looked even more effective. Here's the graph (which I made from Page's data when she spoke in our seminar a few years ago):

peacekeeping.png

The red points are countries with peacekeeping, the black points are countries without, and the y-axis represents how long the countries have gone so far without a return of the conflict. In an example like this, the real research effort comes from putting together the dataset; that said, I think this graph is helpful, and it also illustrates the middle range between anecdotes of individual cases, on one hand, and ironclad causal inference on the other. I have no idea where Collier's research fits in here but I'd like to keep a place in policy analysis for this sort of thing. I'd hate to send the message that, if all we have is correlation, that the observational statistics must be completely ignored.

P.S. Easterly also writes, "To a social scientist, the world is a big laboratory." I would alter this to "observatory." To me, a laboratory evokes images of test tubes and scientific experiments, whereas for me (and, I think, for most quantitative social scientists), the world is something that we gather data on and learn about rather than directly manipulate. Laboratory and field experiments are increasingly popular, sure, but I still don't see them as prototypical social science.

P.P.S. See here for some comments from Chris Blattman, who knows more about this stuff than I do.

Mohammed Mohammed points me to this article by John Nichols, which begins:

Observational studies comparing groups or populations to evaluate services or interventions usually require case-mix adjustment to account for imbalances between the groups being compared. Simulation studies have, however, shown that case-mix adjustment can make any bias worse. One reason this can happen is if the risk factors used in the adjustment are related to the risk in different ways in the groups or populations being compared, and ignoring this commits the ‘‘constant risk fallacy’’. Case-mix adjustment is particularly prone to this problem when the adjustment uses factors that are proxies for the real risk factors. Interactions between risk factors and groups should always be examined before case-mix adjustment in observational studies.

This is interesting, and it connects to my struggles with survey weighting. Survey weighting is similar to adjustment of control and treatment groups in an observational study. (The survey analogy is respondents and nonrespondents.) Nichols's article points out difficulties with adjustment if you ignore interactions, which is a problem we've found in survey adjustment as well. The solution is to include all interactions that are potentially important, but then a model becomes large, and we have to go beyond least squares and exchangeable models . . .

We discuss in chapters 9 and 10 of ARM the general problem of adjusting for differences between treatment and control groups, but we don't specifically focus on the importance of interactions.

Chris Weiss writes with a question about propensity score matching with multilevel data:

Abortion and crime

| 7 Comments

Leo Kahane, David Paton, and Rob Simmons have an interesting discussion in Vox EU on how to study effects of abortion on crime rates. They write:

The hypothesis that the legalisation of abortion contributed to a dramatic fall in crime rates in the United States, originally proposed by John Donohue and Steven Levitt in an article in 2001 and popularised by Levitt’s best selling book Freakonomics, has been the subject of close scrutiny by other academics. . . .

paton_fig1.JPG

For the US (as noted by D&L), crime starts to fall about 18 years after the legalisation of abortion, consistent with abortion being a causal influence. In contrast, property crime in England and Wales starts to fall about 23 years after the first full year of abortion (1969), too late to be consistent with a causal effect. Violent crime does not decrease at all over the period. . . .

paton_fig2.JPG

A natural way of distinguishing these explanations is to examine whether crime fell more (or increased less) amongst those young enough to have been affected by abortion legalisation compared to those born before the legislation. Figure 2 shows the pattern of conviction rates for those aged 10-15, 16-20 and 21 plus in England and Wales. The trends are not supportive of a link between abortion and crime. . . . Given all this, it seems highly unlikely that the legalisation of abortion can, as D&L hypothesised, explain the dramatic drop in crime observed in the US in the 1990s. However, we cannot necessarily conclude from this that abortion has no impact on crime. . . .

A potential explanation of this apparent conundrum arises from considering what actually happened to children who would not have been born had abortion been legal at the time of their conception. Some such children would have been brought up in adverse circumstances (either by the birth parent or by being taken into the care of the state) and may consequently have been at a higher risk of committing crime. On the other hand, other children conceived in similarly adverse circumstances would have been given up for adoption and then brought up in relatively stable and affluent circumstances. Put another way, prior to the legalisation of abortion, unwanted babies did not necessarily become unwanted children. . . .

paton_fig3.JPG

Figure 3 illustrates trends in rates of infant adoptions and children taken into state care in England and Wales between 1960 and 1980. The rate of children in care barely changes after abortion legalisation in 1968, whilst there appears to be a dramatic effect on adoptions. . . .

Kahane, David Paton, and Rob Simmons conclude with a speculation and a research suggestion:

It seems that the legalisation of abortion contributed to a structural shift in society from a situation in which it was normal to put “unwanted” infants up for adoption to one in which adoption was actively discouraged. Once this structural change occurred, it is plausible that, say, a subsequent marginal tightening of the abortion law will have two effects. Children conceived in adverse circumstances are (marginally) more likely to be born rather than aborted but are then more likely to be brought up in those same adverse consequences rather than being put up for adoption. If true, we will observe a negative link between abortion rates and subsequent crime rates.

Although speculative, this hypothesis is consistent with observed trends in adoptions and with the analyses of the abortion-crime link in both the US and the UK. The hypothesis suggests a natural way forward for researchers interested in the social impact of abortion. Rather than trying to identify a causal link from abortion to indirect outcomes such as crime which are only observed many years later, it may be more fruitful to try to tease out the size and direction of the impact of abortion on contemporaneous and direct indicators such as the rates of children taken into care.

I don't know anything about this and so can't comment on the substance, but I like the idea of trying to look at intermediate outcomes. This is an important general point in statistics; see, for example, this famous example.

P.S. The graphs are great, but I have a few suggestions . . .

1. Label the lines directly rather than use a legend. The lines on the first graph are particularly difficult to identify: three of the symbols are nearly identical at this resolution, and the labels on the legend are not in order of the lines.

2. I'm not thrilled with normalizing the series (as was done in the first two graphs). To start with, you lose the information of which countries are higher and which are lower; second, it creates a misleading picture of divergence. For the first graph, I think the way to go is to make two separate plots, one for violent crimes and one for property crimes.

3. Think a bit harder about the numbers on the axes. On the first graph, the y-axis numbers start at 100,000 even though they've been normalized at 100. That can't be right. Labeling the x-axis every two years doesn't help; do every 10 years instead. The second graph doesn't actually say what a "conviction rate" is. But I expect that it makes sense to send the y-axis all the way down to zero. Again, x-axis every 10 yrs would be fine (and also make it easier to compare to the scale of the top graph). Finally, the third graph y-axis should be 0, 10, 20 (or maybe 0, 5, 10, 15), and the y-axis is particularly hard to read.

But the main thing is point 1 above. This should be standard, I think. Given all the effort put into doing this research, why not make the graphs readable too? To me, having cryptic graphs is like writing a paper full of run-on sentences and non-sequitors.

Let me conclude by saying that the paper looks really interesting; I wouldn't have spent the time commenting on the graphs if I didn't think they were potentially saying something important.

Seth is skeptical of skepticism in evaluating scientific research. He starts by pointing out that it can be foolish to ignore data, just because they don't come from a randomized experiment. The "gold standard" of double-blind experimentation has become an official currency, and Seth is arguing for some bimetallism. To continue with this ridiculous analogy, a little bit of inflation is a good thing: some liquidity in scientific research is needed in order to keep the entire enterprise moving smoothly.

As Gresham has taught us, if observational studies are outlawed, then only outlaws will do observational studies.

I think Seth goes too far, though, and that brings up an interesting question.

Kaiser writes,

Boliang writes,

I was sorry to see Steven Levitt repeating the claim about driving a car being good for the environment. I wrote about this last week when it appeared in the other New York Times column of John Tierney, but perhaps it's worth repeating:

Causal inference workshop

| No Comments

Liz Stuart writes:

We are pleased to announce the next Mid-Atlantic Causal Inference Workshop, to be held at Johns Hopkins Bloomberg School of Public Health on Monday and Tuesday May 19-20, ending at noon on May 20.

Robin Hanson suggested here an experimental design in which patients, instead of randomly assigned to particular treatments, are randomly given restrictions (so that each patient would have only n-1 options to consider, with the one option removed at random). I asked some experts about this design and got the following responses.

Eric Bradlow wrote:

I think "exclusion", more generally, in Marketing has been done in the following ways:

[1] A fractional design -- each person only sees a subset of the choices, items, or attributes of a product (intentionally) on the part of the experimenter. Of course, this is commonly done to reduce complexity of the task while trading off the ability to estimate a full set of interactions. The challenge here, and I wrote a paper about this in JMR in 2006, is that people infer the values of the missing attributes and do not, despite instructions, ignore them. Don Rubin actually wrote an invited discussion on my piece. So, random exclusion on the part of the experimenter is done all of the time.

[2] A second way exclusion is sometimes done is prior to the choice or consumption task, you let the respondent remove "unacceptable" alternatives. There was a paper by Seenu Srinivasan of Stanford on this. In this manner, the respondent eliminates "dominated/would never choose alternatives". This is again done for the purposes of reducing task complexity.

[3] A third set of studies I have seen, and Eric Johnson can comment on the psychology of this much more than I can, is something that Dan Ariely (now of Duke formerly of MIT and colleagues have done), which seems closest to this post. In these sets of studies, alternatives are presented and then "start to shrink and/or vanish". What is interesting is that these alternatives that he does this to are not the preferred ones and it has a dramatic effect on people's preferences. I always found these studies fascinating.

[4] A fourth set of related work, of which Eric Johnson has great fame, is a "mouse-lab" like experiment where you allow people to search alternatives until they want to stop. This then becomes a sequential search problem; however, people exclude alternatives when they want to
stop.

So, Andy, I agree with your posting that:

(a) Marketing researchers have done some of this.

(b) Depending on who is doing the excluding, one will have to model this as a two-step process, where the first step is a self-selection (observational study like likelihood piece, if one is going to be model-based).

The aforementioned Eric Johnson then wrote:

I think there are at least two important thoughts here:

(1) random inclusion for learning... Decision-making has changed the way we think about preferences: They are discovered (or constructed) not 'read' from a table (thus Eric B.'s point 3).

A related point is that a random option can discover a preferences (gee, I never thought I liked ceviche....) so there may be value in adding random options to the respondent,,, The late Hillel Einhorn wrote about 'making mistakes to learn.'

(2) "New Wave' choice modeling often consists of generating the experimental design on the fly: Adaptive conjoint. By definition, these models use the results from one choice to eliminate a bunch of possible options and focus on those that have the most information. Olivier Toubia at Columbia Marketing is a master of this.

To elaborate on Eric B.'s points:

Consumer Behavior research shows that elimination is a major part of choice for consumers, probably determining much of the variance in what is chosen. Make choice easier, learning harder.

There is an interesting tradeoff for both the individual and larger publics here: You try a option you are likely not to like (treatment which may well not work). If you are surprised, then you (or subsequent patients) benefit for a long time. Since this is an intertemporal choice, people may
not experiment enough.

Finally, Dan "Decision Science News" Goldstein added:

I've never seen a firm implement such a design in practice, neither when I worked in industry, nor when I judged "marketing effectiveness" competitions.

My own thoughts are, first, that there are a lot of interesting ideas in experimental design beyond the theory in the textbooks. It would be worth thinking systematically about this (someday). Second, I want to echo Eric Johnson's comment about preferences being constructed, not "read off a table" from some idealized utility function. Utility theory is beautiful but it distresses me that people think it fits reality in an even approximate way.

Robin Hanson writes,

To make sense of social complexity we would ideally want to add lots of randomization to people's real choices, and then collect lots of data on what happens to them. But this seems a lot to ask of people. For example, people who eat at a restaurant might be willing to tell you how they felt later after eating there, but they'd be reluctant to eat a random item from the menu even one percent of the time.

Would people be more willing to have a few of their options randomly excluded? For example, would people mind much if on a menu of one hundred items one of the items was randomly excluded each time - "sorry we are out of that today"? Data about choices under such reduced menus would still have a key randomization component.

This idea occurred to me while talking to a cancer doctor who thought he could get thousands of cancer patients to agree to release data on their progress, but who would be more reluctant to accept a random treatment. Once standard drugs have failed, there are about twenty alternative drugs a patient could try, which they usually pick based on the side effects etc. Patients probably wouldn't mind much having one of these options taken off the menu.

My thoughts:

I think I'd eat a random item 1% of the time as part of an experiment--after all, 1% of the time would correspond to three lunches per year.

To get to your main proposal: I think if you exclude one item, you'll get a study that is a mix of experiment and observational study, which could probably be analyzed in a way more robustly than purely observational data could be analyzed, but requiring more information than the analysis of a pure experiment.

This sounds like something that marketing researchers might have studied too.

P.S. See here for much more from the marketing researchers.

Lingzhou Michael Xue writes in with two questions:

"Instrumental variables" is an important technique in applied statistics and econometrics but it can get confusing. See here for our summary (in particular, you can take a look at chapter 10, but Chapter 9 would help too).

Now an example. Piero spoke in our seminar last Thursday on the effects of defamation laws on reporting of corruption in Mexico. In the basic analysis, he found that, in the states where defamation laws are more punitive, there is less reporting of corruption, which suggests a chilling effect of the laws. But there are the usual worries about correlation-is-not-causation, and so Piero did a more elaborate instrumental variables analysis using the severity of homicide penalties as an instrument.

We had a long discussion about this in the seminar. I originally felt that "severity of homicide penalties" was the wackiest instrument in the world, but Piero convinced me that it was reasonable as a proxy for some measure of general punitiveness of the justice system. I said that if it's viewed as a proxy in this way, I'd prefer to use a measurement-error model, but I can see the basic idea.

Still, though, there was something bothering me. So I decided to go back to basics and use my trick for understanding instrumental variables. It goes like this:

The trick: how to think about IV's without getting too confused

Suppose z is your instrument, T is your treatment, and y is your outcome. So the causal model is z -> T -> y. The trick is to think of (T,y) as a joint outcome and to think of the effect of z on each. For example, an increase of 1 in z is associated with an increase of 0.8 in T and an increase of 10 in y. The usual "instrumental variables" summary is to just say the estimated effect of T on y is 10/0.8=12.5, but I'd rather just keep it separate and report the effects on T and y separately.

In Piero's example, this translates into two statements: (a) States with higher penalties for murder had higher penalties for defamation, and (b) States with higher penalties for murder had less reporting of corruption.

Fine. But I don't see how this adds anything at all to my understanding of the defamation/corruption relationship, beyond what I learned from his simpler finding: States with higher penalties for defamation had less reporting of corruption.

In summary . . .

If there's any problem with the simple correlation, I see the same problems with the more elaborate analysis--the pair of correlations which is given the label "instrumental variables analysis." I'm not opposed to instrumental variables in general, but when I get stuck, I find it extremely helpful to go back and see what I've learned from separately thinking about the correlation of z with T, and the correlation of z with y. Since that's ultimately what instrumental variables analysis is doing.

Here's some material on causal inference from a regression perspective. It's from our recent book, and I hope you find it useful.

Chapter 9: Causal inference using regression on the treatment variable

Chapter 10: Causal inference using more advanced models

Chapter 23: Causal inference using multilevel models

Here are some pretty pictures, from the low-birth-weight example:

fig10.3.png

and from the Electric Company example:

fig23.1_small.png

Nick Firoozye writes,

I [Firoozye] wanted to point your attention to the following podcast by Ian Ayres on Supercrunchers, where he shows himself an enthusiastic (if perhaps a bit naïve) proponent of the statistical method. Entertaining, definitely. One thing though that I thought you might be interested in is Russ Roberts’ (the interviewer's) own skepticism over the econometric method, which I think probably warrants a response. It may be that Roberts’ own view is due to his now-Austrian economics slant (i.e., somewhat anti-formallist approach) or perhaps to the fact that mainstream econometrics is a frequentist pursuit and one might question the honesty of the results as a consequence.

I don't really have much to add here, except that the problem noted by Roberts (it's hard to know whether to believe a statistical study) is even more of a problem with non-statstical empirical studies (i.e., anecdotes). I think Roberts might be overstating the problem because he is focusing on issues where he already had a strong personal opinion even before seeing data analyses. (He mentions the examples of concealed handguns and anti-theft devices on cars.) But there are a lot of areas where we have only weak opinions which can indeed be swayed by data (see here for some examples). These cases are important in their own right and also can serve as benchmarks for the success of statistical analysis, so that we can trust good analyses more when they're applied to tougher problems. This is one way that applied statistics proceeds, by exemplary analyses of problems that might not be hugely important on their own terms but serve as useful templates. Consider, for example, the book by Snedecor and Cochran: it's full of examples on agricultural field trials. Sure, these are important, but these methods have been useful in so many other fields. This is a great example, actually: Snedecor and his colleagues worked on agricultural trials because they cared about the results--these were not "toy examples" or thought experiments--and the resulting methods endured.

Bob Erikson writes,

I was trolling the internet and came across your debate with Jens H. from Feb 15 07 on your blog about differences in differences.

You might find the attached document of interest. It is a once-influential currently-obscure article from half a century ago on this topic. The language is not contemporary. But note Campbell's example of 2 ways to analyze the substantive problem and two very different interpretations. Presumably Campbell is correct, using a difference of differences approach.

Bob's office is just across from mine in the political science department, but of course we communicate via blogs and emails. Anyway, I'll have to read the paper carefully. Also it will be interesting to see if they noticed that before-after correlations are higher for controls than for treated units.

Encouraged by the success of his self-experimentation to help his sleep, mood, and weight concerns, Seth Roberts has been experimenting with the effects of drinking flaxseed oil. Here's an example of his results:

sethgraph.jpg

Commenting on another recent one of Seth's self-experiments, I wrote,

Seth, Not to be a wet blanket or anything, but aren’t you worried that your findings might be due to expectation effects: you knew which oil you were taking when doing the tests, right?

Seth replied,

Andrew, no, I’m not worried that the results are due to expectations. If the results always conformed to my expectations, I’d be worried, but they haven’t — see my post about eggs. Moreover, this particular result confirms a result that was a surprise. In other words, I’ve gotten the same result when I was expecting it and when I wasn’t expecting it.

I'm still concerned, though. Seth is saying that it's not just an expectation effect because he wasn't always expecting the results. But I could see a bias arising from positive feedback, as follows: You try a new treatment and then see what happens after, with no expectations except that things might change. There is some noise to this measurement--just at random, it will be higher or lower than before. Having seen this, you adjust your expectations; this then affects your next measurement, etc.

I'm not saying this is definitely happening, but it could be.

To Seth: maybe you could get a partner in experimentation, someone who lives or works nearby, and he or she could give you a randomly assigned oil. That is, your partner would know which oil you're getting, but you wouldn't. In fact, you wouldn't even know if you were being given something new that day. (It wouldn't be hard to set up some complicated randomization scheme so that, for example, you would get one oil for several days, then another, etc.) You, of course, could provide the same service for your self-experimentation partner. This also has the virtue that you'll get twice as many measurements.

Racial bias in baseball umpiring?

| 2 Comments

Some economists from McGill University and UT-Austin just wrote a paper (Parsons et al., 2007) that purports to find racial bias in baseball umpiring, specifically ball/strike calls on pitches at which the batter does not swing. Here's the payoff quote:


The highest percentage of called strikes occurs when both umpire and pitcher are White, while the lowest percentage is when a White umpire is judging a Black pitcher. What is intriguing is that Black umpires judge Hispanic pitchers harshly, relative to how they are judged by White and Hispanic umpires; but Hispanic umpires treat Black pitchers nearly identically to the way Black umpires treat them. Minority umpires treat Asian pitchers far worse than they treat White pitchers.

(Personally, I'm not sure I agree that the apparent bias of black umpires against hispanics and vice versa is more (or less) intriguing than the apparent bias of whites against blacks. But Tom Lehrer's National Brotherhood Week comes to mind.)

Michael Sobel sent me this paper which will appear in the Journal of Educational and Behavioral Statistics. It's about mediation: a crucial issue in causal inference and a difficult issue to think about. The usual rhetorical options here are:

- Blithe acceptance of structural equation models (of the form, "we ran the analysis and found that A mediates the effects of X on Y")

- Blanket dismissal (of the form, "estimating mediation requires uncheckable assumptions, so we won't do it")

- Claims of technological wizardry (of the form, "with our new method you can estimate mediation from observational data")

For example, in our book, Jennifer and I illustrate that regression estimates of mediation make strong assumptions, and we vaguely suggest that something better might come along. We don't provide any solutions or even much guidance.

Michael has thought hard about these problems for a long time. (For example, see here and here, or for some laffs, here.) Michael's also notorious for pointing out that the phrase "causal effect" is redundant: all effects are causal. Anyway, I was interested to see what he has to say about mediation. Here's the abstract of the paper:

Nisha Gottfredson writes:

Wil Wilkinson points to an interesting article by Nicholas Eberstadt (and adds some comments of his own) on the topic of the high birth rates in the United States compared to Europe. Wilkinson attributes the difference to Americans' higher average rates of reported happiness and, regarding government policy, cites Shelly Lundberg and Robert Pollak to suggest that birth rates could be raised via policies leading to lower unemployment for young adults. I know there have been some studies of the relation between local economic conditions and birthrates, but I can't remember the findings. I seem to recall some interactions, with different patterns among different ethnic groups.

The business of unemployment and children is interesting, since from an abstract perspective I suppose that lower unemployment is a good thing, but so are lower birthrates (at least in the U.S., where the population is growing via in immigration anyway). And of course if people are unemployed, presumably they have more time to take care of the children. Maybe "unemployment" isn't quite the right measure here.

To continue with the economic argument . . . Wilkinson writes, "I like the optimism explanation. It’s easy to see why folks would refrain from reproduction if they thought their kids had only a broiling, denuded planet full of wretched consumer-zombies living pointless lives in cookie-cutter McMansions and soulless big box strip malls to look forward to." This isn't quite right, I think: McMansions are good things to have--I think that pessimism is thinking you'll live in a bad neighborhood, not that you'll live in a McMansion.

What I really meant to say was . . .

Anyway, the real reason I brought this up was not to talk about happiness and birth rates (on which I'm no expert) but to discuss the challenges of the "why" sort of causal inference. It's a basic mode of science (and of social science): we see stylized fact X (in this case, higher birth rates in the U.S. than in Europe) and then try to make various comparisons to figure out the causes of X.

But Rubin has taught us to look for the effects of causes, not the causes of effects. A similar problem arose in our Department of Health study where we were trying to understand the different rates of rodent infestation comparing whites, blacks, and hispanics in NYC. Even after controlling for some available information such as the neighborhood, the quality of the building, the floor of the apartment, etc., there were more rodents in the apartments of ethnic minorities. We'd like to "explain"--understand--this pattern, but this sort of reasoning doesn't fit directly into the statistical framework of causal inference. One approach is to reframe things in terms of potential intervntions (as I've done above with the birthrate example by imagining policies that lower unemployment). But that doesn't seem to completely get at Wilkinson's question about happiness.

Mediation

| 1 Comment

Rahul writes:

Jens Hainmueller has an interesting entry here about estimating the causal effects of the 2004 Madrid bombing on the subsequent Spanish elections, by comparing regular votes to absentee votes that were cast before the bombing. Jens cites a paper by Jose Montalvo that uses difference-in-difference estimation; that is, a comparison of this form:

[(avg of treated units at time 2) - (avg of controls at time 2)] - [(avg of treated units at time 1) - (avg of controls at time 1)]

I'm sure this is fine, but it's just a special case of lagged regression where the lag is restricted to have a coefficient of 1. In educational research, this is sometimes called the analysis of "gain scores." In any case, you're generally limiting your statistical efficiency and range of applicability by using differences rather than the more general regression formulation.

I can see why people set up these difference models--if you have a model with error terms for individual units (in this case, precincts or whatever--I can't actually get the link to the Montalvo paper), then differencing makes the error terms drop out, seemingly giving a cleaner estimator. But once you realize that it's a special case of regression, and you start thinking of things like regression to the mean (not to mention varying treatment effects), you're led to the more general lagged regression.

Not that lagged regression solves all problems. It's just better than difference in differences.

P.S. Actually, I would expect there to be varying treatment effects in the Spanish election example.

A new causality blog

| 9 Comments

A group of from University of California in Los Angeles, including the popular author of books on Bayesian networks (sometimes referred to as belief networks or as graphical models, as they aren't Bayesian in the Bayesian statistics sense) and causality Judea Pearl, have set up a new blog on causality. Their approach to causality is based on probability theory with random variables and operators. For a taste of it, see "Causality is undefinable" or "The meaning of counterfactuals".

While it takes the form of a blog, the system is more like a help line. The good stuff is often in the comments.

cover.gif

Our book is finally out! (Here's the Amazon link) I don't have much to say about the book here beyond what's on its webpage, which has some nice blurbs as well as links to the contents, index, teaching tips, data for the examples, errata, and software.

But I wanted to say a little about how the book came to be.

Michael Sobel is speaking Monday Here's the abstract:

During the past 20 years, social scientists using observational studies have generated a large and inconclusive literature on neighborhood effects. Recent workers have argued that estimates of neighborhood effects based on randomized studies of housing mobility, such as the “Moving to Opportunity Demonstration” (MTO), are more credible. These estimates are based on the implicit assumption of no interference between units, that is, a subject’s value on the response depends only on the treatment to which that subject is assigned, not on the treatment assignments of other subjects. For the MTO studies, this assumption is not reasonable. Although little work has been done on the definition and estimation of treatment effects when interference is present, interference is common in studies of neighborhood effects and in many other social settings, for example, schools and networks, and when data from such studies are analyzed under the “no interference assumption”, very misleading inferences can result. Further, the consequences of interference, for example, spillovers, should often be of great substantive interest, though little attention has been paid to this. Using the MTO demonstration as a concrete context, this paper develops a framework for causal inference when interference is present and defines a number of causal estimands of interest. The properties of the usual estimators of treatment effects, which are unbiased and/or consistent in randomized studies without interference, are also characterized. When interference is present, the difference between a treatment group mean and a control group mean (unadjusted or adjusted for covariates) does not estimate an average treatment effect, but rather the difference between two effects defined on two distinct subpopulations. This result is of great importance, for a researcher who fails to recognize this could easily infer that a treatment is beneficial when it is universally harmful.

Here's the paper. (Scroll past the first page which is blank.) See here for more on Sobel and causal inference. The talk is Mon noon in the stat dept.

Michael Weiksner writes,

I [Weiksner] do research on deliberation, where the treatment itself is defined as the interaction with other people (who are inevitably also randomly assigned to the treatment group). Because all the treated individuals interact, I know that the safest course of action is to look only at group level effects. But that's highly unsatisfying, since you can't really shed any light on questions about individuals, like does deliberation create better citizens?

I have always been taught that the randomized experiment is the gold standard for causal inference, and I always thought this was a universal view. Not among all econometricians, apparently. In a recent paper in Sociological Methodology, James Heckman refers to "the myth that causality can only be determined by randomization,
and that glorifies randomization as the ‘‘gold standard’’ of causal inference."

It's an interesting article because he takes the opposite position from all the statisticians I've ever spoken with (Bayesian or non-Bayesian). Heckman is not particularly interested in randomized experiments and does not see them as any sort of baseline, but he very much likes structural models, which statisticians are typically wary of because of their strong and (from a statistical perspective) nearly untestable assumptions. I'm sure that some of this dispute reflects different questions that are being asked in different fields.

Heckman's article is a response to this article [link fixed--thanks Alex] by Michael Sobel, who argues that Heckman's methods are actually not so different from the methods commonly used in statistics. It's all a bit baffling to me because I actually thought that economists were big fans of randomized experiments nowadays.

P.S. As noted by an anonymous commenter, some controversy arose from this issue of Sociological Methodology, but I'm not going into detail here since said controversy is not very relevant to the scientific issues that arise in these papers, which is what I wanted to post on.

The posters for the second mid-Atlantic causal modeling conference have been listed (thanks to Dylan Small, who's organizing the conference). The titles all look pretty interesting, especially Egleston's and Small's on intermediate outcomes. Here they are:

I've become increasingly convinced of the importance of treatment interactions--that is, models (or analyses) in which a treatment effect is measurably different for different units. Here's a quick example (from my 1994 paper with Gary King):

redistrict.png

But there are lots more: see this talk.

Given all this, I was surprised to read Simon Jackman's blog describing David Freedman's talk at Stanford, where Freedman apparently said, "the default position should be to analyze experiments as experiments (i.e., simple comparison of means), rather than jamming in covariates and even worse, interactions between covariates and treatment status in regression type models."

Well, I agree with the first part--comparison of means is the best way to start, and that simple comparison is a crucial part of any analysis--for observational or experimental data. (Even for observational studies that need lots of adjustment, it's a good idea to compute the simple difference in averages, and then understand/explain how the adjustment changes things.) But why is is it "even worse" to look at treatment interactions??? On the contrary, treatment interactions are often the most important part of a study!

I've already given one example--the picture above, where the most important effect of redistricting is to pull the partisan bias toward zero. It's not an additive effect at all. For another example that came up more recently, we found that the coefficient of income, in predicting vote, varies by state in interesting ways:

superplot_var_slopes_annen_2000.png

Now, I admit that these aren't experiments: the redistricting example is an observational study, and the income-and-voting example is a descriptive regression. But given the power of interactions in understanding patterns in a nonexperimental context, I don't see why anyone would want to abandon this tool when analyzing experiments. Simon refers to this as "fritzing around wtih covariate-asjustment via modeling" but in these examples, interactions are more important than the main effects.

Interactions are important

Dave Krantz has commented to me that it is standard in psychology research to be intersted in interactions, typically 3-way interactions actually. The point is that, in psychology, the main effects are obvious; it's the interactions that tell you something.

To put it another way, the claim is that the simple difference in means is the best thing to do. This advice is appropriate for additive treatment effects. I'd rather not make the big fat assumption of additivity if I can avoid it; I'd rather look at interactions (to the extent possible given the data.)

Different perspectives yield different statistical recommendations

I followed the link at Simon's blog and took a look at Freedman's papers. They were thought-provoking and fun to read, and one thing I noticed (in comparison to my papers and books) was: no scatterplots, and no plots of interactions! I'm pretty sure that it would've been hard for me to have realized the importance of interactions without making lots of graphs (which I've always done, even way back before I knew about interactions). In both the examples shown above, I wasn't looking for interactions--they were basically thrust upon me by the data. (Yes, I know that the Mississippi/Ohio/Connecticut plot doesn't show raw data, but we looked at lots of raw data plots along the way to making this graph of the fitted model.) If I hadn't actually looked at the data in these ways--if I had just looked at some regression coefficients or differences in means or algebraic expressions--I wouldn't have thought of modeling the interactions, which turned out to be crucial in both examples.

I know that all my data analyses could use a bit of improvement (if only I had the time to do it!). A first step for me is to try to model data and underlying processes as well as I can, and to go beyond the idea that there's a single "beta" or "treatment effect" that we're trying to estimate.

Awhile ago I had some comments about how, in the best works of alternative history, the alternative world is not "real," that in an underlying sense, our world is the real one. Just to update on this, I sent my thoughts to the great John Clute, who had the following response:

I think it's a neat formulation of at least something of what goes on in the best Alternative History novels, though I tend to think of it more as an involuntary (or elated, or knowing) insertion of a touch of Yin into the Yang. I think the best sf books do tend to wrestle with what I'd call (hey, why not) Minotaur-bearing labyrinth of the "real", and that the best of them tend to make use of that engagement. But a different focus of energy also operates: the enormously powerful urge of the good writer to realize the imagined thing. BRING THE JUBILEE loses some of its poignance and grasp if we think that the world in which it begins is somehow less real, in the imagined matrix of the tale, than the world in which it ends; though it is at the same time clear that the reader is in a "privileged" position as regards his understanding of the nature of the reality of the world "created".

In fact, I think the best angle of understanding of the issues you're addressing may be in reader theory: that the reader is in a particularly privileged, and exposed, and delicate position vis a vis the reality register of any alternative world story; and will be remarkably sensitive to any Yin within the Yang. Something like this is true of any reading experience of fiction (though I find your statement that "in a sense, all novels are alternate histories" true but maybe a bit masking of the readerly issues foregrounded here); but clearly, the stakes are much higher and more visible in the alternate/alternative world story.

As I commented in blog entry, I think that this sort of analysis can be helpful in understanding the "potential outcomes" formulation of causal inference.

In comments here, Alexis writes,

Your post prompts me to ask you something i've been wondering about ever since i began learning about NON-regression-based approaches to causal inference: namely, why do virtually all statistically-oriented political scientists think that regression-based/MLE methods are giving them the correct answers in observational settings? after all, we have long known (since at least the Rubin/Cochran papers of 1970s) that regression is often (and quite possible *generally*) unreliable in observational settings.

Do we have a single example of a non-trivial observational dataset wherein we can show that regression analysis produces the result that would have been obtained in a randomized experiment? We have lots of examples that show regression fails this test (here i'm thinking of dehejia/wahba/lalonde, etc.) where is the definitive empirical success story? there should be many success stories, given the universality of the methodology-- but i don't know of a single one.

In your blog, you write:

"(Parochially, I can point to this [link to gelman paper] and this [link to gelman paper] as particularly clean examples of causal inference from observational data, but lots more is out there.)

I do not doubt that your linked papers (which i have not read) are excellent and rigorous examples of applied regression analysis. but my question is, how is it that these papers validate a regression-based approach to causal inference? what do you know of the "correct answer" in these cases, aside from your regression-based estimates?

My response:

1. Matching and regression are different methods but ultimately are doing the same thing, which is to compare outcomes on cases that differ in the treatment variable while being as similar as possible on whatever pre-treatment covariates are around. Regression relies (and takes advantage of) linearity in the response surface, whereas matching is (potentially) nonparametric.

2. Matching is particularly relevant when the treatment and control groups do not completely overlap--matching methods will exclude the points outside the overlap region, with the understanding that the causal inference applies only to this region of overlap. Rubin's thesis discussed why matching is improved if followed by regression.

3. In each of the examples I've worked on (most notably, estimating the effect of incumbency and the effect of redistricting), there was essentially complete overlap of treatment and control groups.

4. In their usual formulations, matching and regression both assume ignorability and thus ignore biases due to selection based on unmeasured variables.

5. In my particular examples, I don't have external validation, however the results make sense when looked at from various directions (see, for example, our comparison of various estimates in our 1990 AJPS paper).

P.S. See some lively discussion in the comments.

A quick summary of last year's mini-conference on causal inference (held at Columbia) is here, and here's the schedule for last year's meeting.

Below is the schedule for this year's mid-Atlantic causal modeling conference, organized by Prof. Dylan Small at the University of Pennsylvania. I think Jennifer will be speaking on our work studying the NYC public schools, although I don't see that in the title.

Jose Pedro Gala sends in some quesetions about statistical methods for causal inferences. I'll give his questions, then my responses. Jose writes:

Several studies have been performed in the last few years looking at the economic decisions of parents of sons, as compared to parents of daughters. For example, Tyler Cowen links to a report of a study by Andrew Oswald and Nattavudh Powdthavee that "provides evidence that daughters make people more left wing. Having sons, by contrast, makes them more right wing":

Professor Oswald and Dr Powdthavee drew their data from the British Household Panel Survey, which has monitored 10,000 adults in 5,500 households each year since 1991 and is regarded as an accurate tracker of social and economic change. Among parents with two children who voted for the Left (Labour or Lib Dem), the mean number of daughters was higher than the mean number of sons. The same applied to parents with three or four children. Of those parents with three sons and no daughters, 67 per cent voted Left. In households with three daughters and no sons, the figure was 77 per cent.

I've seen some other studies recently with similar findings--a few years ago, a couple of economists found that having daughters, as compared to sons, was associated with the probability of divorce, I think it was, and recently a study by Ebonya Washington found that for Congressmembers, those with daughters (as compared to sons) were more likely to have liberal voting records on women's issues.

Controlling for the number of children: an intermediate outcome

A common feature of all these studies is that they control for the total number of children. This can be seen in the quote above, for example: they compare different sorts of families with 2 kids, then make a separate comparison of different sorts of families with 3 kids.

At first sight, controlling for the total number of children seems reasonable. There is a difficulty, however, in that the total number of kids is an intermediate outcome, and controlling for it (whether by subsetting the data based on #kids or using #kids as a control variable in a regression model) can bias the estimate of the causal effect of having a son (or daughter).

To see this, suppose (hypothetically) that politically conservative parents are more likely to want sons, and if they have two daughters, they are (hypothetically) more likely to try for a third kid. In comparison, liberals are more likely to stop at two daughters. In this case, if you look at data on families with 2 daughters, the conservatives will be underrepresented, and the data could show a correlation of daughters with political liberalism--even if having the daughters has no effect at all!

A solution

A solution is to apply the standard conservative (in the statistical sense!) approach to causal inference, which is to regress on your treatment variable (sex of kid) but controlling only for things that happen before the kid is born. For example, one could compare parents whose first child is a girl to parents whose first child is a boy. One can also look at the second birth, comparing parents whose second child is a girl to those whose second child is a boy--controlling for the sex of the first child. And so on for third child, etc.

The modeling could get interesting here, since there is a sort of pyramid of coefficients (one for the first-kid model, two for the second-kid model (controlling for first kid), and so forth). It might be reasonable to expect coefficients to gradually decline (I assume the effect of the first kid would be the biggest), and one could estimate that with some sort of hierarchical model.

Summary

I'm not saying that all these researchers are wrong; merely that, by controlling for an intermediate outcome, they're subject to a potential bias. Also they could redo their analyses without much effort, I think, to fix the biases and address this concern. I hope they do so (and inform me of their results).

It's an interesting example because we all know not to control for intermediate outcomes, but the total # of kids somehow doesn't look like that, at first.

P.S.

See here for more discussion of the U.K. voting example.

I recently met Carlos Davidson, a prof at Cal State University. He studies amphibians, with a special interest in why frogs in California are disappearing. He said that he can "predict quite well whether a site will have frogs, based on the pesticide use upwind" and that he thinks that pesticides are a big part of the problem. But he also said that others in his field are far from convinced. What should it take to be convincing? Is there a "statistical" answer to questions like, which is more important: lab work, more field work, more analysis of existing field data (perhaps with more covariates included)?

Per Pettersson-Lidbom is presenting a paper (in the Political Economy Seminar) that claims that an increase in size of local-government legislatures decreases the size of local government. First I'll give his abstract, then my comments. The abstract:

This paper addresses the question of whether the size of the legislature matters for the size of government. Previous empirical studies have found a positive relationship between the number of legislators and government spending but those studies do not adequately address the concerns of endogeneity. In contrast, this paper uses variation in council size induced by statutory council size laws to estimate the causal effect of legislature size on government size. These laws create discontinuities in council size at certain known thresholds of an underlying continuous variable, which make it possible to generate “near experimental” causal estimates of the effect of council size on government size. In contrast to previous findings, I [Pettersson-Lidbom] find a negative relationship between council size and government size: on average, spending and revenues are decreased by roughly 0.5 percent for each additional council member.

It's cool how he uses a natural experiment based on the laws of Finland and Sweden. As he writes: "In Finland, the council size of local governments is determined solely by population size. For example, if a local government has a population between 4001 and 8,000, the council must consist of 27 members, but if its population is between 8,001 and 15,000 the council must have 35 members. Thus, the law creates a discontinuity in council size at the threshold of 8001 inhabitants." And regression-discontinuity analysis certainly seems appropriate here.

The actual result is suprising to me--not that I'm any expert on local government, it's just surprising to see a negative effect here--and so I'd like to see some presentation of the data. If the effect is as clear as is claimed, it should show up in some basic analyses--here I'm thinking of scatterplots and matching analyses. This is somewhat a matter of taste--as a statistician, I like graphs, but economists seem to prefer tables. But I just find it difficult to be convinced by results such as Tables 9-14.

To flip it around: this is a pretty clean dataset, right? You have a natural experiment and some points near the boundary. So a scatterplot, and a simple regression could be pretty convincing. Tables 2 and 3 are promising (well, I'd prefer graphs, but still...) but they only have data on "x", not on "y". As things stand, I really just have to take the results on trust. Not that I have any reason to disbelieve them, but I'd like to be a little more confident in the results--especially given that much of the paper discusses why these results differ from the rest of the literature on the topic.

I just came back from a talk by Jere Behrman on "What Determines Adult Cognitive Skills? Impacts of Pre-School, School-Years and Post-School Experiences in Guatemala." Here's the paper.

It was all interesting, but what confused me here, as in other talks of this type, was the interpretation of regressions controlling for several variables that are sequential in time. This particular example was a longitudinal study of about 1500 people, looking at adult cognitive outcomes and including, as predicotrs, measures of health at age 6, years of schooling, and work after school was over. It's tricky to interpret the coefficient of pre-school health in this regression as a "treatment effect" since it can affect the other predictors. People at the seminar were talking along the lines of "causal pathways" but this always confuses me too. A simple response is to follow the basic advice of not controlling for post-treatment outcomes, but doing such an analysis wouldn't address some of the questions the researchers were trying to study here.

So I'm left simply confused. I'm not trying to be critical of this paper, since I'm not really offering an alternative. But I'm not quite sure how to interpret all these regression coefficients. (Even setting aside the issues involving instrumental variables, which are used in this study also.) I'm just a little stuck here.

Causal inference is in demand

| No Comments

The following arrived in the email yesterday:

Zhiqiang Tan recently wrote two papers on the theory of causal inference: see here and here. Here are the abstracts:

On June 20, we had a miniconference on causal inference at the Columbia University Statistics Department. The conference consisted of six talks and lots of discussion. One topic of discussion was the use of propensity scores in causal inference, specifically, discarding data based on propensity scores. Discarding data (e.g., discarding all control units whose propensity scores are outside the range of the propensity scores in the treated group) can reduce or eliminate extrapolation, a potential cause of bias if the treated and control groups have different distributions of background covariates. However, it's sort of unappealing to throw out data, and can sometimes lead to treatment effect estimates for an ill-defined subset of the population. There was discussion on the extent to which modeling can be done using all available data without extrapolation. Other topics of discussion included bounds, intermediate outcomes, and treatment interactions. For more information, click here.

Zhiqiang Tan (Biostatistics, Johns Hopkins) writes, regarding my blog entry on regression and matching.

I wrote:


I'm imagining a unification of matching and regression methods, following the Cochran and Rubin approach: (1) matching, (2) keeping the treated and control units but discarding the information on who was matched with whom, (3) regression including treatment interactions. I'm still confused about exactly how the propensity score fits in.

Zhiqiang writes:

In fact, I'm also working on "causal inference". As I understand, there is a fundamental gap between the idea of propensity score and the likelihood principle or Bayesian inference. The likelihood is factorized in terms of the outcome regression and the propensity score, so that any (parametric) likelihood or Bayesian inference would necessarily ignore the propensity score! One way to reconcile the two "ideas" is to look at the joint distribution of covariates and outcome, as in my paper "Efficient and Robust Causal Inference: A Distributional Approach".

As you can see, the idea is connected to the likelihood formulation for Monte Carlo integration. Here I worked on propensity score weighting as opposed to matching, and followed maximum likelihood/frequentist instead of Bayesian.

My response: I agree that propensity score methods don't tie directly to likelihood or Bayesian inference. I think the appropriate link is through poststratification. But actually carrying out this modeling in a reasonable way is a challenge--an important research problem, I think.

My quick and lazy comments on Zhiqiang's paper: The tables should be graphs. Figure 1 could use a caption explaining what models 1-4 are, and what the two graphs are. The graphs in Figures 2 and 3 can be made smaller, and they should be rotated 90 degrees.

OK, now I have to read the paper for real.

I'm all confused!

| 4 Comments

Are our experiments too large or are they too small?

We would like to incorporate matching methods into a Bayesian regression framework for causal inference, with the ultimate goal of being able to do more effective inference using hierarchical modeling. The founding work here are papers by Cochran and Rubin in 1973, demonstrating that matching followed by regression outperforms either method alone, and papers by Rosenbaum and Rubin in 1984 on propensity scores.

Right now, our starting points are two recent review articles, one by Guido Imbens on the theory of regression and matching adjustments, and one by Liz Stuart on practical implementations of matching. So far, I've read Guido's article and have a bunch of comments/questions. Much of this involves my own work (since that's what I'm most familiar with), so I apologize in advance for that.

A few years ago I picked up the book Virtual History: Alternatives and Counterfactuals, edited by Niall Ferguson. It's a book of essays by historians on possible alternative courses of history (what if Charles I had avoided the English civil war, what if there had been no American Revolution, what if Irish home rule had been established in 1912, ...).

There have been and continue to be other books of this sort (for example, What If: Eminent Historians Imagine What Might Have Been, edited by Robert Cowley), but what makes the Ferguson book different is that he (and most of the other authors in his book) are fairly rigorous in only considering possible actions that the relevant historical personalities were actually considering. In the words of Ferguson's introduction: "We shall consider as plausible or probable only those alternatives which we can show on the basis of contemporary evidence that contemporaries actually considered."

I like this idea because it is a potentially rigorous extension of the now-standard "Rubin model" of causal inference.

Question about causal inference

| 5 Comments

Judea Pearl (Dept of Computer Science, UCLA) spoke here Tuesday on "Inference with cause and effect." I think I understood the method he was describing but it left me with some questions about what were the method's hidden assumptions. Perhaps someone familiar with this approach can help me out here.

I'll work with a specific example from my one of my current research projects.

Following up on this and this and this , Dan Ho sent me the following discussion of the differences between his, Jasjeet Sekhon's, and Ben Hansen's matching programs:

Matching and matching

| 2 Comments

Contingency and alternative history

| 3 Comments

This might not seem like it has much connection to statistics, but bear with me . . .

Alternative history--imaginings of different versions of this world that could have occurred if various key events in the past had been different--is a popular category of science fiction. Alternative history stories come in a number of flavors but a common feature of the best of the novels in this subgenre is that the alternate world is not "real."

Let's consider the top three alternative history novels (top three not in sales but in critical reputation, or at least my judgment of literary quality): The Man in the High Castle, Pavane, and Bring the Jubilee. (warning: spoilers coming)

Causal inference and decision trees

| 2 Comments

Causal inference and decision analysis are two areas of statistics in which I've seen very little overlap: the work in causal inference is typically very "foundational" with continuing reassessment based on first principles, whereas decision analysis is more of meat-and-potatoes Bayesian inference--slap down a probability model, stick in a utility function, and turn the crank. (With all this processing, this must be ground beef and mashed potatoes.)

Actually, though, causal inference and decision analysis are connected at a fundamental level. Both involve manipulation and potential outcomes. In causal inference, the "causal effect" (or, as Michael Sobel would say, the "effect") is the difference between what would happen under treatment A and what would happen under treatment B. The key to this definition is that either treatment could be applied to the experimental unit by some agent (the "experimenter").

In parallel, decision analysis concerns what would happen if decision A or decision B were chosen. When drawing decision trees, we let squares and circles represent decision and uncertainty nodes, respectively. To map on to causal inference, the squares would represent potential treatments and the circles would represent uncertainty in outcomes--or population variability.

In practice, the two areas of research are not always so closely connected. For example, in our decision analysis for home radon, the key decision is whether to remediate your house for radon. The causal effect of this decision on reducing the probability of lung cancer death is assumed to follow a specified functional form as estimated from previous studies. For our decision analysis we don't worry about too much about the details of where that estimate came from.

But in thinking about causal effects, the decision-making framework might be helpful in distinguishing among different possible potential-outcome frameworks.

Jasjeet Sekhon reports:

I recently released a new version of my Matching package for R. The new version has a function, called GenMatch(), which finds optimal balance using multivariate matching where a genetic search algorithm determines the weight each covariate is given. The function never consults the outcome and is able to find amazingly good balance in datasets where human researchers have failed to do so. I'm writing a paper on this algorithm right now.

The software, along with some examples, is here.

We also had a discussion of matching a few months ago on the blog.

Daniel Scharfstein (http://commprojects.jhsph.edu/faculty/bio.cfm?F=Daniel&L=Scharfstein) recently gave a very good talk at the Columbia Biostatistics Department. He presented an application of causal inference using principal stratification. The example was similar to something I've heard Don Rubin and others speak about before, but I realized I'd been missing something important about this particular example.

A well-publicized example of problems with observational studies is hormone replacement therapy and heart attack risks for postmenopausal women. In brief, the observational study gave misleading answers because the "treatment" and "control" groups differed systematically. Could the method of propensity scores have found (and solved) the problem?

The "law of parsimony"?

| 2 Comments

Speaking of parsimony, I came across the following quotation from Commentary magazine (page 80 in the December 2004 issue):

The law of parsimony tells us that when there are alternative explanations of events, the simplest one is likely to be correct.

Commentary is a serious magazine, and this quotation (which I disagree with!) makes me wonder whether this idea of a scientific "law" is common among serious literary and political critics.

Juan Robalino and Alex Pfaff have written a paper on estimating the factors that influence the decision of Costa Rican farmers to clear forest land.

This is an important question because, as they note in the article,

Rural areas of developing countries contain almost the entire stock of the world's tropical forest. The poverty levels in these areas and the world demands for forest conservation have generated discussions concerning the determinants of deforestation and the appropriate policies for conservation.

When a neighbors have cleared their land of forest, a farmer is likely to clear his or her land also. However, as Robalino and Pfaff note, neighboring plots of land will have many potentially unobserved similarities, and so mere correlation between neighbors' decisions is not sufficient evidence of causation.

Rosalino and Pfaff estimate the effect of neighbors' actions on individual deforestation decisions using a two-stage probit regression. In their model, they treat the slopes of the neighboring farmers' land as an instrumental variable. I don't fully understand instrumental variables, but this looks like an interesting example as well as being an important application.

Daniel Ho, Kosuke Imai, Gary King, and Liz Stuart recently wrote a paper on matching, followed by regression, as a tool for causal inference. They apply the methods developed by Don Rubin in 1970 and 1973 to some political science data, and make a strong argument, both theoretical and practical, for why this approach should be used more often in social science research.

I read the paper, and liked it a lot, but I had a few questions about how it "sold" the concept of matching. Kosuke, Don, and I had an email dialogue including the following exchange.

[The abstract of the paper claims that matching methods "offer the promise of causal inference with fewer assumptions" and give "considerably less model-dependent causal inferences"]

AG: Referring to matching as "nonparametric and non-model-based" might be misleading. It depends on how you define "model", I guess, but from a practical standpoint, information has to be used in the matching, and I'm not sure there's such a clear distinction between using a "model" as compared to an "algorithm" to do the matching.

DR: I think much of this stuff about "models" and non-models is unfortunate. Whatever procedure you use, you are ignoring certain aspects of the data (and so regarding them as irrelevant), and emphasizing aspects as important. For a trivial example, when you do something "robust" to estimate the "center" of a distn, you are typically making assumptions about the definition of "center" and the irrelevance of extreme observations to the estimation of the it. Etc.

KI: I want to take this opportunity and ask you one quick question. When I talk about matching methods to political scientists who are so used to running a bunch of regressions, they often ask why matching is better than regression or why they should bother to do matching in combination with regressions. What would be the best way to answer this question? I usually tell them about the benefits of the potential outcome framework, available diagnostics, and flexibility (e.g., as opposed to linear regression) etc. But, I'm wondering what you would say to social scientists!

AG: Matching restricts the range of comparison you're doing. It allows you to make more robust inferences, but with a narrower range of applicability. See Figure 7.2 on page 227 of our book for a simple picture of what's going on, in an extreme case. Matching is just a particular tool that can be used to study a subset of the decision space. The phrase I would use for social scientists is, "knowing a lot about a little". The papers by Dehejia and Wahba discuss these issues in an applied context: http://www.columbia.edu/%7Erd247/papers/w6586.pdf and http://www.columbia.edu/%7Erd247/papers/matching.pdf

DR: Also look at the simple tables in Cochran and Rubin (1973) or Rubin (1973) or Rubin (1979) etc. They all show that regression by itself is terribly unreliable with minor nonlinearity that is difficult to detect, even with careful diagnostics. This message is over three decades old!

KI: I agree that matching gives more robust inferences. That's the main message of the paper that I presented and also of my JASA paper with David. The question my fellow social scientists ask is why matching is more robust than regressions (and hence why they should be doing matching rather than running regressions). One answer is that matching removes some observations and hence avoids extrapolation. But how about other kinds of matching that use all the observations (e.g., subclassification and full matching)? Are they more robust than regressions? What I usually tell them is that regressions often make stronger functional form assumptions than matching. With stratification, for example, you can fit separate regressions within each strata and then aggregate the results (therefore it does not assume that the same model fits all the data). I realize that there is no simple answer to this kind of a vague, and perhaps illposed, question. But, these are the kind of questions that you get when you tell soc. sci. about matching methods!

AG: I think the key idea is avoiding extrapolation, as you say above. But I don't buy the claim that regression makes stronger functional form assumptions than matching. Regressions can (and should) include interactions. Regression-with-interaction, followed by poststratification, is a basic idea. It can be done with or without matching.

The social scientists whom I respect are more interested in the models than in the estimation procedures. For these people, I would focus on your models-with-interactions. If matching makes it easier to fit such models, that's a big selling point of matching to me.

KI: One can think of it as a way to fit models with interactions in the multivariate setting by helping you create matches and subclasses. One thing I wanted to emphasize in my talk is that matching is not necessary a substitute for regressions, and that one can use matching methods to make regressions perform more robust.

Just a quick note on what we're doing with Shepherd's paper, and why...

There's a long standing economics literature (beginning with Ehrlich in 1975) on the question of whether the death penalty deters murders. Since the death penalty is used in some states and not others, and has been used to differing extents over the years, it tends to be treated as a natural experiment.

However, capital punishment is not implemented at random -- be it the political climate, current crime rates, or inherent differences across states, there's something that drives some states to legalize the death penalty, and others not to. Furthermore, the "deterrent effect" of the death penalty varies by state and time, making it difficult to make a single causal claim of deterrence or the lack thereof.

Since there are such differences across states and time, the question of deterrence may lend itself to a multilevel model. Most modern papers look at data by state and year, and a multilevel model allows some flexibility in the appropriate level of aggregation.

Joanna Shepherd's recent paper (available on the Wiki) uses a series of equations to predict murder rates by county and year, as a function of deterrent (and other) measures, and predict the probabilities of arrest, death sentences, and executions, based on murder rates and a number of other factors.

She finds that in some states, executions "deter" crime, in others there is no effect, and in still others, executions "cause" crime. We plan to test the sensitivity of these findings to changes in model specifications.

We don't have her data yet; at the moment I'm using Stata to play with some state-level data to try and get a handle on her model. The county-by-year data might add more to a multilevel model than state-level data, since the death penalty is legalized by state, but deterrent effects may be felt more locally.

We're also planning to test some of the other model specifications -- the reliance on publicity, differences between high-execution and low-execution states, and so forth.

I'm meeting with Jeff on Friday, who has some ideas of what we should be testing, and will relay that conversation...

Recent Comments

  • Bill Drissel: As I hear English, {problem} linked to {candidate cause} and read more
  • Bill Jefferys: I appreciate the link to the very cool "size of read more
  • Thank God for western civ: The under 30 crowd supports school vouchers and social security read more
  • Jared: Elke Weber, right there at Columbia, has done a bunch read more
  • Thorfinn: Maybe you're right about the risk premium, but I'm not read more
  • JonBen: Very interesting data. I understand the social context of putting read more
  • Radu Craiu: I feel compelled to confess that I have read K read more
  • Paul: I think a lot of the issue comes down to read more
  • Nick Cox : Jacob: Thanks for your extra comments. You'd have saved yourself read more
  • Asa: Thanks everyone. I figured out a pretty solid solution to read more
  • Stuart Buck: Is it that medical schools are trying to screen out read more
  • Jacob: BTW, in no way I am putting down R. R read more
  • Jacob: Nick, Of course, my comment on MATLAB's popularity is based read more
  • Steven: http://www.cockeyed.com/science/gallon/liquid.html See for more info read more
  • Andrew Gelman: Jonathan: You are giving the conventional definition of risk aversion read more
  • Jonathan: As an economist who does his work with "the public," read more
  • BrendanH: I'll second the lme4/R recommendation, on the grounds that it read more
  • Chris Brew: In Linguistics Ohio State invites people to on-site recruiting visits read more
  • PalMD: Independent of the evidence, med schools and residencies require very read more
  • Eliot: I agree with Elizabeth and Jeremy -- Americans and Europeans read more