# Recently in Causal Inference Category

## Macro causality

David Backus writes:

This is from my area of work, macroeconomics. The suggestion here is that the economy is growing slowly because consumers aren't spending money. But how do we know it's not the reverse: that consumers are spending less because the economy isn't doing well. As a teacher, I can tell you that it's almost impossible to get students to understand that the first statement isn't obviously true. What I'd call the demand-side story (more spending leads to more output) is everywhere, including this piece, from the usually reliable David Leonhardt.

This whole situation reminds me of the story of the village whose inhabitants support themselves by taking in each others' laundry. I guess we're rich enough in the U.S. that we can stay afloat for a few decades just buying things from each other?

Regarding the causal question, I'd like to move away from the idea of "Does A causes B or does B cause A" and toward a more intervention-based framework (Rubin's model for causal inference) in which we consider effects of potential actions. See here for a general discussion. Considering the example above, a focus on interventions clarifies some of the causal questions. For example, if you want to talk about the effect of consumers spending less, you have to consider what interventions you have in mind that would cause consumers to spend more. One such intervention is the famous helicopter drop but there are others, I assume. Conversely, if you want to talk about the poor economy affecting spending, you have to consider what interventions you have in mind to make the economy go better.

In that sense, instrumental variables are a fundamental way to think of just about all causal questions of this sort. You start with variables A and B (for example, consumer spending and economic growth). Instead of picturing A causing B or B causing A, you consider various treatments that can affect both A and B.

All my discussion is conceptual here. As I never tire of saying, my knowledge of macroeconomics hasn't developed since I took econ class in 11th grade.

## Matching and regression: two great tastes etc etc

Matthew Bogard writes:

Regarding the book Mostly Harmless Econometrics, you state:
A casual reader of the book might be left with the unfortunate impression that matching is a competitor to regression rather than a tool for making regression more effective.
But in fact isn't that what they are arguing, that, in a 'mostly harmless way' regression is in fact a matching estimator itself? "Our view is that regression can be motivated as a particular sort of weighted matching estimator, and therefore the differences between regression and matching estimates are unlikely to be of major empirical importance" (Chapter 3 p. 70) They seem to be distinguishing regression (without prior matching) from all other types of matching techniques, and therefore implying that regression can be a 'mostly harmless' substitute or competitor to matching. My previous understanding, before starting this book was as you say, that matching is a tool that makes regression more effective. I have not finished their book, and have been working at it for a while, but if they do not mean to propose OLS itself as a matching estimator, then I agree that they definitely need some clarification. I actually found your particular post searching for some article that discussed this more formally, as I found my interpretation (misinterpretation) difficult to accept. What say you?

I don't know what Angrist and Pischke actually do in their applied analysis. I'm sorry to report that many users of matching do seem to think of it as a pure substitute for regression: once they decide to use matching, they try to do it perfectly and they often don't realize they can use regression on the matched data to do even better. In my book with Jennifer, we try to clarify that the primary role of matching is to correct for lack of complete overlap between control and treatment groups.

But I think in their comment you quoted above, Angrist and Pischke are just giving a conceptual perspective rather than detailed methodological advice. They're saying that regression, like matching, is a way of comparing-like-with-like in estimating a comparison. This point seems commonplace from a statistical standpoint but may be news to some economists who might think that regression relies on the linear model being true.

Gary King and I discuss this general idea in our 1990 paper on estimating incumbency advantage. Basically, a regression model works if either of two assumptions is satisfied: if the linear model is true, or if the two groups are balanced so that you're getting an average treatment effect. More recently this idea (of their being two bases for an inference) has been given the name "double robustness"; in any case, it's a fundamental aspect of regression modeling, and I think that, by equating regression with matching, Angrist and Pischke are just trying to emphasize that these are just tow different ways of ensuring balance in a comparison.

In many examples, neither regression nor matching works perfectly, which is why it can be better to do both (as Don Rubin discussed in his Ph.D. thesis in 1970 and subsequently in some published articles with his advisor, William Cochran).

## Descriptive statistics, causal inference, and story time

Dave Backus points me to this review by anthropologist Mike McGovern of two books by economist Paul Collier on the politics of economic development in Africa. My first reaction was that this was interesting but non-statistical so I'd have to either post it on the sister blog or wait until the 30 days of statistics was over. But then I looked more carefully and realized that this discussion is very relevant to applied statistics.

Here's McGovern's substantive critique:

Much of the fundamental intellectual work in Collier's analyses is, in fact, ethnographic. Because it is not done very self-consciously and takes place within a larger econometric rhetoric in which such forms of knowledge are dismissed as "subjective" or worse still biased by the political (read "leftist") agendas of the academics who create them, it is often ethnography of a low quality. . . .

Despite the adoption of a Naipaulian unsentimental-dispatches-from-the-trenches rhetoric, the story told in Collier's two books is in the end a morality tale. The tale is about those countries and individuals with the gumption to pull themselves up by their bootstraps or the courage to speak truth to power, and those power- drunk bottom billion elites, toadying sycophants, and soft-hearted academics too blinded by misplaced utopian dreams to recognize the real causes of economic stagnation and civil war. By insisting on the credo of "just the facts, ma'am," the books introduce many of their key analytical moves on the sly, or via anecdote. . . . This is one explana- tion of how he comes to the point of effectively arguing for an international regime that would chastise undemocratic leaders by inviting their armies to oust them--a proposal that overestimates the virtuousness of rich countries (and poor countries' armies) while it ignores many other potential sources of political change . . .

My [McGovern's] aim in this essay is not to demolish Collier's important work, nor to call into question development economics or the use of statistics. . . . But the rhetorical tics of Collier's books deserve some attention. . . . if his European and North American audiences are so deeply (and, it would seem, so easily) misled, why is he quick to presume that the "bottom billion" are rational actors? Mightn't they, too, be resistant to the good sense purveyed by economists and other demystifiers?

Now to the statistical modeling, causal inference, and social science. McGovern writes of Collier (and other quantitatively-minded researchers):

Portions of the two books draw on Collier's academic articles to show one or several intriguing correlations. Having run a series of regressions, he identifies counterintuitive findings . . . However, his analysis is typically a two-step process. First, he states the correlation, and then, he suggests an explanation of what the causal process might be. . . . Much of the intellectual heavy lifting in these books is in fact done at the level of implication or commonsense guessing.

This pattern (of which McGovern gives several convincing examples) is what statistician Kaiser Fung calls story time--that pivot from the quantitative finding to the speculative explanation My favorite recent example remains the recent claim that "a raise won't make you work harder." As with McGovern's example, the "story time" hypothesis there may very well be true (under some circumstances) but the statistical evidence doesn't come close to proving the claim or even convincing me of its basic truth.

The story of story time

But story time can't be avoided. On one hand, there are real questions to be answered and real decisions to be made in development economics (and elsewhere), and researchers and policymakers can't simply sit still and say they can't do anything because the data aren't fully persuasive. (Remember the first principle of decision analysis: Not making a decision is itself a decision.)

From the other direction, once you have an interesting quantitative finding, of course you want to understand it, and it makes sense to use all your storytelling skills here. The challenge is to go back and forth between the storytelling and the data. You find some interesting result (perhaps an observational data summary, perhaps an analysis of an experiment or natural experiment), this motivates a story, which in turn suggests some new hypotheses to be studied. Yu-Sung and I were just talking about this today in regard to our article on public opinion about school vouchers.

The question is: How do quantitative analysis and story time fit into the big picture? Mike McGovern writes that he wishes Paul Collier had been more modest in his causal claims, presenting his quantitative findings as "intriguing and counterintuitive correlations" and frankly recognizing that exploration of these correlations requires real-world understanding, not just the rhetoric of hard-headed empiricism.

I agree completely with McGovern--and I endeavor to follow this sort of modesty in presenting the implications of my own applied work--and I think it's a starting point for Coliier and others. Once they recognize that, indeed, they are in story time, they can think harder about the empirical implications of their stories.

The trap of "identifiability"

As Ole Rogeberg writes (following up on ideas of James Heckman and others), the search for clean identification strategies in social research can be a trap, in that it can result in precise but irrelevant findings tied to broad but unsupported claims. Rogeberg has a theoretical model explaining how economists can be so rigorous in parts of their analysis and so unrigorous in others. Rogeberg sounds very much like McGovern when he writes:

The puzzle that we try to explain is this frequent disconnect between high-quality, sophisticated work in some dimensions, and almost incompetently argued claims about the real world on the other.

The virtue of description

Descriptive statistics is not just for losers. There is value in revealing patterns in observational data, correlations or predictions that were not known before. For example, political scientists were able to forecast presidential election outcomes using information available months ahead of time. This has implications about political campaigns--and no causal identification strategy was needed. Countries with United Nations peacekeeping take longer, on average, to revert to civil war, compared to similarly-situated countries without peacekeeping. A fact worth knowing, even before the storytelling starts. (Here's the link, which happens to also include another swipe at Paul Collier, this time from Bill Easterly.)

I'm not convinced by every correlation I see. For example, there was this claim that warming increases the risk of civil war in Africa. As I wrote at the time, I wanted to see the time series and the scatterplot. A key principle in applied statistics is that you should be able to connect between the raw data, your model, your methods, and your conclusions.

The role of models

In a discussion of McGovern's article, Chris Blattman writes:

Economists often take their models too seriously, and too far. Unfortunately, no one else takes them seriously enough. In social science, models are like maps; they are useful precisely because they don't explain the world exactly as it is, in all its gory detail. Economic theory and statistical evidence doesn't try to fit every case, but rather find systematic tendencies. We go wrong to ignore these regularities, but we also go wrong to ignore the other forces at work-especially the ones not so easily modeled with the mathematical tools at hand.

I generally agree with what Chris writes, but here I think he's a bit off by taking statistical evidence and throwing it in the same category as economic theory and models. My take-away from McGovern is that the statistical evidence of Collier et al. is fine; the problem is with the economic models which are used to extrapolate from the evidence to the policy recommendations. I'm sure Chris is right that economic models can be useful in forming and testing statistical hypotheses, but I think the evidence can commonly be assessed on its own terms. (This is related to my trick of understanding instrumental variables by directly summarizing the effect of the instrument on the treatment and the outcome without taking the next step and dividing the coefficients.)

To put it another way: I would separate the conceptually simple statistical models that are crucial to understanding evidence in any complex-data setting, from the economics (or, more generally, social science) models that are needed to apply empirical correlations to real-world decisions.

## Experimental reasoning in social science

As a statistician, I was trained to think of randomized experimentation as representing the gold standard of knowledge in the social sciences, and, despite having seen occasional arguments to the contrary, I still hold that view, expressed pithily by Box, Hunter, and Hunter (1978) that "To find out what happens when you change something, it is necessary to change it."

At the same time, in my capacity as a social scientist, I've published many applied research papers, almost none of which have used experimental data.

In the present article, I'll address the following questions:

1. Why do I agree with the consensus characterization of randomized experimentation as a gold standard?

2. Given point 1 above, why does almost all my research use observational data?

In confronting these issues, we must consider some general issues in the strategy of social science research. We also take from the psychology methods literature a more nuanced perspective that considers several different aspects of research design and goes beyond the simple division into randomized experiments, observational studies, and formal theory.

Here's the full article, which is appearing in a volume, Field Experiments and Their Critics, edited by Dawn Teele.

It was fun to write a whole article on causal inference in social science without duplicating the article that I'd recently written for the American Journal of Sociology. But I think it came out pretty well. Actually, it contains the material for several blog entries had I chosen to present it that way. In any case, I think points 1 and 2 are central to any consideration of causal inference in applied statistics.

## Controversy over the Christakis-Fowler findings on the contagion of obesity

Nicholas Christakis and James Fowler are famous for finding that obesity is contagious. Their claims, which have been received with both respect and skepticism (perhaps we need a new word for this: "respecticism"?) are based on analysis of data from the Framingham heart study, a large longitudinal public-health study that happened to have some social network data (for the odd reason that each participant was asked to provide the name of a friend who could help the researchers locate them if they were to move away during the study period.

The short story is that if your close contact became obese, you were likely to become obese also. The long story is a debate about the reliability of this finding (that is, can it be explained by measurement error and sampling variability) and its causal implications.

This sort of study is in my wheelhouse, as it were, but I have never looked at the Christakis-Fowler work in detail. Thus, my previous and current comments are more along the lines of reporting, along with general statistical thoughts.

We last encountered Christakis-Fowler last April, when Dave Johns reported on some criticisms coming from economists Jason Fletcher and Ethan Cohen-Cole and mathematician Russell Lyons.

Lyons's paper was recently published under the title, The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis. Lyons has a pretty aggressive tone--he starts the abstract with the phrase "chronic widespread misuse of statistics" and it gets worse from there--and he's a bit rougher on Christakis and Fowler than I would be, but this shouldn't stop us from evaluating his statistical arguments. Here are my thoughts:

## An argument that can't possibly make sense

Tyler Cowen writes:

Texas has begun to enforce [a law regarding parallel parking] only recently . . . Up until now, of course, there has been strong net mobility into the state of Texas, so was the previous lack of enforcement so bad?

I care not at all about the direction in which people park their cars and I have no opinion on this law, but I have to raise an alarm at Cowen's argument here.

Let me strip it down to its basic form:

1. Until recently, state X had policy A.

2. Up until now, there has been strong net mobility into state X

3. Therefore, the presumption is that policy A is ok.

In this particular case, I think we can safely assume that parallel parking regulations have had close to zero impact on the population flows into and out of Texas. More generally, I think logicians could poke some holes into the argument that 1 and 2 above imply 3. For one thing, you could apply this argument to any policy in any state that's had positive net migration.

Hair styling licensing in Florida, anyone?

P.S. I'm not trying to pick on Cowen here. Everybody makes mistakes. The most interesting logical errors are the ones that people make by accident, without reflection. So I thought it could be helpful to point this one out.

P.P.S. Commenters suggest Cowen was joking. In which case I applaud him for drawing attention to this common error in reasoning,

## Grouponomics, counterfactuals, and opportunity cost

I keep encountering the word "Groupon"--I think it's some sort of pets.com-style commercial endeavor where people can buy coupons? I don't really care, and I've avoided googling the word out of a general animosity toward our society's current glorification of get-rich-quick schemes. (As you can tell, I'm still bitter about that whole stock market thing.)

Anyway, even without knowing what Groupon actually is, I enjoyed this blog by Kaiser Fung in which he tries to work out some of its economic consequences. He connects the statistical notion of counterfactuals to the concept of opportunity cost from economics. The comments are interesting too.

## What Do We Learn from Narrow Randomized Studies?

Under the headline, "A Raise Won't Make You Work Harder," Ray Fisman writes:

To understand why it might be a bad idea to cut wages in recessions, it's useful to know how workers respond to changes in pay--both positive and negative changes. Discussion on the topic goes back at least as far as Henry Ford's "5 dollars a day," which he paid to assembly line workers in 1914. The policy was revolutionary at the time, as the wages were more than double what his competitors were paying. This wasn't charity. Higher-paid workers were efficient workers--Ford attracted the best mechanics to his plant, and the high pay ensured that employees worked hard throughout their eight-hour shifts, knowing that if their pace slackened, they'd be out of a job. Raising salaries to boost productivity became known as "efficiency wages."

So far, so good. Fisman then moves from history and theory to recent research:

How much gift exchange really matters to American bosses and workers remained largely a matter of speculation. But in recent years, researchers have taken these theories into workplaces to measure their effect on employee behavior.

In one of the first gift-exchange experiments involving "real" workers, students were employed in a six-hour library data-entry job, entering title, author, and other information from new books into a database. The pay was advertised as \$12 an hour for six hours. Half the students were actually paid this amount. The other half, having shown up expecting \$12 an hour, were informed that they'd be paid \$20 instead. All participants were told that this was a one-time job--otherwise, the higher-paid group might work harder in hopes of securing another overpaying library gig.

The experimenters checked in every 90 minutes to tabulate how many books had been logged. At the first check-in, the \$20-per-hour employees had completed more than 50 books apiece, while the \$12-an-hour employees barely managed 40 each. In the second 90-minute stretch, the no-gift group maintained their 40-book pace, while the gift group fell from more than 50 to 45. For the last half of the experiment, the "gifted" employees performed no better--40 books per 90-minute period--than the "ungifted" ones.

The punchline, according to Fisman:

The goodwill of high wages took less than three hours to evaporate completely--hardly a prescription for boosting long-term productivity.

What I'm wondering is: How seriously should we use an experiment on one-shot student library jobs (or another study, in which short-term employees were rewarded "with a surprise gift of thermoses"), to make general conclusions such as "Raises don't make employees work harder."

What I'm worried about here isn't causal identification--I'm assuming these are clean experiments--but the generalizability to the outside world of serious employment.

Fisman writes:

All participants were told that this was a one-time job--otherwise, the higher-paid group might work harder in hopes of securing another overpaying library gig.

This seems like a direct conflict between the goals of internal and external validity, especially given that one of the key reasons to pay someone more is to motivate them to work harder to secure continuation of the job, and to give them less incentive to spend their time looking for something new.

I'm not saying that the study Fisman cited is useless, just that I'm surprised that he's so careful to consider internal validity issues yet seems to have no problem extending the result to the whole labor force.

These are just my worries. Ray Fisman is an excellent researcher here at the business school at Columbia--actually, I know him and we've talked about statistics a couple times--and I'm sure he's thought about these issues more than I have. So I'm not trying to debunk what he's saying, just to add a different perspective.

Perhaps Fisman's b-school background explains why his studies all seem to be coming from the perspective of the employer: it's the employer who decides what to do with wages (perhaps "presenting the cut as a temporary measure and by creating at least the illusion of a lower workload") and the employees who are the experimental subjects.

Fisman's conclusion:

If we can find other ways of overcoming the simmering resentment that naturally accompanies wage cuts, workers themselves will be better for it in the long run.

The "we" at the beginning of the sentence does not seem to be the same as the "workers" at the end of the sentence. I wonder if there is a problem with designing policies in this unidirectional fashion.

## Is the internet causing half the rapes in Norway? I wanna see the scatterplot.

Ryan King writes:

This involves causal inference, hierarchical setup, small effect sizes (in absolute terms), and will doubtless be heavily reported in the media.

The article is by Manudeep Bhuller, Tarjei Havnes, Edwin Leuven, and Magne Mogstad and begins as follows:

Does internet use trigger sex crime? We use unique Norwegian data on crime and internet adoption to shed light on this question. A public program with limited funding rolled out broadband access points in 2000-2008, and provides plausibly exogenous variation in internet use. Our instrumental variables and fixed effect estimates show that internet use is associated with a substantial increase in reported incidences of rape and other sex crimes. We present a theoretical framework that highlights three mechanisms for how internet use may affect reported sex crime, namely a reporting effect, a matching effect on potential offenders and victims, and a direct effect on crime propensity. Our results indicate that the direct effect is non-negligible and positive, plausibly as a result of increased consumption of pornography.

How big is the effect?

## D. Kahneman serves up a wacky counterfactual

I followed a link from Tyler Cowen to this bit by Daniel Kahneman:

Education is an important determinant of income -- one of the most important -- but it is less important than most people think. If everyone had the same education, the inequality of income would be reduced by less than 10%. When you focus on education you neglect the myriad other factors that determine income. The differences of income among people who have the same education are huge.

I think I know what he's saying--if you regress income on education and other factors, and then you take education out of the model, R-squared decreases by 10%. Or something like that. Not necessarily R-squared, maybe you fit the big model, then get predictions for everyone putting in the mean value for education and look at the sd of incomes or the Gini index or whatever. Or something else along those lines.

My problem is with the counterfactual: "If everyone had the same education . . ." I have a couple problems with this one. First, if everyone had the same education, we'd have a much different world and I don't see why the regressions on which he's relying would still be valid. Second, is it even possible for everyone to have the same education? I majored in physics at MIT. I don't think it's possible for everyone to do this. Setting aside budgetary constraints, I don't think that most college-age kids could handle the MIT physics curriculum (nor do I think I could handle, for example, the courses at a top-ranked music or art college). I suppose you could imagine everyone having the same number of years of education, but that seems like a different thing entirely.

As noted, I think I see what Kahneman is getting at--income is determined by lots of other factors than education--but I'm a bit disappointed that he could be so casual with the causality. And without the causal punch, his statement doesn't seem so impressive to me. Everybody knows that education doesn't determine income, right? Bill Gates never completed college, and everybody knows the story of humanities graduates who can't find a job.

## Improvement of 5 MPG: how many more auto deaths?

This entry was posted by Phil Price.

A colleague is looking at data on car (and SUV and light truck) collisions and casualties. He's interested in causal relationships. For instance, suppose car manufacturers try to improve gas mileage without decreasing acceleration. The most likely way they will do that is to make cars lighter. But perhaps lighter cars are more dangerous; how many more people will die for each mpg increase in gas mileage?

There are a few different data sources, all of them seriously deficient from the standpoint of answering this question. Deaths are very well reported, so if someone dies in an auto accident you can find out what kind of car they were in, what other kinds of cars (if any) were involved in the accident, whether the person was a driver or passenger, and so on. But it's hard to normalize: OK, I know that N people who were passengers in a particular model of car died in car accidents last year, but I don't know how many passenger-miles that kind of car was driven, so how do I convert this to a risk? I can find out how many cars of that type were sold, and maybe even (through registration records) how many are still on the road, but I don't know the total number of miles. Some types of cars are driven much farther than others, on average.

Most states also have data on all accidents in which someone was injured badly enough to go to the hospital. This lets you look at things like: given that the car is in an accident, how likely is it that someone in the car will die? This sort of analyses makes heavy cars look good (for the passengers in those vehicles; not so good for passengers in other vehicles, which is also a phenomenon of interest!) but perhaps this is misleading: heavy cars are less maneuverable and have longer stopping distance, so perhaps they're more likely to be in an accident in the first place. Conceivably, a heavy car might be a lot more likely to be in an accident, but less likely to kill the driver if it's in one, compared to a lighter car that is better for avoiding accidents but more dangerous if it does get hit.

Confounding every question of interest is that different types of driver prefer different cars. Any car that is driven by a disproportionately large fraction of men in their late teens or early twenties is going to have horrible accident statistics, whereas any car that is selected largely by middle-aged women with young kids is going to look pretty good. If 20-year-old men drove Volvo station wagons, the Volvo station wagon would appear to be one of the most dangerous cars on the road, and if 40-year-old women with 5-year-old kids drove Ferraris, the Ferrari would seem to be one of the safest.

There are lots of other confounders, too. Big engines and heavy frames cost money to make, so inexpensive cars tend to be light and to have small engines, in addition to being physically small. They also tend to have less in the way of safety features (no side-curtain airbags, for example). If an inexpensive car has a poor safety record, is it because it's light, because it's small, or because it's lacking safety features? And yes, size matters, not just weight: a bigger car can have a bigger "crumple zone" and thus lower average acceleration if it hits a solid object, for example. If large, heavy cars really are safer than small, light cars, how much of the difference is due to size and how much is due to weight? Perhaps a large, light car would be the best, but building a large, light car would require special materials, like titanium or aluminum or carbon fiber, which might make it a lot more expensive...what, if anything, do we want to hold constant if we increase the fleet gas mileage? Cost? Size?

And of course the parameters I've listed above --- size, weight, safety features, and driver characteristics --- don't begin to cover all of the relevant factors.

So: is it possible to untangle the causal influence of various factors?

Most people who are involved in this research topic appear to rely on linear or logistic regression, controlling for various explanatory variables, and make various interpretations based on the regression coefficients, r-squared values, etc. Is this the best that can be done? And if so, how does one figure out the right set of explanatory variables?

This is a "causal inference" question, and according to the title of this blog, this blog should be just the place for this sort of thing. So, bring it on: where do I look to find the right way to answer this kind of question?

(And, by the way, what is the answer to the question I posed at the end of this causal inference discussion?)

## Bringing Causal Models Into the Mainstream

John Johnson writes at the Statistics Forum.

## "Are Wisconsin Public Employees Underpaid?"

Amy Cohen points me to this blog by Jim Manzi, who writes:

## Poverty, educational performance - and can be done about it

Andrew has pointed to Jonathan Livengood's analysis of the correlation between poverty and PISA results, whereby schools with poorer students get poorer test results. I'd have written a comment, but then I couldn't have inserted a chart.

Andrew points out that a causal analysis is needed. This reminds me of an intervention that has been done before: take a child out of poverty, and bring him up in a better-off family. What's going to happen? There have been several studies examining correlations between adoptive and biological parents' IQ (assuming IQ is a test analogous to the math and verbal tests, and that parent IQ is analogous to the quality of instruction - but the point is in the analysis not in the metric). This is the result (from Adoption Strategies by Robin P Corley in Encyclopedia of Life Sciences):

So, while it did make a difference at an early age, with increasing age of the adopted child, the intelligence of adoptive parents might not be making any difference whatsoever in the long run. At the same time, the high IQ parents could have been raising their own child, and it would probably take the same amount of resources.

There are conscientious people who might not choose to have a child because they wouldn't be able to afford to provide to their own standard (their apartment is too small, for example, or they don't have enough security and stability while being a graduate student). On the other hand, people with less comprehension might neglect this and impose their child on society without the means to provide for him. Is it good for society to ask the first group to pay taxes, and reallocate the funds to the second group? I don't know, but it's a very important question.

I am no expert, especially not in psychology, education, sociology or biology. Moreover, there is a lot more than just IQ: ethics and constructive pro-social behavior are probably more important, and might be explained a lot better by nurture than nature.

I do know that I get anxious whenever a correlation analysis tries to look like a causal analysis. A frequent scenario introduces an outcome (test performance) with a highly correlated predictor (say poverty), and suggests that reducing poverty will improve the outcome. The problem is that poverty is correlated with a number of other predictors. A solution I have found is to understand that multiple predictors information about the outcome overlaps - a tool I use is interaction analysis, whereby we explicate that two predictors' information overlaps (in contrast to regression coefficients which misleadingly separate the contributions of each predictors). But the real solution is a study of interventions, and the twin and adoptive studies with a longer time horizon are pretty rigorous. I'd be curious about similarly rigorous studies of educational interventions, or about the flaws in the twin and adoptive studies.

[Feb 7, 8:30am] An email points out a potential flaw in the correlation analysis:

The thing which these people systematically missed, was that we don't really care at all about the correlation between the adopted child's IQ and that of the adopted parent. The right measure of effect is to look at the difference in IQ level.

Example to drive home the point: Suppose the IQ of every adoptive parent is 120, while the IQ of the biological parents is Normal(100,15), as is that of the biological control siblings is, but that of the adopted children is Normal(110,15). The correlation between adopted children and adopted parents would be exactly zero (because the adopted parents are all so similar), but clearly adoption would have had a massive effect. And, yes, adopted parents, especially in these studies, are very different from the norm, and similar to each other: I don't know about the Colorado study, but in the famous Minnesota twins study, the mean IQ of the adoptive fathers was indeed 120, as compared to a state average of 105.

The review paper you link to is, so far as I can tell, completely silent about these obvious-seeming points.

I would add that correlations are going to be especially misleading for causal inference in any situation where a variable is being regulated towards some goal level, because, if the regulation is successful. It's like arguing that the temperature in my kitchen is causally irrelevant to the temperature in my freezer --- it's uncorrelated, but only because a lot of complicated machinery does a lot of work to keep it that way! With that thought in mind, read this.

Indeed, the model based on correlation doesn't capture the improvement in the average IQ of what the adoptive child would have if brought up in an orphanage or by unwilling or incapable biological parents (as arguably all children put up for adoption are) vs being brought up in a well-functioning family (as probably all adoptive families are). And comments like these are precisely why we should discuss these topics systematically, so that better models can be developed and studied! As a European I am regularly surprised how politicized this topic seems to be in the US. It's an important question that needs more rigor.

Thanks for the emails and comments, they're the main reason why I still write these blog posts.

## An IV won't save your life if the line is tangled

Alex Tabarrok quotes Randall Morck and Bernard Yeung on difficulties with instrumental variables. This reminded me of some related things I've written.

In the official story the causal question comes first and then the clever researcher comes up with an IV. I suspect that often it's the other way around: you find a natural experiment and look at the consequences that flow from it. And maybe that's not such a bad thing. See section 4 of this article.

More generally, I think economists and political scientists are currently a bit overinvested in identification strategies. I agree with Heckman's point (as I understand it) that ultimately we should be building models that work for us rather than always thinking we can get causal inference on the cheap, as it were, by some trick or another. (This is a point I briefly discuss in a couple places here and also in my recent paper for the causality volume that Don Green etc are involved with.)

I recently had this discussion with someone else regarding regression discontinuity (the current flavor of the month; IV is soooo 90's), but I think the point holds more generally, that experiments and natural experiments are great when you have them, and they're great to aspire to and to focus one's thinking, but in practice these inferences are sometimes a bit of a stretch, and sometimes the appeal of an apparently clean identification strategy masks some serious difficulty mapping the identified parameter to underlying quantities of interest.

P.S. How I think about instrumental variables.

## Teaching evaluations, instructor effectiveness, the Journal of Political Economy, and the Holy Roman Empire

Joan Nix writes:

Your comments on this paper by Scott Carrell and James West would be most appreciated. I'm afraid the conclusions of this paper are too strong given the data set and other plausible explanations. But given where it is published, this paper is receiving and will continue to receive lots of attention. It will be used to draw deeper conclusions regarding effective teaching and experience.

Nix also links to this discussion by Jeff Ely.

I don't completely follow Ely's criticism, which seems to me to be too clever by half, but I agree with Nix that the findings in the research article don't seem to fit together very well. For example, Carrell and West estimate that the effects of instructors on performance in the follow-on class is as large as the effects on the class they're teaching. This seems hard to believe, and it seems central enough to their story that I don't know what to think about everything else in the paper.

My other thought about teaching evaluations is from my personal experience. When I feel I've taught well--that is, in semesters when it seems that students have really learned something--I tend to get good evaluations. When I don't think I've taught well, my evaluations aren't so good. And, even when I think my course has gone wonderfully, my evaluations are usually far from perfect. This has been helpful information for me.

That said, I'd prefer to have objective measures of my teaching effectiveness. Perhaps surprisingly, statisticians aren't so good about measurement and estimation when applied to their own teaching. (I think I've blogged on this on occasion.) The trouble is that measurement and evaluation take work! When we're giving advice to scientists, we're always yammering on about experimentation and measurement. But in our own professional lives, we pretty much throw all our statistical principles out the window.

P.S. What's this paper doing in the Journal of Political Economy? It has little or anything to do with politics or economics!

P.P.S. I continued to be stunned by the way in which tables of numbers are presented in social science research papers with no thought of communication with, for example, tables with interval estimate such as "(.0159, .0408)." (What were all those digits for? And what do these numbers have to do with anything at all?). If the words, sentences, and paragraphs of an article were put together in such a stylized, unthinking way, the article would be completely unreadable. Formal structures with almost no connection to communication or content . . . it would be like writing the entire research article in iambic pentameter with an a,b,c,b rhyme scheme, or somesuch. I'm not trying to pick on Carrell and West here--this sort of presentation is nearly universal in social science journals.

## Cars vs. trucks

| 1 Comment

Anupam Agrawal writes:

I am an Assistant Professor of Operations Management at the University of Illinois. . . . My main work is in supply chain area, and empirical in nature. . . . I am working with a firm that has two separate divisions - one making cars, and the other makes trucks. Four years back, the firm made an interesting organizational change. They created a separate group of ~25 engineers, in their car division (from within their quality and production engineers). This group was focused on improving supplier quality and reported to car plant head . The truck division did not (and still does not) have such an independent "supplier improvement group". Other than this unit in car, the organizational arrangements in the two divisions mimic each other. There are many common suppliers to the car and truck division.

Data on quality of components coming from suppliers has been collected (for the last four years). The organizational change happened in January 2007.

My focus is to see whether organizational change (and a different organizational structure) drives improvements.

## Regression discontinuity designs: looking for the keys under the lamppost?

| 1 Comment

Jas sends along this paper (with Devin Caughey), entitled Regression-Discontinuity Designs and Popular Elections: Implications of Pro-Incumbent Bias in Close U.S. House Races, and writes:

The paper shows that regression discontinuity does not work for US House elections. Close House elections are anything but random. It isn't election recounts or something like that (we collect recount data to show that it isn't). We have collected much new data to try to hunt down what is going on (e.g., campaign finance data, CQ pre-election forecasts, correct many errors in the Lee dataset). The substantive implications are interesting. We also have a section that compares in details Gelman and King versus the Lee estimand and estimator.

## Quality control problems at the New York Times

I guess there's a reason they put this stuff in the Opinion section and not in the Science section, huh?

P.S. More here.

Lei Liu writes:

## Matching for preprocessing data for causal inference

Chris Blattman writes:

Matching is not an identification strategy a solution to your endogeneity problem; it is a weighting scheme. Saying matching will reduce endogeneity bias is like saying that the best way to get thin is to weigh yourself in kilos. The statement makes no sense. It confuses technique with substance. . . . When you run a regression, you control for the X you can observe. When you match, you are simply matching based on those same X. . . .

I see what Chris is getting at--matching, like regression, won't help for the variables you're not controlling for--but I disagree with his characterization of matching as a weighting scheme. I see matching as a way to restrict your analysis to comparable cases. The statistical motivation: robustness. If you had a good enough model, you wouldn't neet to match, you'd just fit the model to the data. But in common practice we often use simple regression models and so it can be helpful to do some matching first before regression. It's not so difficult to match on dozens of variables, but it's not so easy to include dozens of variables in your least squares regression. So in practice it's not always the case that "you are simply matching based on those same X. To put it another way: yes, you'll often need to worry about potential X variables that you don't have--but that shouldn't stop you from controlling for everything that you do have, and matching can be a helpful tool in that effort.

Beyond this, I think it's useful to distinguish between two different problems: imbalance and lack of complete overlap. See chapter 10 of ARM for further discussion. Also some discussion here.

## Is instrumental variables analysis particularly susceptible to Type M errors?

Hendrik Juerges writes:

I am an applied econometrician. The reason I am writing is that I am pondering a question for some time now and I am curious whether you have any views on it.

One problem the practitioner of instrumental variables estimation faces is large standard errors even with very large samples. Part of the problem is of course that one estimates a ratio. Anyhow, more often than not, I and many other researchers I know end up with large point estimates and standard errors when trying IV on a problem. Sometimes some of us are lucky and get a statistically significant result. Those estimates that make it beyond the 2 standard error threshold are often ridiculously large (one famous example in my line of research being Lleras-Muney's estimates of the 10% effect of one year of schooling on mortality). The standard defense here is that IV estimates the complier-specific causal effect (which is mathematically correct). But still, I find many of the IV results (including my own) simply incredible.

Now comes my question: Could it be that IV is particularly prone to "type M" errors? (I recently read your article on beauty, sex, and power). If yes, what can be done? Could Bayesian inference help?

I've never actually done any instrumental variables analysis, Bayesian or otherwise. But I do recall that Imbens and Rubin discuss Bayesian solutions in one of their articles, and I think they made the point that the inclusion of a little bit of prior information can help a lot.

In any case, I agree that if standard errors are large, then you'll be subject to Type M errors. That's basically an ironclad rule of statistics.

My own way of understanding IV is to think of the instrument has having a joint effect on the intermediate and final outcomes. Often this can be clear enough, and you don't need to actually divide the coefficients.

And here are my more general thoughts on the difficulty of estimating ratios.

## Story time

This one belongs in the statistical lexicon. Kaiser Fung nails it:

In reading [news] articles, we must look out for the moment(s) when the reporters announce story time. Much of the article is great propaganda for the statistics lobby, describing an attempt to use observational data to address a practical question, sort of a Freakonomics-style application.

We have no problems when they say things like: "There is a substantial gap at year's end between students whose teachers were in the top 10% in effectiveness and the bottom 10%. The fortunate students ranked 17 percentile points higher in English and 25 points higher in math."

Or this: "On average, Smith's students slide under his instruction, losing 14 percentile points in math during the school year relative to their peers districtwide, The Times found. Overall, he ranked among the least effective of the district's elementary school teachers."

Midway through the article (right before the section called "Study in contrasts"), we arrive at these two paragraphs (Kaiser's italics):

On visits to the classrooms of more than 50 elementary school teachers in Los Angeles, Times reporters found that the most effective instructors differed widely in style and personality. Perhaps not surprisingly, they shared a tendency to be strict, maintain high standards and encourage critical thinking.

But the surest sign of a teacher's effectiveness was the engagement of his or her students -- something that often was obvious from the expressions on their faces.

At the very moment they tell readers that engaging students makes teachers more effective, they announce "Story time!" With barely a fuss, they move from an evidence-based analysis of test scores to a speculation on cause--effect. Their story is no more credible than anybody else's story, unless they also provide data to support such a causal link.

I have only two things to add:

1. As Jennifer frequently reminds me, we--researchers and also the general public--generally do care about causal inference. So I have a lot of sympathy for researchers and reporters who go beyond the descriptive content of their data and start speculating. The problem, as Kaiser notes, is when the line isn't drawn clearly, in the short time leading the reader astray and in the longer term, perhaps, discrediting social-scientific research more generally.

2. "Story time" doesn't just happen in the newspapers. We also see it in journal articles all the time. It's that all-too-quick moment when the authors pivot from the causal estimates they've proved, to their speculations, which, as Kaiser says, are "no more credible than anybody else's story." Maybe less credible, in fact, because researchers can fool themselves into thinking they've proved something when they haven't.

## Randomized experiments, non-randomized experiments, and observational studies

In the spirit of Dehejia and Wahba:

Can Nonrandomized Experiments Yield Accurate Answers? A Randomized Experiment Comparing Random and Nonrandom Assignments, by Shadish, Clark, and Steiner.

I just talk about causal inference. These people do it. The second link above is particularly interesting because it includes discussions by some causal inference heavyweights. WWJD and all that.

## Peer pressure, selection, and educational reform

Partly in response to my blog on the Harlem Children's Zone study, Mark Palko wrote this:

Talk of education reform always makes me [Palko] deeply nervous. Part of the anxiety comes having spent a number of years behind the podium and having seen the disparity between the claims and the reality of previous reforms. The rest comes from being a statistician and knowing what things like convergence can do to data.

Convergent behavior violates the assumption of independent observations used in most simple analyses, but educational studies commonly, perhaps even routinely ignore the complex ways that social norming can cause the nesting of student performance data.

In other words, educational research is often based of the idea that teenagers do not respond to peer pressure. . . .

and this:

## Why Development Economics Needs Theory?

Robert Neumann writes:

in the JEP 24(3), page18, Daron Acemoglu states:

Why Development Economics Needs Theory

There is no general agreement on how much we should rely on economic theory in motivating empirical work and whether we should try to formulate and estimate "structural parameters." I (Acemoglu) argue that the answer is largely "yes" because otherwise econometric estimates would lack external validity, in which case they can neither inform us about whether a particular model or theory is a useful approximation to reality, nor would they be useful in providing us guidance on what the effects of similar shocks and policies would be in different circumstances or if implemented in different scales. I therefore define "structural parameters" as those that provide external validity and would thus be useful in testing theories or in policy analysis beyond the specific environment and sample from which they are derived. External validity becomes a particularly challenging task in the presence of general equilibrium and political economy considerations, and a major role of economic theory is in helping us overcome these problems or at the very least alerting us to their importance.

Leaving aside the equilibrium debate, what do you think of his remark that the external validity of estimates refers to an underlying model. Isn't it the other way around?

My reply: This reminds me a lot of Heckman's argument of why randomized experiments are not a gold standard. I see the point but, on the other hand, as Don Green and others have noted, observational studies have external validity problems too! Whether or not a model is motivated by economic theory, you'll have to make assumptions to generalize your inferences beyond the population under study.

When Acemoglu writes, " I therefore define 'structural parameters' as those that provide external validity," I take him to be making the point that Bois, Jiang, and I did in our toxicology article from 1996: When a parameter has a generalizable meaning (in our context, a parameter that is "physiological" rather than merely "phenomenological," you can more usefully incorporate it in a hierarchical model. We used statistical language and Acemoglu is using econometric language but it's the same idea, I think, and a point worth making in as many languages as it takes.

I don't know that I completely agree with Acemoglu about "theory," however. Theory is great--and we had it in abundance in our toxicology analysis--but I'd think you could have generalizable parameters without formal theory, if you're careful enough to define what you're measuring.

## "Texting bans don't reduce crashes; effects are slight crash increases"

John Christie sends along this. As someone who owns neither a car nor a mobile phone, it's hard for me to relate to this one, but it's certainly a classic example for teaching causal inference.

## Paul Rosenbaum on those annoying pre-treatment variables that are sort-of instruments and sort-of covariates

Last year we discussed an important challenge in causal inference: The standard advice (given in many books, including ours) for causal inference is to control for relevant pre-treatment variables as much as possible. But, as Judea Pearl has pointed out, instruments (as in "instrumental variables") are pre-treatment variables that we would not want to "control for" in a matching or regression sense.

At first, this seems like a minor modification, with the new recommendation being to apply instrumental variables estimation using all pre-treatment instruments, and to control for all other pre-treatment variables. But that can't really work as general advice. What about weak instruments or covariates that have some instrumental aspects?

I asked Paul Rosenbaum for his thoughts on the matter, and he wrote the following:

In section 18.2 of Design of Observational Studies (DOS), I [Rosenbaum] discuss "seemingly innocuous confounding" defined to be a covariate that predicts a substantial fraction of the variation in treatment assignment but without obvious importance to the outcomes under study.

The word "seemingly" is important: it may not be innocuous, but only seem so. The example is drawn from a study (Silber, et al. 2009, Health Services Research 44: 444-463) of the timing of the discharge of premature babies from neonatal intensive care units (NICUs). Although all babies must reach a certain level of functional maturity before discharge, there is variation in discharge time beyond this, and we were interested in whether extra days in the NICU were of benefit to the babies who received them. (The extra days are very costly.) It is a long story, but one small part of the story concerns two "seemingly innocuous covariates," namely the day of the week on which a baby achieves functional maturity and the specific hospital in the Kaiser family of hospitals. A baby who achieves maturity on a Thursday goes home on Friday, but a baby who achieves maturity on Saturday goes home on Tuesday, more or less. It would, of course, be ideal if the date of discharge were determined by something totally irrelevant, but is it true that day-of-the-week is something totally irrelevant?

Should you adjust for the day of the week? A neonatologist argued that day of the week is not innocuous: a doc will keep a baby over the weekend if the doc is worried about the baby, but will discharge promptly if not worried, and the doc has information not in the medical record. Should you adjust for the day of the week? Much of the variation in discharge time varied between hospitals in the same chain of hospitals, although the patient populations were similar. Perhaps each hospital's NICU has its own culture. Should you adjust for the hospital?

The answer I suggest in section 18.2 of Design of Observational Studies is literally yes-and-no. We did analyses both ways, showing that the substantive conclusions were similar, so whether or not you think day-of-the-week and hospital are innocuous, you still conclude that extra days in the NICU are without benefit (see also Rosenbaum and Silber 2009, JASA, 104:501-511). Section 18.2 of DOS discusses two techniques, (i) an analytical adjustment for matched pairs that did not match for an observed covariate and (ii) tapered matching which does and does not match for the covariate. Detailed references and discussion are in DOS.

## Some things are just really hard to believe: more on choosing your facts.

Republicans are much more likely than Democrats to think that Barack Obama is a Muslim and was born in Kenya. But why? People choose to be Republicans or Democrats because they prefer the policy or ideology of one party or another, and it's not obvious that there should be any connection whatsoever between those factors and their judgment of a factual matter such as Obama's religion or country of birth.

In fact, people on opposite sides of many issues, such as gay marriage, immigration policy, global warming, and continued U.S. presence in Iraq, tend to disagree, often by a huge amount, on factual matters such as whether the children of gay couples have more psychological problems than the children of straight couples, what are the economic impacts of illegal immigration, what is the effect of doubling carbon dioxide in the atmosphere, and so on.

Of course, it makes sense that people with different judgment of the facts would have different views on policies: if you think carbon dioxide doesn't cause substantial global warming, you'll be on the opposite side of the global warming debate from someone who thinks it does. But often the causality runs the other way: instead of choosing a policy that matches the facts, people choose to believe the facts that back up their values-driven policies. The issue about Obama's birth country is an extreme example: it's clear that people did not first decide whether Obama was born in the U.S., and then decide whether to vote Republican or Democratic. They are choosing their fact based on their values, not the other way around. Perhaps it is helpful to think of people as having an inappropriate prior distribution that makes them more likely to believe things that are aligned with their desires.

## Matching at two levels

Steve Porter writes with a question about matching for inferences in a hierarchical data structure. I've never thought about this particular issue, but it seems potentially important.

Maybe one or more of you have some useful suggestions?

Porter writes:

## Futures contracts, Granger causality, and my preference for estimation to testing

There's a letter in the latest issue of The Economist (July 31st) signed by Sir Richard Branson (Virgin), Michael Masters (Masters Capital Management) and David Frenk (Better Markets) about an ">OECD report on speculation and the prices of commodities, which includes the following: "The report uses a Granger causality test to measure the relationship between the level of commodities futures contracts held by swap dealers, and the prices of those commodities. Granger tests, however, are of dubious applicability to extremely volatile variables like commodities prices."

The report says:

Granger causality is a standard statistical technique for determining whether one time series is useful in forecasting another. It is important to bear in mind that the term causality is used in a statistical sense, and not in a philosophical one of structural causation. More precisely a variable A is said to Granger cause B if knowing the time paths of B and A together improve the forecast of B based on its own time path, thus providing a measure of incremental predictability. In our case the time series of interest are market measures of returns, implied volatility, and realized volatility, or variable B. . . . Simply put, Granger's test asks the question: Can past values of trader positions be used to predict either market returns or volatility?

This seems clear enough, but the authors muddy the water later on by writing:

There is a positive contemporaneous association between changes in net positions held by index traders and price changes (returns) in the CBOT wheat market . . . this contemporaneous analysis cannot distinguish between the increase in index traders' positions and other correlated shifts in fundamentals: correlation does not imply causation. [Italics added by me.]

This seems to miss the point. Granger causality, as defined above, is a measure of correlation, or of partial correlation. It's just a correlation between things that are not happening at the same time. The distinction here is in what's being correlated. The phrase "correlation does not imply causation" does not belong here at all! (Unless I'm missing something, which is always possible.)

I have nothing to say on the particulars, as I have no particular expertise in this area. But in general, I'd prefer if researchers in this sort of problem were to try to estimate the effects of interest (for example, the amount of additional information present in some forecast) rather than setting up a series of hypothesis tests. The trouble with tests is that when they reject, it often tells us nothing more than that the sample size is large. And when they fail to reject, if often tells us nothing more than that the sample size is small. In neither case is the test anything like a direct response to the substantive question of interest.

## "To find out what happens when you change something, it is necessary to change it."

From the classic Box, Hunter, and Hunter book. The point of the saying is pretty clear, I think: There are things you learn from perturbing a system that you'll never find out from any amount of passive observation. This is not always true--sometimes "nature" does the experiment for you--but I think it represents an important insight.

I'm currently writing (yet another) review article on causal inference and am planning use this quote.

P.S. I find it helpful to write these reviews for a similar reason that I like to blog on certain topics over and over, each time going a bit further (I hope) than the time before. Beyond the benefit of communicating my recommendations to new audiences, writing these sorts of reviews gives me an excuse to explore my thoughts in more rigor.

P.P.S. In the original version of this blog entry, I correctly attributed the quote to Box but I incorrectly remembered it as "No understanding without manipulation." Karl Broman (see comment below) gave me the correct reference.

## Reintegrating rebels into civilian life: Quasi-experimental evidence from Burundi

Michael Gilligan, Eric Mvukiyehe, and Cyrus Samii write:

We [Gilligan, Mvukiyehe, and Samii] use original survey data, collected in Burundi in the summer of 2007, to show that a World Bank ex-combatant reintegration program implemented after Burundi's civil war caused significant economic reintegration for its beneficiaries but that this economic reintegration did not translate into greater political and social reintegration.

Previous studies of reintegration programs have found them to be ineffective, but these studies have suffered from selection bias: only ex-combatants who self selected into those programs were studied. We avoid such bias with a quasi-experimental research design made possible by an exogenous bureaucratic failure in the implementation of program. One of the World Bank's implementing partners delayed implementation by almost a year due to an unforeseen contract dispute. As a result, roughly a third of ex-combatants had their program benefits withheld for reasons unrelated to their reintegration prospects. We conducted our survey during this period, constructing a control group from those unfortunate ex-combatants whose benefits were withheld.

We find that the program provided a significant income boost, resulting in a 20 to 35 percentage point reduction in poverty incidence among ex-combatants. We also find moderate improvement in ex-combatants' livelihood prospects.

However, these economic effects do not seem to have caused greater political integration. While we find a modest increase in the propensity to report that civilian life is preferable to combatant life, we find no evidence that the program contributed to more satisfaction with the peace process or a more positive disposition toward current government institutions.

Reintegration programs are central in current peace processes and considerable resources are devoted to them. Thus, our evidence has important policy implications. While we find strong evidence for the effectiveness in terms of economic reintegration, our results challenge theories stating that short-run economic conditions are a major determinant of one's disposition toward society and the state.

Social and political integration of ex-combatants likely requires much more than individually-targeted economic assistance.

This seems important for policy and I hope will get some attention. Form a statistical perspective, they use a cool identification strategy: As noted in the abstract above, they take advantage of a bureaucratic failure. The paper uses matching to handle "incidental" imbalances, inverse propensity adjustment for "exposure heterogeneity", and graphs estimates in terms of population level effects (rather than in terms of individual level effects, which current causal inference literature never take to be identified).

## Hey! Here's a referee report for you!

I just wrote this, and I realized it might be useful more generally:

The article looks reasonable to me--but I just did a shallow read and didn't try to judge whether the conclusions are correct. My main comment is that if they're doing a Poisson regression, they should really be doing an overdispersed Poisson regression. I don't know if I've ever seen data in my life where the non-overdispersed Poisson is appropriate. Also, I'd like to see a before-after plot with dots for control cases and open circles for treatment cases and fitted regression lines drawn in. Whenever there's a regression I like to see this scatterplot. The scatterplot isn't a replacement for the regression, but at the very least it gives me intuition as to the scale of the estimated effect. Finally, all their numbers should be rounded appropriately.

Feel free to cut-and-paste this into your own referee reports (and to apply these recommendations in your own applied research).

## "Too much data"?

Chris Hane writes:

I am scientist needing to model a treatment effect on a population of ~500 people. The dependent variable in the model is the difference in a person's pre-treatment 12 month total medical cost versus post-treatment cost. So there is large variation in costs, but not so much by using the difference between the pre and post treatment costs. The issue I'd like some advice on is that the treatment has already occurred so there is no possibility of creating a fully randomized control now. I do have a very large population of people to use as possible controls via propensity scoring or exact matching.

If I had a few thousand people to possibly match, then I would use standard techniques. However, I have a potential population of over a hundred thousand people. An exact match of the possible controls to age, gender and region of the country still leaves a population of 10,000 controls. Even if I use propensity scores to weight the 10,000 observations (understanding the problems that poses) I am concerned there are too many controls to see the effect of the treatment.

Would you suggest using narrower matching criteria to get the "best" matches, would weighting the observations be enough, or should I also consider creating many models by sampling from both treatment and control and averaging their results? If you could point me to some papers that tackle similar issues that would be great.

My reply: Others know more about this than me, but my quick reaction is . . . what's wrong with having 10,000 controls? I don't see why this would be a problem at all. In a regression analysis, having more controls shouldn't create any problems. But, sure, match on lots of variables. Don't just control for age, sex, and region; control for as many relevant pre-treatment variables as you can get.

## The bane of many causes

One of the newsflies buzzing around today is an article "Brain tumour risk in relation to mobile telephone use: results of the INTERPHONE international case-control study".

The results, shown in this pretty table below, appear to be inconclusive.

A limited amount of cellphone radiation is good for your brain, but not too much? It's unfortunate that the extremes are truncated. The commentary at Microwave News blames bias:

The problem with selection bias --also called participation bias-- became apparent after the brain tumor risks observed throughout the study were so low as to defy reason. If they reflect reality, they would indicate that cell phones confer immediate protection against tumors. All sides agree that this is extremely unlikely. Further analysis pointed to unanticipated differences between the cases (those with brain tumors) and the controls (the reference group).

The second problem concerns how accurately study participants could recall the amount of time and on which side of the head they used their phones. This is called recall bias.

Mobile phones are not the only cause for development and detection of brain tumors. There are lots of factors: age, profession, genetics - all of them affecting the development of tumors. It's too hard to match everyone, but it's a lot easier to study multiple effects at the same time.

We'd see, for example, that healthy younger people at lower risk of brain cancer tend to use mobile phones more, and that older people sick with cancer that might spread to the brain don't need mobile phones. Similar could hold for alcohol consumption (social drinkers tend to be healthy and social, but drinking is an effect, not a cause) and other potential risk factors.

Here's a plot of the relative risk based on cumulative phone usage:

It seems that the top 10% of users has much higher risk. If the data wasn't discretized into just 10 categories, there could be interesting information here, beyond the obvious one that you need to be old and wealthy enough to accumulate 1600 hours of mobile phone usage.

[Changed the title from "many effects" to "many causes" - thanks to a comment by Cyrus]

## Causal inference in economics

Aaron Edlin points me to this issue of the Journal of Economic Perspectives that focuses on statistical methods for causal inference in economics. (Michael Bishop's page provides some links.)

To quickly summarize my reactions to Angrist and Pischke's book: I pretty much agree with them that the potential-outcomes or natural-experiment approach is the most useful way to think about causality in economics and related fields. My main amendments to Angrist and Pischke would be to recognize that:

1. Modeling is important, especially modeling of interactions. It's unfortunate to see a debate between experimentalists and modelers. Some experimenters (not Angrist and Pischke) make the mistake of avoiding models: Once they have their experimental data, they check their brains at the door and do nothing but simple differences, not realizing how much more can be learned. Conversely, some modelers are unduly dismissive of experiments and formal observational studies, forgetting that (as discussed in Chapter 7 of Bayesian Data Analysis) a good design can make model-based inference more robust.

2. In the case of a "natural experiment" or "instrumental variable," inference flows forward from the instrument, not backwards from the causal question. Estimates based on instrumental variables, regression discontinuity, and the like are often presented with the researcher having a causal question and then finding an instrument or natural experiment to get identification. I think it's more helpful, though, to go forward from the intervention and look at all its effects. Your final IV estimate or whatever won't necessarily change, but I think my approach is a healthier way to get a grip on what you can actually learn from your study.

Now on to the articles:

## Two great tastes that taste great together

I've using your book on regression and multilevel modeling and have a quick R question for you. Do you happen to know if there is any R package that can estimate a two-stage (instrumental variable) multi-level model?

My reply: I don't know. I'll post on blog and maybe there will be a response. You could also try the R help list.

## Should Mister P be allowed/encouraged to reside in counter-factual populations?

Lets say you are repeatedly going to recieve unselected sets of well done RCTs on various say medical treatments.

One reasonable assumption with all of these treatments is that they are monotonic - either helpful or harmful for all. The treatment effect will (as always) vary for subgroups in the population - these will not be explicitly identified in the studies - but each study very likely will enroll different percentages of the variuos patient subgroups. Being all randomized studies these subgroups will be balanced in the treatment versus control arms - but each study will (as always) be estimating a different - but exchangeable - treatment effect (Exhangeable due to the ignorance about the subgroup memberships of the enrolled patients.)

That reasonable assumption - monotonicity - will be to some extent (as always) wrong, but given that it is a risk believed well worth taking - if the average effect in any population is positive (versus negative) the average effect in any other population will be positive (versus negative).

If we define a counter-factual population based on a mixture of the study's unknown mixtures of subgroups - by inverse variance weighting of the study's effect estimates by their standard errors - we would get an estimate of the average effect for that counter-factual population that is minimum variance (and the assumptions rule out much - if any bias in this).

Should we encourage (or discourage) such Mr P based estimates - just because they are for counter-factual rather than real populations.

K?

## Modeling heterogenous treatment effects

Don Green and Holger Kern write on one of my favorite topics, treatment interactions (see also here):

We [Green and Kern] present a methodology that largely automates the search for systematic treatment effect heterogeneity in large-scale experiments. We introduce a nonparametric estimator developed in statistical learning, Bayesian Additive Regression Trees (BART), to model treatment effects that vary as a function of covariates. BART has several advantages over commonly employed parametric modeling strategies, in particular its ability to automatically detect and model relevant treatment-covariate interactions in a flexible manner.

To increase the reliability and credibility of the resulting conditional treatment effect estimates, we suggest the use of a split sample design. The data are randomly divided into two equally-sized parts, with the first part used to explore treatment effect heterogeneity and the second part used to confirm the results. This approach permits a relatively unstructured data-driven exploration of treatment effect heterogeneity while avoiding charges of data dredging and mitigating multiple comparison problems. We illustrate the value of our approach by offering two empirical examples, a survey experiment on Americans support for social welfare spending and a voter mobilization field experiment. In both applications, BART provides robust insights into the nature of systematic treatment effect heterogeneity.

I don't have the time to give comments right now, but it looks both important and useful. And it's great to see quantitatively-minded political scientists thinking seriously about statistical inference.

Pretty pictures, too (except for ugly Table 1, but, hey, nobody's perfect).

## Are High-Quality Schools Enough to Close the Achievement Gap? Evidence from a Bold Social Experiment in Harlem

This note on charter schools by Alex Tabarrok reminded me of my remarks on the relevant research paper by Dobbie and Fryer, remarks which I somehow never got around to posting here. So here are my (inconclusive) thoughts from a few months ago:

## Criticizing statistical methods for mediation analysis

Brendan Nyhan passes along an article by Don Green, Shang Ha, and John Bullock, entitled "Enough Already about 'Black Box' Experiments: Studying Mediation Is More Difficult than Most Scholars Suppose," which begins:

The question of how causal effects are transmitted is fascinating and inevitably arises whenever experiments are presented. Social scientists cannot be faulted for taking a lively interest in "mediation," the process by which causal influences are transmitted. However, social scientists frequently underestimate the difficulty of establishing causal pathways in a rigorous empirical manner. We argue that the statistical methods currently used to study mediation are flawed and that even sophisticated experimental designs cannot speak to questions of mediation without the aid of strong assumptions. The study of mediation is more demanding than most social scientists suppose and requires not one experimental study but rather an extensive program of experimental research.

That last sentence echoes a point that I like to make, which is that you generally need to do a new analysis for each causal question you're studying. I'm highly skeptical of the standard poli sci or econ approach which is to have the single master regression from which you can read off many different coefficients, each with its own causal interpretation.

## No comment

How come, when I posted a few entries last year on Pearl's and Rubin's frameworks for causal inference, I got about 100 comments, but when yesterday I posted my 12-page magnum opus on the topic, only three people commented?

My theory is that the Pearl/Rubin framing of the earlier discussion personalized the topic, and people get much more interested in a subject if it can be seen in terms of personalities.

Another hypothesis is that my recent review was so comprehensive and correct that people had nothing to say about it.

P.S. The present entry is an example of reverse causal inference, in the sense described in my review.

## Causality and Statistical Learning

[The following is a review essay invited by the American Journal of Sociology. Details and acknowledgments appear at the end.]

In social science we are sometimes in the position of studying descriptive questions (for example: In what places do working-class whites vote for Republicans? In what eras has social mobility been higher in the United States than in Europe? In what social settings are different sorts of people more likely to act strategically?). Answering descriptive questions is not easy and involves issues of data collection, data analysis, and measurement (how should one define concepts such as "working class whites," "social mobility," and "strategic"), but is uncontroversial from a statistical standpoint.

All becomes more difficult when we shift our focus from What to What-if and Why.

Consider two broad classes of inferential questions:

1. Forward causal inference. What might happen if we do X? What are the effects of smoking on health, the effects of schooling on knowledge, the effect of campaigns on election outcomes, and so forth?

2. Reverse causal inference. What causes Y? Why do more attractive people earn more money, why do many poor people vote for Republicans and rich people vote for Democrats, why did the economy collapse?

In forward reasoning, the potential treatments under study are chosen ahead of time, whereas, in reverse reasoning, the research goal is to find and assess the importance of the causes. The distinction between forward and reverse reasoning (also called "the effects of causes" and the "causes of effects") was made by Mill (1843). Forward causation is a pretty clearly-defined problem, and there is a consensus that it can be modeled using the counterfactual or potential-outcome notation associated with Neyman (1923) and Rubin (1974) and expressed using graphical models by Pearl (2009): the causal effect of a treatment T on an outcome Y for an individual person (say), is a comparison between the value of Y that would've been observed had the person followed the treatment, versus the value that would've been observed under the control; in many contexts, the treatment effect for person i is defined as the difference, Yi(T=1) - Yi(T=0). Many common techniques, such as differences in differences, linear regression, and instrumental variables, can be viewed as estimated average causal effects under this definition.

In the social sciences, where it is generally not possible to try more than one treatment on the same unit (and, even when this is possible, there is the possibility of contamination from past exposure and changes in the unit or the treatment over time), questions of forward causation are most directly studied using randomization or so-called natural experiments (see Angrist and Pischke, 2008, for discussion and many examples). In some settings, crossover designs can be used to estimate individual causal effects, if one accepts certain assumptions about treatment effects being bounded in time. Heckman (2006), pointing to the difficulty of generalizing from experimental to real-world settings, argues that randomization is not any sort of "gold standard" of causal inference, but this is a minority position: I believe that most social scientists and policy analysts would be thrilled to have randomized experiments for their forward-causal questions, even while recognizing that subject-matter models are needed to make useful inferences from any experimental or observational study.

Reverse causal inference is another story. As has long been realized, the effects of action X flow naturally forward in time, while the causes of outcome Y cannot be so clearly traced backward. Did the North Vietnamese win the American War because of the Tet Offensive, or because of American public opinion, or because of the skills of General Giap, or because of the political skills of Ho Chi Minh, or because of the conflicted motivations of Henry Kissinger, or because of Vietnam's rough terrain, or . . .? To ask such a question is to reveal the impossibility of answering it. On the other hand, questions such as "Why do whites do better than blacks in school?", while difficult, do not seem inherently unanswerable or meaningless.

We can have an idea of going backward in the causal chain, accounting for more and more factors until the difference under study disappears--that is, is "explained" by the causal predictors. Such an activity can be tricky--hence the motivation for statistical procedures for studying causal paths--and ultimately is often formulated in terms of forward causal questions: causal effects that add up to explaining the Why question that was ultimately asked. Reverse causal questions are often more interesting and motivate much, perhaps most, social science research; forward causal research is more limited and less generalizable but is more doable. So we all end up going back and forth on this.

We see three difficult problems in causal inference:

## Helping people fill out financial aid forms (at H&R Block!) increases the rate of college attendance

| 1 Comment

Eric Bettinger, Bridget Terry Long, Philip Oreopoulos, and Lisa Sanbonmatsu write:

Growing concerns about low awareness and take-up rates for government support programs like college financial aid have spurred calls to simplify the application process and enhance visibility.

Here's the study:

H&R Block tax professionals helped low- to moderate-income families complete the FAFSA, the federal application for financial aid. Families were then given an estimate of their eligibility for government aid as well as information about local postsecondary options. A second randomly-chosen group of individuals received only personalized aid eligibility information but did not receive help completing the FAFSA.

And the results:

Comparing the outcomes of participants in the treatment groups to a control group . . . individuals who received assistance with the FAFSA and information about aid were substantially more likely to submit the aid application, enroll in college the following fall, and receive more financial aid. . . . However, only providing aid eligibility information without also giving assistance with the form had no significant effect on FAFSA submission rates.

The treatment raised the proportion of applicants in this group who attended college from 27% (or, as they quaintly put it, "26.8%") to 35%. Pretty impressive. Overall, it appears to be a clean study. And they estimate interactions (that is, varying treatment effects), which is always, always, always a good idea.

Here are my recommendations for improving the article (and this, I hope, increasing the influence of this study):

## Update on the coffee experiment

It's working, so far.

## A propensity for bias?

Teryn Mattox writes:

## Solutions to the final exam in my first-semester applied statistics class

I just graded the final exams for my first-semester graduate statistics course that I taught in the economics department at Sciences Po.

I posted the exam itself here last week; you might want to take a look at it and try some of it yourself before coming back here for the solutions.

And see here for my thoughts about this particular exam, this course, and final exams in general.

Now on to the exam solutions, which I will intersperse with the exam questions themselves:

## Question on propensity score matching

Ban Chuan Cheah writes:

I'm trying to learn propensity score matching and used your text as a guide (pg 208-209). After creating the propensity scores, the data is matched and after achieving covariate balance the treatment effect is estimated by running a regression on the treatment variable and some other covariates. The standard error of the treatment effect is also reported - in the book it is 10.2 (1.6).

## Europe vs. America: the grudge match

Tyler Cowen adds to the always-popular "Europe vs. America" debate. At stake is whether western European countries are going broke and should scale back their social spending (for example, here in Paris they have free public preschool starting at 3 years old, and it's better than the private version we were paying for in New York), or whether, conversely, the U.S. should ramp up spending on public goods, as Galbraith suggested back in the 1950s when he wrote The Affluent Society.

Much of the debate turns on statistics, oddly enough. I'm afraid I don't know enough about the topic to offer any statistical contributions to the discussion, but I wanted to bring up one thing which I remember people used to talk about but hasn't seem to have come up in the current discussion (unless I've missed something, which is quite possible).

Here's my question. Shouldn't we be impressed by the performance of the U.S. economy, given that we've spent several zillion dollars more on the military than all the European countries combined, but our economy has continued to grow at roughly the same rate as Europe's? (Cowen does briefly mention "military spending" but only in a parenthetical, and I'm not sure what he was referring to.) From the other direction, I guess you could argue that in the U.S., military spending is a form of social spending--it's just that, instead of providing health care etc. for everyone, it's provided just for military families, and instead of the government supporting some modern-day equivalent of a buggy-whip factory, it's supporting some company that builds airplanes or submarines. Anyway, this just seemed somewhat relevant to the discussion.

P.S. OK, there's one place where I can offer a (very small) bit of statistical expertise.

## Using Bayesian meta-analysis to adjust for bias in experiments and observational studies

Commenter RogerH pointed me to this article by Welton, Ades, Carlin, Altman, and Sterne on models for potentially biased evidence in meta-analysis using empirically based priors. The "Carlin" in the author list is my longtime collaborator John, so I really shouldn't have had to hear about this through a blog comment. Anyway, they write:

We present models for the combined analysis of evidence from randomized controlled trials categorized as being at either low or high risk of bias due to a flaw in their conduct. We formulate a bias model that incorporates between-study and between-meta-analysis heterogeneity in bias, and uncertainty in overall mean bias. We obtain algebraic expressions for the posterior distribution of the bias-adjusted treatment effect, which provide limiting values for the information that can be obtained from studies at high risk of bias. The parameters of the bias model can be estimated from collections of previously published meta-analyses. We explore alternative models for such data, and alternative methods for introducing prior information on the bias parameters into a new meta-analysis. Results from an illustrative example show that the bias-adjusted treatment effect estimates are sensitive to the way in which the meta-epidemiological data are modelled, but that using point estimates for bias parameters provides an adequate approximation to using a full joint prior distribution. A sensitivity analysis shows that the gain in precision from including studies at high risk of bias is likely to be low, however numerous or large their size, and that little is gained by incorporating such studies, unless the information from studies at low risk of bias is limited.We discuss approaches that might increase the value of including studies at high risk of bias, and the acceptability of the methods in the evaluation of health care interventions.

I really really like this idea. As Welton et al. discuss, their method represents two key conceptual advances:

1. In addition to downweighting questionable or possibly-biased studies, they also shift them to adjust in the direction of correcting for the bias.

2. Instead of merely deciding which studies to trust based on prior knowledge, literature review, and external considerations, they also use the data, through a meta-analysis, to estimate the amount of adjustment to do.

And, as a bonus, the article has excellent graphs. (It also has three ugly tables, with gratuitous precision such as "-0.781 (-1.002, -0.562)," but the graph-to-table ratio is much better than usual in this sort of statistical research paper, so I can't really complain.)

This work has some similarities to the corrections for nonsampling errors that we do in survey research. As such, I have one idea here. Would it be possible to take the partially-pooled estimates from any given analysis and re-express them as equivalent weights in a weighted average? (This is an idea I've discussed with John and is also featured in my "Survey weighting is a mess" paper.) I'm not saying there's anything so wonderful about weighted estimates, but it could help in understanding these methods to have a bridge to the past, as it were, and see how they compare in this way to other approaches.

Payment demanded for the meal

There's no free lunch, of course. What assumptions did Welton et al. put in to make this work? They write:

We base the parameters of our bias model on empirical evidence from collections of previously published meta-analyses, because single meta-analyses typically provide only limited information on the extent of bias . . . This, of course, entails the strong assumption that the mean bias in a new meta-analysis is exchangeable with the mean biases in the meta-analyses included in previous empirical (meta-epidemiological) studies. For example, the meta-analyses that were included in the study of Schulz et al. (1995) are mostly from maternity and child care studies, and we must doubt whether the mean bias in studies on drugs for schizophrenia (the Clozapine example meta-analysis) is exchangeable with the mean biases in this collection of meta-analyses.

Assumptions are good. I expect their assumptions are better than the default alternatives, and it's good to have the model laid out there for possible criticism and improvement.

P.S. The article focuses on medical examples but I think the methods would also be appropriate for experiments and observational studies in social science. A new way of thinking about the identification issues that we're talking about all the time.

## Involuntary exits and the incumbency advantage

Ben Highton writes:

One of my colleagues thinks he remembers an essay your wrote in response to the Cox/Katz argument about using "involuntary exits" from the House (due to death, etc.) as a means to get leverage on the incumbency advantage as distinct from strategic retirement in their gerrymandering book. Would you mind sending me a copy?

It's in our rejoinder to my article with Zaiying Huang, Estimating incumbency advantage and its variation, as an example of a before/after study (with discussion), JASA (2008). See page 450. Steve Ansolabehere assisted me in discussing this point.

P.S. There was a question about how this relates to David Lee's work on estimating incumbency advantage using discontinuities in the vote. My short answer is that Lee's work is interesting, but he's not measuring the effect of politicians' incumbency status. He's measuring the effect of being in the incumbent party, which in a country without strong candidate effects (India, perhaps, according to Leigh Linden) can make sense but doesn't correspond to what we think of as incumbency effects in the United States. Identification strategies are all well and good, but you have to look carefully at what you're actually identifying!

## Do not control for post-treatment variables?

| 1 Comment

David Afshartous writes:

Regarding why one should not control for post-treatment variables (p.189, Data Analysis Using Regression and Multilevel/Hierarchical Models), the argument is very clear as shown in Figure 9.13, i.e., we would be comparing units that are not comparable as can be seen by looking at potential outcomes z^0 and z^1 which can never both be observed. How would you respond to someone that says "well, what about a cross-over experiment", wouldn't it be okay for that case?" I suppose one could reply that in a cross-over we do not have z^0 and z^1 in a strict sense, since we observe the effect of T=0 and T=1 on z at different times rather than the counterfactual for an identical time point, etc. Would you add anything further?

My reply: it could be ok, it depends on the context. One point that Rubin has made repeatedly over the past few decades is that inference depends on a model. With a clean, completely randomized design, you don't need much model to get inferences. A crossover design is more complicated. If you make some assumptions about how the treatment at time 1 affects the outcome after time 2, then you can go from there.

To put it another way, the full Bayesian analysis always conditions on all information. Whether this looks like "controlling" for an x-variable, in a regression sense, depends on the model that you're using.

## Continuing puzzlement over "Why" questions

Tyler Cowen links to a blog by Paul Kedrosky that asks why winning times in the Boston marathon have been more variable, in recent years, than winning times in New York. This particular question isn't so interesting--when I saw the title of the post, my first thought was "the weather," and, in fact, that and "the wind" are the most common responses of the blog commenters--but it reminded me of a more general question that we discussed the other day, which is how to think about Why questions.

Many years ago, Don Rubin convinced me that it's a lot easier to think about "effects of causes" than "causes of effects." For example, why did my cat die? Because she ran into the street, because a car was going too fast, because the driver wasn't paying attention, because a bird distracted the cat, because the rain stopped so the cat went outside, etc. When you look at it this way, the question of "why" is pretty meaningless.

Similarly, if you ask a question such as, What caused World War 1, the best sort of answers can take the form of potential-outcomes analyses. I don't think it makes sense to expect any sort of true causal answer here.

But, now let's get back to the "volatility of the Boston marathon" problem. Unlike the question of "why did my cat die" or "why did World War 1 start," the question, "Why have the winning times in the Boston marathon been so variable" does seem answerable.

What happens if we try to apply some statistical principles here?

Principle #1: Compared to what? We can't try to answer "why" without knowing what we are comparing to. This principle seems to work in the marathon-times example. The only way to talk about the Boston times as being unexpectedly variable is to know what "expectedly variable" is. Or, conversely, the New York times are unexpectedly stable compared to what was happening in Boston those same years. Either way, the principle holds that we are comparing to some model or another.

Principle #2: Look at effects of causes, rather than causes of effects. This principle seems to break down in marathon example, where it seems very natural to try to understand why an observed phenomenon is occurring.

What's going on? Perhaps we can understand in the context of another example, something that came up a couple years ago in some of my consulting work. The New York City Department of Health had a survey of rodent infestation, and they found that African Americans and Latinos were more likely than whites to have rodents in their apartments. This difference persisted (albeit at a lesser magnitude) after controlling for some individual and neighborhood-level predictors. Why does this gap remain? What other average differences are there among the dwellings of different ethnic groups?

OK, so now maybe we're getting somewhere. The question on deck now is, how do the "Boston vs. NY marathon" and "too many rodents" problems differ from the "dead cat" problem.

One difference is that we have data on lots of marathons and lots of rodents in apartments, but only one dead cat. But that doesn't quite work as a demarcation criterion (sorry, forgive me for working under the influence of Popper): even if there were only one running of each marathon, we could still quite reasonably answer questions such as, "Why was the winning time so much lower in NY than in Boston?" And, conversely, if we had lots of dead cats, we could start asking questions about attributable risks, but it still wouldn't quite make sense to ask why the cats are dying.

Another difference is that the marathon question and the roach question are comparisons (NY vs. Boston and blacks/hispanics vs. whites), while the dead cat stands alone (or swings alone, I guess I should say). Maybe this is closer to the demarcation we're looking for, the idea being that a "cause" (in this sense) is something that takes you away from some default model. In these examples, it's a model of zero differences between groups, but more generally it could be any model that gives predictions for data.

In this model-checking sense, the search for a cause is motivated by an itch--a disagreement with a default model--which has to be scratched and scratched until the discomfort goes away, by constructing a model that fits the data. Said model can then be interpreted causally in a Rubin-like, "effects of causes," forward-thinking way.

Is this the resolution I'm seeking? I'm not sure. But I need to figure this out, because I'm planning on basing my new intro stat course (and book) on the idea of statistics as comparisons.

P.S. I remain completely uninterested in questions such as, What is the cause? Is it A or is it B? (For example, what caused the differences in marathon-time variations in Boston and New York--is it the temperature, the precipitation, the wind, or something else? Of course if it can be any of these factors, it can be all of them. I remain firm in my belief that any statistical method that claims to distinguish between hypotheses in this way is really just using sampling variation as a way to draw artificial distinctions, fundamentally in a way no different from the notorious comparisons of statistical significance to non-significance.

This last point has nothing to do with causal inference and everything to do with my preference for continuous over discrete models in applications in which I've worked in social science, environmental science, and public health.

## Another long post on causal inference is coming!

So go read the stuff on the main page now before it scrolls off your screens.

## On a Class of Bias-Amplifying Covariates that Endanger Effect Estimates

This research note was triggered by a discussant on your blog, who called my attention to Wooldridge's paper, in response to my provocative question: "Has anyone seen a proof that adjusting for intemediary would introduce bias?"

It led to some interesting observations which I am now glad to share with your bloggers.

As Pearl writes, it is standard advice to adjust for all pre-treatment variables in an experiment or observational study--and I think everyone would agree on the point--but exactly how to "adjust" is not always clear. For example, you wouldn't want to just throw an instrumental variable as another regression predictor. And this then leads to tricky questions of what to do with variables that are sort-of instruments and sort-of covariates. I don't really know what to do in such situations, and maybe Pearl and Woolridge are pointing toward a useful way forward.

I'm curious what Rosenbaum and Rubin (both of whom are cited in Pearl's article) have to say about this paper. And, of course, WWJD.

In suggesting "a socially responsible method of announcing associations," AT points out that, as much as we try to be rigorous about causal inference, assumptions slip in through our language:

The trouble is, causal claims have an order to them (like "aliens cause cancer"), and so do most if not all human sentences ("I like ice cream"). It's all too tempting to read a non-directional association claim as if it were so -- my (least) favourite was a radio blowhard who said that in teens, cellphone use was linked with sexual activity, and without skipping a beat angrily proclaimed that giving kids a cell phone was tantamount to exposing them to STDs. . . . So here's a modest proposal: when possible, beat back the causal assumption by presenting an associational idea in the order least likely to be given a causal interpretation by a layperson or radio host.

Here's AT's example:

A random Google News headline reads: "Prolonged Use of Pacifier Linked to Speech Problems" and strongly implies a cause and effect relationship, despite the (weak) disclaimer from the quoted authors. Reverse that and you've got "Speech Problems linked to Prolonged Use of Pacifier" which is less insinuating.

It's an interesting idea, and it reminds me of something that really bugs me.

## Placebos Have Side Effects Too

Aleks points me to this blog by Neuroskeptic, who reports on some recent research studying the placebo effect:

## The effects of U.S. military aid on political violence in Colombia: a back-and-forth regarding the strength of the causal evidence

Following my comments on their article on U.S. military funding and conflict in Colombia, Oeindrila Dube and Suresh Naidu wrote:

Thanks for the comments on our paper. It seemed that you viewed the correlations in the anaysis as an interesting descriptive exercise, but not interpretable as causal. We agree with you that the most interesting social science is often causal, and in this case in particular the causal claims are the main results. The paper's punchline is that military aid needs to be reconsidered when there is collusion between the army and non-state armed groups, and we couldn't make this claim if we thought the results were purely descriptive.

In the paper, we do a lot of sample splitting and parametric time controls to rule out the possibility that this is a spurious effect. For example, our results are robust to including a base-specific time trend, along with a base-specific post-2001 dummy.

Possibly the best evidence against a strict "conflict" time-series interpretation is that there is no effect (positive or negative) of US military aid on guerrilla attacks near Colombian military bases. In other words, its not just an increase in conflict on all sides, but an increase in paramilitary attacks in particular.

The "differential time trend" that could drive our effect would have to be a) steeply nonlinear b) only applicable to paramilitaries in base municipalities, and c) would have to be fairly unique to the base municipalities, given the wide variety of alternate control groups we examine. So we think this is not a likely alternative explanation that can account for the effects.

To which I replied:

First off, I still would prefer associational language followed by causal speculation. But I can respect your different choice of emphasis. Now to get to details: my basic alternative model goes as follows: - Conflict in Colombia increased during the early 2000's. - U.S. military aid, in the U.S. and elsewhere, increased during that period also. - Most of the paramilitary attacks (and, thus, most of the increase in paramilitary attacks) occurred near military bases. Thus, I'm not so impressed by the "differential time trend" argument. It's unsurprising (but nonetheless worth noting, as you do) that there are fewer guerilla attacks near military bases. But that doesn't mean that the paramilitary attacks wouldn't have increased in the absence of U.S. aid.

None of the above really contradicts your main political story, which is that the Colombian military is involved in paramilitary attacks, and that U.S. aid is an enabler for this sort of violence.

My story above is consistent with your causal story--more U.S. aid, more resources for the military, more paramilitary attacks. It's also consistent with a different causal story, which goes like this: more conflict, more paramilitary attacks, also more U.S. aid which actually serves to stop the situation from getting worse. The argument is, yes, the U.S. is giving weapons to the bad guys, but by doing so, it co-opts them and restrains their behavior.

OK, I'm not saying this latter argument is true, but I think your strongest argument against it is to say something like: "Sure, it's possible that things would be getting even worse in the absence of U.S. military aid. But given that, during the time that aid was higher, violence was also higher--and we're talking here about violence being done by the allies of the recipients of the aid--well, maybe aid isn't such a good idea." That is, you can put the burden of proof on the advocates of aid. Hey, it costs money and it's going to some unsavory characters. You shouldn't have to prove that aid is hurting; I think it would be more defensible, from a statistical/econometric point of view, to show the association and put the ball in their court.

P.S. Just to be clear: I don't have any strong feeling that you're wrong or any goal of "debunking" your paper. It's interesting and important work and I'm trying to understand it better.

And then they shot back with:

Regarding the stylistic point about associations and causal claims, we think this is perhaps discipline-specific, as the style in economics seems to be to make a causal claim and then rule out all the alternative causal stories as much as possible. I'm sure this is probably one of many idiosyncrasies that irks non-economists.

The substantive question is why paramilitary attacks (and paramilitary attacks specifically, rather than other measures of conflict), increase more in places near bases. The account we put forward is that this occurs because the Colombian military funnels a share of its resources to paramilitary groups. Thus, if US military aid translates into more resources for the military which are shared with paramilitary groups, the implication is that in the absence of increases in US military aid, paramilitary attacks would not have increased by as much as they did.

Now the alternative account you put forward is "more conflict, more paramilitary attacks, also more U.S. aid which actually serves to stop the situation from getting worse. The argument is, yes, the U.S. is giving weapons to the bad guys, but by doing so, it co-opts them and restrains their behavior."

It seems like you have two distinct things in mind, that overall conflict is a source of bias, and an associated conjecture that this omitted variable (overall conflict) upward biases our main coefficient since it is positively correlated with paramilitary attacks and positively correlated with the aid shock. First, we explicitly address and rule out potential omitted variables using a number of empirical specifications. But, even if there is an omitted variable correlated with U.S. military aid that differentially affects paramilitary attacks in base municipalities, it is not clear whether the direction of the bias would be positive. As an example, say a change in Colombian government leads the state to become more effective in fighting the guerilla insurgency, and the US rewards the state with more military aid, while paramilitary activity declines differentially in base regions, as this activity becomes less necessary with greater military effectiveness. In this case, the omitted variable (stronger Colombian state) is negatively correlated with paramilitary attacks and positively correlated with the aid shock, and this would lead us to underestimate the true effect of U.S. aid on paramilitary activity.

Moreover, we think we do a good job ruling "conflict in general" at the national, state, or municipality level as a confounding variable. "Overall conflict" variation at the country level is absorbed by year fixed effects, and conflict at the department level is absorbed by the department x year fixed effects. At the municipal level, it is NOT the case that we observe increases in overall conflict, such as total number of clashes amongst all armed actors at the municipal level. (In out data, attacks are one-sided events carried out by a particular group. The fact that we see paramilitary attacks increase means we are specifically observing increases in events that involve only paramilitary groups - e,g, the paramilitaries attack a village or destroy some type of infrastructure. ) Also, in every specification we find no effect on the guerrilla attacks, and we think you are not taking the non-effect sufficiently seriously in terms of countering the overall conflict account. The guerilla non-effect actually provides very robust evidence that the U.S. military aid is not just correlated with any type of conflict, but rather with attacks by a particular group (which has no regional spillovers).

In addition, our base-specific linear trend and post-2001 dummy specification should convince you that our effect is not merely a post-2001 increase in conflict that manifests particularly as paramilitary attacks in base municipalities.

Your alternative account suggests that more aid to paramilitary organizations could actually result in less violence. While it is challenging to know what the counterfactual would have been in the absence of increased aid, Figure 2 shows that when aid rises sharply in 1999 there is a differential increase in aid in the base regions, and when aid decreases in 2001, there is a corresponding closing of differential decrease in the base regions. This seems inconsistent with the idea that lower aid translates into more paramilitary activity. Also, after 2002, when aid rises again, the differential increases yet another time. It is difficult to explain this pattern with the account you put forward, which would have to require additional coincidental reasons why paramilitary attacks should increase more in base regions precisely in 1999, then decline in 2001, and then rise again in 2002. This is possible, but seems unlikely.

We were thinking of some ideas that would be consistent with your alternative account, of why more aid to paramilitary organizations could actually lower violence. One story here could be deterrence - that stronger paramilitaries deter the guerillas resulting in fewer attacks by guerillas or fewer clashes between guerillas and paramilitaries. But, our results do not show a fall in guerilla attacks or clashes amongst the two groups; rather the coefficient on these other variables is close to 0 and they are statistically insignificant, which is inconsistent with the deterrence account.

Another reason could be dependence, that in the short run U.S. aid increases paramilitary violence, but it also induces paramilitary reliance on the Colombian military for supplies, which increases the sway the government has vis-Ã -vis this group, potentially leading to future demobilization. Thus in the long-run, U.S. military aid reduces paramilitary violence. While this process could take "long and variable lags" to manifest, it is important to note that we see a dramatic increase in paramilitary activity in 2005, despite a half-decade of huge U.S. military transfers to Colombia. Thus we do not see evidence of this dependence account in our data.

## Does Special Education Actually Work?

| 1 Comment

FInd out on Thurs 1 Oct at 11:15 am in Kimmel 900 at NYU: Dr. Michael Foster from UNC will present the 4th Statistics in Society lecture, entitled: "Does Special Education Actually Work?" This talk will explore the efficacy of current special education policies while highlighting the role of new methods in causal inference in to helping answer it. It is jointly sponsored by the Departments of Teaching and Learning and Applied Psychology, and by the Institute for Human Development and Social Change.

I'd definitely go to this if I were in town.

## Choices in how to write about regression results (example of an analysis of U.S. military aid in Colombia)

John reports on an article by Oeindrila Dube and Suresh Naidu, who ran some regressions on observational data and wrote:

This paper examines the effect of U.S. military aid on political violence and democracy in Colombia. We take advantage of the fact that U.S. military aid is channeled to Colombian army brigades operating out of military bases, and compare how changes in aid affect outcomes in municipalities with and without bases. Using detailed data on violence perpetuated by illegal armed groups, we Â…find that U.S. military aid leads to differential increases in attacks by paramilitaries . . .

It's an interesting analysis, but I wish they'd restrained themselves and replaced all their causal language with "is associated with" and the like.

From a statistical point of view, what Dubey and Naiduz are doing is estimating the effects of military aid in two ways: first, by comparing outcomes in years in which the U.S. spends more or less in military aid; second, by comparing outcomes in cities in Colombia with and without military bases.

## Why am I skeptical of structural models?

Matt Fox writes:

I teach various Epidemiology courses in Boston and in South Africa and have been reading your blog for the past year or so and used several of your examples in class . . . I am curious to know why you are skeptical of structural models. Much of my training has been in how essential these models are and I rarely hear the other side of the debate.

I've never used structural models myself. They just seem to require so many assumptions that I don't know how to interpret them. (Of course the same could be said of Bayesian methods, but that doesn't seem to bother me at all.) One thing I like to say is that in observational settings I feel I can interpret at most one variable causally. The difficulty is that it's hard to control for things that happen after the variable that you're thinking of as the "treatment."

To put it another way, there's a research paradigm in which you fit a model--maybe a regression, maybe a structural equations model, maybe a multilevel model, whatever--and then you read off the coefficients, with each coefficient telling you something. You gather these together and those are your conclusions.

My paradigm is a bit different. I sometimes say that each causal inference requires its own analysis and maybe its own experiment. I find it difficult to causally interpret several different coefficients from the same model.

## SAT Coaching Found to Boost Scores -- Barely

Things haven't changed much since the 8-schools experiment, apparently. (See also this article by Ben Hansen.) Howard Wainer once told me that SAT coaching is effective--it's about as effective as the equivalent number of hours in your math or English class at school.

## Dumpin' the data in raw

Benjamin Kay writes:

I just finished the Stata Journal article you wrote. In it I found the following quote: "On the other hand, I think there is a big gap in practice when there is no discussion of how to set up the model, an implicit assumption that variables are just dumped raw into the regression."

I saw James Heckman (famous econometrician and labor economist) speak on Friday, and he mentioned that using test scores in many kinds of regressions is problematic, because the assignment of a score is somewhat arbitrary even if the order was not. He suggested that positive, monotonic transformations scores contain the same information and lead to different standard errors if in your words one just "dumped into the regression". It was somewhat of a throw away remark, but considering it longer, I imagine he mans that a difference of test scores need have no constant effect. The remedy he suggested was to recalibrate exam scores such that they have some objective meaning. For example, a mechanics exam scored between one and a hundred, one can pass (65) only if they successfully rebuild the engine in the time allotted, but better scores indicate higher quality or faster speed. In this example one might change it to a binary variable to passing or not, an objective testing of a set of competencies. However, doing that clearly throws away information.

Do you or the readers of Statistical Modeling, Causal Inference, and Social Science blog have any advice here? The transformation of the variable is problematic and the critique of transformations on using it raw seems a serious one, but the act of narrowly mapping it onto a set of objective discrete skills seems to destroy lots of information. Percentile ranks on exams might be a substitute for the raw scores in many cases, but introduces other problems like in comparisons between groups.

My reply: Heckman's suggestion sounds like it would be good in some cases but it wouldn't work for something like the SAT which is essentially a continuous measure. In other cases, such as estimated ideal point measures for congressmembers, it can make sense to break a single continuous ideal-point measure into two variables: political party (a binary variable: Dem or Rep) and the ideology score. This gives you the benefits of discretization without the loss of information.

In chapter 4 of ARM we give a bunch of examples of transformations, sometimes on single variables, sometimes combining variables, sometimes breaking up a variable into parts. A lot of information is coded in how you represent a regression function, and it's criminal to just take the data as they appear in the Stata file and just dump them in raw. But I have the horrible feeling that many people either feel that it's cheating to transform the variables, or that it doesn't really matter what you do to the variables, because regression (or matching, or difference-in-differences, or whatever) is a theorem-certified bit of magic.

## Econometrics reaches The Economist

Instrumental variables help to isolate causal relationships. But they can be taken too far

"Like elaborately plumed birds...we preen and strut and display our t-values." That was Edward Leamer's uncharitable description of his profession in 1983. Mr Leamer, an economist at the University of California in Los Angeles, was frustrated by empirical economists' emphasis on measures of correlation over underlying questions of cause and effect, such as whether people who spend more years in school go on to earn more in later life. Hardly anyone, he wrote gloomily, "takes anyone else's data analyses seriously". To make his point, Mr Leamer showed how different (but apparently reasonable) choices about which variables to include in an analysis of the effect of capital punishment on murder rates could lead to the conclusion that the death penalty led to more murders, fewer murders, or had no effect at all.

In the years since, economists have focused much more explicitly on improving the analysis of cause and effect, giving rise to what Guido Imbens of Harvard University calls "the causal literature". The techniques at the heart of this literature--in particular, the use of so-called "instrumental variables"--have yielded insights into everything from the link between abortion and crime to the economic return from education. But these methods are themselves now coming under attack.

## You can't win for losing

Devin Pope writes:

I wanted to send you an updated version of Jonah Berger and my basketball paper that shows that teams that are losing at halftime win more often than expected.

This new version is much improved. It has 15x more data than the earlier version (thanks to blog readers) and analyzes both NBA and NCAA data.

Also, you will notice if you glance through the paper that it has benefited quite a bit from your earlier critiques. Our empirical approach is very similar to the suggestions that you made.

See here and here for my discussion of the earlier version of Berger and Pope's article.

Here's the key graph from the previous version:

And here's the update:

Much better--they got rid of that wacky fifth-degree polynomial that made the lines diverge in the graph from the previous version of the paper.

What do we see from the new graphs?

## One of those funny things

I published an article in the Stata Journal even though I don't know how to use Stata.

## Estimating treatment effects that vary, with application to voter mobilization experiments in political science

Avi Feller and Chris Holmes sent me a new article on estimating varying treatment effects. Their article begins:

Randomized experiments have become increasingly important for political scientists and campaign professionals. With few exceptions, these experiments have addressed the overall causal effect of an intervention across the entire population, known as the average treatment effect (ATE). A much broader set of questions can often be addressed by allowing for heterogeneous treatment effects. We discuss methods for estimating such effects developed in other disciplines and introduce key concepts, especially the conditional average treatment effect (CATE), to the analysis of randomized experiments in political science. We expand on this literature by proposing an application of generalized additive models to estimate nonlinear heterogeneous treatment effects. We demonstrate the practical importance of these techniques by reanalyzing a major experimental study on voter mobilization and social pressure and a recent randomized experiment on voter registration and text messaging from the 2008 US election.

This is a cool paper--they reanalyze data from some well-known experiments and find important interactions. I just have a few comments to add:

## Pearl's and Gelman's final thoghts (for now) on causal inference

After six entries and 91 comments on the connections between Judea Pearl and Don Rubin's frameworks for causal inference, I thought it would be good to draw the discussion to a (temporary) close. I'll first present a summary from Pearl, then briefly give my thoughts.

Pearl writes:

## Congressional counterfactuals

John Sides links to this quote from Barney Frank:

Not for the first time, as a -- a -- an elected official, I envy economists. Economists have available to them, in an analytical approach, the counterfactual. Economists can explain that a given decision was the best one that could be made, because they can show what would have happened in the counterfactual situation. They can contrast what happened to what would have happened.

No one has ever gotten reelected where the bumper sticker said, "It would have been worse without me." You probably can get tenure with that. But you can't win office.

I have two thoughts on this. First, I think Frank is a bit too confident in economists' ability to "show what would have happened in the counterfactual situation." Maybe "estimate" or "guess" or "hypothesize" would be a bit stronger than "show." Recall this notorious graph, which shows the unintentional counterfactual of some economic predictions:

Second, I don't know how Frank can say that about "no one has ever gotten reelected . . ." In Frank's district in Massachusetts, it would take a lot--a lot--for a Democrat to not get reelected.

## How to think about instrumental variables when you get confused

What with all this discussion of causal inference, I thought I'd rerun a blog entry from a couple years ago about my personal trick for understanding instrumental variables:

A correspondent writes:

I've recently started skimming your blog (perhaps steered there by Brad deLong or Mark Thoma) but despite having waded through such enduring classics as Feller Vol II, Henri Theil's "Econometrics", James Hamilton's "Time Series Analysis", and T.W. Anderson's "Multivariate Analysis", I'm finding some of the discussions such as Pearl/Rubin a bit impenetrable. I don't have a stats degree so I am thinking there is some chunk of the core curriculum on modeling and causality that I am missing. Is there a book (likely one of yours - e.g. Bayesian Data Analysis) that you would recommend to help fill in my background?

1. I recommend the new book, "Mostly Harmless Econometrics," by Angrist and Pischke (see my review here).

2. After that, I'd read the following chapters from my book with Jennifer:

Chapter 9: Causal inference using regression on the treatment variable

Chapter 10: Causal inference using more advanced models

Here are some pretty pictures, from the low-birth-weight example:

and from the Electric Company example:

3. Beyond this, you could read the books by Morgan and Winship and Pearl, but both these are a bit more technical and less applied that the two books linked to above.

The commenters may have other suggestions.

## When to standardize regression inputs and when to leave them alone

Daniel Egan sent me a link to an article, "Standardized or simple effect size: What should be reported?" by Thom Baguley, that recently appeared in the British Journal of Psychology. Here's the abstract:

It is regarded as best practice for psychologists to report effect size when disseminating quantitative research findings. Reporting of effect size in the psychological literature is patchy -- though this may be changing -- and when reported it is far from clear that appropriate effect size statistics are employed. This paper considers the practice of reporting point estimates of standardized effect size and explores factors such as reliability, range restriction and differences in design that distort standardized effect size unless suitable corrections are employed. For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take. Deciding on the appropriate effect size statistic to report always requires careful thought and should be influenced by the goals of the researcher, the context of the research and the potential needs of readers.

Egan writes:

I run into the problem of reporting coefficients all the time, mostly in the context of presenting effects to non-statisticians. While my audiences are generally bright, the obvious question always asked is "which of these is the biggest effect?" The fact that a sex dummy has a large numerical point estimate relative to number-of-purchases is largely irrelevant - its because sex's range is tiny compared to other covariates. But moreover, sex is irrelevant to "policy-making" - we can't change a persons sex! So what we're interested in is the viable range over which we could influence an independent variable, and the second-order likely affect upon the dependent. So two questions: 1. For pedagogical effect, is there any way of getting around these problems? How can we communicate the effects to non-statisticians easily (and think someone who has exactly 10 minutes to understand your whole report) 2. Is there any easy way to infer the elasticity of the effect - i.e. how much can we change the dependent, by attempting to exogenously change one of the independents? While I know that I could design the experiment to do this, I work in far more observational data - and this "effect" size is really what matters the most.

My quick reply to Egan is to refer to my article with Iain Pardoe on average predictive comparisons, where we discuss some of these concerns.

I also have some thoughts on the Baguley article:

## Rubinism: separating the causal model from the Bayesian data analysis

In the most recent round of our recent discussion, Judea Pearl wrote:

There is nothing in his theory of potential-outcome that forces one to "condition on all information" . . . Indiscriminate conditioning is a culturally-induced ritual that has survived, like the monarchy, only because it was erroneously supposed to do no harm.

I agree with the first part of Pearl's statement but not the second part (except to the extent that everything we do, from Bayesian data analysis to typing in English, is a "culturally induced ritual"). And I think I've spotted a key point of confusion.

To put it simply, Donald Rubin's approach to statistics has three parts:

1. The potential-outcomes model for causal inference: the so-called Neyman-Rubin model in which observed data are viewed as a sample from a hypothetical population that, in the simplest case of a binary treatment, includes y_i^1 and y_i^2 for each unit i).

2. Bayesian data analysis: the mode of statistical inference in which you set up a joint probability distribution for everything in your model, then condition on all observed information to get inferences, then evaluate the model by comparing predictive inferences to observed data and other information.

3. Questions of taste: the preference for models supplied from the outside rather than models inspired by data, a preference for models with relatively few parameters (for example, trends rather than splines), a general lack of interest in exploratory data analysis, a preference for writing models analytically rather than graphically, an interest in causal rather than descriptive estimands.

As that last list indicates, my own taste in statistical modeling differs in some ways from Rubin's. But what I want to focus on here is the distinction between item 1 (the potential outcomes notation) and item 2 (Bayesian data analysis).

The potential outcome notation and Bayesian data analysis are logically distinct concepts!

Items 1 and 2 above can occur together or separately. All four combinations (yes/yes, yes/no, no/yes, no/no) are possible:

- Rubin uses Bayesian inference to fit models in the potential outcome framework.

- Rosenbaum (and, in a different way, Greenland and Robins) use the potential outcome framework but estimate using non-Bayesian methods.

- Most of the time I use Bayesian methods but am not particularly thinking about causal questions.

- And, of course, there's lots of statistics and econometrics that's non-Bayesian and does not use potential outcomes.

Bayesian inference and conditioning

In Bayesian inference, you set up a model and then you condition on everything that's been observed. Pearl writes, "Indiscriminate conditioning is a culturally-induced ritual." Culturally-induced it may be, but it's just straight Bayes. I'm not saying that Pearl has to use Bayesian inference--lots of statisticians have done just fine without ever cracking open a prior distribution--but Bayes is certainly a well-recognized approach. As I think I wrote the other day, I use Bayesian inference not because I'm under the spell of a centuries-gone clergyman; I do it because I've seen it work, for me and for others.

Pearl's mistake here, I think, is to confuse "conditioning" with "including on the right-hand side of a regression equation." Conditioning depends on how the model is set up. For example, in their 1996 article, Angrist, Imbens, and Rubin showed how, under certain assumptions, conditioning on an intermediate outcome leads to an inference that is similar to an instrumental variables estimate. They don't suggest including an intermediate variable as a regression predictor or as a predictor in a propensity score matching routine, and they don't suggest including an instrument as a predictor in a propensity score model.

If a variable is "an intermediate outcome" or "an instrument," this is information that must be encoded in the model, perhaps using words or algebra (as in econometrics or in Rubin's notation) or perhaps using graphs (as in Pearl's notation). I agree with Steve Morgan in his comment that Rubin's notation and graphs can both be useful ways of formulating such models. To return to the discussion with Pearl: Rubin is using Bayesian inference and conditioning on all information, but "conditioning" is relative to a model and does not at all imply that all variables are put in as predictors in a regression.

Another example of Bayesian inference is the poststratification which I spoke of yesterday (see item 3 here). But, as I noted then, this really has nothing to do with causality; it's just manipulation of probability distributions in a useful way that allows us to include multiple sources of information.

P.S. We're lucky to be living now rather than 500 years ago, or we'd probably all be sitting around in a village arguing about obscure passages from the Bible.

## More on Pearl/Rubin, this time focusing on a couple of points

To continue with our discussion (earlier entries 1, 2, and 3):

1. Pearl has mathematically proved the equivalence of Pearl's and Rubin's frameworks. At the same time, Pearl and Rubin recommend completely different approaches. For example, Rubin conditions on all information, whereas Pearl does not do so. In practice, the two approaches are much different. Accepting Pearl's mathematics (which I have no reason to doubt), this implies to me that Pearl's axioms do not quite apply to many of the settings that I'm interested in.

I think we've reached a stable point in this part of the discussion: we can all agree that Pearl's theorem is correct, and we can disagree as to whether its axioms and conditions apply to statistical modeling in the social and environmental sciences. I'd claim some authority on this latter point, given my extensive experience in this area--and of course, Rubin, Rosenbaum, etc., have further experience--but of course I have no problem with Pearl's methods being used on political science problems, and we can evaluate such applications one at a time.

2. Pearl and I have many interests in common, and we've each written two books that are relevant to this discussion. Unfortunately, I have not studied Pearl's books in detail and I doubt he's had the time to read my books in detail also. It takes a lot of work to understand someone else's framework, work that we don't necessarily want to do if we're already spending a lot of time and effort developing our own research programmes. It will probably be the job of future researchers to make the synthesis. (Yes, yes, I know that Pearl feels that he already has the synthesis, and that he's proved this to be the case, but Pearl's synthesis doesn't yet take me all the way to where I want to go, which is to do my applied work in social and environmental sciences.) I truly am open to the probability that everything I do can be usefully folded into Pearl's framework someday.

That said, I think Pearl is on shaky ground when he tries to say that Don Rubin or Paul Rosenbaum is making a major mistake in causal inference. If Pearl's mathematics implies that Rubin and Rosenbaum are making a mistake, then my first step would be to apply the syllogism the other way and see whether Pearl's assumptions are appropriate for the problem at hand.

3. I've discussed a poststratification example. As I discussed yesterday (see the first item here), a standard idea, both in survey sampling and causal inference, is to perform estimates conditional on background variables, and then average over the population distribution of the background variables to estimate the population average. Mathematically, p(theta) = sum_x p(theta|x)p(x). Or, if x is discrete and takes on only two values, p(theta) = (N_1 p(theta|x=1) + N_2 p(theta|x=2)) / (N_1 + N_2).

This has nothing at all to do with causal inference: it's straight Bayes.

Pearl thinks that if the separate components p(theta|x) are nonidentifiable, that you can't do this, and you should not include x in the analysis. He writes:

I [Pearl] would really like to see how a Bayesian method estimates the treatment effect in two subgroups where it is not identifiable, and then, by averaging the two results (with two huge posterior uncertainties) gets the correct average treatment effect, which is identifiable, hence has a narrow posterior uncertainly. . . . I have no doubt that it can be done by fine-tuned tweaking . . . But I am talking about doing it the honest way, as you described it: "the uncertainties in the two separate groups should cancel out when they're being combined to get the average treatment effect." If I recall my happy days as a Bayesian, the only operation allowed in combining uncertainties from two subgroups is taking a linear combination of the two, weighted by the (given) relative frequencies of the groups. But, I am willing to learn new methods.

I'm glad that Pearl is willing to learn new methods--so am I--but, no new methods are needed here! This is straightforward, simple Bayes. Rod Little has written a lot about these ideas. I wrote some papers on it in 1997 and 2004. Jeff Lax and Justin Phillips do it in their multilevel modeling and poststratification papers where, for the first, time, they get good state-by-state estimates of public opinion on gay rights issues. No "fine-tuned tweaking" required. You just set up the model and it all works out. If the likelihood provides little to no information on theta|x but it does provide good information on the marginal distribution of theta, then this will work out fine.

In practice, of course, nobody is going to control for x if we have no information on it. Bayesian poststratification really becomes useful in that it can put together different sources of partial information, such as data with small sample sizes in some cells, along with census data on population cell totals.

Please, please don't say "the correct thing to do is to ignore the subgroup identity." If you want to ignore some information, that's fine--in the context of the models you are using, it might even make sense. But Jeff and Justin and the rest of us use this additional information all the time, and we get a lot out of it. What we're doing is not incorrect at all. It's Bayesian inference. We set up a joint probability model and then work from it. If you want to criticize the probability model, that's fine. If you want to criticize the entire Bayesian edifice, then you'll have to go up against mountains of applied successes.

As I wrote earlier, you don't have to be a Bayesian (or, I could say, you don't have to be a Bayesian)--I have a great respect for the work of Hastie, Tibshirani, Robins, Rosenbaum, and many others who are developing methods outside the Bayesian framework)--but I think you're on thin ice if you want to try to claim that Bayesian analysis is "incorrect."

4. Jennifer and I and many others make the routine recommendation to exclude post-treatment variables from analysis. But, as both Pearl and Rubin have noted in different contexts, it can be a very good idea to include such variables--it's just not a good idea to include them as regression predictors.) If the only think you're allowed to do is regression (as in chapter 9 of ARM), then I think it's a good idea to exclude post-treatment predictors. If you're allowed more general models, then one can and should include them. I'm happy to have been corrected by both Pearl and Rubin on this one.

5. As I noted yesterday (see second-to-last item here), all statistical methods have holes. This is what motivates us to consider new conceptual frameworks as well as incremental improvements in the systems with which we are most familiar.

Summary . . . so far

I doubt this discussion is over yet, but I hope the above notes will settle some points. In particular:

- I accept (on authority of Pearl, Wasserman, etc.) that Pearl has proved the mathematical equivalence of his framework and Rubin's. This, along with Pearl's other claim that Rubin and Rosenbaum have made major blunders in applied causal inference (a claim that I doubt), leads me to believe that Pearl's axioms are in some way not appropriate to the sorts of problems that Rubin, Rosenbaum, and I work on: social and environmental problems that don't have clean mechanistic causation stories. Pearl believes his axioms do apply to these problems, but then again he doesn't have the extensive experience that Rosenbaum and Rubin have. So I think it's very reasonable to suppose that his axioms aren't quite appropriate here.

- Poststratification works just fine. It's straightforward Bayesian inference, nothing to do with causality at all.

- I have been sloppy when telling people not to include post-treatment variables. Both Rubin and Pearl, in their different ways, have been more precise about this.

- Much of this discussion is motivated by the fact, that, in practice, none of these methods currently solves all our applied problems in the way that we would like. I'm still struggling with various problems in descriptive/predictive modeling, and causation is even harder!

- Along with this, taste--that is, working with methods we're familiar with--matters. Any of these methods is only as good as the models we put into them, and we typically are better modelers when we use languages with which we're more familiar. (But not always. Sometimes it helps to liberate oneself, try something new, and break out of the implicit constraints we've been working on.)

## More on Pearl's and Rubin's frameworks for causal inference

To follow up on yesterday's discussion, I wanted to go through a bunch of different issues involving graphical modeling and causal inference.

Contents:
- A practical issue: poststratification
- 3 kinds of graphs
- Minimal Pearl and Minimal Rubin
- Getting the most out of Minimal Pearl and Minimal Rubin
- Conceptual differences between Pearl's and Rubin's models
- Controlling for intermediate outcomes
- Statistical models are based on assumptions
- In defense of taste
- Argument from authority?
- How could these issues be resolved?
- Holes everywhere
- What I can contribute

## Philip Dawid's explication of Pearl's model, and two ways of thinking about nonrandom sampling

Philip Dawid (a longtime Bayesian researcher who's done work on graphical models, decision theory, and predictive inference) saw our discussion on causality and sends in some interesting thoughts, which I'll post here and then very briefly comment on:

Having just read through this fascinating interchange, I [Dawid] confess to finding Shrier and Pearl's examples and arguments more convincing that Rubin's. At the risk of adding to the confusion, but also in hope of helping at least some others, let me briefly describe yet another way (related to Pearl's, but with significant differences) of formulating and thinking about the problem. For those who, like me, may be concerned about the need to consider the probabilistic behaviour of counterfactual variables, on the one hand, or deterministic relationships encoded graphically, on the other, this provides an observable-focused, fully stochastic, alternative. A full presentation of the essential ideas can be found in Chapters 9 (Confounding and Sufficient Covariates) and 10 (Reduction of Sufficient Covariate) of my online document "Principles of Statistical Causality".

Like Pearl, I like to think of "causal inference" as the task of inferring what would happen under a hypothetical intervention, say F_E = e, that sets the value of the exposure E at e, when the data available are collected, not under the target "interventional regime", but under some different "observational regime". We could code this regime as F_E = idle. We can think of the non-stochastic variable F_E as a parameter, indexing the joint distribution of all the variables in the problem, under the regime indicated by its value.

## Does Medicare actually have higher administrative costs than private insurers?

Greg Mankiw links to an article that illustrates the challenges of interpreting raw numbers causally. This would really be a great example for your introductory statistics or economics classes, because the article, by Robert Book, starts off by identifying a statistical error and then goes on to make a nearly identical error of its own! Fun stuff.

## Resolving disputes between J. Pearl and D. Rubin on causal inference

This is a pretty long one. It's an attempt to explore some of the differences between Judea Pearl's and Don Rubin's approaches to causal inference, and is motivated by recent article by Pearl.

Pearl sent me a link to this piece of his, writing:

I [Pearl] would like to encourage a blog-discussion on the main points raised there. For example:

Whether graphical methods are in some way "less principled" than other methods of analysis.

Whether confounding bias can only decrease by conditioning on a new covariate.

Whether the M-bias, when it occurs, is merely a mathematical curiosity, unworthy of researchers attention.

Whether Bayesianism instructs us to condition on all available measurements.

I've never been able to understand Pearl's notation: notions such as a "collider of an M-structure" remain completely opaque to me. I'm not saying this out of pride--I expect I'd be a better statistician if I understood these concepts--but rather to give a sense of where I'm coming from. I was a student of Rubin and have used his causal ideas for awhile, starting with this article from 1990 on estimating the incumbency advantage in politics. I'm pleased to see these ideas gaining wider acceptance. In many areas (including studying incumbency, in fact), I think the most helpful feature of Rubin's potential-outcome framework is to get you, as a researcher, to think hard about what you are in fact trying to estimate. In much of the current discussion of identification strategies, regression discontinuities, differences in differences, and the like, I think there's too much focus on technique and not enough thought put into what the estimates are really telling you. That said, it makes sense that other theoretical perspectives such as Pearl's could be useful too.

To return to the article at hand: Pearl is clearly frustrated by what he views as Rubin's bobbing and weaving to avoid a direct settlement of their technical dispute. From the other direction, I think Rubin is puzzled by Pearl's approach and is not clear what the point of it all is.

I can't resolve the disagreements here, but maybe I can clarify some technical issues.

Controlling for pre-treatment and post-treatment variables

Much of Pearl's discussion turns upon notions of "bias," which in a Bayesian context is tricky to define. We certainly aren't talking about the classical-statistical "unbiasedness," in which E(theta.hat | theta) = theta for all theta, an idea that breaks down horribly in all sorts of situations (see page 248 of Bayesian Data Analysis). Statisticians are always trying to tell people, Don't do this, Don't do that, but the rules for saying this can be elusive. This is not just a problem for Pearl: my own work with Rubin suffers from similar problems. In chapter 7 of Bayesian Data Analysis (a chapter that is pretty much my translation of Rubin's ideas), we talk about how you can't do this and you can't do that. We avoid the term "bias," but then it can be a bit unclear what our principles are. For example, we recommend that your model should, if possible, include all variables that affect the treatment assignment. This is good advice, but really we could go further and just recommend that an appropriate analysis should include all variables that are potentially relevant, to avoid omitted-variable bias (or the Bayesian equivalent). Once you've considered a variable, it's hard to go back to the state of innocence in which that information was never present.

If I'm reading his article correctly, Pearl is making two statistical points, both in opposition to Rubin's principle that a Bayesian analysis (and, by implication, any statistical analysis) should condition on all available information:

1. When it comes to causal inference, Rubin says not to control for post-treatment variables (that is, intermediate outcomes), which seems to contradict Rubin's more general advice as a Bayesian to condition on everything.

2. Rubin (and his collaborators such as Paul Rosenbaum) state unequivocally that a model should control for all pre-treatment variables, even though including such variables, in Pearl's words, "may create spurious associations between
treatment and outcome and this, in turns, may increase or decrease confounding bias."

Let me discuss each of these criticisms, as best as I can understand them. Regarding the first point, a Bayesian analysis can control for intermediate outcomes--that's ok--but then the causal effect of interest won't be summarized by a single parameter--a "beta"--from the model. In our book, Jennifer and I recommend not controlling for intermediate outcomes, and a few years ago I heard Don Rubin make a similar point in a public lecture (giving an example where the great R. A. Fisher made this mistake). Strictly speaking, though, you can control for anything; you just then should suitably postprocess your inferences to get back to your causal inferences of interest.

I don't fully understand Pearl's second critique, in which he says that it's not always a good idea to control for pre-treatment variables. My best reconstruction is that Pearl's thinking about a setting where you could estimate a causal effect in a messy observational setting in which there are some important unobserved confounders, and it could well happen that controlling for a particular pre-treatment variable happens to make the confounding worse. The idea, I think, is that if you have an analysis where various problems cancel each other out, then fixing one of these problems (by controlling for one potential counfounder) could result in a net loss. I can believe this could happen in practice, but I'm wary of setting this up as a principle. I'd rather control for all the pre-treatment predictors that I can, and then make adjustments if necessary to attempt to account for remaining problems in the model. Perhaps Pearl's position and mine are not so far apart, however, if his approach of not controlling for a covariate could be seen as an approximation to a fuller model that controls for it while also adjusting for other, unobserved, confounders.

The sum of unidentifiable components can be identifiable

At other points, Pearl seems to be displaying a misunderstanding of Bayesian inference (at least, as I see it). For example, he writes:

For example, if we merely wish to predict whether a given person is a smoker, and we have data on the smoking behavior of seat-belt users and non-users, we should condition our prior probability P(smoking) on whether that person is a "seat-belt user" or not. Likewise, if we wish to predict the causal effect of smoking for a person known to use seat-belts, and we have separate data on how smoking affects seat-belt users and non-users, we should use the former in our prediction. . . . However, if our interest lies in the average causal effect over the entire population, then there is nothing in Bayesianism that compels us to do the analysis in each subpopulation separately and then average the results. The class-specific analysis may actually fail if the causal effect in each class is not identifiable.

I think this discussion misses the point in two ways.

First, at the technical level, yes you definitely can estimate the treatment effect in two separate groups and then average. Pearl is worried that the two separate estimates might bot be identifiable--in Bayesian terms, that they will individually have large posterior uncertainties. But, if the study really is being done in a setting where the average treatment effect is identifiable, then the uncertainties in the two separate groups should cancel out when they're being combined to get the average treatment effect. If the uncertainties don't cancel, it sounds to me like there must be some additional ("prior") information that you need to add.

The second way that I disagree with Pearl's example is that I don't think it makes sense to estimate the smoking behavior separately for seat-belt users and non-users. This just seems like a weird thing to be doing. I guess I'd have to see more about the example to understand why someone would do this. I have a lot of confidence in Rubin, so if he actually did this, I expect he had a good reason. But I'd have to see the example first.

Final thoughts

Hal Stern once told me the real division in statistics was not between the Bayesians and non-Bayesians, but between the modelers and the non-modelers. The distinction isn't completely clear--for example, where does the "Bell Labs school" of Cleveland, Hastie, Tibshirani, etc. fall?--but I like the idea of sharing a category as all the modelers over the years--even those who have not felt the need to use Bayesian methods.

Reading Pearl's article, however, reminded me of another distinction, this time between discrete models and continuous models. I have a taste for continuity and always like setting up my model with smooth parameters. I'm just about never interested in testing whether a parameter equals zero; instead, I'd rather infer about the parameter in a continuous space. To me, this makes particular sense in the sorts of social and environmental statistics problems where I work. For example, is there an interaction between income, religion, and state of residence in predicting one's attitude toward school vouchers? Yes. I knew this ahead of time. Nothing is zero, everything matters to some extent. As discussed in chapter 6 of Bayesian Data Analysis, I prefer continuous model expansion to discrete model averaging.

In contrast, Pearl, like many other Bayesians I've encountered, seems to prefer discrete models and procedures for finding conditional independence. In some settings, this can't matter much: if a source of variation is small, then maybe not much is lost by setting it to zero. But it changes one's focus, pointing Pearl toward goals such as "eliminating bias" and "covariate selection" rather than toward the goals of modeling the relations between variables. I think graphical models are a great idea, but given my own preferences toward continuity, I'm not a fan of the sorts of analyses that attempt to discover whether variables X and Y really have a link between them in the graph. My feeling is, if X and Y might have a link, then they do have a link. The link might be weak, and I'd be happy to use Bayesian multilevel modeling to estimate the strength of the link, partially pool it toward zero, and all the rest--but I don't get much out of statistical procedures that seek to estimate whether the link is there or not.

Finally, I'd like to steal something I wrote a couple years ago regarding disputes over statistical methodology:

Different statistical methods can be used successfully in applications--there are many roads to Rome--and so it is natural for anyone (myself included) to believe that our methods are particularly good for applications. For example, Adrian Raftery does excellent applied work using discrete model averaging, whereas I don't feel comfortable with that approach. Brad Efron has used bootstrapping to help astronomers solve their statistical problems. Etc etc. I don't think that Adrian's methods are particularly appropriate to sociology, or Brad's to astronomy--these are just powerful methods that can work in a variety of fields. Given that we each have successes, it's unsurprising that we can each feel strongly in the superiority of our own approaches. And I certainly don't feel that the approaches in Bayesian Data Analysis are the end of the story. In particular, nonparametric methods such as those of David Dunson, Ed George, and others seem to have a lot of advantages.

Similarly, Pearl has achieved a lot of success and so it would be silly for me to argue, or even to think, that he's doing everything all wrong. I think this expresses some of Pearl's frustration as well: Rubin's ideas have clearly been successful in applied work, so it would be awkward to argue that Rubin is actually doing the wrong thing in the problems he's worked on. It's more that any theoretical system has holes, and the expert practitioners in any system know how to work around these holes.

### Research Supported By

• C Ryan King: I'd say that the previous discussion had a feature which read more
• K? O'Rourke: On the surface, it seems like my plots, but read more
• Vic: I agree with the intervention-based approach -- spending and growth read more
• Phil: David: Ideally I think one would model the process that read more
• Bill Jefferys: Amplifying on Derek's comment: http://en.wikipedia.org/wiki/Buridan%27s_ass read more
• Nameless: It is not uncommon in macro to have relationships that read more
• derek: taking in each others' laundry It's more like the farmer read more
• DK: #17. All these quadrillions and other super low p-values assume read more
• Andrew Gelman: Anon: No such assumption is required. If you multiply the read more
• anon: Doesn't this rely on some form of assumed orthogonality in read more
• Andrew Gelman: David: Yup. What makes these graphs special is: (a) Interpretation. read more
• David Shor: This seems pretty similar to the "Correlations" feature in the read more
• David W. Hogg: If you want probabilistic results (probabilities over outcomes, with and read more
• Cheryl Carpenter: Bob is my brother and he mentioned this blog entry read more
• Bob Carpenter: That's awesome. Thanks. Exactly the graphs I was talking about. read more