I received the following email:
July 2009 Archives
Avi Feller and Chris Holmes sent me a new article on estimating varying treatment effects. Their article begins:
Randomized experiments have become increasingly important for political scientists and campaign professionals. With few exceptions, these experiments have addressed the overall causal effect of an intervention across the entire population, known as the average treatment effect (ATE). A much broader set of questions can often be addressed by allowing for heterogeneous treatment effects. We discuss methods for estimating such effects developed in other disciplines and introduce key concepts, especially the conditional average treatment effect (CATE), to the analysis of randomized experiments in political science. We expand on this literature by proposing an application of generalized additive models to estimate nonlinear heterogeneous treatment effects. We demonstrate the practical importance of these techniques by reanalyzing a major experimental study on voter mobilization and social pressure and a recent randomized experiment on voter registration and text messaging from the 2008 US election.
This is a cool paper--they reanalyze data from some well-known experiments and find important interactions. I just have a few comments to add:
The American Statistical Association organizes a program in which young researchers can submit writing samples and get comments from statisticians who are more experienced writers. I agreed to participate in this program, as long as the authors were willing to have their articles and my comments posted here.
I'm going to start with my general advice after reading and commenting on the two articles sent to me. I think this advice should be of interest to nearly all the readers of this blog. Then I'll link to the articles and give some detailed comments.
General advice
Both the papers sent to me appear to have strong research results. Now that the research has been done, I'd recommend rewriting both articles from scratch, using the following template:
1. Start with the conclusions. Write a couple pages on what you've found and what you recommend. In writing these conclusions, you should also be writing some of the introduction, in that you'll need to give enough background so that general readers can understand what you're talking about and why they should care. But you want to start with the conclusions, because that will determine what sort of background information you'll need to give.
2. Now step back. What is the principal evidence for your conclusions? Make some graphs and pull out some key numbers that represent your research findings which back up your claims.
3. Back one more step, now. What are the methods and data you used to obtain your research findings.
4. Now go back and write the literature review and the introduction.
5. Moving forward one last time: go to your results and conclusions and give alternative explanations. Why might you be wrong? What are the limits of applicability of your findings? What future research would be appropriate to follow up on these loose ends?
6. Write the abstract. An easy way to start is to take the first sentence from each of the first five paragraphs of the article. This probably won't be quite right, but I bet it will be close to what you need.
7. Give the article to a friend, ask him or her to spend 15 minutes looking at it, then ask what they think your message was, and what evidence you have for it. Your friend should read the article as a potential consumer, not as a critic. You can find typos on your own time, but you need somebody else's eyes to get a sense of the message you're sending.
Bob Shapiro, author of two important books on public opinion (The Rational Public, 1992, with Benjamin Page, and Politicians Don't Pander, 2000, with Lawrence Jacobs) sent me this report he just wrote with Sara Arrow, comparing public opinion for Obama's health care initiative with opinion in 1993-94, when Bill Clinton's health plan crashed and burned. They write:
John Sides links to an (unintentionally, I assume) hilarious peer-reviewed article by C. K. Rowley, which begins:
Following up on our earlier discussion of the administrative costs of Medicare and private insurers, Robert Book sent me a report on Illusions of Cost Control in Public Health Care Plans, which is full of numbers and argues that "Medicare's administrative costs are a lower percentage of the total not because Medicare has cheaper administration, but because it has more expensive patients." I don't know enough to evaluate these arguments, but I like that he has a lot of numbers and graphs right out there, so that any disputes can be on specific points.
I do have one question, which probably reflects my ignorance of heath-economics terminology more than anything else. Book writes, "Claims processing is the only category that is at all sensitive to the level of health care utilization." From my personal experience with the health care system, I associate "administrative costs" with the many levels of clerks and paper-pushers you have to deal with before you get to see a doctor or nurse. I'm not quite sure how "claims processing" is defined, but I see a lot of full-time employees (as well as, I assume, some higher-paid full-time employees in some back room) who aren't doing anything health-related; they're just minding the store. And this all seems pretty much proportional to health care utilization: I assume that if people are going to the doctor twice as often, or doing more complicated procedures, there are that many extra visits, that many extra forms to fill out, etc. I've been in hospital wards at night where there is no doctor to be seen, maybe no nurse, but three or four administrative employees appear to be continously busy with something or another.
This is not intended as a criticism of Book's argument, just a thought some of these seemingly neutral terms such as "administrative costs" can be confusing.
Nate Silver links to a Congressional Quarterly list of ratings for 2010 congressional races and concludes that, although these listings give a sense of which races are more likely to be competitive, the CQ chart doesn't really say much about the chance that there will be a "wave" election that would switch partisan control to the Republicans.
The same day, Matthew Yglesias links to a recent Congressional Quarterly report entitled, "2010 House Outlook: Democrats Look Secure" and concludes that, yes, the Democrats look secure to keep their House and Senate majorities.
What should we believe? For the purpose of campaign strategy, you need to look at the races in each district, but to get a sense of what's going to happen overall, I think the best approach is to look at the national vote. There's lots of variation, but, overall, swings occur nationally.
Here's a graph I made after the election, showing the average Democratic share of the two-party vote for the House of Representatives and for president for the past sixty years:

From this picture, it looks possible but unlikely that there will be a 6% swing toward the Republicans (which is what it would take for them to bring their average district vote from 44% to 50%). Historically speaking, a 6% swing is a lot. The biggest shifts in the past few decades appear to be 1946-48, 1956-58, and 1972-74 (in favor of the Democrats) and 1964-66 and 1992-194 (for the Republicans). I don't know if any of these would quite be enough to swing the House majority. A more likely outcome, if the Republicans indeed improve in next year's election, is for them to make some gains but still be in the minority.
The other factor helping the Democrats is incumbency, which helps lock in a congressional majority (as it did for the Republicans after 1994) by bumping up the vote shares of the new congressmembers elected in swing districts. In 2008, John Kastellec, Jamie Chandler, and I estimated that the Republicans would need something like 51% of the average district vote to have an even shot of winning a majority of House seats.
The counterintuitive style of economic analysis is typically set up to make one of two points:
1. Some seemingly stupid thing that people do actually is rational. (For example, see the notorious "rational addiction" model.) Of course, it's gotta be rational, right? Otherwise why would people do it?
2. Some seemingly reasonable thing that people do actually is irrational. I came across a recent example of this sort of argument in a discussion of the sunk cost fallacy by Dan Reeves on Sharad's blog. Of course people are irrational, right? After all, we're bundles of flesh, not calculating machines.
Both these sorts of points are reasonable (although, I have to admit, I'm pretty skeptical both on the "rational addiction" and the "sunk cost" stories).
But what really interests me is that both sorts of arguments below are, as we say in the social sciences, "normative"; that is, they are about what we should do (in the first case, we "should" be less bothered by certain behavior that seems irrational, we should be less inclined to regulate seemingly irrational or predatory behavior, etc; in the second case, we "should" change our behaviors so as not to violate some key theoretical axiom). And both sorts of arguments make sense. But they go in the opposite direction! And I can easily imagine just about any behavior analyzed in either of these two directions. Obviously, we can analyze addiction by discussing the inconsistency of the actions of an addict; similarly, we can rationalize the sunk-cost examples by postulating more complicated goals.
As I wrote last year:
I'm still disturbed by the lack of connection that is made between the fundamental principles of economics (under which $5,000 worth of expensive wine has the same value as $5,000 worth of Cheetos) and the sort of technocratic reasoning (the kind of thing that makes me, as a statistician, happy) where you try to assign a cost to each thing.Really this applies to economics, or "freakonomics," in general: For example, you can do some data analysis to see if sumo wrestlers are cheating, or you can just say that sumo wrestling supplies an entertainment niche and leave it to the wrestlers to figure out how to optimally collude. Either sort of analysis is ok, but I rarely see them juxtaposed--it's typically one or the other, and the conclusions seem to depend a lot on which mode of analysis is chosen.
P.S. I'm not trying to criticize economics, or economic analysis, in general. I do the stuff myself. (See, for example, this article of ours on cost-benefit tradeoffs in radon measurement and remediation). I'm just pointing out what I see as a difficulty with some of the normative arguments out there.
Daniel Lakeland writes:
You may be astounded that people are still reporting 26% more probability to have daughters than sons, and then extrapolating this to decide that evolution is strongly favoring beautiful women... Or considering the degree of innumeracy in the population perhaps you wouldn't be astounded.... in any case... they are still reporting such things.
If anyone out there happens to know Jonathan Leake, the reporter who wrote this story for the (London) Sunday Times, perhaps you could send him a copy of our recent article in the American Scientist. Or, if he'd like more technical details, this article from the Journal of Theoretical Biology?
Thank you. I have nothing more to say at this time.
It's been a dramatic month: A month ago, a coalition of some of the leading teams qualifies for the $1 million grand prize for improving the accuracy of the movie-recommending model by more than 10%. But, they would close the competition 30 days afterward, in case someone else is able to improve upon the result. This happened less than a day before the deadline, by The enormous Ensemble, composed of 23 previously separate teams and individuals. Of course, most of the progress towards the victory was through the models making use of new significant patterns in the data, such as that of time.
The development of an ensemble from many separate teams was another accomplishment, and the GPT's inclusion rules provide some insight into the process: "shares" of the winnings were distributed based on how much was a contribution able to improve the result in terms of percentage points. Simon Owens describes what it was like to participate in The Ensemble.
Bayesian statistics always works with ensembles: the posterior is a weighted average of all models, the weight being based on the fit of each model times the prior quality of the model. There are some additional Bayesian elements that could be a part of future competitions, such as Bayesian scoring functions.
In the past I was asked to contrast Occam's razor with the Epicurean principle. Occam's razor is the Bayesian prior, or the the yang principle: simpler models have greater a priori weight (because we tend to economize that what is useful). Occam's razor goes back to Aristotle, who wrote "For the more limited, if adequate, is always preferable," and "For if the consequences are the same, it is always better to assume the more limited antecedent" in his Physics. We mathematically express it as the prior.
Epicurean principle is the yin, or mathematically expressed as the integral over the model space. Ensembles go back to Epicurus' letter to Herodotus: "When, therefore, we investigate the causes of [...] phenomena, [...] we must take into account the variety of ways in which analogous occurrences happen within our experience." Thus, Bayesian statistics combines the yin and the yang, balancing the pursuit of simplicity with the limitations of uncertainty.
[7/31/09: Added a link to Simon Owens' interview with The Ensemble.]
I just read Charles Seife's excellent book, "Sun in a bottle: The strange history of fusion and the science of wishful thinking." One thing I found charming about the book was that it lumped crackpot cold fusion, nutty plans to use H-bombs to carve out artificial harbors in Alaska, and mainstream tokomaks into the same category: wildly-hyped but unsuccessful promises to change the world. The "wishful thinking" framing seems to fit all these stories pretty well, much better than the usual distinction between the good science of big-budget lasers and tokomaks and the bad science of cold fusion and the like. The physics explanations were good also.
The only part I really disagreed with. On page 220, Seife writes, "Science is little more than a method of tearing away notions that are not supported by cold, hard data." I disagree. Just for a few examples from physics, how about Einstein's papers on Brownian motion and the photoelectric effect? And what about lots of biology, chemistry, and solid state physics, figuring out the structures of crystals and semiconductors and protein folding and all that? Sure, all of this work involves some "tearing away" of earlier models, but much of it--often the most important part--is constructive, building a model--a story--that makes sense and backing it up with data.
I really like this post of Nate Silver's. Ideal-point models and other fancy statistical techniques are fine, but I'm a big fan of using the simple, directly-interpretable summary when it makes the point.
Mike Barnicle's already on the case. So now it's time for the classy upscale take on the story.
After six entries and 91 comments on the connections between Judea Pearl and Don Rubin's frameworks for causal inference, I thought it would be good to draw the discussion to a (temporary) close. I'll first present a summary from Pearl, then briefly give my thoughts.
Pearl writes:
It goes like this: there's something you want to estimate and you have some data. Maybe, to take my favorite recent example, you want to break down support for school vouchers by religion, ethnicity, income, and state (or maybe you'd like to break it down even further, but you have to start somewhere).
Or maybe you want to estimate the difference between how rich and poor people vote, by state, over several decades--but you're lazy and all you want to work with are the National Election Studies, which only have a couple thousand respondents, at most, in any year, and don't even cover all the states.
Or maybe you want to estimate the concentration of cat allergen in a bunch of dust samples, while simultaneously estimating the calibration curve needed to get numerical estimates, all in the presence of contamination that screws up your calibration.
Or maybe you want to identify the places in the United States where it's cost-effective to test your house for radon gas--and the data you have across the country are 80,000 noisy measurements, 5,000 accurate measurements, and some survey data and geological information.
Or maybe you want to understand how perchloroethylene is absorbed in the body--a process that is active at the time scale of minutes and also weeks--given only a couple dozen measurements on each of a few people.
Or maybe you want to get a picture of brain activity given indirect measurements from a big clanking physical device encircling a person's head.
Or maybe you want to estimate what might have happened in past elections had the Democrats or Republicans received 1% more, or 2% more, or 3% more, of the vote.
Or maybe . . . or maybe . . .
What all these examples have in common is some data--not enough, never enough!--and a vague sense arising in my mind of what the answer should look like. Not exactly what it would look like--for example, I did not in any way anticipate the now-notorious pattern of vouchers being more popular among rich white Catholics and evangelicals and among poor blacks and Hispanics (maybe I should've anticipated it; I'm not proud in the level of ignorance that I had that allowed this finding to surprise me, I'm just stating the facts)--but what it could look like. Or, maybe it would be more accurate to say, various things that wouldn't look right, if I were to see them.
And the challenge is to get from point A to point B. So, you throw model after model at the problem, method after method, alternating between quick-and-dirty methods that get me nowhere, and elaborate models that give uninterpretable, nonsensical results. Until finally you get close. Actually, what happens is that you suddenly solve the problem! Unexpectedly, you're done! And boy is the result exciting. And you do some checking, fit to a different dataset maybe, or make some graphs showing raw data and model estimates together, or look carefully at some of the numbers, and you realize you have a problem. And you stare at your code for a long long time and finally bite the bullet, suck it up and do some active debugging, fake-data simulation, and all the rest. You code your quick graphs as diagnostic plots and build them into your procedure. And you go back and do some more modeling, and you get closer, and you never quite return to the triumphant feeling you had earlier--because you know that, at some point, the revolution will come again and with new data or new insights you'll have to start over on this problem, but, for now, yes, yes, you can stop, you can step back and put in the time--hours, days!--to make pretty graphs, you can bask in the successful solution of a problem. You can send your graphs out there and let people take their best shot. You've done it.
But, not so deep inside you, that not-so-still and not-so-small voice reminds you of the compromises you've made, the data you've ignored, the things you just don't know if you believe. You want to do more, but that will require more computing, more modeling, more theory. Yes, more theory. More understanding of what these things called models do. Because, just like storybook characters take on a life of their own, just like Gollum wouldn't die and Frank Bascombe comes up with wisecracks all on his own, and Ramona Quimby won't stay down even if you try to make her, and so on and so on and so on, just like these characters, each with his or her internal logic, so any statistical model worth fitting also has its internal logic, mathematical properties latent in its form but, Turing-machine-like, impossible to anticipate before applying it to data--not just "real data" (how I hate that phrase), but data from live problems. And then comes Statistical Theory--the good kind, the kind that tells us what our models can and cannot do, when they can bend with the data and when they snap. (Did you know that doubly-integrated white noise can't really turn corners? I didn't, until I tried to fit such a model to data that went up, then down.) And you do your best with your Theory, and your simulations, and even your computing (yuck!). But you move on. And you hope that when it's time to come back to this problem, you'll have some better models at hand, things like splines and time series cross sectional models, and you'll have a programming and modeling environment where you can just write down latent factors and have them interact, and you'll be able to include three-way interactions, and four-way interactions, and . . . and . . . you hope that in ten years you'll be fitting the models that, ten years ago, you thought you'd be fitting in five years. And you take a rest. You write up what you found and you write up exactly what you did (not always so easy to do). And a new question comes along. You want a quick answer. You try putting together available data in a simple way. You try some weighting. But you don't believe your answer. You need more data. You need more model. You get to work.
That's how it feels, from the inside.
John Sides links to this quote from Barney Frank:
Not for the first time, as a -- a -- an elected official, I envy economists. Economists have available to them, in an analytical approach, the counterfactual. Economists can explain that a given decision was the best one that could be made, because they can show what would have happened in the counterfactual situation. They can contrast what happened to what would have happened.No one has ever gotten reelected where the bumper sticker said, "It would have been worse without me." You probably can get tenure with that. But you can't win office.
I have two thoughts on this. First, I think Frank is a bit too confident in economists' ability to "show what would have happened in the counterfactual situation." Maybe "estimate" or "guess" or "hypothesize" would be a bit stronger than "show." Recall this notorious graph, which shows the unintentional counterfactual of some economic predictions:

Second, I don't know how Frank can say that about "no one has ever gotten reelected . . ." In Frank's district in Massachusetts, it would take a lot--a lot--for a Democrat to not get reelected.
Alex Tabarrok and Matthew Yglesias comment on "the marginal utility of money income." I'll have to write something longer about this some day, but for now let me just reiterate my current understanding that there is no such thing as a utility function. Rather than people arguing over the shape of the utility function, I hope they can move forward to thinking more directly about what people will do with their money.
From my earlier blog entry:
Original title of article: "Estimating turnout, vote intention, and issue attitudes in subsets of the population"
New title: "Who votes? How did they vote? And what were they thinking?"
I was getting my haircut today, and the TV in the barbershop was set to some kids' channel that was featuring a show about some weird form of basketball where the players can bounce on a trampoline on the way to dunking the ball into the basket. Sort of a cool idea, should definitely appeal to the targeted demographic of 10-year-old boys. It was set up as though it was what we might call a "real" professional sports league, with teams, won-lost records, upcoming games, announcers calling plays, and with players including some retired NBA stars. Not quite as over-the-top as professional wrestling, but that sort of thing.
Anyway, what puzzled me about all this was how little action there was on the screen. There were lots of interviews with players, video features, highlights of previous games, replays, and logos, but very little actual basketball.
Is this what 10-year-old boys want? I'm sure they've done lots of marketing surveys, so the answer is probably yes. But it left me extremely confused. Here you have a made-for-TV sport, the rules can be anything they want--I'd think they'd want there to be as much action as possible--passing, dunking, running, jumping and all the rest. While the ball was in play, the players were impressively athletic. But the ball was almost never in play. To me, it was much less exciting than any random basketball game you might see on ESPN. Again, they can make any rules they want--so why do they do it this way? I'd think kids would prefer to see live action rather than a series of disconnected highlights and replays. Perhaps someone could explain to me?
Freedom House is currently seeking individuals with demonstrated professional experience to work with civil society organizations in Egypt through the International Executive Volunteers (IEV) program for 3 months beginning in September 2009.
Volunteers must have a minimum of five years of relevant professional experience, the ability to commit to 3 months of service, and a resourceful, innovative personality. Previous overseas experience, particularly in Egypt and in the Middle East and North Africa is preferred.
Statistician/Polling Specialist
A statistician/polling specialist has been requested to provide support in the preparation and analysis of survey methodology and questionnaire data. Tasks will include designing work plans, managing logistics, reporting results to targeted groups, and developing relationships with key constituencies. Additional knowledge or expertise is needed for volunteer management - recruiting, retaining, and training for key projects. Arabic language skills are preferred but not required.
Recently I was invited to write an article on the philosophy of Bayesian statistics. For a long time I've been unhappy with the discussions of philosophy offered by Bayesian statisticians and also with the perspectives on Bayesian statistics coming from philosophers. I'd been planning for about fifteen years to write an article on the topic but had never gotten around to it, so I welcomed this opportunity.
I thought it made sense to do some reading, and I thought I'd start with Lakatos, whom I think of as a sort of rationalized Popper (Lakatos actually attributes some of his own ideas to a hypothetical Popper_2). In retrospect, I think this was a good choice. I like a lot of what Lakatos had to say--even though he didn't write much about statistics, or Bayesian statistics, most of the ideas transfer over fine, I think.
But that's not the reason for this note.
I'm writing here to tell you what happened when I ordered the two volumes of Lakatos's collected writings, published by Cambridge under the titles, "The Methodology of Scientific Research Programmes" and "Mathematics, Science, and Epistemology," paperbacks selling on Amazon for about $50 bucks each. I eagerly awaited their arrival in my mailbox, but when they finally came, and I opened them . . . they were really hard to read! The type was blurry.
I guess they took the original book and did some sort of crappy photoimaging . . . Hey! This is Cambridge University Press we're talking about, reprinting a classic academic book and not even taking the trouble to do it right! What's with that??? I can see that it might be a pain to retype the original book, but can't they scan in the text and reset it? Or, maybe even simpler, take their photoimaged text and run it through some software to unblur it? The current version is a joke, and I was embarrassed to even have it in my office.
I returned the volumes to Amazon and ordered the books from the Columbia library. (That was a pain too, but that's another story. I doubt the readers of my blog need to hear about my problems with the Columbia library.) These original hardcovers are fine. Not the greatest print job in the world, actually--I find the font pretty hard to read--but much better than the blur-o-matic that Cambridge was charging $100 for. (Oddly enough, the printing in my paperback copy of Proofs and Refutations is fine.)
P.S. Yes, yes, I know this is unimportant compared to all the hunger and strife in the world, etc etc. But still . . . what ever happened to professionalism?
Tobias Verbeke writes:
I just noticed in your blog post you use Sharon Lohr's book on sampling design and analysis for your course.Some time ago I made an R package with the datasets and a vignette which reproduces part of the analyses with Thomas Lumley's survey package.
This could be useful.
We're having some problem with the blog, where we get comments but they don't show up on the blog. We're trying to figure out what's going on. In the meantime, feel free to post your comments; they'll show up soon, I hope.
I wasn't actually so thrilled with how the course went--I last taught it a few years ago--but I thought it might help to share some of my experiences.
1. I used the excellent book by Lohr. And students always like when you follow the book.
2. That said, whenever I deviated from the straight sampling stuff and talked about modeling (for example, forecasting or missing data imputation or just an overview of regression), they loved it. Our students are much more interested in modeling than in sampling.
3. You have to decide ahead of time how much you want them to do with real data on the computer, and how much you want to have them deriving formulas. Either is ok, you just need to figure that out.
4. Stata is the standard software for survey sampling. I use R because that's what I know.
5. Lohr's book, like all books on surveys, is strongest on design, and weakest on analysis of surveys collected by others (survey weights and the like).
6. I assigned the Groves et al book as a supplementary text. It's a great book, but it didn't work so well to teach out of. It's still probably a good idea to assign it, just so students have it as a reference.
Here's a syllabus, a schedule of homework assignments, and some notes.
A bunch of years ago, I published an article (using some of the material in my Ph.D. thesis) in the Journal of Cerebral Blood Flow and Metabolism. It's ranked as the #25 journal in neuroscience, and has a pretty crappy impact factor of 5.7.
By comparison, the impact factors of the top statistics journals a few years ago were:
JASA 1.6, JRSS 1.5, Ann Stat 1.3, Ann Prob 0.9, Biometrika 1.8, Biometrics 1.1, Stat Sci 2.0, Technometrics 1.3.
So now you know why statisticians don't like impact factors.
I want to talk about some similarities between writing and statistical graphics. Just about everybody knows something about writing, and I'd like to help transfer some of this expertise to thinking about statistical graphics.
The story begins with some ugly pie charts I noticed the other day. I wascommenting on them and suddenly realized . . . the graphs weren't as bad as I thought they were! To be more precise, the graphs had a lot of failings, but the sum total of all these problems wasn't so bad.
Here are the actual charts:
As I wrote earlier, these graphs have lots of obviously-fixable problems, most notably that the wedges aren't labeled directly. Instead, the reader has to go back and forth, back and forth, between the chart and the legend. On the other hand, the information is conveyed unambiguously.
I'd like to make the analogy to sloppy writing--misspellings, grammatical errors, sentence fragments and run-ons, garden-path sentences, distracting cliches, and all the rest. (All these "errors" can be used to good effect. No rule is absolute. For sure, baby. Much of the time, though, I think these really are mistakes rather than intentional use for )emphasis or clarity.)
Why is sloppy writing a bad thing? For example, what's wrong with using "it's" instead of "its," or messing up subject-verb agreement, or losing track of an adverb's pointer, in a setting where the meaning is clear? The problem is that it creates work for your readers, who often have to double back to figure out the meaning. If you're Ezra Pound writing a poem, maybe you want to have that effect, but I don't think it's the goal of most journalists, news bloggers, etc.
OK, back to the pie charts. They could be worse, but they require a lot of work to read. Arguably, this criticism could be thrown at any graph: for example, I love line plots, but if you've never seen a line plot before, you'll struggle with it. The difference is that you can learn to read line plots, but you'll never be able to quickly read the pie charts shown above: no matter what, you have to back and forth between the pie, the legend, the pie, the legend, and so forth, to keep it all in your mind at once.
To push the analogy further, I'm recommending what might be called the George Orwell approach to statistical graphics: the goal is to be clear as a window pane. This isn't the only option, though. There's the Chris Ware style: graphs that are tiny and nearly impossible to read, but if you stare at them for a long time you realize they actually make sense. Or the Martin Amis style: flashy gimmicks that make the graph fun to read even if you don't care so much about the subject. Or the Veronica Geng style: playing it straight while going over the top at the same time. And so forth.
I think some of the confusion that has arisen from Ed Tufte's work is that people read his book and then want to go make cool graphs of their own. But cool like Amis, not cool like Orwell. We each have our own styles, and I'm not trying to tell you what to do, just to help you look at your own writing and graphics so you can think harder about what you want your style to be.
P.S. Yes, yes, I'm sure I have various usage, grammatical, and stylistic errors above. Give me a break, man! It's just a blog entry. More to the point . . . by now you should trust me enough to think, when you see something discordant, that maybe I've done it on purpose!
P.P.S. Another issue is cost or effort. It wasn't necessarily worth it for Tom Schaller to learn a bunch of new graphical tools just to make his blog entry slightly easier to read. In my discussion above, I'm ignoring the investment in time required to think in terms of graphics and to learn the relevant software.
Ben Hyde pointed me to this data-based dating site. I have no comments on how it works for dates, but they have a lot of fun maps, for example this:
Are some human lives worth more than others?


268,864 people have answered
And this:
If you knew for sure you would not get caught,
would you commit murder for any reason?


359,761 people have answered
This is great; I can't resist giving a couple more:
Going through the Profiles in Research published by the Journal of Educational and Behavioral Statistics, I was amused to see the following concluding paragraph in the interview with Lyle Jones:
Despite my [Jones's] strong preference for interval estimation, there are situations for which a test of significance still may be appropriate. One is multiple comparisons, such as comparisons between all pairs of states for average student achievement scale scores in NAEP [National Assessment of Educational Progress]. A related application is assessing the goodness of fit of a model to an array of values. In these cases, interval estimation is not easily employed and the careful application of significance tests may continue to serve about as well as any alternative.
No! Not at all! My paper with Jennifer and Masanao specifically shows how interval estimation (i.e., multilevel modeling) solves the NAEP comparisons problem just fine (setting aside the question of whether we should be interested in these state-level averages in the first place). It's good to knows that some progress has been made since 2003.
Here's our estimate of public support for vouchers, broken down by religion/ethnicity, income, and state:
(Click on image to see larger version.)
We're mapping estimates from a hierarchical Bayes model fit to data from the 2000 Annenberg survey (approximately 50,000 respondents).
In case you're wondering what Bayesian modeling did for us, here are the corresponding maps from the raw data (weighted to adjust for voter turnout, but that doesn't actually do that much anyway):
(Click on image to see larger version.)
OK, so Bayes gives you a lot. The costs?
Brendan O'Connor created a small applet that allows exploring the beta distribution interactively (just hit arrow keys on the keyboard):
This is a good example of what interactive visualization can do - Andreas Buja was also showing some cool examples some time ago.
He also has source available (for Processing).
What with all this discussion of causal inference, I thought I'd rerun a blog entry from a couple years ago about my personal trick for understanding instrumental variables:
A correspondent writes:
I've recently started skimming your blog (perhaps steered there by Brad deLong or Mark Thoma) but despite having waded through such enduring classics as Feller Vol II, Henri Theil's "Econometrics", James Hamilton's "Time Series Analysis", and T.W. Anderson's "Multivariate Analysis", I'm finding some of the discussions such as Pearl/Rubin a bit impenetrable. I don't have a stats degree so I am thinking there is some chunk of the core curriculum on modeling and causality that I am missing. Is there a book (likely one of yours - e.g. Bayesian Data Analysis) that you would recommend to help fill in my background?
1. I recommend the new book, "Mostly Harmless Econometrics," by Angrist and Pischke (see my review here).
2. After that, I'd read the following chapters from my book with Jennifer:
Chapter 9: Causal inference using regression on the treatment variable
Chapter 10: Causal inference using more advanced models
Here are some pretty pictures, from the low-birth-weight example:

and from the Electric Company example:
3. Beyond this, you could read the books by Morgan and Winship and Pearl, but both these are a bit more technical and less applied that the two books linked to above.
The commenters may have other suggestions.
A student at another university writes in with some questions about Red State, Blue State:

To learn why I made this graph, see here.
Robin Hanson is skeptical of my response in the following exchange:
Hanson: What do the customers who are paying your salary get from you?
Gelman: They learn how to fit multilevel models.
Richard Hahn writes:
In some talk slides you recently posted you have the following bullet point: "Need to go beyond exchangeability to shrink batches of parameters in a reasonable way." If you think other readers of the blog might find it interesting, I'd love to see you elaborate on this. While the whole talk is, of course, an elaboration, you do not elsewhere explicitly mention exchangeability. Isn't the point of de Finetti-style theorems that exchangeability is precisely the "reasonable" assumption that leads to parametric models with nice conditional independence properties? Such results entail that we're at liberty to make sophisticated, highly structured models based on conditional independence with the knowledge that a set of exchangeability judgments on observables lies back of them. Even very flexible, fancy DP-based Bayesian nonparametric models are based on notions of exchangeable random partitions. I'm probably just misreading you, but would be very interested in a clarification about what exactly you mean. If not, at root, exchangeability, then what else exactly is driving the batch shrinkage and how is it not ad hoc?
My quick reply: Consider a two-way data structure modeled as y_ij = a_i + b_j + c_ij, with no other information on the rows, the columns, or the individual cells. Then you have no choice but to model the a_i's and b_j's exchangeably. But the c_ij's can be modeled conditional on the a_j's and b_j's--that is, these latent parameters can be considered as group-level predictors. The model is still exchangeable on the i's and the j's, but not on the (ij)'s. This is sometimes called "partial exchangeability." More generally, one can consider three-way models, etc.
Daniel Egan sent me a link to an article, "Standardized or simple effect size: What should be reported?" by Thom Baguley, that recently appeared in the British Journal of Psychology. Here's the abstract:
It is regarded as best practice for psychologists to report effect size when disseminating quantitative research findings. Reporting of effect size in the psychological literature is patchy -- though this may be changing -- and when reported it is far from clear that appropriate effect size statistics are employed. This paper considers the practice of reporting point estimates of standardized effect size and explores factors such as reliability, range restriction and differences in design that distort standardized effect size unless suitable corrections are employed. For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take. Deciding on the appropriate effect size statistic to report always requires careful thought and should be influenced by the goals of the researcher, the context of the research and the potential needs of readers.
Egan writes:
I run into the problem of reporting coefficients all the time, mostly in the context of presenting effects to non-statisticians. While my audiences are generally bright, the obvious question always asked is "which of these is the biggest effect?" The fact that a sex dummy has a large numerical point estimate relative to number-of-purchases is largely irrelevant - its because sex's range is tiny compared to other covariates. But moreover, sex is irrelevant to "policy-making" - we can't change a persons sex! So what we're interested in is the viable range over which we could influence an independent variable, and the second-order likely affect upon the dependent. So two questions: 1. For pedagogical effect, is there any way of getting around these problems? How can we communicate the effects to non-statisticians easily (and think someone who has exactly 10 minutes to understand your whole report) 2. Is there any easy way to infer the elasticity of the effect - i.e. how much can we change the dependent, by attempting to exogenously change one of the independents? While I know that I could design the experiment to do this, I work in far more observational data - and this "effect" size is really what matters the most.
My quick reply to Egan is to refer to my article with Iain Pardoe on average predictive comparisons, where we discuss some of these concerns.
I also have some thoughts on the Baguley article:
Among other things, while on sabbatical in Paris next year I'll be working with my longtime collaborator Frederic Bois, a toxicologist who uses hierarchical Bayes models extensively. We have a project in toxicology that necessarily also involves research in Bayesian computation.
And, there's a postdoctoral position available! Here are the details:
In the most recent round of our recent discussion, Judea Pearl wrote:
There is nothing in his theory of potential-outcome that forces one to "condition on all information" . . . Indiscriminate conditioning is a culturally-induced ritual that has survived, like the monarchy, only because it was erroneously supposed to do no harm.
I agree with the first part of Pearl's statement but not the second part (except to the extent that everything we do, from Bayesian data analysis to typing in English, is a "culturally induced ritual"). And I think I've spotted a key point of confusion.
To put it simply, Donald Rubin's approach to statistics has three parts:
1. The potential-outcomes model for causal inference: the so-called Neyman-Rubin model in which observed data are viewed as a sample from a hypothetical population that, in the simplest case of a binary treatment, includes y_i^1 and y_i^2 for each unit i).
2. Bayesian data analysis: the mode of statistical inference in which you set up a joint probability distribution for everything in your model, then condition on all observed information to get inferences, then evaluate the model by comparing predictive inferences to observed data and other information.
3. Questions of taste: the preference for models supplied from the outside rather than models inspired by data, a preference for models with relatively few parameters (for example, trends rather than splines), a general lack of interest in exploratory data analysis, a preference for writing models analytically rather than graphically, an interest in causal rather than descriptive estimands.
As that last list indicates, my own taste in statistical modeling differs in some ways from Rubin's. But what I want to focus on here is the distinction between item 1 (the potential outcomes notation) and item 2 (Bayesian data analysis).
The potential outcome notation and Bayesian data analysis are logically distinct concepts!
Items 1 and 2 above can occur together or separately. All four combinations (yes/yes, yes/no, no/yes, no/no) are possible:
- Rubin uses Bayesian inference to fit models in the potential outcome framework.
- Rosenbaum (and, in a different way, Greenland and Robins) use the potential outcome framework but estimate using non-Bayesian methods.
- Most of the time I use Bayesian methods but am not particularly thinking about causal questions.
- And, of course, there's lots of statistics and econometrics that's non-Bayesian and does not use potential outcomes.
Bayesian inference and conditioning
In Bayesian inference, you set up a model and then you condition on everything that's been observed. Pearl writes, "Indiscriminate conditioning is a culturally-induced ritual." Culturally-induced it may be, but it's just straight Bayes. I'm not saying that Pearl has to use Bayesian inference--lots of statisticians have done just fine without ever cracking open a prior distribution--but Bayes is certainly a well-recognized approach. As I think I wrote the other day, I use Bayesian inference not because I'm under the spell of a centuries-gone clergyman; I do it because I've seen it work, for me and for others.
Pearl's mistake here, I think, is to confuse "conditioning" with "including on the right-hand side of a regression equation." Conditioning depends on how the model is set up. For example, in their 1996 article, Angrist, Imbens, and Rubin showed how, under certain assumptions, conditioning on an intermediate outcome leads to an inference that is similar to an instrumental variables estimate. They don't suggest including an intermediate variable as a regression predictor or as a predictor in a propensity score matching routine, and they don't suggest including an instrument as a predictor in a propensity score model.
If a variable is "an intermediate outcome" or "an instrument," this is information that must be encoded in the model, perhaps using words or algebra (as in econometrics or in Rubin's notation) or perhaps using graphs (as in Pearl's notation). I agree with Steve Morgan in his comment that Rubin's notation and graphs can both be useful ways of formulating such models. To return to the discussion with Pearl: Rubin is using Bayesian inference and conditioning on all information, but "conditioning" is relative to a model and does not at all imply that all variables are put in as predictors in a regression.
Another example of Bayesian inference is the poststratification which I spoke of yesterday (see item 3 here). But, as I noted then, this really has nothing to do with causality; it's just manipulation of probability distributions in a useful way that allows us to include multiple sources of information.
P.S. We're lucky to be living now rather than 500 years ago, or we'd probably all be sitting around in a village arguing about obscure passages from the Bible.
A websearch turned up this link to our report on Jeff and Justin's research. It's great to see this stuff out there, but, really, "LGBTQI"? The way things are going, we'll be going through the whole alphabet soon! There's gotta be another way. Once you have "Q" in there, doesn't that pretty much cover all the contingencies?
To continue with our discussion (earlier entries 1, 2, and 3):
1. Pearl has mathematically proved the equivalence of Pearl's and Rubin's frameworks. At the same time, Pearl and Rubin recommend completely different approaches. For example, Rubin conditions on all information, whereas Pearl does not do so. In practice, the two approaches are much different. Accepting Pearl's mathematics (which I have no reason to doubt), this implies to me that Pearl's axioms do not quite apply to many of the settings that I'm interested in.
I think we've reached a stable point in this part of the discussion: we can all agree that Pearl's theorem is correct, and we can disagree as to whether its axioms and conditions apply to statistical modeling in the social and environmental sciences. I'd claim some authority on this latter point, given my extensive experience in this area--and of course, Rubin, Rosenbaum, etc., have further experience--but of course I have no problem with Pearl's methods being used on political science problems, and we can evaluate such applications one at a time.
2. Pearl and I have many interests in common, and we've each written two books that are relevant to this discussion. Unfortunately, I have not studied Pearl's books in detail and I doubt he's had the time to read my books in detail also. It takes a lot of work to understand someone else's framework, work that we don't necessarily want to do if we're already spending a lot of time and effort developing our own research programmes. It will probably be the job of future researchers to make the synthesis. (Yes, yes, I know that Pearl feels that he already has the synthesis, and that he's proved this to be the case, but Pearl's synthesis doesn't yet take me all the way to where I want to go, which is to do my applied work in social and environmental sciences.) I truly am open to the probability that everything I do can be usefully folded into Pearl's framework someday.
That said, I think Pearl is on shaky ground when he tries to say that Don Rubin or Paul Rosenbaum is making a major mistake in causal inference. If Pearl's mathematics implies that Rubin and Rosenbaum are making a mistake, then my first step would be to apply the syllogism the other way and see whether Pearl's assumptions are appropriate for the problem at hand.
3. I've discussed a poststratification example. As I discussed yesterday (see the first item here), a standard idea, both in survey sampling and causal inference, is to perform estimates conditional on background variables, and then average over the population distribution of the background variables to estimate the population average. Mathematically, p(theta) = sum_x p(theta|x)p(x). Or, if x is discrete and takes on only two values, p(theta) = (N_1 p(theta|x=1) + N_2 p(theta|x=2)) / (N_1 + N_2).
This has nothing at all to do with causal inference: it's straight Bayes.
Pearl thinks that if the separate components p(theta|x) are nonidentifiable, that you can't do this, and you should not include x in the analysis. He writes:
I [Pearl] would really like to see how a Bayesian method estimates the treatment effect in two subgroups where it is not identifiable, and then, by averaging the two results (with two huge posterior uncertainties) gets the correct average treatment effect, which is identifiable, hence has a narrow posterior uncertainly. . . . I have no doubt that it can be done by fine-tuned tweaking . . . But I am talking about doing it the honest way, as you described it: "the uncertainties in the two separate groups should cancel out when they're being combined to get the average treatment effect." If I recall my happy days as a Bayesian, the only operation allowed in combining uncertainties from two subgroups is taking a linear combination of the two, weighted by the (given) relative frequencies of the groups. But, I am willing to learn new methods.
I'm glad that Pearl is willing to learn new methods--so am I--but, no new methods are needed here! This is straightforward, simple Bayes. Rod Little has written a lot about these ideas. I wrote some papers on it in 1997 and 2004. Jeff Lax and Justin Phillips do it in their multilevel modeling and poststratification papers where, for the first, time, they get good state-by-state estimates of public opinion on gay rights issues. No "fine-tuned tweaking" required. You just set up the model and it all works out. If the likelihood provides little to no information on theta|x but it does provide good information on the marginal distribution of theta, then this will work out fine.
In practice, of course, nobody is going to control for x if we have no information on it. Bayesian poststratification really becomes useful in that it can put together different sources of partial information, such as data with small sample sizes in some cells, along with census data on population cell totals.
Please, please don't say "the correct thing to do is to ignore the subgroup identity." If you want to ignore some information, that's fine--in the context of the models you are using, it might even make sense. But Jeff and Justin and the rest of us use this additional information all the time, and we get a lot out of it. What we're doing is not incorrect at all. It's Bayesian inference. We set up a joint probability model and then work from it. If you want to criticize the probability model, that's fine. If you want to criticize the entire Bayesian edifice, then you'll have to go up against mountains of applied successes.
As I wrote earlier, you don't have to be a Bayesian (or, I could say, you don't have to be a Bayesian)--I have a great respect for the work of Hastie, Tibshirani, Robins, Rosenbaum, and many others who are developing methods outside the Bayesian framework)--but I think you're on thin ice if you want to try to claim that Bayesian analysis is "incorrect."
4. Jennifer and I and many others make the routine recommendation to exclude post-treatment variables from analysis. But, as both Pearl and Rubin have noted in different contexts, it can be a very good idea to include such variables--it's just not a good idea to include them as regression predictors.) If the only think you're allowed to do is regression (as in chapter 9 of ARM), then I think it's a good idea to exclude post-treatment predictors. If you're allowed more general models, then one can and should include them. I'm happy to have been corrected by both Pearl and Rubin on this one.
5. As I noted yesterday (see second-to-last item here), all statistical methods have holes. This is what motivates us to consider new conceptual frameworks as well as incremental improvements in the systems with which we are most familiar.
Summary . . . so far
I doubt this discussion is over yet, but I hope the above notes will settle some points. In particular:
- I accept (on authority of Pearl, Wasserman, etc.) that Pearl has proved the mathematical equivalence of his framework and Rubin's. This, along with Pearl's other claim that Rubin and Rosenbaum have made major blunders in applied causal inference (a claim that I doubt), leads me to believe that Pearl's axioms are in some way not appropriate to the sorts of problems that Rubin, Rosenbaum, and I work on: social and environmental problems that don't have clean mechanistic causation stories. Pearl believes his axioms do apply to these problems, but then again he doesn't have the extensive experience that Rosenbaum and Rubin have. So I think it's very reasonable to suppose that his axioms aren't quite appropriate here.
- Poststratification works just fine. It's straightforward Bayesian inference, nothing to do with causality at all.
- I have been sloppy when telling people not to include post-treatment variables. Both Rubin and Pearl, in their different ways, have been more precise about this.
- Much of this discussion is motivated by the fact, that, in practice, none of these methods currently solves all our applied problems in the way that we would like. I'm still struggling with various problems in descriptive/predictive modeling, and causation is even harder!
- Along with this, taste--that is, working with methods we're familiar with--matters. Any of these methods is only as good as the models we put into them, and we typically are better modelers when we use languages with which we're more familiar. (But not always. Sometimes it helps to liberate oneself, try something new, and break out of the implicit constraints we've been working on.)
I visited AT&T Labs today--lots of fun, a great group of people, an interesting mix of statistics and machine learning. They showed me some cool visualizations that I'll display soon.
Anyway, while I was there, somebody asked me about voters with different educational levels. In discussing it, we realized we wanted to break this down by ethnicity and age. So I quickly prepared a grid of graphs for him.
On the train ride back, I spent a few minutes making the graphs prettier:

These are based on raw Pew data, reweighted to adjust for voter turnout by state, income, and ethnicity. No modeling of vote on age, education, and ethnicity. I think our future estimates based on the 9-way model will be better, but these are basically OK, I think. All but six of the dots in the graph are based on sample sizes greater than 30.
To follow up on yesterday's discussion, I wanted to go through a bunch of different issues involving graphical modeling and causal inference.
Contents:
- A practical issue: poststratification
- 3 kinds of graphs
- Minimal Pearl and Minimal Rubin
- Getting the most out of Minimal Pearl and Minimal Rubin
- Conceptual differences between Pearl's and Rubin's models
- Controlling for intermediate outcomes
- Statistical models are based on assumptions
- In defense of taste
- Argument from authority?
- How could these issues be resolved?
- Holes everywhere
- What I can contribute
Philip Dawid (a longtime Bayesian researcher who's done work on graphical models, decision theory, and predictive inference) saw our discussion on causality and sends in some interesting thoughts, which I'll post here and then very briefly comment on:
Having just read through this fascinating interchange, I [Dawid] confess to finding Shrier and Pearl's examples and arguments more convincing that Rubin's. At the risk of adding to the confusion, but also in hope of helping at least some others, let me briefly describe yet another way (related to Pearl's, but with significant differences) of formulating and thinking about the problem. For those who, like me, may be concerned about the need to consider the probabilistic behaviour of counterfactual variables, on the one hand, or deterministic relationships encoded graphically, on the other, this provides an observable-focused, fully stochastic, alternative. A full presentation of the essential ideas can be found in Chapters 9 (Confounding and Sufficient Covariates) and 10 (Reduction of Sufficient Covariate) of my online document "Principles of Statistical Causality".Like Pearl, I like to think of "causal inference" as the task of inferring what would happen under a hypothetical intervention, say F_E = e, that sets the value of the exposure E at e, when the data available are collected, not under the target "interventional regime", but under some different "observational regime". We could code this regime as F_E = idle. We can think of the non-stochastic variable F_E as a parameter, indexing the joint distribution of all the variables in the problem, under the regime indicated by its value.
Aleks was nice enough to pass this on to us.
Stuart Buck writes:
You posted about this once on your blog, i.e., how many times observational studies have been refuted by clinical trials. Check out the following, especially Table 3.
This Thursday at 7pm Jake Hofman and Suresh Velagapundi will present a session at New York R Statistical Programming Meetup at NYU - Silver Center (100 Washington Square East, Room 401). Here's the outline:
Background:From Principles to Practice:
- Conditional probability & Bayes' Rule
- Treating parameters as random variables & putting distributions on them
- Bayesian inference: from priors & likelihoods to posteriors
- Simple plan; difficult to execute (normalization)
- Resort to approximation methods (variational & MCMC)
- Model selection / complexity control a la Bayes
Greg Mankiw links to an article that illustrates the challenges of interpreting raw numbers causally. This would really be a great example for your introductory statistics or economics classes, because the article, by Robert Book, starts off by identifying a statistical error and then goes on to make a nearly identical error of its own! Fun stuff.
This is a pretty long one. It's an attempt to explore some of the differences between Judea Pearl's and Don Rubin's approaches to causal inference, and is motivated by recent article by Pearl.
Pearl sent me a link to this piece of his, writing:
I [Pearl] would like to encourage a blog-discussion on the main points raised there. For example:Whether graphical methods are in some way "less principled" than other methods of analysis.
Whether confounding bias can only decrease by conditioning on a new covariate.
Whether the M-bias, when it occurs, is merely a mathematical curiosity, unworthy of researchers attention.
Whether Bayesianism instructs us to condition on all available measurements.
I've never been able to understand Pearl's notation: notions such as a "collider of an M-structure" remain completely opaque to me. I'm not saying this out of pride--I expect I'd be a better statistician if I understood these concepts--but rather to give a sense of where I'm coming from. I was a student of Rubin and have used his causal ideas for awhile, starting with this article from 1990 on estimating the incumbency advantage in politics. I'm pleased to see these ideas gaining wider acceptance. In many areas (including studying incumbency, in fact), I think the most helpful feature of Rubin's potential-outcome framework is to get you, as a researcher, to think hard about what you are in fact trying to estimate. In much of the current discussion of identification strategies, regression discontinuities, differences in differences, and the like, I think there's too much focus on technique and not enough thought put into what the estimates are really telling you. That said, it makes sense that other theoretical perspectives such as Pearl's could be useful too.
To return to the article at hand: Pearl is clearly frustrated by what he views as Rubin's bobbing and weaving to avoid a direct settlement of their technical dispute. From the other direction, I think Rubin is puzzled by Pearl's approach and is not clear what the point of it all is.
I can't resolve the disagreements here, but maybe I can clarify some technical issues.
Controlling for pre-treatment and post-treatment variables
Much of Pearl's discussion turns upon notions of "bias," which in a Bayesian context is tricky to define. We certainly aren't talking about the classical-statistical "unbiasedness," in which E(theta.hat | theta) = theta for all theta, an idea that breaks down horribly in all sorts of situations (see page 248 of Bayesian Data Analysis). Statisticians are always trying to tell people, Don't do this, Don't do that, but the rules for saying this can be elusive. This is not just a problem for Pearl: my own work with Rubin suffers from similar problems. In chapter 7 of Bayesian Data Analysis (a chapter that is pretty much my translation of Rubin's ideas), we talk about how you can't do this and you can't do that. We avoid the term "bias," but then it can be a bit unclear what our principles are. For example, we recommend that your model should, if possible, include all variables that affect the treatment assignment. This is good advice, but really we could go further and just recommend that an appropriate analysis should include all variables that are potentially relevant, to avoid omitted-variable bias (or the Bayesian equivalent). Once you've considered a variable, it's hard to go back to the state of innocence in which that information was never present.
If I'm reading his article correctly, Pearl is making two statistical points, both in opposition to Rubin's principle that a Bayesian analysis (and, by implication, any statistical analysis) should condition on all available information:
1. When it comes to causal inference, Rubin says not to control for post-treatment variables (that is, intermediate outcomes), which seems to contradict Rubin's more general advice as a Bayesian to condition on everything.
2. Rubin (and his collaborators such as Paul Rosenbaum) state unequivocally that a model should control for all pre-treatment variables, even though including such variables, in Pearl's words, "may create spurious associations between
treatment and outcome and this, in turns, may increase or decrease confounding bias."
Let me discuss each of these criticisms, as best as I can understand them. Regarding the first point, a Bayesian analysis can control for intermediate outcomes--that's ok--but then the causal effect of interest won't be summarized by a single parameter--a "beta"--from the model. In our book, Jennifer and I recommend not controlling for intermediate outcomes, and a few years ago I heard Don Rubin make a similar point in a public lecture (giving an example where the great R. A. Fisher made this mistake). Strictly speaking, though, you can control for anything; you just then should suitably postprocess your inferences to get back to your causal inferences of interest.
I don't fully understand Pearl's second critique, in which he says that it's not always a good idea to control for pre-treatment variables. My best reconstruction is that Pearl's thinking about a setting where you could estimate a causal effect in a messy observational setting in which there are some important unobserved confounders, and it could well happen that controlling for a particular pre-treatment variable happens to make the confounding worse. The idea, I think, is that if you have an analysis where various problems cancel each other out, then fixing one of these problems (by controlling for one potential counfounder) could result in a net loss. I can believe this could happen in practice, but I'm wary of setting this up as a principle. I'd rather control for all the pre-treatment predictors that I can, and then make adjustments if necessary to attempt to account for remaining problems in the model. Perhaps Pearl's position and mine are not so far apart, however, if his approach of not controlling for a covariate could be seen as an approximation to a fuller model that controls for it while also adjusting for other, unobserved, confounders.
The sum of unidentifiable components can be identifiable
At other points, Pearl seems to be displaying a misunderstanding of Bayesian inference (at least, as I see it). For example, he writes:
For example, if we merely wish to predict whether a given person is a smoker, and we have data on the smoking behavior of seat-belt users and non-users, we should condition our prior probability P(smoking) on whether that person is a "seat-belt user" or not. Likewise, if we wish to predict the causal effect of smoking for a person known to use seat-belts, and we have separate data on how smoking affects seat-belt users and non-users, we should use the former in our prediction. . . . However, if our interest lies in the average causal effect over the entire population, then there is nothing in Bayesianism that compels us to do the analysis in each subpopulation separately and then average the results. The class-specific analysis may actually fail if the causal effect in each class is not identifiable.
I think this discussion misses the point in two ways.
First, at the technical level, yes you definitely can estimate the treatment effect in two separate groups and then average. Pearl is worried that the two separate estimates might bot be identifiable--in Bayesian terms, that they will individually have large posterior uncertainties. But, if the study really is being done in a setting where the average treatment effect is identifiable, then the uncertainties in the two separate groups should cancel out when they're being combined to get the average treatment effect. If the uncertainties don't cancel, it sounds to me like there must be some additional ("prior") information that you need to add.
The second way that I disagree with Pearl's example is that I don't think it makes sense to estimate the smoking behavior separately for seat-belt users and non-users. This just seems like a weird thing to be doing. I guess I'd have to see more about the example to understand why someone would do this. I have a lot of confidence in Rubin, so if he actually did this, I expect he had a good reason. But I'd have to see the example first.
Final thoughts
Hal Stern once told me the real division in statistics was not between the Bayesians and non-Bayesians, but between the modelers and the non-modelers. The distinction isn't completely clear--for example, where does the "Bell Labs school" of Cleveland, Hastie, Tibshirani, etc. fall?--but I like the idea of sharing a category as all the modelers over the years--even those who have not felt the need to use Bayesian methods.
Reading Pearl's article, however, reminded me of another distinction, this time between discrete models and continuous models. I have a taste for continuity and always like setting up my model with smooth parameters. I'm just about never interested in testing whether a parameter equals zero; instead, I'd rather infer about the parameter in a continuous space. To me, this makes particular sense in the sorts of social and environmental statistics problems where I work. For example, is there an interaction between income, religion, and state of residence in predicting one's attitude toward school vouchers? Yes. I knew this ahead of time. Nothing is zero, everything matters to some extent. As discussed in chapter 6 of Bayesian Data Analysis, I prefer continuous model expansion to discrete model averaging.
In contrast, Pearl, like many other Bayesians I've encountered, seems to prefer discrete models and procedures for finding conditional independence. In some settings, this can't matter much: if a source of variation is small, then maybe not much is lost by setting it to zero. But it changes one's focus, pointing Pearl toward goals such as "eliminating bias" and "covariate selection" rather than toward the goals of modeling the relations between variables. I think graphical models are a great idea, but given my own preferences toward continuity, I'm not a fan of the sorts of analyses that attempt to discover whether variables X and Y really have a link between them in the graph. My feeling is, if X and Y might have a link, then they do have a link. The link might be weak, and I'd be happy to use Bayesian multilevel modeling to estimate the strength of the link, partially pool it toward zero, and all the rest--but I don't get much out of statistical procedures that seek to estimate whether the link is there or not.
Finally, I'd like to steal something I wrote a couple years ago regarding disputes over statistical methodology:
Different statistical methods can be used successfully in applications--there are many roads to Rome--and so it is natural for anyone (myself included) to believe that our methods are particularly good for applications. For example, Adrian Raftery does excellent applied work using discrete model averaging, whereas I don't feel comfortable with that approach. Brad Efron has used bootstrapping to help astronomers solve their statistical problems. Etc etc. I don't think that Adrian's methods are particularly appropriate to sociology, or Brad's to astronomy--these are just powerful methods that can work in a variety of fields. Given that we each have successes, it's unsurprising that we can each feel strongly in the superiority of our own approaches. And I certainly don't feel that the approaches in Bayesian Data Analysis are the end of the story. In particular, nonparametric methods such as those of David Dunson, Ed George, and others seem to have a lot of advantages.
Similarly, Pearl has achieved a lot of success and so it would be silly for me to argue, or even to think, that he's doing everything all wrong. I think this expresses some of Pearl's frustration as well: Rubin's ideas have clearly been successful in applied work, so it would be awkward to argue that Rubin is actually doing the wrong thing in the problems he's worked on. It's more that any theoretical system has holes, and the expert practitioners in any system know how to work around these holes.
P.S. More here (and follow the links for still more).
This note by Steve Hsu on the history of the Wranglers (winners of a mathematics competition held each year from 1753-1909 at Cambridge University) reminded me of my experience in the U.S. math olympiad training program in high school. At the time, it seemed clear that we were clearly ordered by ability (with my position somewhere between 15th and 20th out of 24!). In retrospect, I think there are a lot of tricks to solving and writing up solutions to "Olympiad problems," and I didn't know a lot of these tricks.
It was the usual paradox of measurement: I was confusing reliability with validity, as they say in the psychometric literature.
Daljit Dhadwal writes:
On the Ask Metafilter site, someone asked the following:How does statistical analysis differ when analyzing the entire population rather than a sample? I need to do some statistical analysis on legal cases. I happen to have the entire population rather than a sample. I'm basically interested in the relationship between case outcomes and certain features (e.g., time, the appearance of certain words or phrases in the opinion, the presence or absence of certain issues). Should I do anything different than I would if I were using a sample? For example, is a p-value meaningful in this kind of case?
My reply:
This is a question that comes up a lot. For example, what if you're running a regression on the 50 states. These aren't a sample from a larger number of states; they're the whole population.
To get back to the question at hand, it might be that you're thinking of these cases as a sample from a larger population that includes future cases as well. Or, to put it another way, maybe you're interested in making predictions about future cases, in which case the relevant uncertainty comes from the year-to-year variation. That's what we did when estimating the seats-votes curve: we set up a hierarchical model with year-to-year variation estimated from a separate analysis. (Original model is here, later version is here.)
So, one way of framing the problem is to think of your "entire population" as a sample from a larger population, potentially including future cases. Another frame is to think of there being an underlying probability model. If you're trying to understand the factors that predict case outcomes, then the implicit full model includes unobserved factors (related to the notorious "error term") that contribute to the outcome. If you set up a model including a probability distribution for these unobserved outcomes, standard errors will emerge.
After finding the Howard Wainer interview, I looked up the entire series of Profiles in Research published by the Journal of Educational and Behavioral Statistics. I don't have much to say about most of these interviews: some of these people I'd never heard of, and I don't really have much research overlap with the others. Probably I have the most overlap with R. D. Bock, who's done a lot of work on multilevel modeling, but, for whatever reason, his stories didn't grab my interest.
But I was curious about the interview with Arthur Jensen. I've never met him--he gave a talk at the Berkeley statistics department once when I was there, but for some reason I wasn't able to attend the talk. But I've heard of him. As the interviewers (Daniel Robinson and Howard Wainer) state:
A correspondent read my recent note on the limited influence of the median voter and writes:
My understanding of median voter theorem is that each election has its own median voter, and that the median voter's influence is limited to the outcome of that election only. I don't understand, then, why the graph in your post is evidence that the median voter has little influence. It seems to me that there are two elections being considered in that graph, with two different median voters. The graph appears to consider "moderation" to be having a moderate voting record in Congress, but it seems to me that the median voter in Congress is likely quite different from the median voter in any particular Congressional district. The power of the median voter in Congress, it seems to me, is to affect the outcome of Congressional votes, not to improve his own chances for re-election, which are determined by his proximity to the median voter in his district. Thus, I'm not sure why we would expect moderation, as measured by the median Congressional voter, to translate into electoral success, which we would expect to be determined by the median district voter.
My reply:






Recent Comments