Results matching “R”

This post is by Phil.

I love this post by Jialan Wang. Wang "downloaded quarterly accounting data for all firms in Compustat, the most widely-used dataset in corporate finance that contains data on over 20,000 firms from SEC filings" and looked at the statistical distribution of leading digits in various pieces of financial information. As expected, the distribution is very close to what is predicted by Benford's Law.

Very close, but not identical. But does that mean anything? Benford's "Law" isn't really a law, it's more of a rule or principle: it's certainly possible for the distribution of leading digits in financial data --- even a massive corpus of it --- to deviate from the rule without this indicating massive fraud or error. But, aha, Wang also looks at how the deviation from Benford's Law has changed with time, and looks at it by industry, and this is where things get really interesting and suggestive. I really can't summarize any better than Wang did, so click on the first link in this post and go read it. But come back here to comment!

The blog is now here. Time to update your feed (if it didn't happen automatically).

Kind of Bayesian

Astrophysicist Andrew Jaffe pointed me to this and discussion of my philosophy of statistics (which is, in turn, my rational reconstruction of the statistical practice of Bayesians such as Rubin and Jaynes). Jaffe's summary is fair enough and I only disagree in a few points:

1. Jaffe writes:

Subjective probability, at least the way it is actually used by practicing scientists, is a sort of "as-if" subjectivity -- how would an agent reason if her beliefs were reflected in a certain set of probability distributions? This is why when I discuss probability I try to make the pedantic point that all probabilities are conditional, at least on some background prior information or context.

I agree, and my problem with the usual procedures used for Bayesian model comparison and Bayesian model averaging is not that these approaches are subjective but that the particular models being considered don't make sense. I'm thinking of the sorts of models that say the truth is either A or B or C. As discussed in chapter 6 of BDA, I prefer continuous model expansion to discrete model averaging.

Either way, we're doing Bayesian inference conditional on a model; I'd just rather do it on a model that I like. There is some relevant statistical analysis here, I think, about how these different sorts of models perform under different real-world situations.

2. Jaffe writes that I view my philosophy as "Popperian rather than Kuhnian." That's not quite right. In my paper with Shalizi, we speak of our philosophy as containing elements of Popper, Kuhn, and Lakatos. In particular, we can make a Kuhnian identification of Bayesian inference within a model as "normal science" and model checking and replacement as "scientific revolution." (From a Lakatosian perspective, I identify various responses to model checks as different forms of operations in a scientific research programme, ranging from exception-handling through modification of the protective belt of auxiliary hypothesis through full replacement of a model.)

3. Jaffe writes that I "make a rather strange leap: deciding amongst any discrete set of parameters falls into the category of model comparison." This reveals that I wasn't so clear in stating my position. I'm not saying that a Bayesian such as myself shouldn't or wouldn't apply Bayesian inference to a discrete-parameter model. What I was saying is that my philosophy isn't complete. Direct Bayesian inference is fine with some discrete-parameter models (for example, a dense discrete grid approximating a normal prior distribution) but not for others (for example, discrete models for variable selection, where any given coefficient is either "in" (that is, estimated by least squares with a flat prior) or "out" (set to be exactly zero)). My incoherence is that I don't really have a clear rule of when it's OK to do Bayesian model averaging and when it's not.

As noted in my recent article, I don't think this incoherence is fatal--all other statistical frameworks I know of have incoherence issues--but it's interesting.

Andy McKenzie writes:

In their March 9 "counterpoint" in nature biotech to the prospect that we should try to integrate more sources of data in clinical practice (see "point" arguing for this), Isaac Kohane and David Margulies claim that,

"Finally, how much better is our new knowledge than older knowledge? When is the incremental benefit of a genomic variant(s) or gene expression profile relative to a family history or classic histopathology insufficient and when does it add rather than subtract variance?"

Perhaps I am mistaken (thus this email), but it seems that this claim runs contra to the definition of conditional probability. That is, if you have a hierarchical model, and the family history / classical histopathology already suggests a parameter estimate with some variance, how could the new genomic info possibly increase the variance of that parameter estimate? Surely the question is how much variance the new genomic info reduces and whether it therefore justifies the cost. Perhaps the authors mean the word "relative" as "replacing," but then I don't see why they use the word "incremental." So color me confused, and I'd love it if you could help me clear this up.

My reply:

We consider this in chapter 2 in Bayesian Data Analysis, I think in a couple of the homework problems. The short answer is that, in expectation, the posterior variance decreases as you get more information, but, depending on the model, in particular cases the variance can increase. For some models such as the normal and binomial, the posterior variance can only decrease. But consider the t model with low degrees of freedom (which can be interpreted as a mixture of normals with common mean and different variances). if you observe an extreme value, that's evidence that the variance is high, and indeed your posterior variance can go up.

That said, the quote above might be addressing a different issue, that of overfitting. But I always say that overfitting is not a problem of too much information, it's a problem of a model that's not set up to handle a given level of information.

I think I'm starting to resolve a puzzle that's been bugging me for awhile.

Pop economists (or, at least, pop micro-economists) are often making one of two arguments:

1. People are rational and respond to incentives. Behavior that looks irrational is actually completely rational once you think like an economist.

2. People are irrational and they need economists, with their open minds, to show them how to be rational and efficient.

Argument 1 is associated with "why do they do that?" sorts of puzzles. Why do they charge so much for candy at the movie theater, why are airline ticket prices such a mess, why are people drug addicts, etc. The usual answer is that there's some rational reason for what seems like silly or self-destructive behavior.

Argument 2 is associated with "we can do better" claims such as why we should fire 80% of public-schools teachers or Moneyball-style stories about how some clever entrepreneur has made a zillion dollars by exploiting some inefficiency in the market.

The trick is knowing whether you're gonna get 1 or 2 above. They're complete opposites!

Our story begins . . .

Here's a quote from Steven Levitt:

One of the easiest ways to differentiate an economist from almost anyone else in society is to test them with repugnant ideas. Because economists, either by birth or by training, have their mind open, or skewed in just such a way that instead of thinking about whether something is right or wrong, they think about it in terms of whether it's efficient, whether it makes sense. And many of the things that are most repugnant are the things which are indeed quite efficient, but for other reasons -- subtle reasons, sometimes, reasons that are hard for people to understand -- are completely and utterly unacceptable.

As statistician Mark Palko points out, Levitt is making an all-too-convenient assumption that people who disagree with him are disagreeing because of closed-mindedness. Here's Palko:

There are few thoughts more comforting than the idea that the people who disagree with you are overly emotional and are not thinking things through. We've all told ourselves something along these lines from time to time.

I could add a few more irrational reasons to disagree with Levitt: political disagreement (on issues ranging from abortion to pollution) and simple envy at Levitt's success. (It must make the haters even more irritated that Levitt is, by all accounts, amiable, humble, and a genuinely nice guy.) In any case, I'm a big fan of Freakonomics.

But my reaction to reading the above Levitt quote was to think of the puzzle described at the top of this entry. Isn't it interesting, I thought, that Levitt is identifying economists as rational and ordinary people as irrational. That's argument 2 above. In other settings, I think we'd hear him saying how everyone responds to incentives and that what seems like "efficiency" to do-gooding outsiders is actually not efficient at all. The two different arguments get pulled out as necessary.

The set of all sets that don't contain themselves

Which in turn reminds me of this self-negating quote from Levitt protoge Emily Oster:

anthropologists, sociologists, and public-health officials . . . believe that cultural differences--differences in how entire groups of people think and act--account for broader social and regional trends. AIDS became a disaster in Africa, the thinking goes, because Africans didn't know how to deal with it.

Economists like me [Oster] don't trust that argument. We assume everyone is fundamentally alike; we believe circumstances, not culture, drive people's decisions, including decisions about sex and disease.

I love this quote for its twisted logic. It's Russell's paradox all over again. Economists are different from everybody else, because . . . economists "assume everyone is fundamentally alike"! But if everyone is fundamentally alike, how is it that economists are different "from almost anyone else in society"? All we can say for sure is that it's "circumstances, not culture." It's certainly not "differences in how entire groups of people think and act"--er, unless these groups are economists, anthropologists, etc.

OK, fine. I wouldn't take these quotations too seriously; they're just based on interviews, not careful reflection. My impression is that these quotes come from a simple division of the world into good and bad things:

- Good: economists, rationality, efficiency, thinking the unthinkable, believing in "circumstances"

- Bad: anthropologists, sociologists, public-health officials, irrationality, being deterred by repugnant ideas, believing in "culture"

Good is entrepreneurs, bad is bureaucrats. At some point this breaks down. For example, if Levitt is hired by a city government to help reform its school system, is he a rational, taboo-busting entrepreneur (a good thing) or a culture-loving bureaucrat who thinks he knows better than everybody else (a bad thing)? As a logical structure, the division into Good and Bad has holes. But as emotionally-laden categories ("fuzzy sets," if you will), I think it works pretty well.

The solution to the puzzle

OK, now to return to the puzzle that got us started. How is it that economics-writers such as Levitt are so comfortable flipping back and forth between argument 1 (people are rational) and argument 2 (economists are rational, most people are not)?

The key, I believe, is that "rationality" is a good thing. We all like to associate with good things, right? Argument 1 has a populist feel (people are rational!) and argument 2 has an elitist feel (economists are special!). But both are ways of associating oneself with rationality. It's almost like the important thing is to be in the same room with rationality; it hardly matters whether you yourself are the exemplar of rationality, or whether you're celebrating the rationality of others.

Conclusion

I'm not saying that arguments based on rationality are necessarily wrong in particular cases. (I can't very well say that, given that I wrote an article on why it can be rational to vote.) I'm just trying to understand how pop-economics can so rapidly swing back and forth between opposing positions. And I think it's coming from the comforting presence of rationality and efficiency in both formulations. It's ok to distinguish economists from ordinary people (economists are rational and think the unthinkable, ordinary people don't) and it's also ok to distinguish economists from other social scientists (economists think ordinary people are rational, other social scientists believe in "culture"). You just have to be careful not to make both arguments in the same paragraph.

P.S. Statisticians are special because, deep in our bones, we know about uncertainty. Economists know about incentives, physicists know about reality, movers can fit big things in the elevator on the first try, evolutionary psychologists know how to get their names in the newspaper, lawyers know you should never never never talk to the cops, and statisticians know about uncertainty. Of that, I'm sure.

Paul Pudaite writes in response to my discussion with Bartels regarding effect sizes and measurement error models:

You [Gelman] wrote: "I actually think there will be some (non-Gaussian) models for which, as y gets larger, E(x|y) can actually go back toward zero."

I [Pudaite] encountered this phenomenon some time in the '90s. See this graph which shows the conditional expectation of X given Z, when Z = X + Y and the probability density functions of X and Y are, respectively, exp(-x^2) and 1/(y^2+1) (times appropriate constants). As the magnitude of Z increases, E[X|Z] shrinks to zero.

Screen shot 2011-07-15 at 6.23.12 PM.png

I wasn't sure it was worth the effort to try to publish a two paragraph paper.

I suspect that this is true whenever the tail of one distribution is 'sufficiently heavy' with respect to the tail of the other. Hmm, I suppose there might be enough substance in a paper that attempted to characterize this outcome for, say, unimodal symmetric distributions.

Maybe someone can do this? I think it's an important problem. Perhaps some relevance to the Jeffreys-Lindley paradox also.

Macro causality

David Backus writes:

This is from my area of work, macroeconomics. The suggestion here is that the economy is growing slowly because consumers aren't spending money. But how do we know it's not the reverse: that consumers are spending less because the economy isn't doing well. As a teacher, I can tell you that it's almost impossible to get students to understand that the first statement isn't obviously true. What I'd call the demand-side story (more spending leads to more output) is everywhere, including this piece, from the usually reliable David Leonhardt.

This whole situation reminds me of the story of the village whose inhabitants support themselves by taking in each others' laundry. I guess we're rich enough in the U.S. that we can stay afloat for a few decades just buying things from each other?

Regarding the causal question, I'd like to move away from the idea of "Does A causes B or does B cause A" and toward a more intervention-based framework (Rubin's model for causal inference) in which we consider effects of potential actions. See here for a general discussion. Considering the example above, a focus on interventions clarifies some of the causal questions. For example, if you want to talk about the effect of consumers spending less, you have to consider what interventions you have in mind that would cause consumers to spend more. One such intervention is the famous helicopter drop but there are others, I assume. Conversely, if you want to talk about the poor economy affecting spending, you have to consider what interventions you have in mind to make the economy go better.

In that sense, instrumental variables are a fundamental way to think of just about all causal questions of this sort. You start with variables A and B (for example, consumer spending and economic growth). Instead of picturing A causing B or B causing A, you consider various treatments that can affect both A and B.

All my discussion is conceptual here. As I never tire of saying, my knowledge of macroeconomics hasn't developed since I took econ class in 11th grade.

6 links

The Browser asked me to recommend 6 articles for their readers. Here's what I came up with. I really wanted to link to this one but it wouldn't mean much to people who don't know New York. I also recommended this (if you'll forgive my reference to bowling), but I think it was too much of a primary source for their taste.

OK, the 30 days of statistics are over. I'll still be posting regularly on statistical topics, but now it will be mixed in with everything else, as before.

Here's what I put on the sister blogs in the past month:

1. How to write an entire report with fake data.

2. "Life getting shorter for women in hundreds of U.S. counties": I'd like to see a graph of relative change in death rates, with age on the x-axis.

3. "Not a choice" != "genetic".

4. Remember when I said I'd never again post on albedo? I was lying.

5. Update on Arrow's theorem. It's a Swiss thing, you wouldn't understand.

6. Dan Ariely can't read, but don't blame Johnson and Goldstein.

7. My co-blogger endorses college scholarships for bowling. Which reminds me that my friends and I did "intramural bowling" in high school to get out of going to gym class. Nobody paid us. We even had to rent the shoes!

8. The quest for http://www.freakonomics.com/2008/10/10/my-colleague-casey-mulligan-in-the-times-there-is-no-reason-to-panic/

9. For some reason, the commenters got all worked up about the dude with the two kids and completely ignored the lady who had to sell her summer home in the Hamptons."

10. The most outrageous parts of a story are the parts that don't even attract attention.

11. Do academic economists really not talk about economic factors when they talk about academic jobs in economics?

12. The fallacy of composition in brownstone Brooklyn.

13. No, the federal budget is not funded by taking money from poor people.

14. Leading recipient of U.S. foreign aid says that foreign aid is bad.

15. Jim Davis has some pretty controversial opinions.

16. Political scientist links to political scientist linking to political scientist claiming political science is irrelevant.

17. "Approximately one in 11.8 quadrillion." (I love that "approximately." The exact number is 11.8324589480035 quadrillion but they did us the favor of rounding.)

Static sensitivity analysis

This is one of my favorite ideas. I used it in an application but have never formally studied it or written it up as a general method.

Sensitivity analysis is when you check how inferences change when you vary fit several different models or when you vary inputs within a model. Sensitivity analysis is often recommended but is typically difficult to do, what with the hassle of carrying around all these different estimates. In Bayesian inference, sensitivity analysis is associated with varying the prior distribution, which irritates me: why not consider sensitivity to the likelihood, as that's typically just as arbitrary as the prior while having a much larger effect on the inferences.

So we came up with static sensitivity analysis, which is a way to assess sensitivity to assumptions while fitting only one model. The idea is that Bayesian posterior simulation gives you a range of parameter values, and from these you can learn about sensitivity directly.

The published example comes from my paper with Frederic Bois and Don Maszle on the toxicokinetics of percloroethylene (PERC). One of the products of the analysis was estimation of the percent of PERC metabolized at high and low doses. We fit a multilevel model to data from six experimental subjects, so we obtained inference for the percent metabolized at each dose for each person and the distribution of these percents over the general population.

Here's the static sensitivity analysis:

Screen shot 2011-07-14 at 9.25.46 AM.png

Each plot shows inferences for two quantities of interest--percent metabolized at each of the two doses--with the dots representing different draws from the fitted posterior distribution. (The percent metabolized is lower at high doses (an effect of saturation of the metabolic process in the liver), so in this case it's possible to "cheat" and display two summaries on each plot.) The four graphs show percent metabolized as a function of four different variables in the model. All these graphs represent inference for subject A, one of the six people in the experiment. (It would be possible to look at the other five subjects, but the set of graphs here gives the general idea.)

To understand the static sensitivity analysis, consider the upper-left graph. The simulations reveal some posterior uncertainty about the percent metabolized (it is estimated to be between about 10-40% at low dose and 0.5-2% at high dose) and also on the fat-blood partition coefficient displayed on the x-axis (it is estimated to be somewhere between 65 and 110). More to the point, the fat-blood partition coefficient influences the inference for metabolism at low dose but not at high dose. This result can be directly interpreted as sensitivity to the prior distribution for this parameter: if you shift the prior to the left or right, you will shift the inferences up or down for percent metabolized at low dose, but not at high dose.

Now look at the lower-left graph. The scaling coefficient strongly influences the percent metabolized at high dose but has essentially no effect on the low-dose rate.

Suppose that as a decision-maker you are primarily interested in the effects of low-dose exposure. Then you'll want to get good information about the fat-blood partition coefficient (if possible) but it's not so important to get more precise on the scaling coefficient. You can go similarly through the other graphs.

I think this has potential as a general method, but I've never studied it or written it up as such. It's a fun problem: it has applied importance but also links to a huge theoretical literature on sensitivity analysis.

A few days ago I discussed the evaluation of somewhat-plausible claims that are somewhat supported by theory and somewhat supported by statistical evidence. One point I raised was that an implausibly large estimate of effect size can be cause for concern:

Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.

That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!"

Larry Bartels notes that my reasoning above is a bit incoherent:

I [Bartels] strongly agree with your bottom line that our main aim should be "understanding effect sizes on a real scale." However, your paradoxical conclusion ("the larger the estimated effect, the more likely it is to be a mistake") seems to distract attention from the effect size of primary interest-the magnitude of the "true" (causal) effect.

If the model you have in mind is b=c+d+e, where b is the estimated effect, c is the "true" (causal) effect, d is a "systematic error" (in your language), and e is a "random error," your point seems to be that your posterior belief regarding the magnitude of the "systematic error," E(d|b), is increasing in b. But the more important fact would seem to be that your posterior belief regarding the magnitude of the "true" (causal) effect, E(c|b), is also increasing in b (at least for plausible-seeming distributional assumptions).

Your prior uncertainty regarding the distributions of these various components will determine how much of the estimated effect you attribute to c and how much you attribute to d, and in the case of "wacky claims" you may indeed want to attribute most of it to d; nevertheless, it seems hard to see why a larger estimated effect should not increase your posterior estimate of the magnitude of the true causal effect, at least to some extent.

Conversely, your skeptical assessment of the flaws in the design of the July 4th study may very well lead you to believe that d>>0; but wouldn't that same skepticism have been warranted (though it might not have been elicited) even if the estimated effect had happened to look more plausible (say, half as large or one-tenth as large)?

Focusing on whether a surprising empirical result is "a mistake" (whatever that means) seems to concede too much to the simple-minded is-there-an-effect-or-isn't-there perspective, while obscuring your more fundamental interest in "understanding [true] effect sizes on a real scale."

Larry's got a point. I'll have to think about this in the context of an example. Maybe a more correct statement would be that, given reasonable models for x, d, and e, if the estimate gets implausibly large, the estimate for x does not increase proportionally. I actually think there will be some (non-Gaussian) models for which, as y gets larger, E(x|y) can actually go back toward zero. But this will depend on the distributional form.

I agree that "how likely is it to be a mistake" is the wrong way to look at things. For example, in the July 4th study, there are a lot of sources of variation, only some of which are controlled for in the analysis that was presented. No analysis is perfect, so the "mistake" framing is generally not so helpful.

I was pleasantly surprised to have my recreational reading about baseball in the New Yorker interrupted by a digression on statistics. Sam Fuld of the Tampa Bay Rays, was the subjet of a Ben McGrath profile in the 4 July 2011 issue of the New Yorker, in an article titled Super Sam. After quoting a minor-league trainer who described Fuld as "a bit of a geek" (who isn't these days?), McGrath gets into that lovely New Yorker detail:

One could have pointed out the more persuasive and telling examples, such as the fact that in 2005, after his first pro season, with the Class-A Peoria Chiefs, Fuld applied for a fall internship with Stats, Inc., the research firm that supplies broadcasters with much of the data anad analysis that you hear in sports telecasts.

After a description of what they had him doing, reviewing footage of games and cataloguing, he said

"I thought, They have a stat for everything, but they don't have any stats regarding foul balls."

I like lineplots

These particular lineplots are called parallel coordinate plots.

Nick Polson and James Scott write:

We generalize the half-Cauchy prior for a global scale parameter to the wider class of hy- pergeometric inverted-beta priors. We derive expressions for posterior moments and marginal densities when these priors are used for a top-level normal variance in a Bayesian hierarchical model. Finally, we prove a result that characterizes the frequentist risk of the Bayes estimators under all priors in the class. These arguments provide an alternative, classical justification for the use of the half-Cauchy prior in Bayesian hierarchical models, complementing the arguments in Gelman (2006).

This makes me happy, of course. It's great to be validated.

The only think I didn't catch is how they set the scale parameter for the half-Cauchy prior. In my 2006 paper I frame it as a weakly informative prior and recommend that the scale be set based on actual prior knowledge. But Polson and Scott are talking about a default choice. I used to think that such a default would not really be possible but given our recent success with automatic priors for regularized point estimates, now I'm thinking that a reasonable default might be possible in the full Bayes case too.

P.S. I found the above article while looking on Polson's site for this excellent paper, which considers in a more theoretical way some of the themes that Jennifer, Masanao, and I are exploring in our research on hierarchical models and multiple comparisons.

Vincent Yip writes:

I have read your paper [with Kobi Abayomi and Marc Levy] regarding multiple imputation application.

In order to diagnostic my imputed data, I used Kolmogorov-Smirnov (K-S) tests to compare the distribution differences between the imputed and observed values of a single attribute as mentioned in your paper. My question is:

For example I have this attribute X with the following data: (NA = missing)

Original dataset: 1, NA, 3, 4, 1, 5, NA

Imputed dataset: 1, 2 , 3, 4, 1, 5, 6

a) in order to run the KS test, will I treat the observed data as 1, 3, 4,1, 5?

b) and for the observed data, will I treat 1, 2 , 3, 4, 1, 5, 6 as the imputed dataset for the K-S test? or just 2 ,6?

c) if I used m=5, I will have 5 set of imputed data sets. How would I apply K-S test to 5 of them and compare to the single observed distribution? Do I combine the 5 imputed data set into one by averaging each imputed values so I get one single imputed data and compare with the observed data? OR will I run KS test to all 5 and averaging the KS test result (i.e. averaging the p-values)?

My reply:

I have to admit I have not thought about this in detail. I suppose it would make sense to compare the observed data (1,3,4,1,5) to the imputed (2,6). I would do the test separately for each imputation. I also haven't thought about what to do with the p-values. My intuition would be to average them but this again is not something I've thought much about. Also if the test does reject, this implies a difference between observed and imputed values. It does not show that the imputations are wrong, merely that under the model the data are not missing completely at random.

I'm sure there's a literature on combining hypothesis tests with multiple imputation. Usually I'm not particularly interested in testing--we just threw that Kolmogorov-Smirnov idea into our paper without thinking too hard about what we would do with it.

I've been talking a lot about how different graphical presentations serve different goals, and how we should avoid being so judgmental about graphs. Instead of saying that a particular data visualization is bad, we should think about what goal it serves.

That's all well and good, but sometimes a graph really is bad.

Let me draw an analogy to the popular media. Books and videogames serve different goals. I'm a reader and writer of books and have very little interest in videogames, but it would be silly for me to criticize a videogame on the grounds that it's a bad book (or, for that matter, to criticize a book because it doesn't yield a satisfying game-playing experience). But . . . there are bad books and there are bad videogames.

To restrict our scope to books for a moment: you could argue that Mickey Spillane's books are terrible or you could argue that, given that they sold tens of millions of copies, they must have had something going for them. I wouldn't want to characterize such a book as simply bad. But . . . think of all the crappy attempted Spillanes that were produced in the 1950s, all the ones that neither sold well nor had interesting content. Some of them must have been low-quality under any measure. Or we could talk about food. McDonald's is fine but somewhere there's a greasy spoon whose burgers are so barfable that even the locals don't go there. (I remember a place we went to in grad school once at 2 in the morning, a disgusting local restaurant that was open from about 11pm-5am every day and was always empty. The rumor was that its sole function was as a mob hangout. So, sure, it had a function, but food had nothing to do with it. Similarly, some of those attempted Spillanes probably had the intended function of selling books but, at that they failed miserably.)

And now we return to statistical graphics, in particular this graph of trends in military spending:

Screen shot 2011-07-10 at 5.02.55 PM.png

I learned about this beauty from Justin Logan and Charles Zakaib, who wrote:

We have found the charts in this Council on Foreign Relations report [by Neil Bouhan and Paul Swartz] confusing and would appreciate any thoughts you have on the authors' choice of graphical representation. The charts at the bottom of pages 4 and 5 are particularly vexing. It seems to us that the message could have been conveyed in a much clearer way.

I too was vexed. I've been blogging long enough that I think faithful readers could easily reel off ten things I don't like about the graph. I'm too lazy to list them here (but feel free to do so in the comments!); instead, I just want to make the point that, indeed, this graph could've been much better and, no, I don't see that its ugly clutteredness serves any clear non-statistical goals either. Just as a lot of writing is done by people without good command of the tools of the written language, so are many graphs made by people who can only clumsily handle the tools of graphics. The problem is made worse, I believe, because I don't think the creators of the graph thought hard about what their goals were.

That said, I applaud their decision to make the presentations graphically. I think the best way to read the report is to read the text and then glance at the graphs to see that they provide evidence for the points made in the text. And you'll have to decide for yourself whether it's actually bad news (as claimed on page 5 of the report) that "the United StatesÂ’' and its alliesÂ’' share of world military spending . . . is projected to fall further, to 66 percent, by 2015." I mean, sure, it would be great if other countries didn't waste any money on expensive tanks, fighter jets, military retirement plans, etc., but can you really blame them for wanting to be like us?

P.S. No, the 30 days are not up, and yes, this entry does have statistical content. It's part of our ongoing exploration of criteria for understanding and evaluating statistical graphics and, by implication, statistical communication more generally.

Around these parts we see a continuing flow of unusual claims supported by some statistical evidence. The claims are varyingly plausible a priori. Some examples (I won't bother to supply the links; regular readers will remember these examples and newcomers can find them by searching):

- Obesity is contagious
- People's names affect where they live, what jobs they take, etc.
- Beautiful people are more likely to have girl babies
- More attractive instructors have higher teaching evaluations
- In a basketball game, it's better to be behind by a point at halftime than to be ahead by a point
- Praying for someone without their knowledge improves their recovery from heart attacks
- A variety of claims about ESP

How should we think about these claims? The usual approach is to evaluate the statistical evidence--in particular, to look for reasons that the claimed results are not really statistically significant. If nobody can shoot down a claim, it survives.

The other part of the story is the prior. The less plausible the claim, the more carefully I'm inclined to check the analysis.

But what does it mean, exactly, to check an analysis? The key step is to interpret the findings quantitatively: not just as significant/non-significant but as an effect size, and then looking at the implications of the estimated effect.

I'll explore in the context of two examples, one from political science and one from psychology. An easy example is one in which the estimated effect is completely plausible (for example, the incumbency advantage in U.S. elections), or in which it is completely implausible (for example, a new and unreplicated claim of ESP).

Neither of the examples I consider here is easy: both of the claims are odd but plausible, and both are supported by data, theory, and reasonably sophisticated analysis.

The effect of rain on July 4th

My co-blogger John Sides linked to an article by Andreas Madestam and David Yanagizawa-Drott that reports that going to July 4th celebrations in childhood had the effect of making people more Republican. Madestam and Yanagizawa-Drott write:

Using daily precipitation data to proxy for exogenous variation in participation on Fourth of July as a child, we examine the role of the celebrations for people born in 1920-1990. We find that days without rain on Fourth of July in childhood have lifelong effects. In particular, they shift adult views and behavior in favor of the Republicans and increase later-life political participation. Our estimates are significant: one Fourth of July without rain before age 18 raises the likelihood of identifying as a Republican by 2 percent and voting for the Republican candidate by 4 percent. . . .

Here was John's reaction:

In sum, if you were born before 1970, and experienced sunny July 4th days between the ages of 7-14, and lived in a predominantly Republican county, you may be more Republican as a consequence.

When I [John] first read the abstract, I did not believe the findings at all. I doubted whether July 4th celebrations were all that influential. And the effects seem to occur too early in the life cycle: would an 8-year-old would be affected politically? Doesn't the average 8-year-old care more about fireworks than patriotism?

But the paper does a lot of spadework and, ultimately, I was left thinking "Huh, maybe this is true." I'm still not certain, but it was worth a blog post.

My reaction is similar to John's but a bit more on the skeptical side.

Let's start with effect size. One July 4th without rain increases the probability of Republican vote by 4%. From their Figure 3, the number of rain-free July 4ths is between 6 and 12 for most respondents. So if we go from the low to the high end, we get an effect of 6*4%, or 24%.

[Note: See comment below from Winston Lim. If the effect is 24% (not 24 percentage points!) on the Republican vote and 0% on the Democratic vote, then the effect on the vote share D/(D+R) is 1.24/1.24 - 1/2 or approximately 6%. So the estimate is much less extreme than I'd thought. The confusion arose because I am used to seeing results reported in terms of the percent of the two-party vote share, but these researchers used a different form of summary.]

Does a childhood full of sunny July 4ths really make you 24 percentage points more likely to vote Republican? (The authors find no such effect when considering the weather in a few other days in July.) I could imagine an effect--but 24 percent of the vote? The number seems too high--especially considering the expected attenuation (noted in section 3.1 of the paper) because not everyone goes to a July 4th celebration and that they don't actually know the counties where the survey respondents lived as children. It's hard enough to believe an effect size of 24%, but it's really hard to believe of 24% as an underestimate.

So what could've gone wrong? The most convincing part of the analysis was that they found no effect of rain on July 2, 3, 5, or 6. But this made me wonder about the other days of the year. I'd like to see them automate their analysis and loop it thru all 365 days, then make a graph showing how the coefficient for July 4th fits in. (I'm not saying they should include all 365 in a single regression--that would be a mess. Rather, I'm suggesting the simpler option of 365 analyses, each for a single date.)

Otherwise there are various features in the analysis that could cause problems. The authors predict individual survey respondents given the July 4th weather when they were children, in the counties where they currently reside. Right away we can imagine all sorts of biases based on how moves and who stays put.

Setting aside these measurement issues, the big identification issue is that counties with more rain might be systematically different than counties with less rain. To the extent the weather can be considered a random treatment, the randomization is occurring across years within counties. The authors attempt to deal with this by including "county fixed effects"--that is, allowing the intercept to vary by county. That's ok but their data span a 70 year period, and counties have changed a lot politically in 70 years. They also include linear time trends for states, which helps some more, but I'm still a little concerned about systematic differences not captured in these trends.

No study is perfect, and I'm not saying these are devastating criticisms. I'm just trying to work through my thoughts here.

The effects of names on life choices

For another example, consider the study by Brett Pelham, Matthew Mirenberg, and John Jones of the dentists named Dennis (and the related stories of people with names beginning with F getting low grades, baseball players with K names getting more strikeouts, etc.). I found these claims varyingly plausible: the business with the grades and the strikeouts sounded like a joke, but the claims about career choices etc seemed possible.

My first step in trying to understand these claims was to estimate an effect size: my crude estimate was that, if the research findings were correct, that about 1% of people choose their career based on their first names.

This seemed possible to me, but Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.

That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!

Simonsohn moved the discussion forward by calibrating the effect-size questions to other measurable quantities:

We need a benchmark to make a more informed judgment if the effect is small or large. For example, the Dennis/dentist effect should be much smaller than parent-dentist/child-dentist. I think this is almost certainly true but it is an easy hurdle. The J marries J effect should not be much larger than the effect of, say, conditioning on going to the same high-school, having sat next to each other in class for a whole semester.

I have no idea if that hurdle is passed. These are arbitrary thresholds for sure, but better I'd argue than both my "100% increase is too big", and your "pr(marry smith) up from 1% to 2% is ok."

Summary

No easy answers. But I think that understanding effect sizes on a real scale is a start.

Here.

My reaction was, It's cute how the bars move but why is this the future?

Aleks replied:

Integrated in the browser, works on any device, requires no software installation. Here are more examples, for maps.

Matthew Bogard writes:

Regarding the book Mostly Harmless Econometrics, you state:
A casual reader of the book might be left with the unfortunate impression that matching is a competitor to regression rather than a tool for making regression more effective.
But in fact isn't that what they are arguing, that, in a 'mostly harmless way' regression is in fact a matching estimator itself? "Our view is that regression can be motivated as a particular sort of weighted matching estimator, and therefore the differences between regression and matching estimates are unlikely to be of major empirical importance" (Chapter 3 p. 70) They seem to be distinguishing regression (without prior matching) from all other types of matching techniques, and therefore implying that regression can be a 'mostly harmless' substitute or competitor to matching. My previous understanding, before starting this book was as you say, that matching is a tool that makes regression more effective. I have not finished their book, and have been working at it for a while, but if they do not mean to propose OLS itself as a matching estimator, then I agree that they definitely need some clarification. I actually found your particular post searching for some article that discussed this more formally, as I found my interpretation (misinterpretation) difficult to accept. What say you?

My reply:

I don't know what Angrist and Pischke actually do in their applied analysis. I'm sorry to report that many users of matching do seem to think of it as a pure substitute for regression: once they decide to use matching, they try to do it perfectly and they often don't realize they can use regression on the matched data to do even better. In my book with Jennifer, we try to clarify that the primary role of matching is to correct for lack of complete overlap between control and treatment groups.

But I think in their comment you quoted above, Angrist and Pischke are just giving a conceptual perspective rather than detailed methodological advice. They're saying that regression, like matching, is a way of comparing-like-with-like in estimating a comparison. This point seems commonplace from a statistical standpoint but may be news to some economists who might think that regression relies on the linear model being true.

Gary King and I discuss this general idea in our 1990 paper on estimating incumbency advantage. Basically, a regression model works if either of two assumptions is satisfied: if the linear model is true, or if the two groups are balanced so that you're getting an average treatment effect. More recently this idea (of their being two bases for an inference) has been given the name "double robustness"; in any case, it's a fundamental aspect of regression modeling, and I think that, by equating regression with matching, Angrist and Pischke are just trying to emphasize that these are just tow different ways of ensuring balance in a comparison.

In many examples, neither regression nor matching works perfectly, which is why it can be better to do both (as Don Rubin discussed in his Ph.D. thesis in 1970 and subsequently in some published articles with his advisor, William Cochran).

The quest for the holy graph

Eytan Adar writes:

I was just going through the latest draft of your paper with Anthony Unwin. I heard part of it at the talk you gave (remotely) here at UMich. I'm curious about your discussion of the Baby Name Voyager.

The tool in itself is simple, attractive, and useful. No argument from me there. It's an awesome demonstration of how subtle interactions can be very helpful (click and it zooms, type and it filters... falls perfectly into the Shneiderman visualization mantra). It satisfies a very common use case: finding appropriate names for children.

That said, I can't help but feeling that what you are really excited about is the very static analysis on last letters (you spend most of your time on this). This analysis, incidentally, is not possible to infer from the interactive application (which doesn't support this type of filtering and pivoting). In a sense, the two visualizations don't have anything to do with each other (other than a shared context/dataset).

The real problem is that the first visualization does not seem to meet your goal for "graphics as part of a story." Or at least it wasn't obvious to me what the story was. The story you seem excited about is unrelated and it almost seems like you are patching a missing goal from the first visualization by a story told from the second.

The outcome: no interactive visualization that you like that satisfies all 6 goals (or at least not in some obvious ways that would serve as guidance for others trying to build infovis systems).

My reply:

Yes, I am most excited about that static analysis, and I realize that you can't make those three graphs directly from the Baby Name Voyager. But I still give the Voyager the credit, because I'm guessing that the way Laura Wattenberg found the pattern that is displayed in those three ugly graphs is by playing around with lots of time trends on the interactive graph. So, although the mapping from interactive to static and back is not perfect, I think they worked well together in this instance.

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48