Recently in Decision Theory Category

Nick Polson and James Scott write:

We generalize the half-Cauchy prior for a global scale parameter to the wider class of hy- pergeometric inverted-beta priors. We derive expressions for posterior moments and marginal densities when these priors are used for a top-level normal variance in a Bayesian hierarchical model. Finally, we prove a result that characterizes the frequentist risk of the Bayes estimators under all priors in the class. These arguments provide an alternative, classical justification for the use of the half-Cauchy prior in Bayesian hierarchical models, complementing the arguments in Gelman (2006).

This makes me happy, of course. It's great to be validated.

The only think I didn't catch is how they set the scale parameter for the half-Cauchy prior. In my 2006 paper I frame it as a weakly informative prior and recommend that the scale be set based on actual prior knowledge. But Polson and Scott are talking about a default choice. I used to think that such a default would not really be possible but given our recent success with automatic priors for regularized point estimates, now I'm thinking that a reasonable default might be possible in the full Bayes case too.

P.S. I found the above article while looking on Polson's site for this excellent paper, which considers in a more theoretical way some of the themes that Jennifer, Masanao, and I are exploring in our research on hierarchical models and multiple comparisons.

Around these parts we see a continuing flow of unusual claims supported by some statistical evidence. The claims are varyingly plausible a priori. Some examples (I won't bother to supply the links; regular readers will remember these examples and newcomers can find them by searching):

- Obesity is contagious
- People's names affect where they live, what jobs they take, etc.
- Beautiful people are more likely to have girl babies
- More attractive instructors have higher teaching evaluations
- In a basketball game, it's better to be behind by a point at halftime than to be ahead by a point
- Praying for someone without their knowledge improves their recovery from heart attacks
- A variety of claims about ESP

How should we think about these claims? The usual approach is to evaluate the statistical evidence--in particular, to look for reasons that the claimed results are not really statistically significant. If nobody can shoot down a claim, it survives.

The other part of the story is the prior. The less plausible the claim, the more carefully I'm inclined to check the analysis.

But what does it mean, exactly, to check an analysis? The key step is to interpret the findings quantitatively: not just as significant/non-significant but as an effect size, and then looking at the implications of the estimated effect.

I'll explore in the context of two examples, one from political science and one from psychology. An easy example is one in which the estimated effect is completely plausible (for example, the incumbency advantage in U.S. elections), or in which it is completely implausible (for example, a new and unreplicated claim of ESP).

Neither of the examples I consider here is easy: both of the claims are odd but plausible, and both are supported by data, theory, and reasonably sophisticated analysis.

The effect of rain on July 4th

My co-blogger John Sides linked to an article by Andreas Madestam and David Yanagizawa-Drott that reports that going to July 4th celebrations in childhood had the effect of making people more Republican. Madestam and Yanagizawa-Drott write:

Using daily precipitation data to proxy for exogenous variation in participation on Fourth of July as a child, we examine the role of the celebrations for people born in 1920-1990. We find that days without rain on Fourth of July in childhood have lifelong effects. In particular, they shift adult views and behavior in favor of the Republicans and increase later-life political participation. Our estimates are significant: one Fourth of July without rain before age 18 raises the likelihood of identifying as a Republican by 2 percent and voting for the Republican candidate by 4 percent. . . .

Here was John's reaction:

In sum, if you were born before 1970, and experienced sunny July 4th days between the ages of 7-14, and lived in a predominantly Republican county, you may be more Republican as a consequence.

When I [John] first read the abstract, I did not believe the findings at all. I doubted whether July 4th celebrations were all that influential. And the effects seem to occur too early in the life cycle: would an 8-year-old would be affected politically? Doesn't the average 8-year-old care more about fireworks than patriotism?

But the paper does a lot of spadework and, ultimately, I was left thinking "Huh, maybe this is true." I'm still not certain, but it was worth a blog post.

My reaction is similar to John's but a bit more on the skeptical side.

Let's start with effect size. One July 4th without rain increases the probability of Republican vote by 4%. From their Figure 3, the number of rain-free July 4ths is between 6 and 12 for most respondents. So if we go from the low to the high end, we get an effect of 6*4%, or 24%.

[Note: See comment below from Winston Lim. If the effect is 24% (not 24 percentage points!) on the Republican vote and 0% on the Democratic vote, then the effect on the vote share D/(D+R) is 1.24/1.24 - 1/2 or approximately 6%. So the estimate is much less extreme than I'd thought. The confusion arose because I am used to seeing results reported in terms of the percent of the two-party vote share, but these researchers used a different form of summary.]

Does a childhood full of sunny July 4ths really make you 24 percentage points more likely to vote Republican? (The authors find no such effect when considering the weather in a few other days in July.) I could imagine an effect--but 24 percent of the vote? The number seems too high--especially considering the expected attenuation (noted in section 3.1 of the paper) because not everyone goes to a July 4th celebration and that they don't actually know the counties where the survey respondents lived as children. It's hard enough to believe an effect size of 24%, but it's really hard to believe of 24% as an underestimate.

So what could've gone wrong? The most convincing part of the analysis was that they found no effect of rain on July 2, 3, 5, or 6. But this made me wonder about the other days of the year. I'd like to see them automate their analysis and loop it thru all 365 days, then make a graph showing how the coefficient for July 4th fits in. (I'm not saying they should include all 365 in a single regression--that would be a mess. Rather, I'm suggesting the simpler option of 365 analyses, each for a single date.)

Otherwise there are various features in the analysis that could cause problems. The authors predict individual survey respondents given the July 4th weather when they were children, in the counties where they currently reside. Right away we can imagine all sorts of biases based on how moves and who stays put.

Setting aside these measurement issues, the big identification issue is that counties with more rain might be systematically different than counties with less rain. To the extent the weather can be considered a random treatment, the randomization is occurring across years within counties. The authors attempt to deal with this by including "county fixed effects"--that is, allowing the intercept to vary by county. That's ok but their data span a 70 year period, and counties have changed a lot politically in 70 years. They also include linear time trends for states, which helps some more, but I'm still a little concerned about systematic differences not captured in these trends.

No study is perfect, and I'm not saying these are devastating criticisms. I'm just trying to work through my thoughts here.

The effects of names on life choices

For another example, consider the study by Brett Pelham, Matthew Mirenberg, and John Jones of the dentists named Dennis (and the related stories of people with names beginning with F getting low grades, baseball players with K names getting more strikeouts, etc.). I found these claims varyingly plausible: the business with the grades and the strikeouts sounded like a joke, but the claims about career choices etc seemed possible.

My first step in trying to understand these claims was to estimate an effect size: my crude estimate was that, if the research findings were correct, that about 1% of people choose their career based on their first names.

This seemed possible to me, but Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.

That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!

Simonsohn moved the discussion forward by calibrating the effect-size questions to other measurable quantities:

We need a benchmark to make a more informed judgment if the effect is small or large. For example, the Dennis/dentist effect should be much smaller than parent-dentist/child-dentist. I think this is almost certainly true but it is an easy hurdle. The J marries J effect should not be much larger than the effect of, say, conditioning on going to the same high-school, having sat next to each other in class for a whole semester.

I have no idea if that hurdle is passed. These are arbitrary thresholds for sure, but better I'd argue than both my "100% increase is too big", and your "pr(marry smith) up from 1% to 2% is ok."


No easy answers. But I think that understanding effect sizes on a real scale is a start.

Martyn Plummer replied to my recent blog on DIC with information that was important enough that I thought it deserved its own blog entry. Martyn wrote:

DIC has been around for 10 years now and despite being immensely popular with applied statisticians it has generated very little theoretical interest. In fact, the silence has been deafening. I [Martyn] hope my paper added some clarity.

As you say, DIC is (an approximation to) a theoretical out-of-sample predictive error. When I finished the paper I was a little embarrassed to see that I had almost perfectly reconstructed the justification of AIC as approximate cross-validation measure by Stone (1977), with a Bayesian spin of course.

But even this insight leaves a lot of choices open. You need to choose the right loss function and also which level of the model you want to replicate from. David Spiegelhalter and colleagues called this the "focus". In practice the focus is limited to the lowest level of the model. You generally can't calculate the log likelihood (the default penalty) for higher level parameters. But this narrow choice might not correspond to the interests of the analyst. For example, in disease mapping DIC answers the question "What model yield the disease map that best captures the features of the observed incidence data during this period?" But people are often asking more fundamental questions about their models, like "Is there spatial aggregation in disease X?" There is quite a big gap between these questions.

Regarding the slow convergence of DIC, you might want to try an alternative definition of the effective number of parameters pD that I came up with in 2002, in the discussion of Spiegelhalter et al. It is non-negative and coordinate free. It can be calculated from 2 or more parallel chains and so its sample variance can be estimated using standard MCMC diagnostics. I finally justified it in my 2008 paper and implemented in JAGS. The steps are (or should be):

- Compile a model with at least 2 parallel chains
- Load the dic module
- Set a trace monitor for "pD".
- Output with the coda command

If you are only interested in the sample mean, not the variance, the dic.samples function from the rjags package will give you this in a nice R object wrapper.

I suppose we can implement this in stan too.

Aki Vehtari commented too, with a link to a recent article by Sumio Watanabe on something called the widely applicable information criterion. Watanabe's article begins:

In regular statistical models, the leave-one-out cross-validation is asymptotically equivalent to the Akaike information criterion. However, since many learning machines are singular statistical models, the asymptotic behavior of the cross-validation remains unknown. In previous studies, we established the singular learning theory and proposed a widely applicable information criterion, the expectation value of which is asymptotically equal to the average Bayes generalization loss. In the present paper, we theoretically compare the Bayes cross-validation loss and the widely applicable information criterion and prove two theorems. First, the Bayes cross-validation loss is asymptotically equivalent to the widely applicable information criterion as a random variable. Therefore, model selection and hyperparameter optimization using these two values are asymptotically equivalent. Second, the sum of the Bayes generalization error and the Bayes cross-validation error is asymptotically equal to 2λ/n, where λ is the real log canonical threshold and n is the number of training samples. Therefore the relation between the cross-validation error and the generalization error is determined by the algebraic geometrical structure of a learning machine. We also clarify that the deviance information criteria are different from the Bayes cross-validation and the widely applicable information criterion.

It's great to see progress in this area. After all these years of BIC variants, which I just hate, I like that researchers are moving back to the predictive-error framework. I think that Spiegelhalter, Best, Carlin, and Van der Linde made an important contribution with their DIC paper ten years ago. Even if DIC is not perfect and can be improved, they pushed the field in the right direction.

According to a New York Times article, cognitive scientists Hugo Mercier and Dan Sperber have a new theory about rational argument: humans didn't develop it in order to learn about the world, we developed it in order to win arguments with other people. "It was a purely social phenomenon. It evolved to help us convince others and to be careful when others try to convince us."

Based on the NYT article, it seems that Mercier and Sperber are basically flipping around the traditional argument, which is that humans learned to reason about the world, albeit imperfectly, and learned to use language to convey that reasoning to others. These guys would suggest that it's the other way around: we learned to argue with others, and this has gradually led to the ability to actually make (and recognize) sound arguments, but only indirectly. The article says ""At least in some cultural contexts, this results in a kind of arms race towards greater sophistication in the production and evaluation of arguments," they write. "When people are motivated to reason, they do a better job at accepting only sound arguments, which is quite generally to their advantage."

Of course I have no idea if any of this is true, or even how to test it. But it's definitely true that people are often convinced by wrong or even crazy arguments, and they (we) are subject to confirmation bias and availability bias and all sorts of other systematic biases. One thing that bothers me especially is that a lot of people are simply indifferent to facts and rationality when making decisions. Mercier and Sperber have at least made a decent attempt to explain why people are like this.

E. J. Wagenmakers writes:

Here's a link for you. The first sentences tell it all:
Climate warming since 1995 is now statistically significant, according to Phil Jones, the UK scientist targeted in the "ClimateGate" affair. Last year, he told BBC News that post-1995 warming was not significant--a statement still seen on blogs critical of the idea of man-made climate change. But another year of data has pushed the trend past the threshold usually used to assess whether trends are "real."

Now I [Wagenmakers] don't like p-values one bit, but even people who do like them must cringe when they read this. First, this apparently is a sequential design, so I'm not sure what sampling plan leads to these p-values. Secondly, comparing significance values suggests that the data have suddenly crossed some invisible line that divided nonsignificant from significant effects (as you pointed out in your paper with Hal Stern). Ugh!

I share Wagenmakers's reaction. There seems to be some confusion here between inferential thresholds and decision thresholds. Which reminds me how much I hate the old 1950s literature (both classical and Bayesian) on inference as decision, loss functions for estimators, and all the rest. I think the p-value serves a role in summarizing certain aspects of a model's fit to data but I certainly don't think it makes sense as any kind of decision threshold (despite that it is nearly universally used as such to decide on acceptance of research in scientific journals).

Howard Wainer writes in the Statistics Forum:

The Chinese scientific literature is rarely read or cited outside of China. But the authors of this work are usually knowledgeable of the non-Chinese literature -- at least the A-list journals. And so they too try to replicate the alpha finding. But do they? One would think that they would find the same diminished effect size, but they don't! Instead they replicate the original result, even larger. Here's one of the graphs:

How did this happen?

Full story here.

Much-honored playwright Tony Kushner was set to receive one more honor--a degree from John Jay College--but it was suddenly taken away from him on an 11-1 vote of the trustees of the City University of New York. This was the first rejection of an honorary degree nomination since 1961.

The news article focuses on one trustee, Jeffrey Wiesenfeld, an investment adviser and onetime political aide, who opposed Kushner's honorary degree, but to me the relevant point is that the committee as a whole voted 11-1 to ding him.

Kusnher said, "I'm sickened," he added, "that this is happening in New York City. Shocked, really." I can see why he's shocked, but perhaps it's not so surprising that it's happening in NYC. Recall the famous incident from 1940 in which Bertrand Russell was invited and then uninvited to teach at City College. The problem that time was Russell's views on free love (as they called it back then). There seems to be a long tradition of city college officials being willing to risk controversy to make a political point.

P.S. I was trying to imagine what these 11 trustees could've been thinking . . . my guess is it was some sort of group-dynamics thing. They started talking about it and convinced each other that the best thing to do would be to set Kushner's nomination aside. I bet if they'd had to decide separately most of them wouldn't have come to this conclusion. And I wouldn't be surprised if, five minutes after walking away from that meeting, most of those board members suddenly thought, Uh oh--we screwed up on this one! As cognitive psychologists have found, this is one of the problems with small-group deliberation: a group of people can be led to a decision which is not anywhere near the center of their positions considered separately.

A graduate student in public health writes:

I have been asked to do the statistical analysis for a medical unit that is delivering a pilot study of a program to [details redacted to prevent identification]. They are using a prospective, nonrandomized, cohort-controlled trial study design.

The investigator thinks they can recruit only a small number of treatment and control cases, maybe less than 30 in total. After I told the Investigator that I cannot do anything statistically with a sample size that small, he responded that small sample sizes are common in this field, and he send me an example of analysis that someone had done on a similar study.

So he still wants me to come up with a statistical plan. Is it unethical for me to do anything other than descriptive statistics? I think he should just stick to qualitative research. But the study she mentions above has 40 subjects and apparently had enough power to detect some effects. This is a pilot study after all so the n does not have to be large. It's not randomized though so I would think it would need a larger n because of the weak design.

My reply:

My first, general, recommendation is that it always makes sense to talk with any person as if he is completely ethical. If he is ethical, this is a good idea, and if he is not, you don't want him to think you think badly of him. If you are worried about a serious ethical problem, you can ask about it by saying something like, "From the outside, this could look pretty bad. An outsider, seeing this plan, might think we are being dishonest etc. etc." That way you can express this view without it being personal. And maybe your colleague has a good answer, which he can tell you.

To get to your specific question, there is really no such thing as a minimum acceptable sample size. You can get statistical significance with n=5 if your signal is strong enough.

Generally, though, the purpose of a pilot study is not to get statistical significance but rather to get experience with the intervention and the measurements. It's ok to do a pilot analysis, recognizing that it probably won't reach statistical significance. Also, regardless of sample size, qualitative analysis is appropriate and necessary in any pilot study.

Finally, of course they should not imply that they can collect a larger sample size than they can actually do.

John Sides followed up on a discussion of his earlier claim that political independents vote for president in a reasonable way based on economic performance. John's original post led to the amazing claim by New Republic writer Jonathan Chait that John wouldn't "even want to be friends with anybody who" voted in this manner.

I've been sensitive to discussions of rationality and voting ever since Aaron Edlin, Noah Kaplan, and I wrote our article on voting as a rational choice: why and how people vote to improve the well-being of others.

Models of rationality are controversial In politics, just as they are in other fields ranging from economics to criminology. On one side you have people trying to argue that all behavior is rational, from lottery playing to drug addiction to engaging in email with exiled Nigerian royalty. Probably the only behavior that nobody has yet to claim is rational is blogging, but I bet that's coming too. From the other direction, lots of people point to strong evidence of subject matter ignorance in all fields ranging from demography to the Federal budget to demonstrate that, even if voters think they're being rational, they can't be making reasoned decisions in any clear senses.

Here's what I want to add. In the usual debates, people argue about whether a behavior is rational or not. Or, at a more sophisticated level, people might dispute how rational or irrational a given action is. But I don't think this is the right way of thinking about it.

People have many overlapping reasons for anything they do. For a behavior to be "rational" does not mean that a person does it as the result of a reasoned argument but rather that some aspects of that behavior could be modeled as such. This comes up in section 5.2 of my article with Edlin and Kaplan: To model a behavior as rational does not compete with more traditional psychological explanations; it reinforces them.

For example, voter turnout is higher in elections that are anticipated to be close. This has a rational explanation---if an election is close, it's more likely that you will cast the deciding vote--and also a process explanation: if an election is close, candidates will campaign harder, more people will talk about the election, and a voter is more likely to want to be part of the big stories. These two explanations work together, they don't compete: it's rational for you to vote, and it's also rational for the campaigns to try to get you to vote, to make the race more interesting to increase your motivation level.

I don't anticipate that this note will resolve some of the debates about participation of independents in politics but I hope that this clarifies some of the concerns about the "rationality" label.

P.S. John is better at engaging journalists than I am. When Chait wrote something that I didn't like and then responded to my response, I grabbed on a key point in his response and emphasized our agreement, thus ending the debate (such as it was), rather than emphasizing our remaining points of disagreement. John is better at keeping the discussion alive.

Details here.

Free $5 gift certificate!

| 1 Comment

I bought something online and got a gift certificate for $5 to use at The gift code is TP07zh4q5dc and it expires on 30 Apr. I don't need a T-shirt so I'll pass this on to you.

I assume it only works once. So the first person who follows up on this gets the discount. Enjoy!

The following is an essay into a topic I know next to nothing about.

As part of our endless discussion of Dilbert and Charlie Sheen, commenter Fraac linked to a blog by philosopher Edouard Machery, who tells a fascinating story:

How do we think about the intentional nature of actions? And how do people with an impaired mindreading capacity think about it?

Consider the following probes:

The Free-Cup Case

Joe was feeling quite dehydrated, so he stopped by the local smoothie shop to buy the largest sized drink available. Before ordering, the cashier told him that if he bought a Mega-Sized Smoothie he would get it in a special commemorative cup. Joe replied, 'I don't care about a commemorative cup, I just want the biggest smoothie you have.' Sure enough, Joe received the Mega-Sized Smoothie in a commemorative cup. Did Joe intentionally obtain the commemorative cup?

The Extra-Dollar Case

Joe was feeling quite dehydrated, so he stopped by the local smoothie shop to buy the largest sized drink available. Before ordering, the cashier told him that the Mega-Sized Smoothies were now one dollar more than they used to be. Joe replied, 'I don't care if I have to pay one dollar more, I just want the biggest smoothie you have.' Sure enough, Joe received the Mega-Sized Smoothie and paid one dollar more for it. Did Joe intentionally pay one dollar more?

You surely think that paying an extra dollar was intentional, while getting the commemorative cup was not. [Indeed, I do--AG.] So do most people (Machery, 2008).

But Tiziana Zalla and I [Machery] have found that if you had Asperger Syndrome, a mild form of autism, your judgments would be very different: You would judge that paying an extra-dollar was not intentional, just like getting the commemorative cup.

I'm not particularly interested in the Asperger's angle (except for the linguistic oddity that most people call it Asperger's but in the medical world it's called Asperger; compare, for example, the headline of the linked blog to its text), but I am fascinated by the above experiment. Even after reading the description, it seems to me perfectly natural to think of the free cup as unintentional and the extra dollar as intentional. But I also agree with the implicit point that, in a deeper sense, the choice to pay the extra dollar isn't really more intentional than the choice to take the cup. It just feels that way.

To engage in a bit of introspective reasoning (as is traditional in in the "heuristics and biases" field), I'd say the free cup just happened whereas in the second scenario Joe had to decide to pay the dollar.

But that's not really it. The passive/active division correctly demarcates the free cup and extra dollar examples, but Machery presents other examples where both scenarios are passive, or where both scenarios are active, and you can get perceived intentionality or lack of intentionality in either case. (Just as we learned from classical decision theory and the First Law of Robotics, to not decide is itself a decision.)

Machery's explanation (which I don't buy)

Scott "Dilbert" Adams has met Charlie Sheen and thinks he really is a superbeing. This perhaps relates to some well-known cognitive biases. I'm not sure what this one's called, but the idea is that Adams is probably overweighting his direct impressions: he saw Sheen-on-the-set, not Sheen-beating-his-wife. Also, everybody else hates Sheen, so Adams can distinguish himself by being tolerant, etc.

I'm not sure what this latter phenomenon is called, but I've noticed it before. When I come into a new situation and meet some person X, who everybody says is a jerk, and then person X happens to act in a civilized way that day, then there's a real temptation to say, Hey, X isn't so bad after all. It makes me feel so tolerant and above-it-all. Perhaps that's partly what's going on with Scott Adams here: he can view himself as the objective outsider who can be impressed by Sheen, not like all those silly emotional people who get hung up on the headlines. From here, though, it just makes Adams look silly, to be so impressed that Sheen didn't miss a line of dialogue, etc. The logical next step is the story of how he met John Edwards and was impressed at how statesmanlike he was.

Some thoughts on the implausibility of Paul Ryan's 2.8% unemployment forecast. Some general issues arise.

P.S. Yes, Democrats also have been known to promote optimistic forecasts!

No joke. See here (from Kaiser Fung). At the Statistics Forum.

I was recently rereading and enjoying Bill James's Historical Baseball Abstract (the second edition, from 2001).

But even the Master is not perfect. Here he is, in the context of the all-time 20th-greatest shortstop (in his reckoning):

Are athletes special people? In general, no, but occasionally, yes. Johnny Pesky at 75 was trim, youthful, optimistic, and practically exploding with energy. You rarely meet anybody like that who isn't an ex-athlete--and that makes athletes seem special. [italics in the original]

Hey, I've met 75-year-olds like that--and none of them are ex-athletes! That's probably because I don't know a lot of ex-athletes. But Bill James . . . he knows a lot of athletes. He went to the bathroom with Tim Raines once! The most I can say is that I saw Rickey Henderson steal a couple bases when he was playing against the Orioles once.

Cognitive psychologists talk about the base-rate fallacy, which is the mistake of estimating probabilities without accounting for underlying frequencies. Bill James knows a lot of ex-athletes, so it's no surprise that the youthful, optimistic, 75-year-olds he meets are likely to be ex-athletes. The rest of us don't know many ex-athletes, so it's no suprrise that most of the youthful, optimistic, 75-year-olds we meet are not ex-athletes.

The mistake James made in the above quote was to write "You" when he really meant "I." I'm not disputing his claim that athletes are disproportionately likely to become lively 75-year-olds; what I'm disagreeing with is his statement that almost all such people are ex-athletes.

Yeah, I know, I'm being picky. But the point is important, I think, because of the window it offers into the larger issue of people being trapped in their own environment (the "availability heuristic," in the jargon of cognitive psychology). Athletes loom large in Bill James's world--and I wouldn't want it any other way--and sometimes he forgets that the rest of us live in a different world.

A common aphorism among artificial intelligence practitioners is that A.I. is whatever machines can't currently do.

Adam Gopnik, writing for the New Yorker, has a review called Get Smart in the most recent issue (4 April 2011). Ostensibly, the piece is a review of new books, one by Joshua Foer, Moonwalking with Einstein: The Art and Science of Remembering Everything, and one by Stephen Baker Final Jeopardy: Man vs. Machine and the Quest to Know Everything (which would explain Baker's spate of Jeopardy!-related blog posts). But like many such pieces in highbrow magazines, the book reviews are just a cover for staking out a philosophical position. Gopnik does a typically New Yorker job in explaining the title of this blog post.

Remember that bizarre episode in Freakonomics 2, where Levitt and Dubner went to the Batcave-like lair of a genius billionaire who told them that "the problem with solar panels is that they're black." I'm not the only one who wondered at the time: of all the issues to bring up about solar power, why that one?

Well, I think I've found the answer in this article by John Lanchester:

In 2004, Nathan Myhrvold, who had, five years earlier, at the advanced age of forty, retired from his job as Microsoft's chief technology officer, began to contribute to the culinary discussion board . . . At the time he grew interested in sous vide, there was no book in English on the subject, and he resolved to write one. . . . broadened it further to include information about the basic physics of heating processes, then to include the physics and chemistry of traditional cooking techniques, and then to include the science and practical application of the highly inventive new techniques that are used in advanced contemporary restaurant food--the sort of cooking that Myhrvold calls "modernist."

OK, fine. But what does this have to do with solar panels? Just wait:

Notwithstanding its title, "Modernist Cuisine" contains hundreds of pages of original, firsthand, surprising information about traditional cooking. Some of the physics is quite basic: it had never occurred to me that the reason many foods go from uncooked to burned at such speed is that light-colored foods reflect heat better than dark: "As browning reactions begin, the darkening surface rapidly soaks up more and more of the heat rays. The increase in temperature accelerates dramatically."

Aha! Now, I'm just guessing here, but my conjecture is that after studying this albedo effect in the kitchen, Myhrvold was primed to see it everywhere. Of course, maybe it went the other way: he was thinking about solar panels first and then applied his ideas to the kitchen. But, given that the experts seem to think the albedo effect is a red herring (so to speak) regarding solar panels, I wouldn't be surprised if Myhrvold just started talking about reflectivity because it was on his mind from the cooking project. My own research ideas often leak from one project to another, so I wouldn't be surprised if this happens to others too.

P.S. More here and here.

Mark Palko points to a news article by Michael Winerip on teacher assessment:

No one at the Lab Middle School for Collaborative Studies works harder than Stacey Isaacson, a seventh-grade English and social studies teacher. She is out the door of her Queens home by 6:15 a.m., takes the E train into Manhattan and is standing out front when the school doors are unlocked, at 7. Nights, she leaves her classroom at 5:30. . . .

Her principal, Megan Adams, has given her terrific reviews during the two and a half years Ms. Isaacson has been a teacher. . . . The Lab School has selective admissions, and Ms. Isaacson's students have excelled. Her first year teaching, 65 of 66 scored proficient on the state language arts test, meaning they got 3's or 4's; only one scored below grade level with a 2. More than two dozen students from her first two years teaching have gone on to . . . the city's most competitive high schools. . . .

You would think the Department of Education would want to replicate Ms. Isaacson . . . Instead, the department's accountability experts have developed a complex formula to calculate how much academic progress a teacher's students make in a year -- the teacher's value-added score -- and that formula indicates that Ms. Isaacson is one of the city's worst teachers.

According to the formula, Ms. Isaacson ranks in the 7th percentile among her teaching peers -- meaning 93 per cent are better. . . .

How could this happen to Ms. Isaacson? . . . Everyone who teaches math or English has received a teacher data report. On the surface the report seems straightforward. Ms. Isaacson's students had a prior proficiency score of 3.57. Her students were predicted to get a 3.69 -- based on the scores of comparable students around the city. Her students actually scored 3.63. So Ms. Isaacson's value added is 3.63-3.69.

Remember, the exam is on a 1-4 scale, and we were already told that 65 out of 66 students scored 3 or 4, so an average of 3.63 (or, for that matter, 3.69) is plausible. The 3.57 is "the average prior year proficiency rating of the students who contribute to a teacher's value added score." I assume that the "proficiency rating" is the same as the 1-4 test score but I can't be sure.

The predicted score is, according to Winerip, "based on 32 variables -- including whether a student was retained in grade before pretest year and whether a student is new to city in pretest or post-test year. . . . Ms. Isaacson's best guess about what the department is trying to tell her is: Even though 65 of her 66 students scored proficient on the state test, more of her 3s should have been 4s."

This makes sense to me. Winerip seems to presenting this is as some mysterious process but it seems pretty clear to me. A "3" is a passing grade, but if you're teaching in a school with "selective admissions" with the particular mix of kids that this teacher has, the expectation is that most of your students will get "4"s.

We can work through the math (at least approximately). We don't know this teacher's students did this year so I'll use the data given above, from her first year. Suppose that x students in the class got 4's, 65-x got 3's, and one student got a 2. To get an average of 3.63, you need 4x + 3(65-x) + 2 = 3.63*66. That is, x = 3.63*66 - 2 - 3*65 = 42.58. This looks like x=43. Let's try it out: (4*43 + 3*22 + 2)/66 = 3.63 (or, to three decimal places, 3.636). This is close enough for me. To get 3.69 (more precisely, 3.697), you'd need 47 4's, 18 3's, and a 2. So the gap would be covered by four students (in a class of 66) moving up from a 3 to a 4. This gives a sense of the difference between a teacher in the 7th percentile and a teacher in the 50th.

I wonder what this teacher's value-added scores were for the previous two years.

Hipmunk < Expedia, again


This time on a NY-Cincinnati roundtrip. Hipmunk could find the individual flights but could not put them together. In contrast, Expedia got it right the first time.

See here and here for background. If anybody reading this knows David Pogue, please let him know about this. A flashy interface is fine, but ultimately what I'm looking for is a flight at the right place and the right time.

Calibration in chess


Has anybody done this study yet? I'm curious about the results. Perhaps there's some chess-playing cognitive psychologist who'd like to collaborate on this?

Mike Cohen writes:

Bidding for the kickoff


Steven Brams and James Jorash propose a system for reducing the advantage that comes from winning the coin flip in overtime:

Dispensing with a coin toss, the teams would bid on where the ball is kicked from by the kicking team. In the NFL, it's now the 30-yard line. Under Brams and Jorasch's rule, the kicking team would be the team that bids the lower number, because it is willing to put itself at a disadvantage by kicking from farther back. However, it would not kick from the number it bids, but from the average of the two bids.

To illustrate, assume team A bids to kick from the 38-yard line, while team B bids its 32-yard line. Team B would win the bidding and, therefore, be designated as the kick-off team. But B wouldn't kick from 32, but instead from the average of 38 and 32--its 35-yard line.

This is better for B by 3 yards than the 32-yard line that it proposed, because it's closer to the end zone it is kicking towards. It's also better for A by 3 yards to have B kick from the 35-yard line, rather than from the 38-yard line, it proposed if it were the kick-off team.

In other words, the 35-yard line is a win-win solution--both teams gain a 3-yard advantage over what they reported would make them indifferent between kicking and receiving. While bidding to determine the yard line from which a ball is kicked has been proposed before, the win-win feature of using the average of the bids--and recognizing that both teams benefit if the low bidder is the kicking team--has not. Teams seeking to merely get the ball first would be discouraged from bidding too high--for example, the 45-yard line--as this could result in a kick-off pinning them far back in their own territory.

"Metaphorically speaking, the bidding system levels the playing field," Brams and Jorasch maintain. "It also enhances the importance of the strategic choices that the teams make, rather than leaving to chance which team gets a boost in the overtime period."

This seems like a good idea. Also fun for the fans--another way to second-guess the coach.

Homework and treatment levels


Interesting discussion here by Mark Palko on the difficulty of comparing charter schools to regular schools, even if the slots in the charter schools have been assigned by lottery. Beyond the direct importance of the topic, I found the discussion interesting because I always face a challenge in my own teaching to assign the right amount of homework, given that if I assign too much, students will simply rebel and not do it.

To get back to the school-choice issue . . . Mark discussed selection effects: if a charter school is popular, it can require parents to sign a contract agreeing they will supervise their students to do lots of homework. Mark points out that there is a selection issue here, that the sort of parents who would sign that form are different from parents in general. But it seems to me there's one more twist: These charter schools are popular, right? So that would imply that there is some reservoir of parents who would like to sign the form but don't have the opportunity to do so in a regular school. So, even if the charter school is no more effective, conditional on the level of homework assigned, the spread of charter schools could increase the level of homework and thus be a good thing in general (assuming, of course, that you want your kid to do more homework). Or maybe I'm missing something here.

P.S. More here (from commenter ceolaf).

Andrew Gelman (Columbia University) and Eric Johnson (Columbia University) seek to hire a post-doctoral fellow to work on the application of the latest methods of multilevel data analysis, visualization and regression modeling to an important commercial problem: forecasting retail sales at the individual item level. These forecasts are used to make ordering, pricing and promotions decisions which can have significant economic impact to the retail chain such that even modest improvements in the accuracy of predictions, across a large retailer's product line, can yield substantial margin improvements.

Activities focus on the development of iterative imputation algorithms and diagnostics for missing-data imputation. Activities would include model-development, programming, and data analysis. This project is to be undertaken with, and largely funded by, a firm which provides forecasting technology and services to large retail chains, and which will provide access to a unique and rich set of proprietary data. The postdoc will be expected to spend some time working directly with this firm, but this is fundamentally a research position.

The ideal candidate will have a background in statistics, psychometrics, or economics and be interested in marketing or related topics. He or she should be able to work fluently in R and should already know about hierarchical models and Bayesian inference and computation.

The successful candidate will become part of the lively Applied Statistics Center community, which includes several postdocs (with varied backgrounds in statistics, computer science, and social science), Ph.D., M.A., and undergraduate students, and faculty at Columbia and elsewhere. We want people who love collaboration and have the imagination, drive, and technical skills to make a difference in our projects.

If you are interested in this position, please send a letter of application, a CV, some of your articles, and three letters of recommendation to the Applied Statistics Center coordinator, Caroline Peters, Review of applications will begin immediately.

Tyler Cowen links a blog by Samuel Arbesman mocking people who are so lazy that they take the elevator from 1 to 2. This reminds me of my own annoyance about a guy who worked in my building and did not take the elevator. (For the full story, go here and search on "elevator.")

Seeing as the Freakonomics people were kind enough to link to my list of five recommended books, I'll return the favor and comment on a remark from Levitt, who said:

Thiel update


A year or so ago I discussed the reasoning of zillionaire financier Peter Thiel, who seems to believe his own hype and, worse, seems to be able to convince reporters of his infallibility as well. Apparently he "possesses a preternatural ability to spot patterns that others miss."

More recently, Felix Salmon commented on Thiel's financial misadventures:

Peter Thiel's hedge fund, Clarium Capital, ain't doing so well. Its assets under management are down 90% from their peak, and total returns from the high point are -65%. Thiel is smart, successful, rich, well-connected, and on top of all that his calls have actually been right . . . None of that, clearly, was enough for Clarium to make money on its trades: the fund was undone by volatility and weakness in risk management.

There are a few lessons to learn here.

Firstly, just because someone is a Silicon Valley gazillionaire, or any kind of successful entrepreneur for that matter, doesn't mean they should be trusted with other people's money.

Secondly, being smart is a great way of getting in to a lot of trouble as an investor. In order to make money in the markets, you need a weird combination of arrogance and insecurity. Arrogance on its own is fatal, but it's also endemic to people in Silicon Valley who are convinced that they're rich because they're smart, and that since they're still smart, they can and will therefore get richer still. . . .

Just to be clear, I'm not saying that Thiel losing money is evidence that he's some sort of dummy. (Recall my own unsuccess as an investor.) What I am saying is, don't believe the hype.

Chapter 1

On Sunday we were over on 125 St so I stopped by the Jamaican beef patties place but they were closed. Jesus Taco was next door so I went there instead. What a mistake! I don't know what Masanao and Yu-Sung could've been thinking. Anyway, then I had Jamaican beef patties on the brain so I went by Monday afternoon and asked for 9: 3 spicy beef, 3 mild beef (for the kids), and 3 chicken (not the jerk chicken; Bob got those the other day and they didn't impress me). I'm about to pay and then a bunch of people come in and start ordering. The woman behind the counter asks if I'm in a hurry, I ask why, she whispers, For the same price you can get a dozen. So I get two more spicy beef and a chicken. She whispers that I shouldn't tell anyone. I can't really figure out why I'm getting this special treatment. So I walk out of there with 12 patties. Total cost: $17.25. It's a good deal: they're small but not that small. Sure, I ate 6 of them, but I was hungry.

Chapter 2

A half hour later, I'm pulling keys out of my pocket lock up my bike and a bunch of change falls out. (Remember--the patties cost $17.25, so I had three quarters in my pocket, plus whatever happened to be there already.) I see all three quarters plus a couple of pennies. The change is on the street, and, as I'm leaning down to pick it up, I notice there's a parked car, right in front of me, with its engine running. There's no way the driver can see me if I'm bending down behind the rear wheels. And if he backs up, I'm dead meat.

It suddenly comes to me--this is what they mean when they talk about "picking pennies in front of a steamroller." That's exactly what I was about to do!

After a brief moment of indecision, I bent down and picked up the quarters. I left the pennies where they were, though.

P.S. The last time I experienced an economics cliche in real time was a few weeks ago, when I spotted $5 in cash on the street.

Type S error: When your estimate is the wrong sign, compared to the true value of the parameter

Type M error: When the magnitude of your estimate is far off, compared to the true value of the parameter

More here.

Costless false beliefs



From the Gallup Poll:

Four in 10 Americans, slightly fewer today than in years past, believe God created humans in their present form about 10,000 years ago.

They've been asking the question since 1982 and it's been pretty steady at 45%, so in some sense this is good news! (I'm saying this under the completely unsupported belief that it's better for people to believe truths than falsehoods.)

Gur Huberman asks what I think of this magazine article by Johah Lehrer (see also here).

My reply is that it reminds me a bit of what I wrote here. Or see here for the quick powerpoint version: The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount. I know that Dave Krantz has thought about this issue for awhile; it came up when Francis Tuerlinckx and I wrote our paper on Type S errors, ten years ago.

My current thinking is that most (almost all?) research studies of the sort described by Lehrer should be accompanied by retrospective power analyses, or informative Bayesian inferences. Either of these approaches--whether classical or Bayesian, the key is that they incorporate real prior information, just as is done in a classical prospective power analysis--would, I think, moderate the tendency to overestimate the magnitude of effects.

In answer to the question posed by the title of Lehrer's article, my answer is Yes, there is something wrong with the scientific method, if this method is defined as running experiments and doing data analysis in a patternless way and then reporting, as true, results that pass a statistical significance threshold.

And corrections for multiple comparisons will not solve the problem: such adjustments merely shift the threshold without resolving the problem of overestimation of small effects.

Sander Wagner writes:

I just read the post on ethical concerns in medical trials. As there seems to be a lot more pressure on private researchers i thought it might be a nice little exercise to compare p-values from privately funded medical trials with those reported from publicly funded research, to see if confirmation pressure is higher in private research (i.e. p-values are closer to the cutoff levels for significance for the privately funded research). Do you think this is a decent idea or are you sceptical? Also are you aware of any sources listing a large number of representative medical studies and their type of funding?

My reply:

This sounds like something worth studying. I don't know where to get data about this sort of thing, but now that it's been blogged, maybe someone will follow up.

Aleks points me to this research summary from Dan Goldstein. Good stuff. I've heard of a lot of this--I actually use some of it in my intro statistics course, when we show the students how they can express probability trees using frequencies--but it's good to see it all in one place.

Rational addiction


Ole Rogeberg sends in this:

and writes:

No idea if this is amusing to non-economists, but I tried my hand at the xtranormal-trend. It's an attempt to spoof the many standard "incantations" I've encountered over the years from economists who don't want to agree that rational addiction theory lacks justification for some of the claims it makes. More specifically, the claims that the theory can be used to conduct welfare analysis of alternative policies.

See here (scroll to Rational Addiction) and here for background.

Prison terms for financial fraud?


My econ dept colleague Joseph Stiglitz suggests that financial fraudsters be sent to prison. He points out that the usual penalty--million-dollar fines--just isn't enough for crimes whose rewards can be in the hundreds of millions of dollars.

That all makes sense, but why do the options have to be:

1. No punishment

2. A fine with little punishment or deterrent value

3. Prison.

What's the point of putting nonviolent criminals in prison? As I've said before, I'd prefer if the government just took all these convicted thieves' assets along with 95% of their salary for several years, made them do community service (sorting bottles and cans at the local dump, perhaps; a financier should be good at this sort of thing, no?), etc. If restriction of personal freedom is judged be part of the sentence, they could be given some sort of electronic tag that would send a message to the police if you are ever more than 3 miles from your home. And a curfew so you have to stay home between the hours of 7pm and 7am. Also take away internet access and require that you live in a 200-square-foot apartment in a grungy neighborhood. And so forth. But no need to bill the taxpayers for the cost of prison.

Stiglitz writes:

When you say the Pledge of Allegiance you say, with "justice for all." People aren't sure that we have justice for all. Somebody is caught for a minor drug offense, they are sent to prison for a very long time. And yet, these so-called white-collar crimes, which are not victimless, almost none of these guys, almost none of them, go to prison.

To me, though, this misses the point. Why send minor drug offenders to prison for a very long time? Instead, why not just equip them with some sort of recorder/transmitter that has to be always on. If they can do all their drug deals in silence, then, really, how much trouble are they going to be causing?

Readers with more background in criminology than I will be able to poke holes in my proposals, I'm sure.

P.S. to the impatient readers out there: Yeah, yeah, I have some statistics items on deck. They'll appear at the approximate rate of one a day.

Is parenting a form of addiction?


The last time we encountered Slate columnist Shankar Vedantam was when he puzzled over why slightly more than half of voters planned to vote for Republican candidates, given that polls show that Americans dislike the Republican Party even more than they dislike the Democrats. Vedantam attributed the new Republican majority to irrationality and "unconscious bias." But, actually, this voting behavior is perfectly consistent with there being some moderate voters who prefer divided government. The simple, direct explanation (which Vedantam mistakenly dismisses) actually works fine.

I was flipping through Slate today and noticed a new article by Vedantam headlined, "If parenthood sucks, why do we love it? Because we're addicted." I don't like this one either.

Paul Nee sends in this amusing item:

A recent story about academic plagiarism spurred me to some more general thoughts about the intellectual benefits of not giving a damn.

I'll briefly summarize the plagiarism story and then get to my larger point.

Copying big blocks of text from others' writings without attribution

Last month I linked to the story of Frank Fischer, an elderly professor of political science who was caught copying big blocks of text (with minor modifications) from others' writings without attribution.

Silly old chi-square!


Brian Mulford writes:

I [Mulford] ran across this blog post and found myself questioning the relevance of the test used.

I'd think Chi-Square would be inappropriate for trying to measure significance of choice in the manner presented here; irrespective of the cute hamster. Since this is a common test for marketers and website developers - I'd be interested in which techniques you might suggest?

For tests of this nature, I typically measure a variety of variables (image placement, size, type, page speed, "page feel" as expressed in a factor, etc) and use LOGIT, Cluster and possibly a simple Bayesian model to determine which variables were most significant (chosen). Pearson Chi-squared may be used to express relationships between variables and outcome but I've typically not used it to simply judge a 0/1 choice as statistically significant or not.

My reply:

I like the decision-theoretic way that the blogger (Jason Cohen, according to the webpage) starts:

If you wait too long between tests, you're wasting time. If you don't wait long enough for statistically conclusive results, you might think a variant is better and use that false assumption to create a new variant, and so forth, all on a wild goose chase! That's not just a waste of time, it also prevents you from doing the correct thing, which is to come up with completely new text to test against.

But I agree with Mulford that chi-square is not the way to go. I'd prefer a direct inference on the difference in proportions. Take that inference--the point estimate and its uncertainty, estimated using the usual (y+1)/(n+2) formulas--and then carry that uncertainty into your decision making. Balance costs and benefits, and all that.

Moving forward, you're probably making lots and lots of this sort of comparison, so put it into a hierarchical model and you'll get inferences that are more reasonable and more precise.

But . . . who knows? Maybe Cohen's advice is a net plus. Ignoring the chi-square stuff, the key message I take away from the above-linked blog is that, with small samples, randomness can be huge. And that's an important lesson--really, one of the key concepts in statistics. Don't overreact to small samples. If the silly old chi-square test is your way of coming to this conclusion, that's not so bad.

After learning of a news article by Amy Harmon on problems with medical trials--sometimes people are stuck getting the placebo when they could really use the experimental treatment, and it can be a life-or-death difference, John Langford discusses some fifteen-year-old work on optimal design in machine learning and makes the following completely reasonable point:

With reasonable record keeping of existing outcomes for the standard treatments, there is no need to explicitly assign people to a control group with the standard treatment, as that approach is effectively explored with great certainty. Asserting otherwise would imply that the nature of effective treatments for cancer has changed between now and a year ago, which denies the value of any clinical trial. . . .

Done the right way, the clinical trial for a successful treatment would start with some initial small pool (equivalent to "phase 1″ in the article) and then simply expanded the pool of participants over time as it proved superior to the existing treatment, until the pool is everyone. And as a bonus, you can even compete with policies on treatments rather than raw treatments (i.e. personalized medicine).

Langford then asks: if these ideas are so good, why aren't they done already? He conjectures:

Getting from here to there seems difficult. It's been 15 years since EXP3.P was first published, and the progress in clinical trial design seems glacial to us outsiders. Partly, I think this is a communication and education failure, but partly, it's also a failure of imagination within our own field. When we design algorithms, we often don't think about all the applications, where a little massaging of the design in obvious-to-us ways so as to suit these applications would go a long ways.

I agree with these sentiments, but . . . the sorts of ideas Langford is talking about have been around in a statistics for a long long time--much more than 15 years! I welcome the involvement of computer scientists in this area, but it's not simply that the CS people have a great idea and just need to communicate it or adapt it to the world of clinical trials. The clinical trials people already know about these ideas (not with the same terminology, but they're the same basic ideas) but, for various reasons, haven't widely adapted them.

P.S. The news article is by Amy Harmon, but Langford identifies it only as being from the New York Times. I don't think this is appropriate to omit the author's name. The publication is relevant but it's the reporter who did the work. I certainly wouldn't like it if someone referred to one of my articles by writing, "The Journal of the American Statistical Association reported today that . . ."

I've written a lot on polls and elections ("a poll is a snapshot, not a forecast," etc., or see here for a more technical paper with Kari Lock) but had a few things to add in light of Sam Wang's recent efforts. As a biologist with a physics degree, Wang brings an outsider's perspective to political forecasting, which can be a good thing. (I'm a bit of an outsider to political science myself, as is my sometime collaborator Nate Silver, who's done a lot of good work in the past few years.)

But there are two places where Wang misses the point, I think.

Taleb + 3.5 years


I recently had the occasion to reread my review of The Black Swan, from April 2007.

It was fun reading my review (and also this pre-review; "nothing useful escapes from a blackbody," indeed). It was like a greatest hits of all my pet ideas that I've never published.

Looking back, I realize that Taleb really was right about a lot of things. Not that the financial crisis has happened, we tend to forget that the experts who Taleb bashes were not always reasonable at all. Here's what I wrote in my review, three and a half years ago:

On page 19, Taleb refers to the usual investment strategy (which I suppose I actually use myself) as "picking pennies in front of a steamroller." That's a cute phrase; did he come up with it? I'm also reminded of the famous Martingale betting system. Several years ago in a university library I came across a charming book by Maxim (of gun fame) where he went through chapter after chapter demolishing the Martingale system. (For those who don't know, the Martingale system is to bet $1, then if you lose, bet $2, then if you lose, bet $4, etc. You're then guaranteed to win exactly $1--or lose your entire fortune. A sort of lottery in reverse, but an eternally popular "system.")

Throughout, Taleb talks about forecasters who aren't so good at forecasting, picking pennies in front of steamrollers, etc. I imagine much of this can be explained by incentives. For example, those Long-Term Capital guys made tons of money, then when their system failed, I assume they didn't actually go broke. They have an incentive to ignore those black swans, since others will pick up the tab when they fail (sort of like FEMA pays for those beachfront houses in Florida). It reminds me of the saying that I heard once (referring to Donald Trump, I believe) that what matters is not your net worth (assets minus liabilities), but the absolute value of your net worth. Being in debt for $10 million and thus being "too big to fail" is (almost) equivalent to having $10 million in the bank.

So, yeah, "too big to fail" is not a new concept. But as late as 2007, it was still a bit of an underground theory. People such as Taleb screamed about, but the authorities weren't listening.

And then there are parts of the review that make me really uncomfortable. As noted in the above quote, I was using the much-derided "picking pennies in front of a steamroller" investment strategy myself--and I knew it! Here's some more, again from 2007:

I'm only a statistician from 9 to 5

I try (and mostly succeed, I think) to have some unity in my professional life, developing theory that is relevant to my applied work. I have to admit, however, that after hours I'm like every other citizen. I trust my doctor and dentist completely, and I'll invest my money wherever the conventional wisdom tells me to (just like the people whom Taleb disparages on page 290 of his book).

Not long after, there was a stock market crash and I lost half my money. OK, maybe it was only 40%. Still, what was I thinking--I read Taleb's book and still didn't get the point!

Actually, there was a day in 2007 or 2008 when I had the plan to shift my money to a safer place. I recall going on the computer to access my investment account but I couldn't remember the password, was too busy to call and get it, and then forgot about it. A few weeks later the market crashed.

If only I'd followed through that day. Oooohhh, I'd be so smug right now. I'd be going around saying, yeah, I'm a statistician, I read Taleb's book and I thought it through, blah blah blah. All in all, it was probably better for me to just lose the money and maintain a healthy humility about my investment expertise.

But the part of the review that I really want everyone to read is this:

On page 16, Taleb asks "why those who favor allowing the elimination of a fetus in the mother's womb also oppose capital punishment" and "why those who accept abortion are supposed to be favorable to high taxation but against a strong military," etc. First off, let me chide Taleb for deterministic thinking. From the General Social Survey cumulative file, here's the crosstab of the responses to "Abortion if woman wants for any reason" and "Favor or oppose death penalty for murder":

40% supported abortion for any reason. Of these, 76% supported the death penalty.

60% did not support abortion under all conditions. Of these, 74% supported the death penalty.

This was the cumulative file, and I'm sure things have changed in recent years, and maybe I even made some mistake in the tabulation, but, in any case, the relation between views on these two issues is far from deterministic!

Finally, a lot of people bash Taleb, partly for his idosyncratic writing style, but I have fond memories of both his books, for their own sake and because they inspired me to write down some of my pet ideas. Also, he deserves full credit for getting things right several years ago, back when the Larry Summerses of the world were still floating on air, buoyed by the heads-I-win, tails-you-lose system that kept the bubble inflated for so long.

Musical chairs in econ journals


Tyler Cowen links to a paper by Bruno Frey on the lack of space for articles in economics journals. Frey writes:

To further their careers, [academic economists] are required to publish in A-journals, but for the vast majority this is impossible because there are few slots open in such journals. Such academic competition maybe useful to generate hard work, however, there may be serious negative consequences: the wrong output may be produced in an inefficient way, the wrong people may be selected, and losers may react in a harmful way.

According to Frey, the consensus is that there are only five top economics journals--and one of those five is Econometrica, which is so specialized that I'd say that, for most academic economists, there are only four top places they can publish. The difficulty is that demand for these slots outpaces supply: for example, in 2007 there were only 275 articles in all these journals combined (or 224 if you exclude Econometrica), while "a rough estimate is that there are around 10,000 academics actively aspiring to publish in A-journals."

I agree completely with Frey's assessment of the problem, and I've long said that statistics has a better system: there are a lot fewer academic statisticians than academic economists, and we have many more top journals we can publish in (all the probability and statistics journals, plus the econ journals, plus the poli sci journals, plus the psych journals, etc), so there's a lot less pressure.

I wonder if part of the problem with the econ journals is that economists enjoy competition. If there were not such a restricted space in top journals, they wouldn't have a good way to keep score.

Just by comparison, I've published in most of the top statistics journals, but my most cited articles have appeared in Statistical Science, Statistica Sinica, Journal of Computational and Graphical Statistics, and Bayesian Analysis. Not a single "top 5 journal" in the bunch.

But now let's take the perspective of a consumer of economics journals, rather than thinking about the producers of the articles. From my consumer's perspective, it's ok that the top five journals are largely an insider's club (with the occasional exceptional article from an outsider). These insiders have a lot to say, and it seems perfectly reasonable for them to have their own journal. The problem is not the exclusivity of the journals but rather the presumption that outsiders and new entrants should be judged based on their ability to conform to the standards of these journals. The tenured faculty at the top 5 econ depts are great, I'm sure--but does the world really need 10,000 other people trying to become just like them??? Again, based on my own experience, some of our most important work is the stuff that does not conform to conventional expectations.

P.S. I met Frey once. He said, "Gelman . . . you wrote the zombies paper!" So, you see, you don't need to publish in the AER for your papers to get noticed. Arxiv is enough. I don't know whether this would work with more serious research, though.

P.P.S. On an unrelated note, if you have to describe someone as "famous," he's not. (Unless you're using "famous" to distinguish two different people with the same name (for example, "Michael Jordan--not the famous one"), but it doesn't look like that's what's going on here.)

Tyler Cowen links to a blog by Greg Mankiw with further details on his argument that his anticipated 90% marginal tax rate will reduce his work level.

Having already given my thoughts on Mankiw's column, I merely have a few things to add/emphasize.

Greg Mankiw writes (link from Tyler Cowen):

Without any taxes, accepting that editor's assignment would have yielded my children an extra $10,000. With taxes, it yields only $1,000. In effect, once the entire tax system is taken into account, my family's marginal tax rate is about 90 percent. Is it any wonder that I [Mankiw] turn down most of the money-making opportunities I am offered?

By contrast, without the tax increases advocated by the Obama administration, the numbers would look quite different. I would face a lower income tax rate, a lower Medicare tax rate, and no deduction phaseout or estate tax. Taking that writing assignment would yield my kids about $2,000. I would have twice the incentive to keep working.

First, the good news

Obama's tax rates are much lower than Mankiw had anticipated! According to the above quote, his marginal tax rate is currently 80% but threatens to rise to 90%.

But, in October 2008, Mankiw calculated that Obama's would tax his marginal dollar at 93%. What we're saying, then, is that Mankiw's marginal tax rate is currently thirteen percentage points lower than he'd anticipated two years ago. In fact, Mankiw's stated current marginal tax rate of 80% is three points lower than the tax rate he expected to pay under a McCain administration! And if the proposed new tax laws are introduced, Mankiw's marginal tax rate of 90% is still three percentage points lower than he'd anticipated, back during the 2008 election campaign. I assume that, for whatever reason, Obama did not follow through on all his tax-raising promises.

To frame the numbers more dramatically: According to Mankiw's calculations, he is currently keeping almost three times the proportion of his income that he was expecting to keep under the Obama administration (and 18% more than he was expecting to keep under a hypothetical McCain administration). If the new tax plans are put into effect, Mankiw will still keep 43% more of his money than he was expecting to keep, only two years ago. (For those following along at home, the calculations are (1-0.80)/(1-0.93)=2.9, (1-0.80)/(1-0.83)=1.18, and (1-0.90)/(1-0.93)=1.43.)

Given that Mankiw currently gets to keep 20% of his money--rather than the measly 7% he was anticipating--it's no surprise that he's still working!

Now, the bad news

I don't think Mankiw has fully thought this through.

Someone who works in statistics in the pharmaceutical industry (but prefers to remain anonymous) sent me this update to our discussion on the differences between approvals of drugs and medical devices:

The 'substantial equivalence' threshold is a very outdated. Basically the FDA has to follow federal law and the law is antiquated and leads to two extraordinarily different paths for device approval.

You could have a very simple but first-in-kind device with an easy to understand physiological mechanism of action (e.g. the FDA approved a simple tiny stent that would relieve pressure from a glaucoma patient's eye this summer). This device would require a standard (likely controlled) trial at the one-sided 0.025 level. Even after the trial it would likely go to a panel where outside experts (e.g.practicing & academic MDs and statisticians) hear evidence from the company and FDA and vote on its safety and efficacy. FDA would then rule, consider the panel's vote, on whether to approve this device.

On the other hand you could have a very complex device with uncertain physiological mechanism declared equivalent to a device approved before May 28, 1976 and it requires much less evidence. And you can have a device declared similar to a device that was similar to a device that was similar to a device on the market before 1976. So basically if there was one type I error in this chain, you now have a device that's equivalent to a non-efficacious device. For these no trial is required, no panel meeting is required. The regulatory burden is tens of millions of dollars less expensive and we also have substantially less scientific evidence.

But the complexity of the device has nothing to do with which path gets taken. Only it's similarity to a device that existed before 1976.

This was in the WSJ just this morning.

You can imagine there was nothing quite like the "NanoKnife" on the market in 1976. But it's obviously very worth a company's effort to get their new device declare substantially equivalent to an old one. Otherwise they have to spend the money for a trial and risk losing that trial. Why do research when you can just market!?

So this unfortunately isn't a scientific question -- we know what good science would lead us to do. It's a legal question and the scientists at FDA are merely following U.S. law which is fundamentally flawed and leads to two very different paths and scientific hurdles for device approval.

A question for psychometricians


Don Coffin writes:

A colleague of mine and I are doing a presentation for new faculty on a number of topics related to teaching. Our charge is to identify interesting issues and to find research-based information for them about how to approach things. So, what I wondered is, do you know of any published research dealing with the sort of issues about structuring a course and final exam in the ways you talk about in this blog post? Some poking around in the usual places hasn't turned anything up yet.

I don't really know the psychometrics literature but I imagine that some good stuff has been written on principles of test design. There are probably some good papers from back in the 1920s. Can anyone supply some references?

The winner's curse


If an estimate is statistically significant, it's probably an overestimate of the magnitude of your effect.

P.S. I think youall know what I mean here. But could someone rephrase it in a more pithy manner? I'd like to include it in our statistical lexicon.

Dan Goldstein sends along this bit of research, distinguishing terms used in two different subfields of psychology. Dan writes:

Intuitive calls included not listing words that don't occur 3 or more times in both programs. I [Dan] did this because when I looked at the results, those cases tended to be proper names or arbitrary things like header or footer text. It also narrowed down the space of words to inspect, which means I could actually get the thing done in my copious free time.

I think the bar graphs are kinda ugly, maybe there's a better way to do it based on classifying the words according to content? Also the whole exercise would gain a new dimension by comparing several areas instead of just two. Maybe that's coming next.

Brendan Nyhan gives the story.

Here's Sarah Palin's statement introducing the now-notorious phrase:

The America I know and love is not one in which my parents or my baby with Down Syndrome will have to stand in front of Obama's "death panel" so his bureaucrats can decide, based on a subjective judgment of their "level of productivity in society," whether they are worthy of health care.

And now Brendan:

Palin's language suggests that a "death panel" would determine whether individual patients receive care based on their "level of productivity in society." This was -- and remains -- false. Denying coverage at a system level for specific treatments or drugs is not equivalent to "decid[ing], based on a subjective judgment of their 'level of productivity in society.'"

Seems like an open-and-shut case to me. The "bureaucrats" (I think Palin is referring to "government employees") are making decisions based on studies of the drug's effectiveness:

I can't escape it


I received the following email:

Ms. No.: ***

Title: ***

Corresponding Author: ***

All Authors: ***

Dear Dr. Gelman,

Because of your expertise, I would like to ask your assistance in determining whether the above-mentioned manuscript is appropriate for publication in ***. The abstract is pasted below. . . .

My reply:

I would rather not review this article. I suggest ***, ***, and *** as reviewers.

I think it would be difficult for me to review the manuscript fairly.

NSF crowdsourcing


I have no idea what this and this are, but Aleks passed these on, and maybe some of you will find them interesting.

There's a lot of free advice out there. As I wrote a couple years ago, it's usually presented as advice to individuals, but it's also interesting to consider the possible total effects if the advice is taken.

For example, Nassim Taleb has a webpage that includes a bunch of one-line bits of advice (scroll to item 132 on the linked page). Here's his final piece of advice:

If you dislike someone, leave him alone or eliminate him; don't attack him verbally.

I'm a big Taleb fan (search this blog to see), but this seems like classic negative-sum advice. I can see how it can be a good individual strategy to keep your mouth shut, bide your time, and then sandbag your enemies. But it can't be good if lots of people are doing this. Verbal attacks are great, as long as there's a chance to respond. I've been in environments where people follow Taleb's advice, saying nothing and occasionally trying to "eliminate" people, and it's not pretty. I much prefer for people to be open about their feelings. Or, if you want to keep your dislikes to yourself, fine, but don't go around eliminating people!

On the other hand, maybe I'm missing the point. Taleb attacks people verbally all the time, so maybe his advice is tongue in cheek, along the lines of "do as I say, not as I do."

As noted above, I think Taleb is great, but I'm really down on this sort of advice where people are advised to be more strategic, conniving, etc. In my experience, this does not lead to a pleasant equilibrium where everybody is reasonably savvy. Rather, it can lead to a spiral of mistrust and poor communication.

P.S. Taleb's other suggestions seem more promising.

I came across this blog by Jonathan Weinstein that illustrated, once again, some common confusion about ideas of utility and risk. Weinstein writes:

When economists talk about risk, we talk about uncertain monetary outcomes and an individual's "risk attitude" as represented by a utility function. The shape of the function determines how willing the individual is to accept risk. For instance, we ask students questions such as "How much would Bob pay to avoid a 10% chance of losing $10,000?" and this depends on Bob's utility function.

This is (a) completely wrong, and (b) known to be completely wrong. To be clear: what's wrong here is not that economists talk this way. What's wrong is the identification of risk aversion with a utility function for money. (See this paper from 1998 or a more formal argument from Yitzhak in a paper from 2000.)

It's frustrating. Everybody knows that it's wrong to associate a question such as "How much would Bob pay to avoid a 10% chance of losing $10,000?" with a utility function, yet people do it anyway. It's not Jonathan Weinstein's fault--he's just calling this the "textbook definition"--but I guess it is the fault of the people who write the textbooks.

P.S. Yes, yes, I know that I've posted on this before. It's just sooooooo frustrating that I'm compelled to write about it again. Unlike some formerly recurring topics on this blog, I don't associate this fallacy with any intellectual dishonesty. I think it's just an area of confusion. The appealing but wrong equation of risk aversion with nonlinear utility functions is a weed that's grown roots so deep that no amount of cutting and pulling will kill it.

P.P.S. To elaborate slightly: The equation of risk aversion with nonlinear utility is empirically wrong (people are much more risk averse for small sums than could possibly make sense under the utility model) and conceptually wrong (risk aversion is an attitude about process rather than outcome).

P.P.P.S. I'll have to write something more formal about this some time . . . in the meantime, let me echo the point made by many others that the whole idea of a "utility function for money" is fundamentally in conflict with the classical axiom of decision theory that preferences should depend only on outcomes, not on intermediate steps. Money's value is not in itself but rather in what it can do for you, and in the classical theory, utilities would be assigned to the ultimate outcomes. (But even if you accept the idea of a "utility of money" as some sort of convenient shorthand, you still can't associate it with attitudes about risky gambles, for the reasons discussed by Yitzhak and myself and which are utterly obvious if you ever try to teach the subject.)

P.P.P.P.S. Yes, I recognize the counterargument: that if this idea is really so bad and yet remains so popular, it must have some countervailing advantages. Maybe so. But I don't see it. It seems perfectly possible to believe in supply and demand, opportunity cost, incentives, externalities, marginal cost and benefits, and all the rest of the package--without building it upon the idea of a utility function that doesn't exist. To put it another way, the house stands up just fine without the foundations. To extent that the foundations hold up at all, I suspect they're being supported by the house.

We're doing a new thing here at the Applied Statistics Center, throwing monthly Friday afternoon mini-conferences in the Playroom (inspired by our successful miniconference on statistical consulting a couple years ago).

This Friday (10 Sept), 1-5pm:

Come join us this Friday, September 10th for an engaging interdisciplinary discussion of risk perception at the individual and societal level, and the role it plays in current environmental, social, and health policy debates. All are welcome!

"Risk Perception in Environmental Decision-Making"

Elke Weber, Columbia Business School

"Cultural Cognition and the Problem of Science Communication"

Dan Kahan, Yale Law School

Discussants include:

Michael Gerrard, Columbia Law School

David Epstein, Department of Political Science, Columbia University

Andrew Gelman, Department of Statistics, Columbia University

From Bannerjee and Duflo, "The Experimental Approach to Development Economics," Annual Review of Economics (2009):

One issue with the explicit acknowledgment of randomization as a fair way to allocate the program is that implementers may find that the easiest way to present it to the community is to say that an expansion of the program is planned for the control areas in the future (especially when such is indeed the case, as in phased-in design).

I can't quite figure out whether Bannerjee and Duflo are saying that they would lie and tell people that an expansion is planned when it isn't, or whether they're deploring that other people do it.

I'm not bothered by a lot of the deception in experimental research--for example, I think the Milgram obedience experiment was just fine--but somehow the above deception bothers me. It just seems wrong to tell people that an expansion is planned if it's not.

P.S. Overall the article is pretty good. My only real problem with it is that when discussing data analysis, they pretty much ignore the statistical literature and just look at econometrics. In the long run, that's fine--any relevant developments in statistics should eventually make their way over to the econometrics literature. But for now I think it's a drawback in that it encourages a focus on theory and testing rather than modeling and scientific understanding.

Here are the titles of some of the cited papers:

Bootstrap tests for distributional treatment effects in instrumental variables models
Nonparametric tests for treatment effect heterogeneity
Testing the correlated random coefficient model
Asymptotics for statistical decision rules

Most of the paper, and most of the references, are applied rather than theoretical, so I'm not claiming that Bannerjee and Duflo are ivory-tower theorists. Rather, I'm suggesting that their statistical methods might not be allowing them to get the most out of their data--and that they're looking in the wrong place when researching better methods. The problem, I think, is that they (like many economists) think of statistical methods not as a tool for learning but as a tool for rigor. So they gravitate toward math-heavy methods based on testing, asymptotics, and abstract theories, rather than toward complex modeling. The result is a disconnect between statistical methods and applied goals.

No radon lobby


Kaiser writes thoughtfully about the costs, benefits, and incentives for different policy recommendation options regarding a recent water crisis. Good stuff: it's solid "freakonomics"--and I mean this in positive way: a mix of economic and statistical analysis, with assumptions stated clearly. Kaiser writes:

Using the framework from Chapter 4, we should think about the incentives facing the Mass. Water Resources Authority:

A false positive error (people asked to throw out water when water is clean) means people stop drinking tap water temporarily, perhaps switching to bottled water, and the officials claim victory when no one falls sick, and businesses that produce bottled water experience a jump in sales. It is also very difficult to prove a "false positive" when people have stopped drinking the water. So this type of error is easy to hide behind.

A false negative error (people told it's safe to drink water when water is polluted) becomes apparent when someone falls sick as a result of drinking the water -- notice that it would be impossible to know if such a person is affected by bacteria from the pond water or bacteria from the main water line but no matter, any sickness will be blamed on the pond water. We think the risk is low but if it happens, the false negative error creates a public relations nightmare.

I [Kaiser] think this goes a long way to explaining why government officials behave the way they do. This applies also to the FDA and CDC in terms of foodborne diseases (a subject of Chapter 2), and to the NTSB in terms of car recalls. They tend to be overly conservative. In the case of food or product recalls, being overly conservative leads to massive economic losses and waste as food or products are thrown out, almost all of them good.

This reminds me of my work with Phil in the mid-1990s on home radon. The Environmental Protection Agency had a recommendation that every homeowner in the country measure their radon levels, and that anyone with a measurement higher than 4 picoCuries per liter get their house remediated. We recommended a much more targeted strategy which we estimated could save the same number of lives at much less cost. But the EPA resisted our approach. One thing that was going on, we decided, was that there is no pro-radon lobby. Radon is a natural hazard, and so there's no radon manufacturer's association pushing to minimize its risks. If anything, polluters like to focus on radon as it takes the hook off them for other problems. And the EPA has every incentive to make a big deal out of it. So you get a one-sided political environment leading to recommendations that are too expensive. Similar things go on with other safety issues in the U.S.

Dodging the diplomats


The usually-reasonable-even-if-you-disagree-with-him Tyler Cowen writes:

Presumably diplomats either enjoy serving their country or they enjoy the ego rents of being a diplomat or both. It is a false feeling of power, borrowed power from one's country of origin rather than from one's personal achievements.

Huh? I'd hardly think this needs to be explained, but here goes:

My cobloggers sometimes write about "Politics Everywhere." Here's an example of a political writer taking something that's not particularly political and trying to twist it into a political context. Perhaps the title should be "political journalism everywhere".

Michael Kinsley writes:

Scientists have discovered a spinal fluid test that can predict with 100 percent accuracy whether people who already have memory loss are going to develop full-fledged Alzheimer's disease. They apparently don't know whether this test works for people with no memory problems yet, but reading between the lines of the report in the New York Times August 10, it sounds as if they believe it will. . . . This is truly the apple of knowledge: a test that can be given to physically and mentally healthy people in the prime of life, which can identify with perfect accuracy which ones are slowly going to lose their mental capabilities. If your first instinct is, "We should outlaw this test" or at least "we should forbid employers from discriminating on the basis of this test," congratulations--you're a liberal. People should be judged on the basis of their actual, current abilities, not on the basis of what their spinal fluid indicates about what may happen some day. Tests can be wrong. [Italics added by me.]

By the time Kinsley reached the end of this passage, he seems to have forgotten that he had already stipulated that the test is 100% accurate. Make up your mind, man!

Also, what's that bit about "congratulations, you're a liberal"? I think there are conservatives who believe that "people should be judged on the basis of their actual, current abilities." Don't forget that over 70% of Americans support laws prohibiting employment discrimination on the basis of sexual orientation. I don't think all these people are liberals. Lots of people of all persuasions believe people should be judged based on what they can do, not who they are.

We're always hearing about the problems caused by political polarization, and I think this is an example. Medical diagnostics are tough enough without trying to align them on a liberal-conservative scale.

P.S Kaiser has more on that "100% accuracy" claim.

Kaggle forcasting update

| No Comments

Anthony Goldbloom writes:

The Elo rating system is now in 47th position (team Elo Benchmark on the leaderboard). Team Intuition submitted using Microsoft's Trueskill rating system - Intuition is in 38th position.

And for the tourism forecasting competition, the best submission is doing better than the threshold for publication in the International Journal of Forecasting.



I'm just glad that universities don't sanction professors for publishing false theorems.

If the guy really is nailed by the feds for fraud, I hope they don't throw him in prison. In general, prison time seems like a brutal, expensive, and inefficient way to punish people. I'd prefer if the government just took 95% of his salary for several years, made him do community service (cleaning equipment at the local sewage treatment plant, perhaps; a lab scientist should be good at this sort of thing, no?), etc. If restriction of this dude's personal freedom is judged be part of the sentence, he could be given some sort of electronic tag that would send a message to the police if he were ever more than 3 miles from his home. But no need to bill the taxpayers for the cost of keeping him in prison.

Psychologists talk about "folk psychology": ideas that make sense to us about how people think and behave, even if these ideas are not accurate descriptions of reality. And physicists talk about "folk physics" (for example, the idea that a thrown ball falls in a straight line and then suddenly drops, rather than following an approximate parabola).

There's also "folk statistics." Some of the ideas of folk statistics are so strong that even educated people--even well-known researchers--can make these mistakes.

One of the ideas of folk statistics that bothers me a lot is what might be called the "either/or fallacy": the idea that if there are two possible stories, the truth has to be one or the other.

I have often encountered the either/or fallacy in Bayesian statistics, for example the vast literature on "model selection" or "variable selection" or "model averaging" in which it is assumed that one of some pre-specified discrete set of models is the truth, and that this true model can be determined from the data. Or, more generally, that the goal is to estimate the posterior probability of each of these models. As discussed in chapter 6 of BDA, in the application areas I've worked on, such discrete formulations don't make sense to me. Rather than saying that model A or model B might be true, I'd rather say they can both be true. Which is not the same as assigning, say, .3 probability to model A and .7 probability to model B; rather, I'm talking about a continuous model expansion that would include A and B as special cases. That said, any model I fit will have its limitations, so I recognize that discrete model averaging might be useful in practice. But I don't have to like it.

Since I've been primed to see it, I notice the either/or fallacy all over the place. For example, as I discuss here, cognitive scientist Steven Sloman writes:

A good politician will know who is motivated by greed and who is motivated by larger principles in order to discern how to solicit each one's vote when it is needed.

I can well believe that people think in this way but I don't buy it! Just about everyone is motivated by greed and by larger principles! This sort of discrete thinking doesn't seem to me to be at all realistic about how people behave--although it might very well be a good model about how people characterize others!

Later in his book on causal reasoning, Sloman writes:

No matter how many times A and B occur together, mere co-occurrence cannot reveal whether A causes B, or B causes A, or something else causes both. [italics added]

Again, I am bothered by this sort of discrete thinking. I'm not trying to pick on Sloman here; I'm just demonstrating how the either/or fallacy is so entrenched in our ideas of folk statistics that it comes out in all sorts of settings.

Most recently, I noticed the fallacy in the humble precincts of our blog, when, in response to Phil's remark that having lots of kids puts a strain on the environment, commenter A. Zarkov wrote,

Believe or not, some people really like children and want a lot of them. They think of each child as a blessing, not a strain on the bio-sphere.

That's the either/or fallacy again! As I see it, each child is a blessing and a strain on the biosphere. There's no reason to think it's just one or the other.

I'll stop now. I think you get the point.

Note to semi-spammers

| 1 Comment

I just deleted another comment that seemed reasonable but was attached to an advertisement.

Here's a note to all of you advertisers out there: If you want to leave a comment on this site, please do so without the link to your website on search engine optimization or whatever. Or else it will get deleted. Which means you were wasting your time in writing the comment.

I want your comments and I don't want you to waste your time. So please just stop already with the links, and we'll both be happier.

P.S. Don't worry, you're still not as bad as the journal Nature (see the P.S. here).

Angry about the soda tax


My Columbia colleague Sheena Iyengar (most famous among ARM readers for her speed-dating experiment) writes an interesting column on potential reactions to a tax on sugary drinks. The idea is that people might be so annoyed at being told what to do that they might buy more of the stuff, at least in the short term.

On the other hand, given the famous subsidies involved in the production of high-fructose corn syrup, soda pop is probably a bit cheaper than it should be, so maybe it all balances out?

I agree with Sheena that there's something about loss of control that is particularly frustrating. One thing that bugs me when I buy a Coke is that I'm paying for the fees of Michael Jordan or whoever it is they have nowadays endorsing their product. I wish there were some way I could just pay for everything else but withhold the money that's going into those silly celebrity endorsements.

Fascinating interview by Kathryn Schulz of Google research director Peter Norvig. Lots of stuff about statistical design and analysis.

From a commenter on the web, 21 May 2010:

Tampa Bay: Playing .732 ball in the toughest division in baseball, wiped their feet on NY twice. If they sweep Houston, which seems pretty likely, they will be at .750, which I [the commenter] have never heard of.

At the time of that posting, the Rays were 30-11. Quick calculation: if a team is good enough to be expected to win 100 games, that is, Pr(win) = 100/162 = .617, then there's a 5% chance that they'll have won at least 30 of their first 41 games. That's a calculation based on simple probability theory of independent events, which isn't quite right here but will get you close and is a good way to train one's intuition, I think.

Having a .732 record after 41 games is not unheard-of. The Detroit Tigers won 35 of their first 40 games in 1984: that's .875. (I happen to remember that fast start, having been an Orioles fan at the time.)

Now on to the key ideas

The passage quoted above illustrates three statistical fallacies which I believe are common but are not often discussed:

1. Conditioning on the extrapolation. "If they sweep Houston . . ." The relevant data were that the Rays were .732, not .750.

2. Counting data twice: "Playing .732 . . . wiped their feet on NY twice." Beating the Yankees is part of how they got to .732 in the first place.

3. Remembered historical evidence: "at .750, which I have never heard of." There's no particular reason the commenter should've heard of the 1894 Tigers; my point here is that past data aren't always as you remember them.

P.S. I don't mean to pick on the above commenter, who I'm sure was just posting some idle thoughts. In some ways, though, perhaps these low-priority remarks are the best windows into our implicit thinking.

P.P.S. Yes, I realize this is out of date--the perils of lagged blog posting. But the general statistical principles are still valid.

Say again?


"Ich glaube, dass die Wahrscheinlichkeitsrechnung das richtige Werkzeug zum Lösen solcher Probleme ist", sagt Andrew Gelman, Statistikprofessor von der Columbia-Universität in New York. Wie oft aber derart knifflige Aufgaben im realen Leben auftauchen, könne er nicht sagen. Was fast schon beruhigend klingt.

OK, fine.

Interesting article by Sharon Begley and Mary Carmichael. They discuss how there is tons of federal support for basic research but that there's a big gap between research findings and medical applications--a gap that, according to them, arises not just from the inevitable problem that not all research hypotheses pan out, but because actual promising potential cures don't get researched because of the cost.

I have two thoughts on this. First, in my experience, research at any level requires a continuing forward momentum, a push from somebody to keep it going. I've worked on some great projects (some of which had Federal research funding) that ground to a halt because the original motivation died. I expect this is true with medical research also. One of the projects that I'm thinking of, which I've made almost no progress on for several years, I'm sure would make a useful contribution. I pretty much know it would work--it just takes work to make it work, and it's hard to do this without the motivation of it being connected to other projects.

My second thought is about economics. Begley and Carmichael discuss how various potential cures are not being developed because of the expense of animal and then human testing. I guess this is part of the expensive U.S. medical system, that simple experiments cost millions of dollars. But I'm also confused: if these drugs are really "worth it" and would save lots of lives, wouldn't it be worth it for the drug and medical device companies to expend the dollars to test them? There's some big-picture thing I'm not understanding here.

A very short story


A few years ago we went to a nearby fried chicken place that the Village Voice had raved about. While we were waiting to place our order, someone from the local Chinese takeout place came in with a delivery, which the employees of the chicken place proceeded to eat. This should've been our signal to leave. Instead, we bought some chicken. It was terrible.

Dan Kahan writes:

Here is a very interesting article form Science that reports result of experiment that looked at whether people bought a product (picture of themselves screaming or vomiting on roller coaster) or paid more for it when told "1/2 to charity." Answer was "buy more" but "pay lots less" than when alternative was fixed price w/ or w/o charity; and "buy more" & "pay more" if consumer could name own price & 1/2 went to charity than if none went to charity. Pretty interesting.

But . . .

David Blackwell


David Blackwell was already retired by the time I came to Berkeley, and probably our closest connection was that I taught the class in decision theory that he used to teach. I enjoyed that class a lot, partly because it took me out of my usual comfort zone of statistical inference and data analysis toward something more theoretical and mathematical. Blackwell was one of the legendary figures in the department at that time and was also one of the most tolerant of alternative approaches to statistics, perhaps because of combination of a mathematical background, applied research in the war and after (which I learned about in this recent obituary), and personal experiences,

Blackwell may be best known in statistics for the Rao-Blackwell theorem. Rao, of course, is also famoust for the Cramer-Rao lower bound. Both theorems relate to minimum-variance statistical estimators.

Here's a quote from Thomas (Jesus's dad) Ferguson in Blackwell's obituary:

He went from one area to another, and he'd write a fundamental paper in each, He would come into a field that had been well studied and find something really new that was remarkable. That was his forte.

And here's a quote from Peter Bickel, who in 1967 published an important paper on Bayesian inference:

He had this great talent for making things appear simple, He liked elegance and simplicity. That is the ultimate best thing in mathematics, if you have an insight that something seemingly complicated is really simple, but simple after the fact.

And here's Blackwell himself, from 1983:

Basically, I'm not interested in doing research and I never have been, I'm interested in understanding, which is quite a different thing. And often to understand something you have to work it out yourself because no one else has done it.

I'm surprised to hear Blackwell consider "research" and "understanding" to be different, as to me they seem to be closely related. One of the most interesting areas of statistical research today is on methods for understanding models as maps from data to predictions. As Blackwell and his collaborators demonstrated, even the understanding of simple statistical inferences is not a simple task.

P.S. According to the obituary, Blackwell was denied jobs at Princeton and the University of California because of racial discrimination, and so, a year after receiving his Ph.D., he "sent out applications to 104 black colleges on the assumption that no other schools would hire him." The bit about the 104 job applications surprised me. Nowadays I know that people send out hundreds of job applications, but I didn't know that this was done back in 1943. I somehow thought the academic world was more self-contained back then.

P.P.S. My Barnard College colleague Rajiv Sethi discusses Blackwell's research as seen by economists.

Note to "Cigarettes"


To the person who posted an apparently non-spam comment with a URL link to a "cheap cigarettes" website: In case you're wondering, no, your comment didn't get caught by the spam filter--I'm not sure why not, given that URL. I put it in the spam file manually. If you'd like to participate in blog discussion in the future, please refrain from including spam links. Thank you.

Also, it's "John Tukey," not "John Turkey."

Maggie Fox writes:

Brain scans may be able to predict what you will do better than you can yourself . . . They found a way to interpret "real time" brain images to show whether people who viewed messages about using sunscreen would actually use sunscreen during the following week.

The scans were more accurate than the volunteers were, Emily Falk and colleagues at the University of California Los Angeles reported in the Journal of Neuroscience. . . .

About half the volunteers had correctly predicted whether they would use sunscreen. The research team analyzed and re-analyzed the MRI scans to see if they could find any brain activity that would do better.

Activity in one area of the brain, a particular part of the medial prefrontal cortex, provided the best information.

"From this region of the brain, we can predict for about three-quarters of the people whether they will increase their use of sunscreen beyond what they say they will do," Lieberman said.

"It is the one region of the prefrontal cortex that we know is disproportionately larger in humans than in other primates," he added. "This region is associated with self-awareness, and seems to be critical for thinking about yourself and thinking about your preferences and values."

Hmm . . . they "analyzed and re-analyzed the scans to see if they could find any brain activity" that would predict better than 50%?! This doesn't sound so promising. But maybe the reporter messed up on the details . . .

I took advantage of my library subscription to take a look at the article, "Predicting Persuasion-Induced Behavior Change from the Brain," by Emily Falk,Elliot Berkman,Traci Mann, Brittany Harrison, and Matthew Lieberman. Here's what they say:

- "Regions of interest were constructed based on coordinates reported by Soon et al. (2008) in MPFC and precuneus, regions that also appeared in a study of persuasive messaging." OK, so they picked two regions of interest ahead of time. They didn't just search for "any brain activity." I'll take their word for it that they just looked at these two, that they didn't actually look at 50 regions and then say they reported just two.

- Their main result had a t-statistic of 2.3 (on 18 degrees of freedom, thus statistically significant at the 3% level) in one of the two regions they looked at, and a t-statistic of 1.5 (not statistically significant) in the other. A simple multiple-comparisons correction takes the p-value of 0.03 and bounces it up to an over-the-threshold 0.06, which I think would make the result unpublishable! On the other hand, a simple average gives a healthy t-statistic of (1.5+2.3)/sqrt(2) = 2.7, although that ignores any possible correlation between the two regions (they don't seem to supply that information in their article).

- They also do a cross-validation but this seems 100% pointless to me since they do the cross-validation on the region that already "won" on the full data analysis. For the cross-validation to mean anything at all, they'd have to use the separate winner on each of the cross-validatory fits.

- As an outcome, they use before-after change. They should really control for the "before" measurement as a regression predictor. That's a freebie. And, when you're operating at a 6% significance level, you should take any freebie that you can get! (It's possible that they tried adjusting for the "before" measurement and it didn't work, but I assume they didn't do that, since I didn't see any report of such an analysis in the article.)

The bottom line

I'm not saying that the reported findings are wrong, I'm just saying that they're not necessarily statistically significant in the usual way this term is used. I think that, in the future, such work would be improved by more strongly linking the statistical analysis to the psychological theories. Rather than simply picking two regions to look at, then taking the winner in a study of n=20 people, and going from there to the theories, perhaps they could more directly model what they're expecting to see.

The difference between . . .

Also, the difference between "significant'' and "not significant'' is not itself statistically significant. How is this relevant in the present study? They looked at two regions, MPFC and precuneus. Both showed positive correlations, one with a t-value of 2.3, one with a t-value of 1.5. The first of these is statistically significant (well, it is, if you ignore that it's the maximum of two values), the second is not. But the difference is not anything close to statistically significant, not at all! So why such a heavy emphasis on the winner and such a neglect of #2?

Here's the count from a simple document search:

MPFC: 20 instances (including 2 in the abstract)
precuneus: 8 instances (0 in the abstract)

P.S. The "picked just two regions" bit gives a sense of why I prefer Bayesian inference to classical hypothesis testing. The right thing, I think, is actually to look at all 50 regions (or 100, or however many regions there are) and do an analysis including all of them. Not simply picking the region that is most strongly correlated with the outcome and then doing a correction--that's not the most statistically efficient thing to do, you're just asking, begging to be overwhelmed by noise)--but rather using the prior information about regions in a subtler way than simply picking out 2 and ignoring the other 48. For example, you could have a region-level predictor which represents prior belief in the region's importance. Or you could group the regions into a few pre-chosen categories and then estimate a hierarchical model with each group of regions being its own batch with group-level mean and standard deviation estimated from data. The point is, you have information you want to use--prior knowledge from the literature--without it unduly restricting the possibilities for discovery in your data analysis.

Near the end, they write:

In addition, we observed increased activity in regions involved in memory encoding, attention, visual imagery, motor execution and imitation, and affective experience with increased behavior change.

These were not pre-chosen regions, which is fine, but at this point I'd like to see the histogram of correlations for all the regions, along with a hierarchical model that allows appropriate shrinkage. Or even a simple comparison to the distribution of correlations one might expect to see by chance. By suggesting this, I'm not trying to imply that all the findings in this paper are due to chance; rather, I'm trying to use statistical methods to subtract out the chance variation as much as possible.

P.P.S. Just to say this one more time: I'm not at all trying to claim that the researchers are wrong. Even if they haven't proven anything in a convincing way, I'll take their word for it that their hypothesis makes scientific sense. And, as they point out, their data are definitely consistent with their hypotheses.

P.P.P.S. For those who haven't been following these issues, see here, here, here, and here.

A few months ago, I blogged on John Gottman, a psychologist whose headline-grabbing research on marriages (he got himself featured in Blink with a claim that he could predict with 83 percent accuracy whether a couple would be divorced--after meeting with them for 15 minutes!) was recently debunked in a book by Laurie Abraham.

The question I raised was: how could someone who was evidently so intelligent and accomplished--Gottman, that is--get things so wrong? My brief conclusion was that once you have some success, I guess there's not much of a motivation to change your ways. Also, I could well believe that, for all its flaws, Gottman's work is better than much of the other research out there on marriages. There's still the question of how this stuff gets published in scientific journals. I haven't looked at Gottman's articles in detail and so don't really have thoughts on that one.

Anyway, I recently corresponded with a mathematician who had heard of Gottman's research and wrote that he was surprised by what Abraham had found:

Seeking balance


I'm trying to temporarily kick the blogging habit as I seem to be addicted. I'm currently on a binge and my plan is to schedule a bunch of already-written entries at one per weekday and not blog anything new for awhile.

Yesterday I fell off the wagon and posted 4 items, but maybe now I can show some restraint.

P.S. In keeping with the spirit of this blog, I scheduled it to appear on 13 May, even though I wrote it on 15 Apr. Just about everything you've been reading on this blog for the past several weeks (and lots of forthcoming items) were written a month ago. The only exceptions are whatever my cobloggers have been posting and various items that were timely enough that I inserted them in the queue afterward.

P.P.S I bumped it up to 22 Jun because, as of 14 Apr, I was continuing to write new entries. I hope to slow down soon!

P.P.P.S. (20 June) I was going to bump it up again--the horizon's now in mid-July--but I thought, enough is enough!

Right now I think that about half of my posts are topical, appearing within a couple days of posting--I often write them in the evening but I like to have them appear between 9 and 10am, eastern time--and half are on a longer delay.

As part of my continuing research project with Grazia and Roberto, I've been reading papers on happiness and life satisfaction research. I'll share with you my thoughts on some of the published work in this area.

Rajiv Sethi offers a fascinating discussion of the incentives involved in paying people zillions of dollars to lie, cheat, and steal. I'd been aware for a long time of the general problem of the system of one-way bets in which purported risk-takers can make huge fortunes with little personal risks, but Rajiv and his commenters go further by getting into the specifics of management at financial firms.

Alan and Felix.

Recent Comments

  • mb: Small issue. From their Figure 3, the number of rain-free read more
  • Andrew Gelman: Jim: As Kobi and I write in our paper, we read more
  • Sumio Watanabe: Dear Dr. Gelman, I agree with your opinion that, even read more
  • Jim: Just curious what would be the next step if the read more
  • Winston Lin: Andrew, the July 4 findings might not be quite so read more
  • Megan Pledger: This is based on my softball knowledge from a long read more
  • Andrew Gelman: Yup. This'll be fixed in a few days when we read more
  • Millsy: Here's some shorter term, very preliminary, very basic, very ugly read more
  • Matt: Ben Fry’s Baseball Chart looks more like an art-museum-security-laser plot. read more
  • Rodney Sparapani: I guess that means that we can't post comments on read more
  • Pablo Verde: Excellent article! Where I can get the R script for read more
  • Millsy: I'll race Dan to the challenge! read more
  • Dan Turkenkopf: I think I already have the code to do most read more
  • Willem: I feel the hunger in that last paragraph. Remember: if read more
  • Andrew Gelman: James: I thought of that but then doesn't the scaling read more

About this Archive

This page is an archive of recent entries in the Decision Theory category.

Causal Inference is the previous category.

Economics is the next category.

Find recent content on the main index or look in the archives to find all content.