# Recently in Miscellaneous Science Category

## How do we evaluate a new and wacky claim?

Around these parts we see a continuing flow of unusual claims supported by some statistical evidence. The claims are varyingly plausible a priori. Some examples (I won't bother to supply the links; regular readers will remember these examples and newcomers can find them by searching):

- Obesity is contagious
- People's names affect where they live, what jobs they take, etc.
- Beautiful people are more likely to have girl babies
- More attractive instructors have higher teaching evaluations
- In a basketball game, it's better to be behind by a point at halftime than to be ahead by a point
- Praying for someone without their knowledge improves their recovery from heart attacks
- A variety of claims about ESP

How should we think about these claims? The usual approach is to evaluate the statistical evidence--in particular, to look for reasons that the claimed results are not really statistically significant. If nobody can shoot down a claim, it survives.

The other part of the story is the prior. The less plausible the claim, the more carefully I'm inclined to check the analysis.

But what does it mean, exactly, to check an analysis? The key step is to interpret the findings quantitatively: not just as significant/non-significant but as an effect size, and then looking at the implications of the estimated effect.

I'll explore in the context of two examples, one from political science and one from psychology. An easy example is one in which the estimated effect is completely plausible (for example, the incumbency advantage in U.S. elections), or in which it is completely implausible (for example, a new and unreplicated claim of ESP).

Neither of the examples I consider here is easy: both of the claims are odd but plausible, and both are supported by data, theory, and reasonably sophisticated analysis.

The effect of rain on July 4th

My co-blogger John Sides linked to an article by Andreas Madestam and David Yanagizawa-Drott that reports that going to July 4th celebrations in childhood had the effect of making people more Republican. Madestam and Yanagizawa-Drott write:

Using daily precipitation data to proxy for exogenous variation in participation on Fourth of July as a child, we examine the role of the celebrations for people born in 1920-1990. We find that days without rain on Fourth of July in childhood have lifelong effects. In particular, they shift adult views and behavior in favor of the Republicans and increase later-life political participation. Our estimates are significant: one Fourth of July without rain before age 18 raises the likelihood of identifying as a Republican by 2 percent and voting for the Republican candidate by 4 percent. . . .

Here was John's reaction:

In sum, if you were born before 1970, and experienced sunny July 4th days between the ages of 7-14, and lived in a predominantly Republican county, you may be more Republican as a consequence.

When I [John] first read the abstract, I did not believe the findings at all. I doubted whether July 4th celebrations were all that influential. And the effects seem to occur too early in the life cycle: would an 8-year-old would be affected politically? Doesn't the average 8-year-old care more about fireworks than patriotism?

But the paper does a lot of spadework and, ultimately, I was left thinking "Huh, maybe this is true." I'm still not certain, but it was worth a blog post.

My reaction is similar to John's but a bit more on the skeptical side.

Let's start with effect size. One July 4th without rain increases the probability of Republican vote by 4%. From their Figure 3, the number of rain-free July 4ths is between 6 and 12 for most respondents. So if we go from the low to the high end, we get an effect of 6*4%, or 24%.

[Note: See comment below from Winston Lim. If the effect is 24% (not 24 percentage points!) on the Republican vote and 0% on the Democratic vote, then the effect on the vote share D/(D+R) is 1.24/1.24 - 1/2 or approximately 6%. So the estimate is much less extreme than I'd thought. The confusion arose because I am used to seeing results reported in terms of the percent of the two-party vote share, but these researchers used a different form of summary.]

Does a childhood full of sunny July 4ths really make you 24 percentage points more likely to vote Republican? (The authors find no such effect when considering the weather in a few other days in July.) I could imagine an effect--but 24 percent of the vote? The number seems too high--especially considering the expected attenuation (noted in section 3.1 of the paper) because not everyone goes to a July 4th celebration and that they don't actually know the counties where the survey respondents lived as children. It's hard enough to believe an effect size of 24%, but it's really hard to believe of 24% as an underestimate.

So what could've gone wrong? The most convincing part of the analysis was that they found no effect of rain on July 2, 3, 5, or 6. But this made me wonder about the other days of the year. I'd like to see them automate their analysis and loop it thru all 365 days, then make a graph showing how the coefficient for July 4th fits in. (I'm not saying they should include all 365 in a single regression--that would be a mess. Rather, I'm suggesting the simpler option of 365 analyses, each for a single date.)

Otherwise there are various features in the analysis that could cause problems. The authors predict individual survey respondents given the July 4th weather when they were children, in the counties where they currently reside. Right away we can imagine all sorts of biases based on how moves and who stays put.

Setting aside these measurement issues, the big identification issue is that counties with more rain might be systematically different than counties with less rain. To the extent the weather can be considered a random treatment, the randomization is occurring across years within counties. The authors attempt to deal with this by including "county fixed effects"--that is, allowing the intercept to vary by county. That's ok but their data span a 70 year period, and counties have changed a lot politically in 70 years. They also include linear time trends for states, which helps some more, but I'm still a little concerned about systematic differences not captured in these trends.

No study is perfect, and I'm not saying these are devastating criticisms. I'm just trying to work through my thoughts here.

The effects of names on life choices

For another example, consider the study by Brett Pelham, Matthew Mirenberg, and John Jones of the dentists named Dennis (and the related stories of people with names beginning with F getting low grades, baseball players with K names getting more strikeouts, etc.). I found these claims varyingly plausible: the business with the grades and the strikeouts sounded like a joke, but the claims about career choices etc seemed possible.

My first step in trying to understand these claims was to estimate an effect size: my crude estimate was that, if the research findings were correct, that about 1% of people choose their career based on their first names.

This seemed possible to me, but Uri Simonsohn (the author of the recent rebuttal of the name-choice article by Pelham et al.) argued that the implied effects were too large to be believed (just as I was arguing above regarding the July 4th study), which makes more plausible his claims that the results arise from methodological artifacts.

That calculation is straight Bayes: the distribution of systematic errors has much longer tails than the distribution of random errors, so the larger the estimated effect, the more likely it is to be a mistake. This little theoretical result is a bit annoying, because it is the larger effects that are the most interesting!

Simonsohn moved the discussion forward by calibrating the effect-size questions to other measurable quantities:

We need a benchmark to make a more informed judgment if the effect is small or large. For example, the Dennis/dentist effect should be much smaller than parent-dentist/child-dentist. I think this is almost certainly true but it is an easy hurdle. The J marries J effect should not be much larger than the effect of, say, conditioning on going to the same high-school, having sat next to each other in class for a whole semester.

I have no idea if that hurdle is passed. These are arbitrary thresholds for sure, but better I'd argue than both my "100% increase is too big", and your "pr(marry smith) up from 1% to 2% is ok."

Summary

No easy answers. But I think that understanding effect sizes on a real scale is a start.

## Combining survey data obtained using different modes of sampling

I'm involved (with Irv Garfinkel and others) in a planned survey of New York City residents. It's hard to reach people in the city--not everyone will answer their mail or phone, and you can't send an interviewer door-to-door in a locked apartment building. (I think it violates IRB to have a plan of pushing all the buzzers by the entrance and hoping someone will let you in.) So the plan is to use multiple modes, including phone, in person household, random street intercepts and mail.

The question then is how to combine these samples. My suggested approach is to divide the population into poststrata based on various factors (age, ethnicity, family type, housing type, etc), then to pool responses within each poststratum, then to runs some regressions including postratsta and also indicators for mode, to understand how respondents from different modes differ, after controlling for the demographic/geographic adjustments.

Maybe this has already been done and written up somewhere?

P.S. As you try to do this sort of thing more carefully you run up against the sorts of issues discussed in the Struggles paper. So this is definitely statistical research, not merely an easy application of existing methods.

P.P.S. Cyrus has some comments, which for convenience I'll repost here:

It's interesting to consider this problem by combining a "finite population" perspective with some ideas about "principal strata" from the causal inference literature. Suppose a finite population U from which we draw a sample of N units. We have two modes of contact, A and B. Suppose for the moment that each unit can be characterized by one of the following response types (these are the "principal strata"):

TypeMode A responseMode B response
I11
II10
III01
IV00

Then, there are two cases to consider, depending on whether mode of contact affects response:

Mode of contact does not affect response

This might be a valid assumption if the questions of interest are not subject to social desirability biases, interviewer effects, etc. In this case, it is easy to define a target parameter as the average response in the population. You could proceed efficiently by first applying mode A to the sample, and then applying mode B to those who did not respond with mode A. At the end, you would have outcomes for types I, II, and III units, and you'd have an estimate of the rate of type IV units in the population. You could content yourself with an estimate for the average response on the type I, II, and III subpopulation. If you wanted to recover an estimate of the average response for the full population (including type IV's), you would effectively have to impute values for type IV respondents. This could be done by using auxiliary information either to genuinely impute or (in a manner that is pretty much equivalent) to determine which type I, II, or III units resemble the missing type IV units, and up-weight. In any case, if the response of interest has finite support, one could also compute "worst case" (Manski-type) bounds on the average response by imputing maximum and minimum values to type IV units.

Mode of contact affects response

This might be relevant if, for example, the modes of contact are phone call versus face-to-face interview, and outcomes being measured vary depending on whether the respondent feels more or less exposed in the interview situation. This possibility makes things a lot trickier. In this case, each unit is characterized by a response under mode A and another under mode B (that is, two potential outcomes). One immediately faces a quandary of defining the target parameter. Is it the average of responses under the two modes of contact? Maybe it is some "latent" response that is imperfectly revealed under the two modes of contact? If so, how can we characterize this "imperfection"? Furthermore, only for type I individuals will you be able to obtain information on both potential responses. Does it make sense to restrict ourselves to this subpopulation? If not, then we would again face the need for imputation. A design that applied both mode A and mode B to the complete sample would mechanically reveal the proportion of type I units in the population, and by implication would identify the proportion of type II, III, and IV units. For type II units we could use mode A responses to improve imputations for mode B responses, and vice versa for type III respondents. Type IV respondents' contributions to our estimate of the "average response" would be based purely on auxiliary information. Again, one could construct worst case bounds by imputing maximum and minimum response values for each of the missing response types.

One wrinkle that I ignored above was that the order of modes of contact may affect either response behavior or outcomes reported. This multiplies the number potential response behaviors and the number of potential outcome responses given that the unit is interviewed. You could get some way past these issues by randomizing the order of mode of contact--e.g. A then B for one half, and B then A for the other half. But you would have to impose some more assumptions to make use of this random assignment. E.g., you'd have to assume that A-then-B always-responders are exchangeable with B-then-A always responders in order to combine the information from the always-responders in each half-sample. Or, you could "shift the goal posts" by saying that all you are interested in is the average of responses from modes A and B under the A-then-B design.

Update

:

The above analysis did not explore how other types of assumptions might help to identify the population average. Andy's proposal to use post-stratification and regressions relies (according to my understanding) on the assumption potential outcomes are independent of mode of contact conditional on covariates. Formally, if the mode of contact is $M$ taking on values $A$ or $B$, potential outcomes under mode of contact $m$ is $y(m)$$T$ is principal stratum, and $X$ is a covariate, then $\left[y(A),y(B)\right] \perp M | T, X$ implies that,

$E(y(m)|T,X) = E(y(m)|M=m, T,X) = E(y(m)|M \ne m, T,X)$.

As discussed above, the design that applies modes A and B to all units in the sample can determine principal stratum membership, and so these covariate- and principal-stratum specific imputations can be applied. Ordering effects will again complicate things, and so more assumptions would be needed. A worthwhile type of analysis would be to study evidence of mode-of-contact as well as ordering effects among the type I (always-responder) units.

Now, it may be that mode of contact affects response but units are contacted via either mode A or B. Then, a unit's principal stratum membership is not identifiable, nor is the proportion of types I through IV identifiable (we would end up with two mixtures of responding and non-responding types, with no way to parse out relative proportions of the different types). If some kind of response "monotonicity" held, then that would help a little. Response monotonicity would mean that either type II or type III responders didn't exist. Otherwise, we would have to impose more stringent assumptions. The common one would be that principal stratum membership is independent of potential responses conditional on covariates. This is a classic "ignorable non-response" assumption, and it suffers from having no testable implications.

## Error in an attribution of an error

When you say that somebody else screwed up, you have to be extra careful you're not getting things wrong yourself! A philosopher of science is quoted as having written, "it seems best to let this grubby affair rest in a footnote," but I think it's good for these things to be out in the open.

## How the ignorant idiots win, explained. Maybe.

According to a New York Times article, cognitive scientists Hugo Mercier and Dan Sperber have a new theory about rational argument: humans didn't develop it in order to learn about the world, we developed it in order to win arguments with other people. "It was a purely social phenomenon. It evolved to help us convince others and to be careful when others try to convince us."

Based on the NYT article, it seems that Mercier and Sperber are basically flipping around the traditional argument, which is that humans learned to reason about the world, albeit imperfectly, and learned to use language to convey that reasoning to others. These guys would suggest that it's the other way around: we learned to argue with others, and this has gradually led to the ability to actually make (and recognize) sound arguments, but only indirectly. The article says ""At least in some cultural contexts, this results in a kind of arms race towards greater sophistication in the production and evaluation of arguments," they write. "When people are motivated to reason, they do a better job at accepting only sound arguments, which is quite generally to their advantage."

Of course I have no idea if any of this is true, or even how to test it. But it's definitely true that people are often convinced by wrong or even crazy arguments, and they (we) are subject to confirmation bias and availability bias and all sorts of other systematic biases. One thing that bothers me especially is that a lot of people are simply indifferent to facts and rationality when making decisions. Mercier and Sperber have at least made a decent attempt to explain why people are like this.

## The happiness gene: My bottom line (for now)

I had a couple of email exchanges with Jan-Emmanuel De Neve and James Fowler, two of the authors of the article on the gene that is associated with life satisfaction which we blogged the other day. (Bruno Frey, the third author of the article in question, is out of town according to his email.) Fowler also commented directly on the blog.

I won't go through all the details, but now I have a better sense of what's going on. (Thanks, Jan and James!) Here's my current understanding:

1. The original manuscript was divided into two parts: an article by De Neve alone published in the Journal of Human Genetics, and an article by De Neve, Fowler, Frey, and Nicholas Christakis submitted to Econometrica. The latter paper repeats the analysis from the Adolescent Health survey and also replicates with data from the Framingham heart study (hence Christakis's involvement).

The Framingham study measures a slightly different gene and uses a slightly life-satisfaction question compared to the Adolescent Health survey, but De Neve et al. argue that they're close enough for the study to be considered a replication. I haven't tried to evaluate this particular claim but it seems plausible enough. They find an association with p-value of exactly 0.05. That was close! (For some reason they don't control for ethnicity in their Framingham analysis--maybe that would pull the p-value to 0.051 or something like that?)

2. Their gene is correlated with life satisfaction in their data and the correlation is statistically significant. The key to getting statistical significance is to treat life satisfaction as a continuous response rather than to pull out the highest category and call it a binary variable. I have no problem with their choice; in general I prefer to treat ordered survey responses as continuous rather than discarding information by combining categories.

3. But given their choice of a continuous measure, I think it would be better for the researchers to stick with it and present results as points on the 1-5 scale. From their main regression analysis on the Adolescent Health data, they estimate the effect of having two (compared to zero) "good" alleles as 0.12 (+/- 0.05) on a 1-5 scale. That's what I think they should report, rather than trying to use simulation to wrestle this into a claim about the probability of describing oneself as "very satisfied."

They claim that having the two alleles increases the probability of describing oneself as "very satisfied" by 17%. That's not 17 percentage points, it's 17%, thus increasing the probability from 41% to 1.17*41% = 48%. This isn't quite the 46% that's in the data but I suppose the extra 2% comes from the regression adjustment. Still, I don't see this as so helpful. I think they'd be better off simply describing the estimated improvement as 0.1 on a 1-5 scale. If you really really want to describe the result for a particular category, I prefer percentage points rather than percentages.

4. Another advantage as describing the result as 0.1 on a 1-5 scale is that it is more consistent with intuitive notions of 1% of variance explained. It's good they have this 1% in their article--I should present such R-squared summaries in my own work, to give a perspective on the sizes of the effects that I find.

5. I suspect the estimated effect of 0.1 is an overestimate. I say this for the usual reason, discussed often on this blog, that statistically significant findings, by their very nature, tend to be overestimates. I've sometimes called this the statistical significance filter, although "hurdle" might be a more appropriate term.

6. Along with the 17% number comes a claim that having one allele gives an 8% increase. 8% is half of 17% (subject to rounding) and, indeed, their estimate for the one-allele case comes from their fitted linear model. That's fine--but the data aren't really informative about the one-allele case! I mean, sure, the data are perfectly consistent with the linear model, but the nature of leverage is such that you really don't get a good estimate on the curvature of the dose-response function. (See my 2000 Biostatistics paper for a general review of this point.) The one-allele estimate is entirely model-based. It's fine, but I'd much prefer simply giving the two-allele estimate and then saying that the data are consistent with a linear model, rather than presenting the one-allele estimate as a separate number.

7. The news reports were indeed horribly exaggerated. No fault of the authors but still something to worry about. The Independent's article was titled, "Discovered: the genetic secret of a happy life," and the Telegraph's was not much better: "A "happiness gene" which has a strong influence on how satisfied people are with their lives, has been discovered." An effect of 0.1 on a 1-5 scale: an influence, sure, but a "strong" influence?

8. There was some confusion with conditional probabilities that made its way into the reports as well. From the Telegraph:

The results showed that a much higher proportion of those with the efficient (long-long) version of the gene were either very satisfied (35 per cent) or satisfied (34 per cent) with their life - compared to 19 per cent in both categories for those with the less efficient (short-short) form.

After looking at the articles carefully and having an email exchange with De Neve, I can assure you that the above quote is indeed wrong, which is really too bad because it was an attempted correction of an earlier mistake. The correct numbers are not 35, 34, 19, 19. Rather, they are 41, 46, 37, 44. A much less dramatic difference: changes of 4% and 2% rather than 18% and 15%. The Telegraph reporter was giving P(gene|happiness) rather than P(happiness|gene). What seems to have happened is that he misread Figure 2 in the Human Genetics paper. He then may have got stuck on the wrong track by expecting to see a difference of 17%.

9. The abstract for the Human Genetics paper reports a p-value of 0.01. But the baseline model (Model 1 in Table V of the Econometrica paper) reports a p-value of 0.02. The lower p-values are obtained by models that control for a big pile of intermediate outcomes.

10. In section 3 of the Econometrica paper, they compare identical to fraternal twins (from the Adolescent Health survey, it appears) and estimate that 33% of the variation in reported life satisfaction is explained by genes. As they say, this is roughly consistent with estimates of 50% or so from the literature. I bet their 33% has a big standard error, though: one clue is that the difference in correlations between identical and fraternal twins is barely statistically significant (at the 0.03 level, or, as they quaintly put it, 0.032). They also estimate 0% of the variation to be due to common environment, but again that 0% is gonna be a point estimate with a huge standard error.

I'm not saying that their twin analysis is wrong. To me the point of these estimates is to show that the Adolescent Health data are consistent with the literature on genes and happiness, thus supporting the decision to move on with the rest of their study. I don't take their point estimates of 33% and 0% seriously but it's good to know that the twin results go in the expected direction.

11. One thing that puzzles me is why De Neve et al. only studied one gene. I understand that this is the gene that they expected to relate to happiness and life satisfaction, but . . . given that it only explains 1% of the variation, there must be hundreds or thousands of genes involved. Why not look at lots and lots? At the very least, the distribution of estimates over a large sample of genes would give some sense of the variation that might be expected. I can't see the point of looking at just one gene, unless cost is a concern. Are other gene variants already recorded for the Adolescent Health and Framingham participants?

12. My struggles (and the news reporters' larger struggles) with the numbers in these articles makes me feel, even more strongly than before, the need for a suite of statistical methods for building from simple comparisons to more complicated regressions. (In case you're reading this, Bob and Matt3, I'm talking about the network of models.)

As researchers, transparency should be our goal. This is sometimes hindered by scientific journals' policies of brevity. You can end up having to remove lots of the details that make a result understandable.

13. De Neve concludes the Human Genetics article as follows:

There is no single ''happiness gene.' Instead, there is likely to be a set of genes whose expression, in combination with environmental factors, influences subjective well-being.

I would go even further. Accepting their claim that between one-third and one-half of the variation in happiness and life satisfaction is determined by genes, and accepting their estimate that this one gene explains as much as 1% of the variation, and considering that this gene was their #1 candidate (or at least a top contender) for the "happiness gene" . . . my guess is that the set of genes that influence subjective well-being is a very large number indeed! The above disclaimer doesn't seem disclaimery-enough to me, in that it seems to leave open the possibility that this "set of genes" might be just three or four. Hundreds or thousands seems more like it.

I'm reminded of the recent analysis that found that the simple approach of predicting child's height using a regression model given parents' average height performs much better than a method based on combining 54 genes.

14. Again, I'm not trying to present this as any sort of debunking, merely trying to fit these claims in with the rest of my understanding. I think it's great when social scientists and public health researchers can work together on this sort of study. I'm sure that in a couple of decades we'll have a much better understanding of genes and subjective well-being, but you have to start somewhere. This is a clean study that can be the basis for future research.

Hmmm . . . .could I publish this as a letter in the Journal of Human Genetics? Probably not, unfortunately.

P.S. You could do this all yourself! This and my earlier blog on the happiness gene study required no special knowledge of subject matter or statistics. All I did was tenaciously follow the numbers and pull and pull until I could see where all the claims were coming from. A statistics student, or even a journalist with a few spare hours, could do just as well. (Why I had a few spare hours to do this is another question. The higher procrastination, I call it.) I probably could've done better with some prior knowledge--I know next to nothing about genetics and not much about happiness surveys either--but I could get pretty far just tracking down the statistics (and, as noted, without any goal of debunking or any need to make a grand statement).

P.P.S. See comments for further background from De Neve and Fowler!

## "Discovered: the genetic secret of a happy life"

I took the above headline from a news article in the (London) Independent by Jeremy Laurance reporting a study by Jan-Emmanuel De Neve, James Fowler, and Bruno Frey that reportedly just appeared in the Journal of Human Genetics.

One of the pleasures of blogging is that I can go beyond the usual journalistic approaches to such a story: (a) puffing it, (b) debunking it, (c) reporting it completely flatly. Even convex combinations of (a), (b), (c) do not allow what I'd like to do, which is to explore the claims and follow wherever my exploration takes me. (And one of the pleasures of building my own audience is that I don't need to endlessly explain background detail as was needed on a general-public site such as 538.)

OK, back to the genetic secret of a happy life. Or, in the words the authors of the study, a gene that "explains less than one percent of the variation in life satisfaction."

"The genetic secret" or "less than one percent of the variation"?

Perhaps the secret of a happy life is in that one percent??

I can't find a link to the journal article which appears based on the listing on De Neve's webpage to be single-authored, but I did find this Googledocs link to a technical report from January 2010 that seems to have all the content. Regular readers of this blog will be familiar with earlier interesting research of Fowler and Frey working separately; I had no idea that they have been collaborating.

De Neve et al. took responses to a question on life satisfaction from a survey that was linked to genetic samples. They looked at a gene called 5HTT which, according to their literature review, has been believed to be associated with happy feelings.

I haven't taken a biology class since 9th grade, so I'll give a simplified version of the genetics. You can have either 0, 1, or 2 alleles of the gene in question. Of the people in the sample, 20% have 0 alleles, 45% have 1 allele, and 35% have 2. The more alleles you have, the happier you'll be (on average): The percentage of respondents describing themselves as "very satisfied" with their lives is 37% for people with 0 alleles, 38% for those with one allele, and 41% for those with two alleles.

The key comparison here comes from the two extremes: 2 alleles vs. 0. People with 2 alleles are 4 percentage points (more precisely, 3.6 percentage points) more likely to report themselves as very satisfied with their lives. The standard error of this difference in proportions is sqrt(.41*(1-.41)/862+.37*(1-.37)/509) = 0.027, so the difference is not statistically significant at a conventional level.

But in their abstract, De Neve et al. reported the following:

Having one or two allleles . . . raises the average likelihood of being very satisfied with one's life by 8.5% and 17.3%, respectively?

How did they get from a non-significant difference of 4% (I can't bring myself to write "3.6%" given my aversion to fractional percentage points) to a statistically significant 17.3%?

A few numbers that I can't figure out at all!

Here's the summary from Stephen Adams, medical correspondent of the Daily Telegraph:

The researchers found that 69 per cent of people who had two copies of the gene said they were either satisfied (34) or very satisfied (35) with their life as a whole.

But among those who had no copy of the gene, the proportion who gave either of these answers was only 38 per cent (19 per cent 'very satisfied' and 19 per cent 'satisfied').

This leaves me even more confused! According to the table on page 21 of the De Neve et al. article, 46% of people who had two copies of the gene described themselves as satisfied and 41% described themselves as very satisfied. The corresponding percentages for those with no copies were 44% and 37%.

I suppose the most likely explanation is that Stephen Adams just made a mistake, but it's no ordinary confusion because his numbers are so specific. Then again, I could just be missing something big here. I'll email Fowler for clarification but I'll post this for now so you loyal blog readers can see error correction (of one sort or another) in real time.

Where did the 17% come from?

OK, so setting Stephen Adams aside, how can we get from a non-significant 4% to a significant 17%?

- My first try is to use the numerical life-satisfaction measure. Average satisfaction on a 1-5 scale is 4.09 for the 0-allele people in this sample and 4.25 for the 1-allele people, and the difference has a standard error of 0.05. Hey--a difference of 0.16 with a standard error of 0.05--that's statistically significant! So it doesn't seem just like a fluctuation in the data.

- The main analysis of De Neve et al., reported in their Table 1, appears to be a least-squares regression of well-being (on that 1-5) scale, using the number of alleles as a predictor and also throwing in some controls for ethnicity, sex, age, and some other variables. They include error terms for individuals and families but don't seem to report the relative sizes of the errors. In any case, the controls don't seem to do much. Their basic result (Model 1, not controlling for variables such as marital status which might be considered as intermediate outcomes of the gene) yields a coefficient estimate of 0.06.

They then write, "we summarize the results for 5HTT by simulating first differences from the coefficient covariance matrix of Model 1. Holding all else constant and changing the 5HTT gene of all subjects from zero to one long allele would increase the reporting of being very satisfied with one's life in this population by about 8.5%." Huh? I completely don't understand this. It looks to me that the analyses in Table 1 are regressions on the 1-5 scale. So how can they transfer these to claims about "the reporting of being very satisfied"? Also, if it's just least squares, why do they need to work with the covariance matrix? Why can't they just look at the coefficient itself?

- They report (in Table 5) that whites have higher life satisfaction responses than blacks but lower numbers of alleles, on average. So controlling for ethnicity should increase the coefficient. I still can't see it going all the way from 4% to 17%. But maybe this is just a poverty of my intuition.

- OK, I'm still confused and have no idea where the 17% could be coming from. All I can think of is that the difference between 0 alleles and 2 alleles corresponds to an average difference of 0.16 in happiness on that 1-5 scale. And 0.16 is practically 17%, so maybe when you control for things the number jumps around a bit. Perhaps the result of their "first difference" calculations was somehow to carry that 0.16 or 0.17 and attribute it to the "very satisfied" category?

1% of variance explained

One more thing . . . that 1% quote. Remember? "the 5HTT gene explains less than one percent of the variation in life satisfaction." This is from page 14 of the De Neve, Fowler, and Frey article. 1%? How can we understand this?

Let's do a quick variance calculation:

- Mean and sd of life satisfaction responses (on the 1-5 scale) among people with 0 alleles: 4.09 and 0.8
- Mean and sd of life satisfaction responses (on the 1-5 scale) among people with 2 alleles: 4.25 and 0.8
- The difference is 0.16 so the explained variance is (0.16/2)^2 = 0.08^2
- Finally, R-squared is explained variance divided by total variance: (0.08/0.8)^2 = 0.01.

A difference of 0.16 on a 1-5 scale ain't nothing (it's approximately the same as the average difference in life satisfaction, comparing whites and blacks), especially given that most people are in the 4 and 5 categories. But it only represents 1% of the variance in the data. It's hard for me to hold these two facts in my head at the same time. The quick answer is that the denominator of the R-squared--the 0.8--contains lots of individual variation, including variation in the survey response. Still, 1% is such a small number. No surprise it didn't make it into the newspaper headline . . .

Here's another story of R-squared = 1%. Consider a 0/1 outcome with about half the people in each category. For.example, half the people with some disease die in a year and half live. Now suppose there's a treatment that increases survival rate from 50% to 60%. The unexplained sd is 0.5 and the explained sd is 0.05, hence R-squared is, again, 0.01.

Summary (for now):

I don't know where the 17% came from. I'll email James Fowler and see what he says. I'm also wondering about that Daily Telegraph article but it's usually not so easy to reach newspaper journalists so I'll let that one go for now.

P.S. According to his website, Fowler was named the most original thinker of the year by The McLaughlin Group. On the other hand, our sister blog won an award by the same organization that honored Peggy Noonan. So I'd call that a tie!

P.P.S. Their data come from the National Survey of Adolescent Health, which for some reason is officially called "Add Health." Shouldn't that be "Ad Health" or maybe "Ado Health"? I'm confused where the extra "d" is coming from.

P.P.P.S. De Neve et al. note that the survey did not actually ask about happiness, only about life satisfaction. We all know people who appear satisfied with their lives but don't seem so happy, but the presumption is that, in general, things associated with more life satisfaction are also associated with happiness. The authors also remark upon the limitations using a sample of adolescents to study life satisfaction. Not their fault--as is appropriate, they use the data they have and then discuss the limitations of their analysis.

P.P.P.P.S. De Neve and Fowler have a related paper with a nice direct title, "The MAOA Gene Predicts Credit Card Debt." This one, also from Add Health, reports: "Having one or both MAOA alleles of the low efficiency type raises the average likelihood of having credit card debt by 14%." For some reason I was having difficulty downloading the pdf file (sorry, I have a Windows machine!) so I don't know how to interpret the 14%. I don't know if they've looked at credit card debt and life satisfaction together. Being in debt seems unsatisfying; on the other hand you could go in debt to buy things that give you satisfaction, so it's not clear to me what to expect here.

P.P.P.P.P.S. I'm glad Don Rubin didn't read the above-linked article. Footnote 9 would probably make him barf.

P.P.P.P.P.P.S. Just to be clear: The above is not intended to be a "debunking" of the research of De Neve, Fowler, and Frey. It's certainly plausible that this gene could be linked to reported life satisfaction (maybe, for example, it influences the way that people respond to survey questions). I'm just trying to figure out what's going on, and, as a statistician, it's natural for me to start with the numbers.

P.^7S. James Fowler explains some of the confusion in a long comment.

## Suspicious pattern of too-strong replications of medical research

| 1 Comment

Howard Wainer writes in the Statistics Forum:

The Chinese scientific literature is rarely read or cited outside of China. But the authors of this work are usually knowledgeable of the non-Chinese literature -- at least the A-list journals. And so they too try to replicate the alpha finding. But do they? One would think that they would find the same diminished effect size, but they don't! Instead they replicate the original result, even larger. Here's one of the graphs:

How did this happen?

Full story here.

## Statistics ethics question

A graduate student in public health writes:

I have been asked to do the statistical analysis for a medical unit that is delivering a pilot study of a program to [details redacted to prevent identification]. They are using a prospective, nonrandomized, cohort-controlled trial study design.

The investigator thinks they can recruit only a small number of treatment and control cases, maybe less than 30 in total. After I told the Investigator that I cannot do anything statistically with a sample size that small, he responded that small sample sizes are common in this field, and he send me an example of analysis that someone had done on a similar study.

So he still wants me to come up with a statistical plan. Is it unethical for me to do anything other than descriptive statistics? I think he should just stick to qualitative research. But the study she mentions above has 40 subjects and apparently had enough power to detect some effects. This is a pilot study after all so the n does not have to be large. It's not randomized though so I would think it would need a larger n because of the weak design.

My first, general, recommendation is that it always makes sense to talk with any person as if he is completely ethical. If he is ethical, this is a good idea, and if he is not, you don't want him to think you think badly of him. If you are worried about a serious ethical problem, you can ask about it by saying something like, "From the outside, this could look pretty bad. An outsider, seeing this plan, might think we are being dishonest etc. etc." That way you can express this view without it being personal. And maybe your colleague has a good answer, which he can tell you.

To get to your specific question, there is really no such thing as a minimum acceptable sample size. You can get statistical significance with n=5 if your signal is strong enough.

Generally, though, the purpose of a pilot study is not to get statistical significance but rather to get experience with the intervention and the measurements. It's ok to do a pilot analysis, recognizing that it probably won't reach statistical significance. Also, regardless of sample size, qualitative analysis is appropriate and necessary in any pilot study.

Finally, of course they should not imply that they can collect a larger sample size than they can actually do.

## Psychology researchers discuss ESP

Chris Masse writes:

I know you hate the topic, but during this debate (discussing both sides), they were issues raised that are of interest of your science.

Actually I just don't have the patience to watch videos. But I'll forward it on to the rest of you. I've already posted my thoughts on the matter here. ESP is certainly something that a lot of people want to be true.

## My NOAA story

I recently learned we have some readers at the National Oceanic and Atmospheric Administration so I thought I'd share an old story.

About 35 years ago my brother worked briefly as a clerk at NOAA in their D.C. (or maybe it was D.C.-area) office. His job was to enter the weather numbers that came in. He had a boss who was very orderly. At one point there was a hurricane that wiped out some weather station in the Caribbean, and his boss told him to put in the numbers anyway. My brother protested that they didn't have the data, to which his boss replied: "I know what the numbers are."

Nowadays we call this sort of thing "imputation" and we like it. But not in the raw data! I bet nowadays they have an NA code.

## A.I. is Whatever We Can't Yet Automate

A common aphorism among artificial intelligence practitioners is that A.I. is whatever machines can't currently do.

Adam Gopnik, writing for the New Yorker, has a review called Get Smart in the most recent issue (4 April 2011). Ostensibly, the piece is a review of new books, one by Joshua Foer, Moonwalking with Einstein: The Art and Science of Remembering Everything, and one by Stephen Baker Final Jeopardy: Man vs. Machine and the Quest to Know Everything (which would explain Baker's spate of Jeopardy!-related blog posts). But like many such pieces in highbrow magazines, the book reviews are just a cover for staking out a philosophical position. Gopnik does a typically New Yorker job in explaining the title of this blog post.

## Is it plausible that 1% of people pick a career based on their first name?

In my discussion of dentists-named-Dennis study, I referred to my back-of-the-envelope calculation that the effect (if it indeed exists) corresponds to an approximate 1% aggregate chance that you'll pick a profession based on your first name. Even if there are nearly twice as many dentist Dennises as would be expected from chance alone, the base rate is so low that a shift of 1% of all Dennises would be enough to do this. My point was that (a) even a small effect could show up when looking at low-frequency events such as the choice to pick a particular career or live in a particular city, and (b) any small effects will inherently be difficult to detect in any direct way.

Uri Simonsohn (the author of the recent rebuttal of the original name-choice article by Brett Pelham et al.) wrote:

## Weather visualization with WeatherSpark

WeatherSpark: prediction and observation quantiles, historic data, multiple predictors, zoomable, draggable, colorful, wonderful:

Via Jure Cuhalev.

## Call for book proposals

Rob Calver writes:

## The scalarization of America

Mark Palko writes:

You lose information when you go from a vector to a scalar.

But what about this trick, which they told me about in high school? Combine two dimensions into one by interleaving the decimals. For example, if a=.11111 and b=.22222, then (a,b) = .1212121212.

## A new idea for a science core course based entirely on computer simulation

Columbia College has for many years had a Core Curriculum, in which students read classics such as Plato (in translation) etc. A few years ago they created a Science core course. There was always some confusion about this idea: On one hand, how much would college freshmen really learn about science by reading the classic writings of Galileo, Laplace, Darwin, Einstein, etc.? And they certainly wouldn't get much out by puzzling over the latest issues of Nature, Cell, and Physical Review Letters. On the other hand, what's the point of having them read Dawkins, Gould, or even Brian Greene? These sorts of popularizations give you a sense of modern science (even to the extent of conveying some of the debates in these fields), but reading them might not give the same intellectual engagement that you'd get from wrestling with the Bible or Shakespeare.

I have a different idea. What about structuring the entire course around computer programming and simulation? Start with a few weeks teaching the students some programming language that can do simulation and graphics. (R is a little clunky and Matlab is not open-source. Maybe Python?)

After the warm-up, students can program simulations each week:
- Physics: simulation of bouncing billiard balls, atomic decay, etc.
- Chemistry: simulation of chemical reactions, cool graphs of the concentrations of different chemicals over time as the reaction proceeds
- Biology: evolution and natural selection
And so forth.

There could be lecture material connecting these simulations with relevant scientific models. This could be great!

## Brain Structure and the Big Five

Many years ago, a research psychologist whose judgment I greatly respect told me that the characterization of personality by the so-called Big Five traits (extraversion, etc.) was old-fashioned. So I'm always surprised to see that the Big Five keeps cropping up. I guess not everyone agrees that it's a bad idea.

For example, Hamdan Azhar wrote to me:

## Age and happiness: The pattern isn't as clear as you might think

A couple people pointed me to this recent news article which discusses "why, beyond middle age, people get happier as they get older." Here's the story:

When people start out on adult life, they are, on average, pretty cheerful. Things go downhill from youth to middle age until they reach a nadir commonly known as the mid-life crisis. So far, so familiar. The surprising part happens after that. Although as people move towards old age they lose things they treasure--vitality, mental sharpness and looks--they also gain what people spend their lives pursuing: happiness.

This curious finding has emerged from a new branch of economics that seeks a more satisfactory measure than money of human well-being. Conventional economics uses money as a proxy for utility--the dismal way in which the discipline talks about happiness. But some economists, unconvinced that there is a direct relationship between money and well-being, have decided to go to the nub of the matter and measure happiness itself. . . There are already a lot of data on the subject collected by, for instance, America's General Social Survey, Eurobarometer and Gallup. . . .

And here's the killer graph:

All I can say is . . . it ain't so simple. I learned this the hard way. After reading a bunch of articles on the U-shaped relation between age and happiness--including some research that used the General Social Survey--I downloaded the GSS data (you can do it yourself!) and prepared some data for my introductory statistics class. I made a little dataset with happiness, age, sex, marital status, income, and a couple other variables and ran some regressions and made some simple graphs. The idea was to start with the fascinating U-shaped pattern and then discuss what could be learned further using some basic statistical techniques of subsetting and regression.

But I got stuck--really stuck. Here was my first graph, a quick summary of average happiness level (on a 0, 1, 2 scale; in total, 12% of respondents rated their happiness at 0 (the lowest level), 56% gave themselves a 1, and 32% described themselves as having the highest level on this three-point scale). And below are the raw averages of happiness vs. age. (Note: the graph has changed. In my original posted graph, I plotted the percentage of respondents of each age who had happiness levels of 1 or 2; this corrected graph plots average happiness levels.)

Uh-oh. I did this by single years of age so it's noisy--even when using decades of GSS, the sample's not infinite--but there's nothing like the famous U-shaped pattern! Sure, if you stare hard enough, you can see a U between ages 35 and 70, but the behavior from 20-35 and from 70-90 looks all wrong. There's a big difference between the publishedl graph, which has maxima at 20 and 85, and the my graph from GSS, which has minima at 20 and 85.

There are a lot of ways these graphs could be reconciled. There could be cohort or period effects, perhaps I should be controlling for other variables, maybe I'm using a bad question, or maybe I simply miscoded the data. All of these are possibilities. I spent several hours staring at the GSS codebook and playing with the data in different ways and couldn't recover the U. Sometimes I could get happiness to go up with age, but then it was just a gradual rise from age 18, without the dip around age 45 or 50. There's a lot going on here and I very well may still be missing something important. [Note: I imagine that sort of cagey disclaimer is typical of statisticians: by our training we are so aware of uncertainty. Researchers in other fields don't seem to feel the same need to do this.]

Anyway, at some point in this analysis I was getting frustrated at my inability to find the U (I felt like the characters in that old movie they used to show on TV on New Year's Eve, all looking for "the big W") and beginning to panic that this beautiful example was too fragile to survive in the classroom.

So I called Grazia Pittau, an economist (!) with whom I'd collaborated on some earlier happiness research (in which I contributed multilevel modeling and some ideas about graphs but not much of substance regarding psychology or economics). Grazia confirmed to me that the U-shaped pattern is indeed fragile, that you have to work hard to find it, and often it shows up when people fit linear and quadratic terms, in which case everything looks like a parabola. (I'd tried regressions with age & age-squared, but it took a lot of finagling to get the coefficient for age-squared to have the "correct" sign.)

And then I encountered a paper by Paul Frijters and Tony Beatton which directly addressed my confusion. Frijters and Beatton write:

Whilst the majority of psychologists have concluded there is not much of a relationship at all, the economic literature has unearthed a possible U-shape relationship. In this paper we [Frijters and Beatton] replicate the U-shape for the German SocioEconomic Panel (GSOEP), and we investigate several possible explanations for it.

They conclude that the U is fragile and that it arises from a sample-selection bias. I refer you to the above link for further discussion.

In summary: I agree that happiness and life satisfaction are worth studying--of course they're worth studying--but, in the midst of looking for explanations for that U-shaped pattern, it might be worth looking more carefully to see what exactly is happening. At the very least, the pattern does not seem to be as clear as implied from some media reports. (Even a glance at the paper by Stone, Schwartz, Broderick, and Deaton, which is the source of the top graph above, reveals a bunch of graphs, only some of which are U-shaped.) All those explanations have to be contingent on the pattern actually existing in the population.

My goal is not to debunk but to push toward some broader thinking. People are always trying to explain what's behind a stylized fact, which is fine, but sometimes they're explaining things that aren't really happening, just like those theoretical physicists who, shortly after the Fleischmann-Pons experiment, came up with ingenious models of cold fusion. These theorists were brilliant but they were doomed because they were modeling a phenomenon which (most likely) doesn't exist.

A comment from a few days ago by Eric Rasmusen seems relevant, connecting this to general issues of confirmation bias. If you make enough graphs and you're looking for a U, you'll find it. I'm not denying the U is there, I'm just questioning the centrality of the U to the larger story of age, happiness, and life satisfaction. There appear to be many different age patterns and it's not clear to me that the U should be considered the paradigm.

P.S. I think this research (even if occasionally done by economists) is psychology, not economics. No big deal--it's just a matter of terminology--but I think journalists and other outsiders can be misread if they hear about this sort of thing and start searching in the economics literature rather than in the psychology literature. In general, I think economists will have more to say than psychologists about prices, and psychologists will have more insights about emotions and happiness. I'm sure that economists can make important contributions to the study of happiness, just as psychologists can make important contributions to the study of prices, but even a magazine called "The Economist" should know the difference.

## Unlogging

| 1 Comment

Catherine Bueker writes:

I [Bueker] am analyzing the effect of various contextual factors on the voter turnout of naturalized Latino citizens. I have included the natural log of the number of Spanish Language ads run in each state during the election cycle to predict voter turnout. I now want to calculate the predicted probabilities of turnout for those in states with 0 ads, 500 ads, 1000 ads, etc. The problem is that I do not know how to handle the beta coefficient of the LN(Spanish language ads). Is there someway to "unlog" the coefficient?

My reply: Calculate these probabilities for specific values of predictors, then graph the predictions of interest. Also, you can average over the other inputs in your model to get summaries. See this article with Pardoe for further discussion.

## Science, ideology, and human origins

A link from Tyler Cowen led me to this long blog article by Razib Khan, discussing some recent genetic findings on human origins in the context of the past twenty-five years of research and popularization of science.

## Why a bonobo won't play poker with you

| 1 Comment

Sciencedaily has posted an article titled Apes Unwilling to Gamble When Odds Are Uncertain:

The apes readily distinguished between the different probabilities of winning: they gambled a lot when there was a 100 percent chance, less when there was a 50 percent chance, and only rarely when there was no chance In some trials, however, the experimenter didn't remove a lid from the bowl, so the apes couldn't assess the likelihood of winning a banana The odds from the covered bowl were identical to those from the risky option: a 50 percent chance of getting the much sought-after banana. But apes of both species were less likely to choose this ambiguous option.
Like humans, they showed "ambiguity aversion" -- preferring to gamble more when they knew the odds than when they didn't. Given some of the other differences between chimps and bonobos, Hare and Rosati had expected to find the bonobos to be more averse to ambiguity, but that didn't turn out to be the case.

Thanks to Stan Salthe for the link.

## Some ideas on communicating risks to the general public

Aleks points me to this research summary from Dan Goldstein. Good stuff. I've heard of a lot of this--I actually use some of it in my intro statistics course, when we show the students how they can express probability trees using frequencies--but it's good to see it all in one place.

## Whassup with phantom-limb treatment?

OK, here's something that is completely baffling me. I read this article by John Colapinto on the neuroscientist V. S. Ramachandran, who's famous for his innovative treatment for "phantom limb" pain:

His first subject was a young man who a decade earlier had crashed his motorcycle and torn from his spinal column the nerves supplying the left arm. After keeping the useless arm in a sling for a year, the man had the arm amputated above the elbow. Ever since, he had felt unremitting cramping in the phantom limb, as though it were immobilized in an awkward position. . . . Ramachandram positioned a twenty-inch-by-twenty-inch drugstore mirror . . . and told him to place his intact right arm on one side of the mirror and his stump on the other. He told the man to arrange the mirror so that the reflection created the illusion that his intact arm was the continuation of the amputated one. The Ramachandran asked the man to move his right and left arms . . . "Oh, my God!" the man began to shout. . . . For the first time in ten years, the patient could feel his phantom limb "moving," and the cramping pain was instantly relieved. After the man had used the mirror therapy ten minutes a day for a month, his phantom limb shrank . . .

Ramachandran conducted the experiment on eight other amputees and published the results in Nature, in 1995. In all but one patient, phantom hands that had been balled into painful fists opened, and phantom arms that had stiffened into agonizing contortions straightened. . . .

So far, so good. But then the story continues:

Dr. Jack Tsao, a neurologist for the U.S. Navy . . . read Ramachandran's Nature paper on mirror therapy for phantom-limb pain. . . . Several years later, in 2004, Tsao began working at Walter Reed Military Hospital, where he saw hundreds of soldiers with amputations returning from Iraq and Afghanistan. Ninety percent of them had phantom-limb pain, and Tsao, noting that the painkillers routinely prescribed for the condition were ineffective, suggested mirror therapy. "We had a lot of skepticism from the people at the hospital, my colleagues as well as the amputee subjects themselves," Tsao said. But in a clinical trial of eighteen service members with lower-limb amputations . . . the six who used the mirror reported that their pain decreased [with no corresponding improvement in the control groups] . . . Tsao published his results in the New England Journal of Medicine, in 2007. "The people who really got completely pain-free remain so, two years later," said Tsao, who is currently conducting a study involving mirror therapy on upper-limb amputees at Walter Reed.

At first, this sounded perfectly reasonable: Bold new treatment is dismissed by skeptics but then is proved to be a winner in a clinical trial. But . . . wait a minute! I have some questions:

1. Ramachandran published his definitive paper in 1995 in a widely-circulated journal. Why did his mirror therapy not become the standard approach, especially given that "the painkillers routinely prescribed for the condition were ineffective"? Why were these ineffective painkillers "routinely prescribed" at all?

2. When Tsao finally got around to trying a therapy that had been published nine years before why did they have "a lot of skepticism from the people at the hospital"?

3. If Tsao saw "hundreds of soldiers" with phantom-limb pain, why did he try the already-published mirror therapy on only 18 of them?

4. How come, in 2009, two years after his paper in the New England Journal of Medicine--and fourteen years after Ramachandran's original paper in Nature--even now, Tsao is "currently conducting a study involving mirror therapy"? Why isn't he doing mirror therapy on everybody?

Ok, maybe I have the answer to the last question: Maybe Tsao's current (as of 2009) study is of different variants of mirror therapy. That is, maybe he is doing it on everybody, just in different ways. That would make sense.

But I don't understand items 1,2,3 above at all. There must be some part of the story that I'm missing. Perhaps someone could explain?

P.S. More here.

## Reinventing the wheel, only more so.

Posted by Phil Price:

A blogger (can't find his name anywhere on his blog) points to an article in the medical literature in 1994 that is...well, it's shocking, is what it is. This is from the abstract:

In Tai's Model, the total area under a curve is computed by dividing the area under the curve between two designated values on the X-axis (abscissas) into small segments (rectangles and triangles) whose areas can be accurately calculated from their respective geometrical formulas. The total sum of these individual areas thus represents the total area under the curve. Validity of the model is established by comparing total areas obtained from this model to these same areas obtained from graphic method (less than +/- 0.4%). Other formulas widely applied by researchers under- or overestimated total area under a metabolic curve by a great margin

Yes, that's right, this guy has rediscovered the trapezoidal rule. You know, that thing most readers of this blog were taught back in 11th or 12th grade, and all med students were taught by freshman year in college.

The blogger finds this amusing, but I find it mostly upsetting and sad. Which is sadder: (1) That this paper got past the referees, (2) that it has been cited dozens of times in the medical literature, including this year, (3) that, if the abstract is to be believed, many medical researchers DON'T use an accurate method to calculate the area under a curve.

Things gets reinvented all the time. I, too, have published results that I've later found were previously published by someone else. But I've never done it with something that is taught in high school calculus. And --- I'm practically spluttering with indignation --- if I wanted to calculate something like the area under a curve, I would at least first see if there is already a known way to do it! I wouldn't invent an obvious method, name it after myself, and send it to a journal, without it ever occurring to me that, gee, maybe someone else has thought about this already! Grrrrrr.

## Neumann update

Steve Hsu, who started off this discussion, had some comments on my speculations on the personality of John von Neumann and others. Steve writes:

I [Hsu] actually knew Feynman a bit when I was an undergrad, and found him to be very nice to students. Since then I have heard quite a few stories from people in theoretical physics which emphasize his nastier side, and I think in the end he was quite a complicated person like everyone else.

There are a couple of pseudo-biographies of vN, but none as high quality as, e.g., Gleick's book on Feynman or Hodges book about Turing. (Gleick studied physics as an undergrad at Harvard, and Hodges is a PhD in mathematical physics -- pretty rare backgrounds for biographers!) For example, as mentioned on the comment thread to your post, Steve Heims wrote a book about both vN and Wiener (!), and Norman Macrae wrote a biography of vN. Both books are worth reading, but I think neither really do him justice. The breadth of vN's work is just too much for any one person to absorb, ranging from pure math to foundations of QM, to shock wave theory (important for nuclear weapons), to game theory, to computation.

I read the biography of Gell-Mann that came out several years ago, and it made me feel sad for the guy. In particular, I'm thinking about the bit where, after Feynman hit the bestseller list, Gell-Mann got a big book contract himself, but then he got completely blocked and couldn't figure out what to put in the book (which eventually became the unreadable but respectfully-reviewed The Quark and the Jaguar).

I'm still interested in the von Neumann paradox, but given what's been written in the comment thread so far, I'm at this point doubting that it will ever be resolved to my satisfaction. If only I could bring Ulam back to life and ask him a few questions, I'm sure he could explain. Ulam definitely seems like my kind of guy.

## One way that psychology research is different than medical research

Medical researchers care about main effects, psychologists care about interactions. In psychology, the main effects are typically obvious, and it's only the interactions that are worth studying.

I, like Steve Hsu, I too would love to read a definitive biography of John von Neumann (or, as we'd say in the U.S., "John Neumann"). I've read little things about him in various places such as Stanislaw Ulam's classic autobiography, and two things I've repeatedly noticed are:

1. Neumann comes off as a obnoxious, self-satisfied jerk. He just seems like the kind of guy I wouldn't like in real life.

2. All these great men seem to really have loved the guy.

It's hard for me to reconcile two impressions above. Of course, lots of people have a good side and a bad side, but what's striking here is that my impressions of Neumann's bad side come from the very stories that his friends use to demonstrate how lovable he was! So, yes, I'd like to see the biography--but only if it could resolve this paradox.

Also, I don't know how relevant this is, but Neumann shares one thing with the more-lovable Ulam and the less-lovable Mandelbrot: all had Jewish backgrounds but didn't seem to like to talk about it.

P.S. Just to calibrate, here are my impressions of some other famous twentieth-century physicists. In all cases this is based on my shallow reading, not from any firsthand or even secondhand contact:

Feynman: Another guy who seemed pretty unlikable. Phil and I use the term "Feynman story" for any anecdote that someone tells that is structured so that the teller comes off as a genius and everyone else in the story comes off as an idiot. Again, lots of people, from Ulam to Freeman Dyson on down, seemed to think Feynman was a great guy. But I think it's pretty clear that a lot of other people didn't think he was so great. So Feynman seems like a standard case of a guy who was nice to some people and a jerk to others.

Einstein: Everyone seems to describe him as pretty remote, perhaps outside the whole "nice guy / jerk" spectrum entirely.

Gell-Mann (or, as we'd say in the U.S., "Gelman"): Nobody seemed to like him so much. He doesn't actually come off as a bad guy in any way, just someone who, for whatever reason, isn't so lovable.

Fermi, Bohr, Bethe: In contrast, everyone seemed to love these guys.

Hawking: What can you say about a guy with this kind of disability?

Oppenheimer: A tragic figure etc etc. I don't think anyone called him likable.

Teller: Even less likable, apparently.

That's about it. (Sorry, I'm not very well-read when it comes to physics gossip. I don't know, for example, if any Nobel-Prize-winning physicists have tried to run down any of their colleagues in a parking lot.)

Paul Erdos is another one: He always seems to be described as charmingly eccentric, but from all the descriptions I've read, he sounds just horrible! Perhaps the key is to come into these interactions with appropriate expectations, then everything will be OK.

Maybe Michael Frayn would have some insight into this . . . not that I have any way of reaching him!

## Is parenting a form of addiction?

The last time we encountered Slate columnist Shankar Vedantam was when he puzzled over why slightly more than half of voters planned to vote for Republican candidates, given that polls show that Americans dislike the Republican Party even more than they dislike the Democrats. Vedantam attributed the new Republican majority to irrationality and "unconscious bias." But, actually, this voting behavior is perfectly consistent with there being some moderate voters who prefer divided government. The simple, direct explanation (which Vedantam mistakenly dismisses) actually works fine.

I was flipping through Slate today and noticed a new article by Vedantam headlined, "If parenthood sucks, why do we love it? Because we're addicted." I don't like this one either.

## Society for Industrial and Applied Mathematics startup-math meetup

Chris Wiggins sends along this.

It's a meetup at Davis Auditorium, CEPSR Bldg, Columbia University, on Wed 10 Nov (that's tomorrow! or maybe today! depending on when you're reading this), 6-8pm.

## "I was finding the test so irritating and boring that I just started to click through as fast as I could"

In this article, Oliver Sacks talks about his extreme difficulty in recognizing people (even close friends) and places (even extremely familiar locations such as his apartment and his office).

After reading this, I started to wonder if I have a very mild case of face-blindness. I'm very good at recognizing places, but I'm not good at faces. And I can't really visualize faces at all. Like Sacks and some of his correspondents, I often have to do it by cheating, by recognizing certain landmarks that I can remember, thus coding the face linguistically rather than visually. (On the other hand, when thinking about mathematics or statistics, I'm very visual, as readers of this blog can attest.)

## This is a link to a news article about a scientific paper

Somebody I know sent me a link to this news article by Martin Robbins describing a potential scientific breakthrough. I express some skepticism but in a vague enough way that, in the unlikely event that the research claim turns out to be correct, there's no paper trail showing that I was wrong. I have some comments on the graphs--the tables are horrible, no need to even discuss them!--and I'd prefer if the authors of the paper could display their data and model on a single graph. I realize that their results reached a standard level of statistical significance, but it's hard for me to interpret their claims until I see their estimates on some sort of direct real-world scale. In any case, though, I'm sure these researchers are working hard, and I wish them the best of luck in their future efforts to replicate their findings.

I'm sure they'll have no problem replicating, whether or not their claims are actually true. That's the way science works: Once you know what you're looking for, you'll find it!

## I can't escape it

Ms. No.: ***

Title: ***

Corresponding Author: ***

All Authors: ***

Dear Dr. Gelman,

Because of your expertise, I would like to ask your assistance in determining whether the above-mentioned manuscript is appropriate for publication in ***. The abstract is pasted below. . . .

I would rather not review this article. I suggest ***, ***, and *** as reviewers.

I think it would be difficult for me to review the manuscript fairly.

## Fighting Migraine with Multilevel Modeling

Hal Pashler writes:

Ed Vul and I are working on something that, although less exciting than the struggle against voodoo correlations in fMRI :-) might interest you and your readers. The background is this: we have been struck for a long time by how many people get frustrated and confused trying to figure out whether something they are doing/eating/etc is triggering something bad, whether it be migraine headaches, children's tantrums, arthritis pains, or whatever. It seems crazy to try to do such computations in one's head--and the psychological literature suggests people must be pretty bad at this kind of thing--but what's the alternative? We are trying to develop one alternative approach--starting with migraine as a pilot project.

We created a website that migraine sufferers can sign up for. The users select a list of factors that they think might be triggering their headaches (eg drinking red wine, eating stinky cheese, etc.--the website suggests a big list of candidates drawn from the migraine literature). Then, every day the user is queried about how much they were exposed to each of these potential triggers that day, as well as whether they had a headache. After some months, the site begins to analyze the user's data to try to figure out which of these triggers--if any--are actually causing headaches.

Our approach uses multilevel logistic regression as in Gelman and Hill, and or Gelman and Little (1997), and we use parametric bootstrapping to obtain posterior predictive confidence intervals to provide practical advice (rather than just ascertain the significance of effects). At the start the population-level hyperparameters on individual betas start off uninformative (uniform), but as we get data from an adequate number of users (we're not there quite yet), we will be able to pool information across users to provide appropriate population-level priors on the regression coefficients for each possible trigger factor for each person. The approach is outlined in this FAQ item.

Looks cool to me.

## The China Study: fact or fallacy?

| 1 Comment

Alex Chernavsky writes:

I recently came across an interesting blog post, written by someone who is self-taught in statistics (not that there's anything wrong with that).

I have no particular expertise in statistics, but her analysis looks impressive to me. I'd be very interested to find out the opinion of a professional statistician. Do you have any interest in blogging about this subject?

My (disappointing, I'm sure) reply: This indeed looks interesting. I don't have the time/energy to look at it more right now, and it's too far from any areas of my expertise for me to give any kind of quick informed opinion. It would be good for this sort of discussion to appear in a nutrition journal where the real experts could get at it. I expect there are some strong statisticians who work in that field, although I don't really know for sure.

P.S. I suppose I really should try to learn more about this sort of thing, as it could well affect my life more than a lot of other subjects (from sports to sex ratios) that I've studied in more depth.

## Ratios where the numerator and denominator both change signs

A couple years ago, I used a question by Benjamin Kay as an excuse to write that it's usually a bad idea to study a ratio whose denominator has uncertain sign. As I wrote then:

Similar problems arise with marginal cost-benefit ratios, LD50 in logistic regression (see chapter 3 of Bayesian Data Analysis for an example), instrumental variables, and the Fieller-Creasy problem in theoretical statistics. . . . In general, the story is that the ratio completely changes in interpretation when the denominator changes sign.

More recently, Kay sent in a related question:

## Climate Change News

I. State of the Climate report
The National Oceanic and Atmospheric Administration recently released their "State of the Climate Report" for 2009. The report has chapters discussing global climate (temperatures, water vapor, cloudiness, alpine glaciers,...); oceans (ocean heat content, sea level, sea surface temperatures, etc.); the arctic (sea ice extent, permafrost, vegetation, and so on); Antarctica (weather observations, sea ice extent,...), and regional climates.

NOAA also provides a nice page that lets you display any of 11 relevant time-series datasets (land-surface air temperature, sea level, ocean heat content, September arctic sea-ice extent, sea-surface temperature, northern hemisphere snow cover, specific humidity, glacier mass balance, marine air temperature, tropospheric temperature, and stratospheric temperature). Each of the plots overlays data from several databases (not necessarily indepenedent of each other), and you can select which ones to include or leave out.

News flash: the earth's atmosphere and oceans are warming rapidly.

By the way, note that one of the temperature series -- Stratospheric (high-altitude) temperature -- is declining rather than increasing. That's to be expected since the stratosphere is getting less heat from below than it used to: more of the heat coming from the earth is absorbed by the CO2 in the lower atmosphere.

II. 35th Anniversary of a major global warming prediction
Another recent news item is the "celebration" of the 35th anniversary of the very brief article, in the journal Science, "Climatic Change: Are We on the Brink of Pronounced Global Warming", by Wallace Broecker. When the paper was published (1975) the global mean temperature was only about 0.2 C higher than it had been in 1900, and the trend was downward rather than upward. Broeker correctly predicted that the downward trend would end soon, and that the ensuing warming would "by the year 2000 bring average global temperatures beyond the range experienced in the last 1000 years." He got that right, or at least, the highly uncertain temperature data from 1000 years ago are consistent with Broeker having gotten that right. If he was wrong, it was only by a decade or so.

III. Not really news, but since we're here...
Speaking of global temperatures 1000 years ago, one thing anthropogenic climate change skeptics like to point out is that wine was produced in England in the year 1000, and the Norse on Greenland were able to graze cattle and produce crops. True! It's also true that you can visit vineyards in England today, and if you're in Greenland, don't forget to try the local mutton or beef.

IV. No climate bill again this year
Meanwhile, Congress has dropped efforts to reduce greenhouse gas emissions.

## Information is good

Washington Post and Slate reporter Anne Applebaum wrote a dismissive column about Wikileaks, saying that they "offer nothing more than raw data."

Applebaum argues that "The notion that the Internet can replace traditional news-gathering has just been revealed to be a myth. . . . without more journalism, more investigation, more work, these documents just don't matter that much."

Fine. But don't undervalue the role of mere data! The usual story is that we don't get to see the raw data underlying newspaper stories. Wikileaks and other crowdsourced data can be extremely useful, whether or not they replace "traditional news-gathering."

## Why don't we have peer reviewing for oral presentations?

Panos Ipeirotis writes in his blog post:

Everyone who has attended a conference knows that the quality of the talks is very uneven. There are talks that are highly engaging, entertaining, and describe nicely the research challenges and solutions. And there are talks that are a waste of time. Either the presenter cannot present clearly, or the presented content is impossible to digest within the time frame of the presentation.

We already have reviewing for the written part. The program committee examines the quality of the written paper and vouch for its technical content. However, by looking at a paper it is impossible to know how nicely it can be presented. Perhaps the seemingly solid but boring paper can be a very entertaining presentation. Or an excellent paper may be written by a horrible presenter.

Why not having a second round of reviewing, where the authors of accepted papers submit their presentations (slides and a YouTube video) for presentation to the conference. The paper will be accepted and be included in the proceedings anyway but having a paper does not mean that the author gets a slot for an oral presentation.

Under an oral presentation peer review, a committee looks at the presentation, votes on accept/reject and potentially provides feedback to the presenter. The best presentations get a slot on the conference program.

While I've enjoyed quiet time for meditation during boring talks, this is a very interesting idea - cost permitting. As the cost of producing a paper and a presentation to pass peer review goes into weeks, a lot of super-interesting early-stage research just moves off the radar.

## The Three Golden Rules for Successful Scientific Research

A famous computer scientist, Edsger W. Dijkstra, was writing short memos on a daily basis for most of his life. His memo archives contains a little over 1300 memos. I guess today he would be writing a blog, although his memos do tend to be slightly more profound than what I post.

Here are the rules (follow link for commentary), which I tried to summarize:

• Pursue quality and challenge, avoid routine. ("Raise your quality standards as high as you can live with, avoid wasting your time on routine problems, and always try to work as closely as possible at the boundary of your abilities. Do this, because it is the only way of discovering how that boundary should be moved forward.")

• When pursuing social relevance, never compromise on scientific soundness. ("We all like our work to be socially relevant and scientifically sound. If we can find a topic satisfying both desires, we are lucky; if the two targets are in conflict with each other, let the requirement of scientific soundness prevail.")

• Solve the problems nobody can solve better than you. ("Never tackle a problem of which you can be pretty sure that (now or in the near future) it will be tackled by others who are, in relation to that problem, at least as competent and well-equipped as you.")

[D+1: Changed "has been" into "was" - the majority of commenters decided Dijkstra is better treated as a dead person who was, rather than an immortal who "has been", is, and will be.]

## Burglars are local

This makes sense:

In the land of fiction, it's the criminal's modus operandi - his method of entry, his taste for certain jewellery and so forth - that can be used by detectives to identify his handiwork. The reality according to a new analysis of solved burglaries in the Northamptonshire region of England is that these aspects of criminal behaviour are on their own unreliable as identifying markers, most likely because they are dictated by circumstances rather than the criminal's taste and style. However, the geographical spread and timing of a burglar's crimes are distinctive, and could help with police investigations.

And, as a bonus, more Tourette's pride!

P.S. On yet another unrelated topic from the same blog, I wonder if the researchers in this study are aware that the difference between "significant'' and "not significant'' is not itself statistically significant.

StackOverflow has been a popular community where software developers would help one another. Recently they raised some VC funding, and to make profits they are selling job postings and expanding the model to other areas. Metaoptimize LLC has started a similar website, using the open-source OSQA framework for such as statistics and machine learning. Here's a description:

You and other data geeks can ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization.
Here you can ask and answer questions, comment and vote for the questions of others and their answers. Both questions and answers can be revised and improved. Questions can be tagged with the relevant keywords to simplify future access and organize the accumulated material.

If you work very hard on your questions and answers, you will receive badges like "Guru", "Student" or "Good answer". Just like a computer game! In return, well-meaning question answerers will be helping feed Google and numerous other companies with good information they will offer the public along with sponsored information that someone is paying for.

I'll join the party myself when they introduce the "Rent," "Mortgage Payment," "Medical Bill", and "Grocery" badges. Until then, I'll be spending time and money, and someone else will be saving time and earning money. For a real community, there has to be some basic fairness.

[9:15pm: Included Ryan Shaw's correction to my post, pointing out that MetaOptimize is based on OSQA and not on the StackOverflow platform.]
[D+1, 7:30am: Igor Carron points to an initiative that's actually based on the StackOverflow.]

## Scientists can read your mind . . . as long as the're allowed to look at more than one place in your brain and then make a prediction after seeing what you actually did

Maggie Fox writes:

Brain scans may be able to predict what you will do better than you can yourself . . . They found a way to interpret "real time" brain images to show whether people who viewed messages about using sunscreen would actually use sunscreen during the following week.

The scans were more accurate than the volunteers were, Emily Falk and colleagues at the University of California Los Angeles reported in the Journal of Neuroscience. . . .

About half the volunteers had correctly predicted whether they would use sunscreen. The research team analyzed and re-analyzed the MRI scans to see if they could find any brain activity that would do better.

Activity in one area of the brain, a particular part of the medial prefrontal cortex, provided the best information.

"From this region of the brain, we can predict for about three-quarters of the people whether they will increase their use of sunscreen beyond what they say they will do," Lieberman said.

"It is the one region of the prefrontal cortex that we know is disproportionately larger in humans than in other primates," he added. "This region is associated with self-awareness, and seems to be critical for thinking about yourself and thinking about your preferences and values."

Hmm . . . they "analyzed and re-analyzed the scans to see if they could find any brain activity" that would predict better than 50%?! This doesn't sound so promising. But maybe the reporter messed up on the details . . .

I took advantage of my library subscription to take a look at the article, "Predicting Persuasion-Induced Behavior Change from the Brain," by Emily Falk,Elliot Berkman,Traci Mann, Brittany Harrison, and Matthew Lieberman. Here's what they say:

- "Regions of interest were constructed based on coordinates reported by Soon et al. (2008) in MPFC and precuneus, regions that also appeared in a study of persuasive messaging." OK, so they picked two regions of interest ahead of time. They didn't just search for "any brain activity." I'll take their word for it that they just looked at these two, that they didn't actually look at 50 regions and then say they reported just two.

- Their main result had a t-statistic of 2.3 (on 18 degrees of freedom, thus statistically significant at the 3% level) in one of the two regions they looked at, and a t-statistic of 1.5 (not statistically significant) in the other. A simple multiple-comparisons correction takes the p-value of 0.03 and bounces it up to an over-the-threshold 0.06, which I think would make the result unpublishable! On the other hand, a simple average gives a healthy t-statistic of (1.5+2.3)/sqrt(2) = 2.7, although that ignores any possible correlation between the two regions (they don't seem to supply that information in their article).

- They also do a cross-validation but this seems 100% pointless to me since they do the cross-validation on the region that already "won" on the full data analysis. For the cross-validation to mean anything at all, they'd have to use the separate winner on each of the cross-validatory fits.

- As an outcome, they use before-after change. They should really control for the "before" measurement as a regression predictor. That's a freebie. And, when you're operating at a 6% significance level, you should take any freebie that you can get! (It's possible that they tried adjusting for the "before" measurement and it didn't work, but I assume they didn't do that, since I didn't see any report of such an analysis in the article.)

The bottom line

I'm not saying that the reported findings are wrong, I'm just saying that they're not necessarily statistically significant in the usual way this term is used. I think that, in the future, such work would be improved by more strongly linking the statistical analysis to the psychological theories. Rather than simply picking two regions to look at, then taking the winner in a study of n=20 people, and going from there to the theories, perhaps they could more directly model what they're expecting to see.

The difference between . . .

Also, the difference between "significant'' and "not significant'' is not itself statistically significant. How is this relevant in the present study? They looked at two regions, MPFC and precuneus. Both showed positive correlations, one with a t-value of 2.3, one with a t-value of 1.5. The first of these is statistically significant (well, it is, if you ignore that it's the maximum of two values), the second is not. But the difference is not anything close to statistically significant, not at all! So why such a heavy emphasis on the winner and such a neglect of #2?

Here's the count from a simple document search:

MPFC: 20 instances (including 2 in the abstract)
precuneus: 8 instances (0 in the abstract)

P.S. The "picked just two regions" bit gives a sense of why I prefer Bayesian inference to classical hypothesis testing. The right thing, I think, is actually to look at all 50 regions (or 100, or however many regions there are) and do an analysis including all of them. Not simply picking the region that is most strongly correlated with the outcome and then doing a correction--that's not the most statistically efficient thing to do, you're just asking, begging to be overwhelmed by noise)--but rather using the prior information about regions in a subtler way than simply picking out 2 and ignoring the other 48. For example, you could have a region-level predictor which represents prior belief in the region's importance. Or you could group the regions into a few pre-chosen categories and then estimate a hierarchical model with each group of regions being its own batch with group-level mean and standard deviation estimated from data. The point is, you have information you want to use--prior knowledge from the literature--without it unduly restricting the possibilities for discovery in your data analysis.

Near the end, they write:

In addition, we observed increased activity in regions involved in memory encoding, attention, visual imagery, motor execution and imitation, and affective experience with increased behavior change.

These were not pre-chosen regions, which is fine, but at this point I'd like to see the histogram of correlations for all the regions, along with a hierarchical model that allows appropriate shrinkage. Or even a simple comparison to the distribution of correlations one might expect to see by chance. By suggesting this, I'm not trying to imply that all the findings in this paper are due to chance; rather, I'm trying to use statistical methods to subtract out the chance variation as much as possible.

P.P.S. Just to say this one more time: I'm not at all trying to claim that the researchers are wrong. Even if they haven't proven anything in a convincing way, I'll take their word for it that their hypothesis makes scientific sense. And, as they point out, their data are definitely consistent with their hypotheses.

P.P.P.S. For those who haven't been following these issues, see here, here, here, and here.

## Imputing count data

I am analyzing an original survey of farmers in Uganda. I am hoping to use a battery of welfare proxy variables to create a single welfare index using PCA. I have quick question which I hope you can find time to address:

How do you recommend treating count data? (for example # of rooms, # of chickens, # of cows, # of radios)? In my dataset these variables are highly skewed with many responses at zero (which makes taking the natural log problematic). In the case of # of cows or chickens several obs have values in the hundreds.

My response: Here's what we do in our mi package in R. We split a variable into two parts: an indicator for whether it is positive, and the positive part. That is, y = u*v. Then u is binary and can be modeled using logisitc regression, and v can be modeled on the log scale. At the end you can round to the nearest integer if you want to avoid fractional values.

## Auto-Gladwell, or Can fractals be used to predict human history?

I just reviewed the book Bursts, by Albert-László Barabási, for Physics Today. But I had a lot more to say that couldn't fit into the magazine's 800-word limit. Here I'll reproduce what I sent to Physics Today, followed by my additional thoughts.

The back cover of Bursts book promises "a revolutionary new theory showing how we can predict human behavior." I wasn't fully convinced on that score, but the book does offer a well-written and thought-provoking window into author Albert-László Barabási's research in power laws and network theory.

Power laws--the mathematical pattern that little things are common and large things are rare--have been observed in many different domains, including incomes (as noted by economist Vilfredo Pareto in the nineteenth century), word frequencies (as noted by linguist George Zipf), city sizes, earthquakes, and virtually anything else that can be measured. In the mid-twentieth century, the mathematician Benoit Mandelbrot devoted an influential career to the study of self-similarity, deriving power laws for phenomena ranging from taxonomies (the distribution of the lengths of index entries) to geographical measurements. (I was surprised to encounter neither Zipf, Mandelbrot, nor Herbert Simon in the present book, but perhaps an excessive discussion of sources would have impeded the book's narrative flow.)

Mandlebrot made a convincing case that nature is best described, not by triangles, squares, and circles, but by fractals--patterns that reveal increasing complexity when they are studies at finer levels. The shapes familiar to us from high-school geometry lose all interest when studied close up, whereas fractals--and real-life objects such mountains, trees, and even galaxies--are full of structure at many different levels. (Recall the movie Powers of Ten, which I assume nearly all readers of this magazine have seen at least once in their lives.)

A similar distinction between regularity and fractality holds in the social world, with designed structures such as bus schedules having a smooth order, and actual distributions of bus waiting times (say) having a complex pattern of randomness.

Trained as a physicist, Albert-László Barabási has worked for several years on mathematical models for the emergence of power laws in complex systems such as the Internet. In his latest book, Barabási describes many aspects of power laws, including a computer simulation of busy responses that went like this:

a) I [Barbasi] selected the highest-priority task and removed it from the list, mimicking the real habit I have when I execute a task.
b) I replaced the executed task with a new one, randomly assigning it a priority, mimicking the fact that I do not know the importance of the next task that lands on my list.

The resulting simulation reproduced the power-law distribution that he and others have observed to characterize the waiting time between responses to emails, web visits, and other data. But this is more than a cute model to explain a heretofore-mysterious stylized fact. As with Albert Einstein's theory of Brownian motion, such latent-variable models suggest new directions of research, in this case moving from a static analysis of waiting time distributions to a dynamic study of the decisions that underlie the stochastic process.

For an application of this idea, Barabási discusses the "Harry Potter" phenomenon, in which hospital admissions in Britain were found to drop dramatically upon the release of each installment of the cult favorite children's book. A similar pattern happened in relation to Boston's professional baseball team: emergency-room visits in the city dropped when the Red Sox had winning days.

In addition to this sort of detail, Barabási makes some larger points, some of which are persuasive to me and some of which are not. He distinguishes between traditional models of randomness--Poisson and Gaussian distributions--which are based on statistically independent events, and bursty processes, which arise from feedback processes that at times suppress and at other times amplify variation. (A familiar example, not discussed in the book, is the system of financial instruments which shifted risk around for years before eventually blowing up.)

Barabási characterizes bursty processes as predictable; at one point he discusses the burstiness of people's physical locations (we spend most of our time at home, school or work, or in between, but occasionally go on long trips). From here, he takes a leap--which I couldn't follow at all--to conjecture a more general order within human behavior and historical events, in his opinion calling into question Karl Popper's argument that human history is inherently unpredictable. The book also features a long excursion into Hungarian history, but the connection of this narrative to the scientific themes was unclear to me.

Despite my skepticism of the book's larger claims, I found many of the stories in Bursts to be interesting and (paradoxically) unpredictable, and it offers an inside view on some fascinating research. I particularly liked how Barabási takes his models seriously enough that, when they fail, he learns even more from their refutation. I suspect there are quite a few more bursts to come in this particular research programme.

And now for my further thoughts, first on the structure of the book, then on some of its specific claims.

Structure

The book Bursts falls into what might be called the auto-Gladwell genre of expositions of a researcher's own work, told through stories and personal anecdotes in a magazine-article style, but ultimately focused on the underlying big idea. Auto-Gladwell is huge nowadays; other examples in the exact same subfield as Barabási's include Six Degrees (a book written by Duncan Watts, who was my Columbia colleague at the time, but which I actually first encountered it through a (serious) review in the Onion, of all places), Steven Strogatz's Sync, last year's Connected by Nicholas Christakis and James Fowler's, as well as, of course, Barabási's own Linked, published in 2002. These guys have collaborated with each other In different combinations, forming their own social network.

In keeping with the Gladwellian imperative, Bursts jumps around from story to story, often dropping the reader right into the middle of a narrative with little sense of how it connects to the big story. This makes for an interesting reading experience, but ultimately I'd be happier to see each story presented separately and to its conclusion. About half the book tells the story of a peasant rebellion in sixteenth-century Hungary (along with associated political maneuvering), but it's broken up into a dozen chapters spread throughout the book, and I had to keep flipping back and forth to follow what was going on. (Also, as I noted above, I didn't really see the connection between the story and Barabási's scientific material.)

Similarly, Barabási begins with, concludes with, and occasionally mentions a friend of his, an artist with a Muslim-sounding name who keeps being hassled by U.S. customs officials. It's an interesting story but does not benefit from being presented in bits and pieces. In other places, interesting ideas come up and are never resolved in the book. For example, chapter 15 features the story of a unified field theory proposed in 1919 (in Barabási's words, "the two forces [gravity and electromagnetism] could be brought together if we assume that our world is not three- but five-dimensional"; apparently it anticipated string theory by several decades) and published on Albert Einstein's recommendation in 1921. This is used as an example to introduce an analysis of Einstein's correspondence, but it's not clear to me exactly how the story relates to the larger themes of the book.

Specifics

As noted in the review above, I thought Barabási's dynamic model of priority-setting was fascinating, and I would've liked to hear more details--ideally, something more specific than stories and statements that people are 90% predictable, but less terse than what's in the papers that he and his collaborators have published in journals such as Science. On one hand, I can hardly blame the author for trying to make his book accessible to general audiences; still, I kept wanting more detail, to fill in the gaps and understand exactly how the mathematical models and statistical analyses fit into the stories and the larger claims.

My impression was that the book was making two scientific claims:

1. Bursty phenomena can often be explained by dynamic models of priority-setting.

2. In some fundamental way, bursty processes can be thought of as predictable and not so random. in particular, human behavior and even human history can perhaps be much more predictable than we thought?

How convincing in Barabási on these two claims? On the first claim, somewhat. His model is compelling to me, at least in the examples he focuses on. I think the presentation would've been stronger had he discussed the variety of different mathematical models that researchers have developed to explain power laws. Is Barabási's model better or more plausible than the others? Or perhaps all these different models have important common features that could be emphasized? I'd like to know.

Of Barabási's second claim, I'm not convinced at all. It seems like a big leap to go from being able to predict people's locations (mostly they're at work from 9-5 and home at other times) and forecasting human history as one might forecast the weather.

Also, I think Barabási is slightly misstating Karl Popper's position from The Poverty of Historicism. Barabási quotes Popper as saying that social patterns don't have the regularity of natural sciences--we can't predict revolutions like we can predict eclipses because societies, unlike planets, don't move in regular orbits. And, indeed, we still have difficulty predicting natural pheonomena such as earthquakes that do not occur on regular schedules.

But Popper was saying more than that. Popper had two other arguments against the predictability of social patterns. First, there is the feedback mechanism: if a prediction is made publicly, it can induce behavior that is intended to block or hasten the prediction. This sort of issue arises in economic policy nowadays. Second, social progress depends on the progress of science, and scientific progress is inherently unpredictable. (When the National Science Foundation gives our research group \$200,000 to work on a project, there's no guarantee of success. In fact, they don't give money to projects with guaranteed success; that sort of thing isn't really scientific research at all.)

I agree with Barabási that questions of the predictability of individual and social behavior are ultimately empirical. Persuasive as Popper's arguments may be (and as relevant as they may have been when combatting mid-twentieth century Communist ideology), it might still be that modern scientists will be able to do it. But I think it's only fair to present Popper's full argument.

Finally, a few small points.

- On page 142 there is a discussion of Albert Einstein's letters: "His sudden fame had drastic consequences for his correspondence. In 1919, he received 252 letters and wrote 239, his life still in its subcritical phase . . . By 1920 Einstein had moved into the supercritical regime, and he never recovered. The peak came in 1953, two years before his death, when he received 832 letters and responded to 476 of them." This can't be right. Einstein must have been receiving zillions of letters in 1953.

- On page 194, it says, "It is tempting to see life as a crusade against randomness, a yearning for a safe, ordered existence." This echoes the famous idea from Schroedinger's What is Life, of living things as entropy pumps, islands of low-entropy systems within a larger world governed by the second law of thermodynamics.

- On page 195, Barabási refers to Chaoming Song, "a bright postdoctoral research associate who joined my lab in the spring of 2008." I hope that all his postdocs are bright!

- On page 199, he writes, "when it comes to the predicability of our actions, to our surprise power laws are replaced by Gaussians." This confused me. The distribution of waiting times can't be Gaussian, right? It would help to have some detail on exactly what is being measured here. I understand that, for accessibility reasons, the book has no graphs, but still it would be good to have a bit more information here.

## The Two Cultures: still around

David Blackbourn writes in the London Review of Books about the German writer Hans Magnus Eisenberger:

But there are several preoccupations to which Enzensberger has returned. One is science and technology. Like left-wing intellectuals of an earlier period, but unlike most contemporary intellectuals of any political stamp, he follows scientific thinking and puts it to use in his work. There are already references to systems theory in his media writings of the 1960s, while essays from the 1980s onwards bear the traces of his reading in chaos theory.

For some inexplicable reason, catastrophe theory has been left off the list. Blackburn continues:

One of these takes the topological figure of the 'baker's transformation' (a one-to-one mapping of a square onto itself) discussed by mathematicians such as Stephen Smale and applies the model to historical time as the starting point for a series of reflections on the idea of progress, the invention of tradition and the importance of anachronism.

Pseuds corner indeed. I can hardly blame a European intellectual who was born in 1929 for getting excited about systems theory, chaos theory, and the rest. Poets and novelists of all sorts have taken inspiration by scientific theories, and the point is not whether John Updike truly understood modern computer science or whether Philip K. Dick had any idea what the minimax strategy was really about--these were just ideas, hooks for them to hang their stories. All is fodder for the creative imagination.

I have less tolerance, however, for someone to write in the London Review of Books to describe this sort of thing as an indication that Enzensberger "follows scientific thinking and puts it to use in his work." Perhaps "riffs on scientific ideas" would be a better way of putting it.

P.S. See Sebastian's comment below. Maybe I was being too quick to judge.

## Lucia de Berk found not guilty

Maarten Buis writes:

A while ago you blogged on the case of Lucia de Berk:

Maybe you would like to know that her conviction has been overturned today.

## What should they teach in school?

Bill Mill links to this blog by Peter Gray arguing that we should stop teaching arithmetic in elementary schools. He cites a research study from 75 years ago!

L. P. Benezet (1935/1936). The teaching of Arithmetic: The Story of an Experiment. Originally published in Journal of the National Education Association in three parts. Vol. 24, #8, pp 241-244; Vol. 24, #9, p 301-303; & Vol. 25, #1, pp 7-8.

I imagine there's been some research done on this since then?? Not my area of expertise, but I'm curious.

P.S. You gotta read this anonymous comment that appeared on Gray's blog. I have no idea if the story is true, but based my recollection of teachers from elementary school, I can definitely believe it!

## Those silly statistics on divorce predictions . . . where did it all go wrong?

A couple weeks ago I blogged on John Gottman, a psychologist whose headline-grabbing research on marriages (he got himself featured in Blink with a claim that he could predict with 83 percent accuracy whether a couple would be divorced--after meeting with them for 15 minutes!) was recently debunked in a book by Laurie Abraham. Discussion on the blog revealed that Laurie Abraham had tried to contact Gottman but he had not replied to the request for an interview.

After this, Seth wrote to me:

## No problem, we'll adjust the data to fit the model

"...it has become standard in climate science that data in contradiction to alarmism is inevitably 'corrected' to bring it closer to alarming models. None of us would argue that this data is perfect, and the corrections are often plausible. What is implausible is that the 'corrections' should always bring the data closer to models." - Richard Lindzen, MIT Professor of Meteorology

Background:

Back in 2002, researchers at NASA published a paper entitled "Evidence for large decadal variability in the tropical mean radiative energy budget" (Wielicki et al., Science, 295:841-844, 2002). The paper reported data from a satellite that measures solar radiation headed towards earth, and reflected and radiated energy headed away from earth, and thereby measure the difference in incident and outgoing energy. The data reported in the paper showed that outgoing energy climbed measurably in the late 1990s, in contradiction to the assumptions of predictions from climate models that assume positive or near-zero "climate feedback."

## Of eggs, Blinky, Siamese dandelions, and four-leaf clovers

As part of his continuing campaign etc., Jimmy points me to this and wrote that "it seemed a little creepy. I [Jimmy] was reminded of blinky, the 3-eyed fish from the simpsons."

Jimmy's one up on me. I remember the fish but didn't know it had a name.

P.S. to Jimmy: Don't you have a job? Who has the time to search the web for Siamese eggs??

## Lojack for Grandpa

| 1 Comment

No statistical content here but it's interesting. I remain baffled why they can't do more of this for people on probation, using cellular technology to enforce all sorts of movement restrictions.

## Test failures

Jimmy brings up the saying that the chi-squared test is nothing more than "a test of sample size" and asks:

Would you mind elaborating or giving an example? Hypothesis tests are dependent on sample size. but is the chi-squared test more so than other tests?

And setting aside the general problems of hypothesis testing, off the top of your head, what other tests would you consider useless or counterproductive? (For new and infrequent readers, Fisher's exact test.)

I like chi-squared tests, in their place. See chapter 2 of ARM for an example. Or my 1996 paper with Meng and Stern for some more in-depth discussion.

To answer your later question, I think that most "named" tests are pointless: Wilcoxon, McNemar, Fisher, etc. etc. These procedures might all have their place, but I think much harm is done by people taking their statistical problems and putting them into these restricted, conventional frameworks. In contrast, methods such as regression and Anova (not to mention elaborations such as multilevel models and glm) are much more open-ended and allow the user to incorporate more data and more subject-matter information into his or her analysis.

## Sifting and Sieving

Following our recent discussion of p-values, Anne commented:

We use p-values for something different: setting detection thresholds for pulsar searches. If you're looking at, say, a million independent Fourier frequencies, and you want to bring up an expected one for further study, you look for a power high enough that its p-value is less than one in a million. (Similarly if you're adding multiple harmonics, coherently or incoherently, though counting your "number of trials" becomes more difficult.) I don't know whether there's another tool that can really do the job. (The low computing cost is also important, since in fact those million Fourier frequencies are multiplied by ten thousand dispersion measure trials and five thousand beams.)

That said, we don't really use p-values: in practice, radio-frequency interference means we have no real grasp on the statistics of our problem. There are basically always many signals that are statistically significant but not real, so we rely on ad-hoc methods to try to manage the detection rates.

I don't know anything about astronomy--just for example, I can't remember which way the crescent moon curves in its different phases during the month--but I can offer some general statistical thoughts.

My sense is that p-values are not the best tool for this job. I recommend my paper with Jennifer and Masanao on multiple comparisons; you can also see my talks on the topic. (There's even a video version where you can hear people laughing at my jokes!) Our general advice is to model the underlying effects rather than thinking of them as a million completely unrelated outcomes.

The idea is to get away from the whole sterile p-value/Bayes-factor math games and move toward statistical modeling.

Another idea that's often effective is to select as subset of your million possibilities for screening and then analyze that subset more carefully. The work of Tian Zheng and Shaw-Hwa Lo on feature selection (see the Statistics category here) might be relevant for this purpose.

## Clippin' it

The other day I was talking with someone and, out of nowhere, he mentioned that he'd lost 20 kilos using Seth's nose-clipping strategy. I asked to try on his nose clips, he passed them over to me, and I promptly broke them. (Not on purpose; I just didn't understand how to put them on.)

I'll say this for Seth: I might disagree with him on climate change, academic research, Karl Popper, and Holocaust denial--but his dieting methods really seem to work.

P.S. to Phil: Yes, I'll buy this guy a new set of noseclips.

P.P.S. Another friend recently had a story about losing a comparable amount of weight, but using a completely different diet. So I'm certainly not claiming that Seth's methods are the only game in town.

P.P.P.S. As discussed in the comments below, Seth's publisher should have a good motive to commission a controlled trial of his diet, no?

P.P.P.P.S. Seth points out that we agree on many other things, including the virtues of John Tukey, David Owen, Veronica Geng, Jane Jacobs, Nassim Taleb, and R. And to that I'll add the late great Spy magazine.

## Of psychiatrists and statisticians

Sanjay Srivastava writes:

Below are the names of some psychological disorders. For each one, choose one of the following:

A. This is under formal consideration to be included as a new disorder in the DSM-5.

B. Somebody out there has suggested that this should be a disorder, but it is not part of the current proposal.

C. I [Srivastava] made it up.

1. Factitious dietary disorder - producing, feigning, or exaggerating dietary restrictions to gain attention or manipulate others

2. Skin picking disorder - recurrent skin picking resulting in skin lesions

3. Olfactory reference syndrome - preoccupation with the belief that one emits a foul or offensive body odor, which is not perceived by others

4. Solastalgia - psychological or existential stress caused by environmental changes like global warming

5. Hypereudaimonia - recurrent happiness and success that interferes with interpersonal functioning

6. Premenstrual dysphoric disorder - disabling irritability before and during menstruation

7. Internet addiction disorder - compulsive overuse of computers that interferes with daily life

8. Sudden wealth syndrome - anxiety or panic following the sudden acquisition of large amounts of wealth

9. Kleine Levin syndrome - recurrent episodes of sleeping 11+ hours a day accompanied by feelings of unreality or confusion

10. Quotation syndrome - following brain injury, speech becomes limited to the recitation of quotes from movies, books, TV, etc.

11. Infracaninophilia - compulsively supporting individuals or teams perceived as likely to lose competitions

12. Acquired situational narcissism - narcissism that results from being a celebrity

In academic research, "sudden wealth syndrome" describes the feeling right after you've received a big grant, and you suddenly realize you have a lot of work to do. As a blogger, I can also relate to #7 above.

. . . and statisticians

It's easy to make fun of psychiatrists for this sort of thing--but if statisticians had a similar official manual (not a ridiculous scenario, given that the S in DSM stands for Statistical), it would be equally ridiculous, I'm sure.

Sometimes this comes up when I hear about what is covered in graduate education in statistics and biostatistics--a view of data analysis in which each different data structure gets its own obscurely named "test" (Wilcoxon, McNemar, etc.). The implication, I fear, is that the practicing statistician is like a psychiatrist, listening to the client, diagnosing his or her problems, and then prescribing the appropriate pill (or, perhaps, endless Gibbs sampling^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H talk therapy). I don't know if I have a better model for the training of thousands of statisticians, nor maybe do I have a full understanding of what statistical practice is like for people on the inferential assembly line, as it were. But I strongly feel that the testing approach--and, more generally, the approach of picking your method based on the data structure--is bad statistics. So I'm pretty sure I'd find much to mock in any DSM-of-statistics that might be created.

Another uncomfortable analogy between the two professions is that statistical tests, like psychiatric diagnoses, are trendy, despite their supposed firm foundation in mathematics and demonstrated practical success (just as psychiatry boasts a firm foundation medicine along with millions of satisfied customers over the decades). Compounding the discomfort is that some of the oldest and most established statistical tests are often useless or even counterproductive. (Consider the chi-squared test, which when used well can be helpful--see chapter 2 of ARM for an example--but is also notorious as being nothing more than "a test of sample size" and has let many researchers to disastrously oversimplify their data structures in order to fit the crudest version of this analysis.)

Instead of a DSM, the statistical profession has various standard textbooks, from Snedecor and Cochran to . . . whatever. But our informal DSM, as defined by practice, word-of-mouth, and our graduate curricula, is nothing to be proud of.

## My final exam

I'm not particularly proud of this one, but I thought it might interest some of you in any case. It's the final exam for the course I taught this fall to the economics students at Sciences Po. Students were given two hours.

## Thoughts on journalists who want to write about science

First the scientific story, then the journalist, then my thoughts.

Part 1: The scientific story

From the Daily News:

Spanking makes kids perform better in school, helps them become more successful: study

The research, by Calvin College psychology professor Marjorie Gunnoe, found that kids smacked before age 6 grew up to be more successful . . . Gunnoe, who interviewed 2,600 people about being smacked, told the [London] Daily Mail: "The claims that are made for not spanking children fail to hold up. I think of spanking as a dangerous tool, but then there are times when there is a job big enough for a dangerous tool. You don't use it for all your jobs."

From the Daily Mail article:

Professor Gunnoe questioned 2,600 people about being smacked, of whom a quarter had never been physically chastised. The participants' answers then were compared with their behaviour, such as academic success, optimism about the future, antisocial behaviour, violence and bouts of depression. Teenagers in the survey who had been smacked only between the ages of two and six performed best on all the positive measures. Those who had been smacked between seven and 11 fared worse on negative behaviour but were more likely to be academically successful. Teenagers who were still smacked fared worst on all counts.

Part 2: The journalist

Po Bronson (whose life and career are eerily similar to the slightly older and slightly more famous Michael Lewis) writes about this study in Newsweek:

Unfortunately, there's been little study of [kids who haven't been spanked], because children who've never been spanked aren't easy to find. Most kids receive physical discipline at least once in their life. But times are changing, and parents today have numerous alternatives to spanking. The result is that kids are spanked less often overall, and kids who've never been spanked are becoming a bigger slice of the pie in long-term population studies.

One of those new population studies underway is called Portraits of American Life. It involves interviews of 2,600 people and their adolescent children every three years for the next 20 years. Dr. Marjorie Gunnoe is working with the first wave of data on the teens. It turns out that almost a quarter of these teens report they were never spanked.

So this is a perfect opportunity to answer a very simple question: are kids who've never been spanked any better off, long term?

Gunnoe's summary is blunt: "I didn't find that in my data." . . . those who'd been spanked just when they were young--ages 2 to 6--were doing a little better as teenagers than those who'd never been spanked. On almost every measure.

A separate group of teens had been spanked until they were in elementary school. Their last spanking had been between the ages of 7 and 11. These teens didn't turn out badly, either.

Compared with the never-spanked, they were slightly worse off on negative outcomes, but a little better off on the good outcomes. . . .

Gunnoe doesn't know what she'll find, but my thoughts jump immediately to the work of Dr. Sarah Schoppe-Sullivan, whom we wrote about in NurtureShock. Schoppe-Sullivan found that children of progressive dads were acting out more in school. This was likely because the fathers were inconsistent disciplinarians; they were emotionally uncertain about when and how to punish, and thus they were reinventing the wheel every time they had to reprimand their child. And there was more conflict in their marriage over how best to parent, and how to divide parenting responsibilities.

I [Bronson] admit to taking a leap here, but if the progressive parents are the ones who never spank (or at least there's a large overlap), then perhaps the consistency of discipline is more important than the form of discipline. In other words, spanking regularly isn't the problem; the problem is having no regular form of discipline at all.

I couldn't find a copy of Gunnoe's report on the web. Her local newspaper (the Grand Rapids News) reports that she "presented her findings at a conference of the Society for Research in Child Development," but the link only goes to the conference website, not to any manuscript. Following the link for Marjorie Gunnoe takes me to this page at Calvin College, which describes itself as "the distinctively Christian, academically excellent liberal arts college that shapes minds for intentional participation in the renewal of all things."

Gunnoe is quoted in the Grand Rapids Press as saying:

"This in no way should be thought of as a green light for spanking . . . This is a red light for people who want to legally limit how parents choose to discipline their children. I don't promote spanking, but there's not the evidence to outlaw it."

I'm actually not sure why these results, if valid, should not by taken as a "green light" for spanking, but I guess Gunnoe's point is that parental behaviors are situational, and you might not want someone reading her article and then hitting his or her kid for no reason, just for its as-demonstrated-by-research benefits.

Unsurprisingly, there's lots of other research on the topic of corporal punishment. A commenter at my other blog found a related study of Gunnoe's, from 1997. It actually comes from an entire issue of the journal that's all about discipine, including several articles on spanking.

Another commenter linked to several reports of research, including this from University of New Hampshire professor Murray Straus:

(I don't know who is spanked exactly once, but maybe this is #times spanked per week, or something like that. I didn't search for the original source of the graph.)

I agree with the commenter that it would be interesting to see Gunnoe and Straus speaking on the same panel.

Part 3: My thoughts

I can't exactly say that Po Bronson did anything wrong in his writeup--he's knowledgeable in this area (more than I am, certainly) and has thought a lot about it. He's a journalist who's written a book on child-rearing, and this is a juicy topic, so I can't fault him for discussing Gunnoe's findings. And I certainly wouldn't suggest that this topic is off limits just because nothing has really been "proved" on the causal effects of corporal punishment. Research in this area is always going to be speculative.

Nonetheless, I'm a little bothered by Bronson's implicit acceptance of Gunnoe's results and his extrapolations from her more modest claims. I get a bit uncomfortable when a reporter starts to give explanations for why something is happening, when that "something" might not really be true at all. I don't see any easy solution here--Bronson is even careful enough to say, "I admit to taking a leap here." Still, I'm bothered by what may be a too-easy implicit acceptance of an unpublished research claim. Again, I'm not saying that blanket skepticism is a solution either, but still . . .

It's a tough situation to be in, to report on headline-grabbing claims when there's no research paper to back them up. (I assume that if Bronson had a copy of Gunnoe's research article, he could send it to various experts he knows to get their opinions.)

P.S. I altered the second-to-last paragraph above in light of Jason's comments.

## Circle, purple, empty, orange, silver, and prefix codes

A while ago, this blog had a discussion of short English words that have no rhymes. We've all heard of "purple" (which, in fact, rhymes with the esoteric but real word hirple) and "orange" in this context, but there are others. This seems a bit odd, which I guess is why some of these words are famous for having no rhyme. Naively, and maybe not so naively, one might expect that at least some new words would be created to take advantage of the implied gaps in the gamut of two-syllable words. Is there something that prevents new coinages from filling the gaps? Why do we have blogs and and vegans and wikis and pixels and ipods, but not merkles and rilvers and gurples?

I have a hypothesis, which is more in the line of idle speculation. Perhaps some combinations are automatically disfavored because they interfere with rapid processing of the spoken language. I need to digress for just a moment to mention a fact that supposedly baffled early workers in speech interpretation technology: in spoken language, there are no pauses or gaps between words. If you say a typical sentence --- let's take the previous sentence for example --- and then play it back really slowly, or look at the sound waveform on the screen, you will find that there are no gaps between most of the words. It's not "I -- need -- to -- digress," it's "Ineed todigressforjusta momento..." Indeed, unless you make a special effort to enunciate clearly, you may well use the final "t" in "moment" as the "t" in "to": most people wouldn't say the t twice. But with all of these words strung together, how is it that our minds are able to separate and interpret them, and in fact to do this unconsciously most of the time, to the extent that we feel like we are hearing separate words?

My thought --- and, as I said, it is pure speculation --- is that perhaps there is an element of "prefix coding" in spoken language, or at least in spoken English (but presumably others too). "Prefix coding" is the assignment of a code such that no symbol in the code is the start (prefix) of another symbol in the code. Hmm, that sentence only means something if you already know what it means. Try this. Suppose I want to compose a language based on only two syllables, "ba" and "fee". Using a prefix code, it's possible to come up with a rule for words in this language, such that I can always tell where one word stops and another word ends, even with no gaps between words. ("Huffman coding" provides the most famous way of doing this.) For instance, suppose I have words bababa, babafee, feeba, bafee, and feefeefee. No matter how I string these together, it turns out there is only one possible breakdown into words: babafeefeebabafeefeefeefeefeebabababa can only be parsed one way, so there's no need for word breaks. In fact, as soon as you reach the end of one word, you know you have done so; no need to "go backwards" from later in the message, to try out alternative parses.

English doesn't quite work like this. For example, the syllable string see-thuh-car-go-on-the-ship can be interpreted as "see the cargo on the ship" or "see the car go on the ship". But it took me several tries to come up with that example! To a remarkable degree, you don't need pauses between the words, especially if the sentence also has to make sense.

So, maybe words that rhymes with "circle" or "empty" are disfavored because they would interfere with a quasi-"prefix coding" character of the language? Suppose there were a word "turple" for example. It would start with a "tur" sound, which is one of the more common terminal sounds in English (center, mentor, enter, renter, rater, later...). A string of syllables that contains "blah-blah-en-tur-ple-blah" could be split more than one place...maybe that's a problem. Of course, you'll say "but there are other words that start with "tur", why don't those cause a problem, why just "turple"? But there aren't all that many other common "tur" words --- surprisingly few, actually --- turn, term, terminal. "Turple" would be the worst, when it comes to parsing, because its second syllable --- pul --- is a common starting syllable in rapidly spoken English (where many pl words, like please and plus and play, start with an approximation of the sound).

So...perhaps I'm proposing nonsense, or perhaps I'm saying something that has been known to linguists forever, but that's my proposal: some short words tend to evolve out of the language because they interfere with our spoken language interpretation.

## Something I just wrote in a referee report: Post your numbers now, not later

The following is the last paragraph in a (positive) referee report I just wrote. It's relevant for lots of other articles too, I think, so I'll repeat it here:

Just as a side note, I recommend that the authors post their estimates immediately; I imagine their numbers will be picked up right away and be used by other researchers. First, this is good for the authors, as others will cite their work; second, these numbers should help advance research in the field; and, third, people will take the estimates seriously enough that, if there are problems, they will be uncovered. It makes sense to start this process now, so if anything bad comes up, it can be fixed before the paper gets published!

I have to admit that I'm typically too lazy to post my estimates right away; usually it doesn't happen until someone sends me an email request and then I put together a dataset. But, after writing the above paragraph, maybe I'll start following my own advice.

## Funding research

Via Mendeley, a nice example of several overlapping histograms:

The x axis is overlabelled, but I don't want to nitpick.

Previous post on histogram visualization: The mythical Gaussian distribution and population differences

Update 12/21/09: JB links to an improved version of the histograms by Eric Drexler below. And Eric links to the data. Thanks!

## How do I form my attitudes about scientific questions?

The lively discussion on Phil's entries on global warming here and here prompted me to think about the sources of my own attitudes toward this and other scientific issues.

For the climate change question, I'm well situated to have an informed opinion: I have a degree in physics, two of my closest friends have studied the topic pretty carefully, and I've worked on a couple related research projects, one involving global climate models and one involving tree ring data.

In our climate modeling project we were trying to combine different temperature forecasts on a scale in which Africa was represented by about 600 grid boxes. No matter how we combined these precipitation models, we couldn't get any useful forceasts out of them. Also, I did some finite-element analysis many years ago as part of a research project on the superheating of silicon crystals (for more details of the project, you can go to my published research papers and scroll way, way, way down). We were doing analysis on a four-inch wafer, and even that was tricky, so I'm not surprised that you'll have serious problems trying to model the climate in this way. As for the tree-ring analysis, I'm learning more about this now--we're just at the beginning of a three-year NSF-funded project--but, so far, it seems like one of those statistical problems that's easy to state but hard to solve, involving a sort of multilevel modeling of splines that's never been done before. It's tricky stuff, and I can well believe that previous analyses will need to be seriously revised.

Notwithstanding my credentials in this area, I actually take my actual opinions on climate change directly from Phil: he's more qualified to have an opinion on this than I am--unlike me, he's remained in physics--and he's put some time into reading up and thinking about the issues. He's also a bit of an outsider, in that he doesn't do climate change research himself. And if I have any questions about what Phil says, I can run it by Upmanu--a water-resources expert--and see what he thinks.

What if you don't know any experts personally?

It helps to have experts who are personal friends. Steven Levitt has been criticized for not talking over some of his climate-change speculations with climate expert Raymond Pierrehumbert at the University of Chicago (who helpfully supplied a map showing how Levitt could get to his office), but I can almost sort-of understand why Levitt didn't do this. It's not so easy to understand what a subject-matter expert is saying--there really are language barriers, and if the expert is not a personal friend, communication can be difficult. It's not enough to simply be at the same university, and perhaps Levitt realized this.

## Climate skeptics, deniers, hawks, and True Believers

Lots of accusations are flying around in the climate change debate. People who believe in anthropogenic (human-caused) climate change are accused of practicing religion, not science. People who don't are called "deniers", which some of them think is an attempt to draw a moral link with holocaust deniers. Al Gore referred to Sarah Palin as a "climate change denier," and Palin immediately responded that she believes the climate changes, she just doesn't think the level of greenhouse gases in the atmosphere has anything to do with it. What's the right word to use for people like her? And yes, we do need some terminology if we want to be able to discuss the climate change debate!

## ClimateGate: How do YOU choose what to believe?

Like a lot of scientists -- I'm a physicist -- I assumed the "Climategate" flap would cause a minor stir but would not prompt any doubt about the threat of global warming, at least among educated, intelligent people. The evidence for anthropogenic (that is, human-caused) global warming is strong, comes from many sources, and has been subject to much scientific scrutiny. Plenty of data are freely available. The basic principles can be understood by just about anyone, and first- and second-order calculations can be perfomed by any physics grad student. Given these facts, questioning the occurrence of anthropogenic global warming seems crazy. (Predicting the details is much, much more complicated). And yet, I have seen discussions, articles, and blog posts from smart, educated people who seem to think that anthropogenic climate change is somehow called into question by the facts that (1) some scientists really, deeply believe that global warming skeptics are wrong in their analyses and should be shut out of the scientific discussion of global warming, and (2) one scientist may have fiddled with some of the numbers in making one of his plots. This is enough to make you skeptical of the whole scientific basis of global warming? Really?

## "Orange" ain't so special

Mark Liberman comes in with a data-heavy update (and I mean "data-heavy" in a good way, not as some sort of euphemism for "data-adipose") on my comments of the other day. I'm glad to see that he agrees with me that my impressedness with Laura Wattenberg's observation was justified.

## Yet more antblogging

James Waters writes:

## Equation search, part 2

Some further thoughts on the Eureqa program which implements the curve-fitting method of Michael Schmidt and Hod Lipson:

The program kept running indefinitely, so I stopped it in the morning, at which point I noticed that the output didn't quite make sense, and I went back and realized that I'd messed up when trying to delete some extra data in the file. So I re-ran with the actual data. The program functioned as before but moved much quicker to a set of nearly-perfect fits (R-squared = 99.9997%, and no, that's not a typo). Here's what the program came up with:

The model at the very bottom of the list is pretty pointless, but in general I like the idea of including "scaffolding" (those simple models that we construct on the way toward building something that fits better) so I can't really complain.

It's hard to fault the program for not finding y^2 = x1^2 + x2^2, given that it already had such a success with the models that it did find.

Steven Levitt writes:

My view is that the emails [extracted by a hacker from the climatic research unit at the University of East Anglia] aren't that damaging. Is it surprising that scientists would try to keep work that disagrees with their findings out of journals? When I told my father that I was sending my work saying car seats are not that effective to medical journals, he laughed and said they would never publish it because of the result, no matter how well done the analysis was. (As is so often the case, he was right, and I eventually published it in an economics journal.)

Within the field of economics, academics work behind the scenes constantly trying to undermine each other. I've seen economists do far worse things than pulling tricks in figures. When economists get mixed up in public policy, things get messier. So it is not at all surprising to me that climate scientists would behave the same way.

I have a couple of comments, not about the global-warming emails--I haven't looked into this at all--but regarding Levitt's comments about scientists and their behavior:

1. Scientists are people and, as such, are varied and flawed. I get particularly annoyed with scientists who ignore criticisms that they can't refute. The give and take of evidence and argument is key to scientific progress.

2. Levitt writes, about scientists who "try to keep work that disagrees with their findings out of journals." This is or is not ethical behavior, depending on how it's done. If I review a paper for a journal and find that it has serious errors or, more generally, that it adds nothing to the literature, then I should recommend rejection--even if the article claims to have findings that disagree with my own work. Sure, I should bend over backwards and all that, but at some point, crap is crap. If the journal editor doesn't trust my independent judgment, that's fine, he or she should get additional reviewers. On occasion I've served as an outside "tiebreaker" referee for journals on controversial articles outside of my subfield.

Anyway, my point is that "trying to keep work out of journals" is ok if done through the usual editorial process, not so ok if done by calling the journal editor from a pay phone at 3am or whatever.

I wonder if Levitt is bringing up this particular example because he served as a referee for a special issue of a journal that he later criticized. So he's particularly aware of issues of peer review.

3. I'm not quite sure how to interpret the overall flow of Levitt's remarks. On one hand, I can't disagree with the descriptive implications: Some scientists behave badly. I don't know enough about economics to verify his claim that academics in that field "constantly trying to undermine each other . . . do far worse things than pulling tricks in figures"--but I'll take Levitt's word for it.

But I'm disturbed by the possible normative implications of Levitt's statement. It's certainly not the case that everybody does it! I'm a scientist, and, no, I don't "pull tricks in figures" or anything like this. I don't know what percentage of scientists we're talking about here, but I don't think this is what the best scientists do. And I certainly don't think it's ok to do so.

What I'm saying is, I think Levitt is doing a big service by publicly recognizing that scientists sometimes--often?--do unethical behavior such as hiding data. But I'm unhappy with the sense of amused, world-weary tolerance that I get from reading his comment.

Anyway, I had a similar reaction a few years ago when reading a novel about scientific misconduct. The implication of the novel was that scientific lying and cheating wasn't so bad, these guys are under a lot of pressure and they do what they can, etc. etc.--but I didn't buy it. For the reasons given here, I think scientists who are brilliant are less likely to cheat.

4. Regarding Levitt's specific example--he article on car seats that was rejected by medical journals--I wonder if he's being too quick to assume that the journals were trying to keep his work out because it disagreed with previous findings.

As a scientist whose papers have been rejected by top journals in many different fields, I think I can offer a useful perspective here.

Much of what makes a paper acceptable is style. As a statistician, I've mastered the Journal of the American Statistical Association style and have published lots of papers there. But I've never successfully published a paper in political science or economics without having a collaborator in that field. There's just certain things that a journal expects to see. It may be comforting to think that a journal will not publish something "because of the result," but my impression is that most journals like a bit of controversy--as long as it is presented in their style. I'm not surprised that, with his training, Levitt had more success publishing his public health work in econ journals.

P.S. Just to repeat, I'm speaking in general terms about scientific misbehavior, things such as, in Levitt's words, "pulling tricks in figures" or "far worse things." I'm not making a claim that the scientists at the University of East Anglia were doing this, or were not doing this, or whatever. I don't think I have anything particularly useful to add on that; you can follow the links in Freakonomics to see more on that particular example.

## "Finding signal from noise": Dr. Bancel responds

The other day I commented on an article by Peter Bancel and Roger Nelson that reported evidence that "the coherent attention or emotional response of large populations" can affect the output of quantum-mechanical random number generators.

I was pretty dismissive of the article; in fact elsewhere I gave my post the title, "Some ESP-bashing red meat for you ScienceBlogs readers out there."

Dr. Bancel was pointed to my blog and felt I wasn't giving the full story. I'll give his comments and then at the end add some thoughts of my own. Bancel wrote:

## Finding signal from noise

A reporter contacted me to ask my impression of this article by Peter Bancel and Roger Nelson, which reports evidence that "the coherent attention or emotional response of large populations" can affect the output of quantum-mechanical random number generators.

I spent a few minutes looking at the article, and, well, it's about what you might expect. Very professionally done, close to zero connection between their data and whatever they actually think they're studying.

## Your chance to help some people make money (maybe) and improve research (maybe)

Hello, my name is Lauren Schmidt, and I recently graduated from the Brain & Cognitive Sciences graduate program at MIT, where I spent a lot of time doing online research using human subjects. I also spent a lot of time being frustrated with the limitations of various existing online research tools. So now I am co-founding a start-up, HeadLamp Research, with the goal of making online experimental design and data collection as fast, easy, powerful, and painless as can be. But we need your help to come up with an online research tool that is as useful as possible!

We have a short survey (5-10 min) on your research practices and needs, and we would really appreciate your input if you are interested in online data collection.

I imagine they're planning to make money off this start-up and so I think it would be only fair if they pay their survey participants. Perhaps they can give them a share of the profits, if any exist?

## Statistical computing job in Bristol, England

| 1 Comment

Bill Browne sends in this interesting job possibility. Closing date for applications is 30 Oct 2009, so if you're interested, let him know right away!

## Rorschach's on the loose

According to Josh Millet, the notorious Rorschach inkplots have been posted on the web, leading to much teeth-gnashing among psychologists, who worry that they can't use the test anymore now that civilians can get their hands on the images ahead of time.

For example, here's a hint for Card IV (see below): "The human or animal content seen in the card is almost invariably classified as male rather than female, and the qualities expressed by the subject may indicate attitudes toward men and authority."

So, if they show you this one on a pre-employment test, better play it safe and say that the big figure looks trustworthy and that you'd never, ever steal paperclips from it.

Oh, and when Card II comes up, maybe you should just play it safe and not mention blood at all.

More general concerns

I'm not particularly worried about the Rorschach test since it's pretty much a joke--you can read into it whatever you want--but, as Millet points out, similar issues would arise, for example, if someone stole a bunch of SAT questions and posted them. It would compromise the test's integrity. Millet points out that this problem could be solved if you were to release thousands and thousands of potential SAT questions: nobody could memorize all of these, it would be easier to just learn the material.

I've had the plan for many years to do this for introductory statistics classes: to have, say, 200 questions for the final exam, give out the questions to all the students, and explain ahead of time that the actual exam will be a stratified sample from the list. This would encourage students to study the material but not in a way that they could usefully "game the system." I haven't done this yet--it's a lot of work!--but I'm still planning to do so.

## The laws of conditional probability are false

This is all standard physics. Consider the two-slit experiment--a light beam, two slits, and a screen--with y being the place on the screen that lights up. For simplicity, think of the screen as one-dimensional. So y is a continuous random variable.

Consider four experiments:

1. Slit 1 is open, slit 2 is closed. Shine light through the slit and observe where the screen lights up. Or shoot photons through one at a time, it doesn't matter. Either way you get a distribution, which we can call p1(y).

2. Slit 1 is closed, slit 2 is open. Same thing. Now we get p2(y).

3. Both slits are open. Now we get p3(y).

4. Now run experiment 3 with detectors at the slits. You'll find out which slit each photon goes through. Call the slit x. So x is a discrete random variable taking on two possible values, 1 or 2. Assuming the experiment has been set up symmetrically, you'll find that Pr(x=1) = Pr(x=2) = 1/2.

You can also record y, thus you can get p4(y), and you can also observe the conditional distributions, p4(y|x=1) and p4(y|x=2). You'll find that p4(y|x=1) = p1(y) and p4(y|x=2) = p2(y). You'll also find that p4(y) = (1/2) p1(y) + (1/2) p2(y). So far, so good.

The problem is that p4 is not the same as p3. Heisenberg's uncertainty principle: putting detectors at the slits changes the distribution of the hits on the screen.

This violates the laws of conditional probability, in which you have random variables x and y, and in which p(x|y) is the distribution of x if you observe y, p(y|x) is the distribution of y if you observe x, and so forth.

A dissenting argument (that doesn't convince me)

To complicate matters, Bill Jefferys writes:

As to the two slit experiment, it all depends on how you look at it. Leslie Ballentine wrote an article a number of years ago in The American Journal of Physics, in which he showed that conditional probability can indeed be used to analyze the two slit experiment. You just have to do it the right way.

I looked at the Ballentine article and I'm not convinced. Basically he's saying that the reasoning above isn't a correct application of probability theory because you should really be conditioning on all information, which in this case includes the fact that you measured or did not measure a slit. I don't buy this argument. If the probability distribution changes when you condition on a measurement, this doesn't really seem to be classical "Boltzmannian" probability to me.

In standard probability theory, the whole idea of conditioning is that you have a single joint distribution sitting out there--possibly there are parts that are unobserved or even unobservable (as in much of psychometrics)--but you can treat it as a fixed object that you can observe through conditioning (the six blind men and the elephant). Once you abandon the idea of a single joint distribution, I think you've moved beyond conditional probability as we usually know it.

And so I think I'm justified in pointing out that the laws of conditional probability are false. This is not a new point with me--I learned it in college, and obviously the ideas go back to the founders of quantum mechanics. But not everyone in statistics knows about this example, so I thought it would be useful to lay it out.

What I don't know are whether there are any practical uses to this idea in statistics, outside of quantum physics. For example, would it make sense to use "two-slit-type" models in psychometrics, to capture the idea that asking one question affects the response to others? I just don't know.

## A horror story involving the correction of a published scientific article

Lee Sigelman points to this article by physicist Rick Trebino describing his struggles to publish a correction in a peer-reviewed journal. It's pretty frustrating, and by the end of it--hell, by the first third of it--I share Trebino's frustration. It would be better, though, if he'd link to his comment and the original article that inspired it. Otherwise, how can we judge his story? Somehow, by the way that it's written, I'm inclined to side with Trebino, but maybe that's not fair--after all, I'm only hearing half of the story.

Anyway, reading Trebino's entertaining rant (and I mean "rant" in a good way, of course) reminded me of my own three stories on this topic. Rest assured, none of them are as horrible as Trebino's.

## Model checking in the presence of missing data

I received this question in the mail:

Your Biometrics article, Multiple imputation for model checking: completed-data plots with missing and latent data, suggests diagnostics when the missing values of a dataset are filled in by multiple imputation. But suppose we have two equivalent files--File A with variable y left-censored at known threshold and File B with y fully observed. We draw multiple imputations of censored y in File A. (1) Can we validate our imputation model by setting y in File B as left-censored according to the inclusion indicator from A, performing multiple imputation of these "censored" data, and comparing imputed to observed values? (2) In particular, what diagnostic measure(s) would tell us whether the imputed and observed values fit closely enough to validate our imputation model?

My reply: I'm a little confused: if you already have File B, what do you need File A for? Do the two files have different data, or are you just using this to validate your imputation model? If the latter, then, yes, you can see whether the observations in File B are consistent with the predictive distributions obtained from your multiple imputations on File A. You wouldn't expect the imputations to be perfect, but you'd like the imputed 50% intervals to have approximate 50% coverage, you'd like the average values of the true data to equal the predictions from the imputations, on average, and conditional on any information in the observed data in File A. (But the imputations don't have to--and, in general, shouldn't--be correct on average, conditional on the hidden true values.)

You may also be interested in my 2004 article, Exploratory data analysis for complex models, which actually an example on death-penalty sentencing, with censored data.

## Truth in Data

| 1 Comment

David Blei is teaching this cool new course at Princeton in the fall. I'll give the description and then my thoughts.

Daniel Lakeland writes:

You may be astounded that people are still reporting 26% more probability to have daughters than sons, and then extrapolating this to decide that evolution is strongly favoring beautiful women... Or considering the degree of innumeracy in the population perhaps you wouldn't be astounded.... in any case... they are still reporting such things.

If anyone out there happens to know Jonathan Leake, the reporter who wrote this story for the (London) Sunday Times, perhaps you could send him a copy of our recent article in the American Scientist. Or, if he'd like more technical details, this article from the Journal of Theoretical Biology?

Thank you. I have nothing more to say at this time.

## The science of wishful thinking

I just read Charles Seife's excellent book, "Sun in a bottle: The strange history of fusion and the science of wishful thinking." One thing I found charming about the book was that it lumped crackpot cold fusion, nutty plans to use H-bombs to carve out artificial harbors in Alaska, and mainstream tokomaks into the same category: wildly-hyped but unsuccessful promises to change the world. The "wishful thinking" framing seems to fit all these stories pretty well, much better than the usual distinction between the good science of big-budget lasers and tokomaks and the bad science of cold fusion and the like. The physics explanations were good also.

The only part I really disagreed with. On page 220, Seife writes, "Science is little more than a method of tearing away notions that are not supported by cold, hard data." I disagree. Just for a few examples from physics, how about Einstein's papers on Brownian motion and the photoelectric effect? And what about lots of biology, chemistry, and solid state physics, figuring out the structures of crystals and semiconductors and protein folding and all that? Sure, all of this work involves some "tearing away" of earlier models, but much of it--often the most important part--is constructive, building a model--a story--that makes sense and backing it up with data.

## Arthur Jensen: "the possible indicators of g are of unlimited diversity . . ."

After finding the Howard Wainer interview, I looked up the entire series of Profiles in Research published by the Journal of Educational and Behavioral Statistics. I don't have much to say about most of these interviews: some of these people I'd never heard of, and I don't really have much research overlap with the others. Probably I have the most overlap with R. D. Bock, who's done a lot of work on multilevel modeling, but, for whatever reason, his stories didn't grab my interest.

But I was curious about the interview with Arthur Jensen. I've never met him--he gave a talk at the Berkeley statistics department once when I was there, but for some reason I wasn't able to attend the talk. But I've heard of him. As the interviewers (Daniel Robinson and Howard Wainer) state:

### Research Supported By

• Cheryl Carpenter: Bob is my brother and he mentioned this blog entry read more
• Bob Carpenter: That's awesome. Thanks. Exactly the graphs I was talking about. read more
• Manuel Moe G: Do I detect a small inconsistency in the nomenclature you read more
• Jed: Speaking of wacky claims... Not sure if you saw this. read more
• mb: Small issue. From their Figure 3, the number of rain-free read more
• Andrew Gelman: Jim: As Kobi and I write in our paper, we read more
• Sumio Watanabe: Dear Dr. Gelman, I agree with your opinion that, even read more
• Jim: Just curious what would be the next step if the read more
• Winston Lin: Andrew, the July 4 findings might not be quite so read more
• Megan Pledger: This is based on my softball knowledge from a long read more
• Andrew Gelman: Yup. This'll be fixed in a few days when we read more
• Millsy: Here's some shorter term, very preliminary, very basic, very ugly read more
• Matt: Ben Fry’s Baseball Chart looks more like an art-museum-security-laser plot. read more
• Rodney Sparapani: I guess that means that we can't post comments on read more
• Pablo Verde: Excellent article! Where I can get the R script for read more