Recently in Public Health Category

When it rains it pours . . .

John Transue writes:

I saw a post on Andrew Sullivan's blog today about life expectancy in different US counties. With a bunch of the worst counties being in Mississippi, I thought that it might be another case of analysts getting extreme values from small counties.

However, the paper (see here) includes a pretty interesting methods section. This is from page 5, "Specifically, we used a mixed-effects Poisson regression with time, geospatial, and covariate components. Poisson regression fits count outcome variables, e.g., death counts, and is preferable to a logistic model because the latter is biased when an outcome is rare (occurring in less than 1% of observations)."

They have downloadable data. I believe that the data are predicted values from the model. A web appendix also gives 90% CIs for their estimates.

Do you think they solved the small county problem and that the worst counties really are where their spreadsheet suggests?

My reply:

I don't have a chance to look in detail but it sounds like they're on the right track. I like that they cross-validated; that's what we did to check we were ok with our county-level radon estimates.

Regarding your question about the small county problem: no matter what you do, all maps of parameter estimates are misleading. Even the best point estimates can't capture uncertainty. As noted above, cross-validation (at the level of the county, not of the individual observation) is a good way to keep checking.

Brendan Nyhan points me to this from Don Taylor:

Can national data be used to estimate state-level results? . . . A challenge is the fact that the sample size in many states is very small . . . Richard [Gonzales] used a regression approach to extrapolate this information to provide a state-level support for health reform:
To get around the challenge presented by small sample sizes, the model presented here combines the benefits of incorporating auxiliary demographic information about the states with the hierarchical modeling approach commonly used in small area estimation. The model is designed to "shrink" estimates toward the average level of support in the region when there are few observations available, while simultaneously adjusting for the demographics and political ideology in the state. This approach therefore takes fuller advantage of all information available in the data to estimate state-level public opinion.

This is a great idea, and it is already being used all over the place in political science. For example, here. Or here. Or here.

See here for an overview article, "How should we estimate public opinion in the states?" by Jeff Lax and Justin Phillips.

It's good to see practical ideas being developed independently in different fields. I know that methods developed by public health researchers have been useful in political science, and I hope that in turn they can take advantage of the progress we've made in multilevel regression and poststratification.

Since we're on the topic of nonreplicable research . . . see here (link from here) for a story of a survey that's so bad that the people who did it won't say how they did it.

I know too many cases where people screwed up in a survey when they were actually trying to get the right answer, for me to trust any report of a survey that doesn't say what they did.

I'm reminded of this survey which may well have been based on a sample of size 6 (again, the people who did it refused to release any description of methodology).

Christakis-Fowler update

| 1 Comment

After I posted on Russ Lyons's criticisms of the work of Nicholas Christakis and James Fowler's work on social networks, several people emailed in with links to related articles. (Nobody wants to comment on the blog anymore; all I get is emails.)

Here they are:

Nicholas Christakis and James Fowler are famous for finding that obesity is contagious. Their claims, which have been received with both respect and skepticism (perhaps we need a new word for this: "respecticism"?) are based on analysis of data from the Framingham heart study, a large longitudinal public-health study that happened to have some social network data (for the odd reason that each participant was asked to provide the name of a friend who could help the researchers locate them if they were to move away during the study period.

The short story is that if your close contact became obese, you were likely to become obese also. The long story is a debate about the reliability of this finding (that is, can it be explained by measurement error and sampling variability) and its causal implications.

This sort of study is in my wheelhouse, as it were, but I have never looked at the Christakis-Fowler work in detail. Thus, my previous and current comments are more along the lines of reporting, along with general statistical thoughts.

We last encountered Christakis-Fowler last April, when Dave Johns reported on some criticisms coming from economists Jason Fletcher and Ethan Cohen-Cole and mathematician Russell Lyons.

Lyons's paper was recently published under the title, The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis. Lyons has a pretty aggressive tone--he starts the abstract with the phrase "chronic widespread misuse of statistics" and it gets worse from there--and he's a bit rougher on Christakis and Fowler than I would be, but this shouldn't stop us from evaluating his statistical arguments. Here are my thoughts:

Here is my discussion of a recent article by David Spiegelhalter, Christopher
Sherlaw-Johnson, Martin Bardsley, Ian Blunt, Christopher Wood and Olivia Grigg, that is scheduled to appear in the Journal of the Royal Statistical Society:

I applaud the authors' use of a mix of statistical methods to attack an important real-world problem. Policymakers need results right away, and I admire the authors' ability and willingness to combine several different modeling and significance testing ideas for the purposes of rating and surveillance.

That said, I am uncomfortable with the statistical ideas here, for three reasons. First, I feel that the proposed methods, centered as they are around data manipulation and corrections for uncertainty, has serious defects compared to a more model-based approach. My problem with methods based on p-values and z-scores--however they happen to be adjusted--is that they draw discussion toward error rates, sequential analysis, and other technical statistical concepts. In contrast, a model-based approach draws discussion toward the model and, from there, the process being modeled. I understand the appeal of p-value adjustments--lots of quantitatively-trained people know about p-values--but I'd much rather draw the statistics toward the data rather than the other way around. Once you have to bring out the funnel plot, this is to me a sign of (partial) failure, that you're talking about properties of a statistical summary rather than about the underlying process that generates the observed data.

My second difficulty is closely related: to me, the mapping seems tenuous from statistical significance to the ultimate healthcare and financial goals. I'd prefer a more direct decision-theoretic approach that focuses on practical significance.

That said, the authors of the article under discussion are doing the work and I'm not. I'm sure they have good reasons for using what I consider to be inferior methods, and I believe that one of the points of this discussion is to give them a chance to give this explanation.

Finally, I am glad that these methods result in ratings rather than rankings. As has been discussed by Louis (1984), Lockwood et al. (2002), and others, two huge problems arise when constructing ranks from noisy data. First, with unbalanced data (for example, different sample sizes in different hospitals) there is no way to simultaneously get reasonable point estimates of parameters and their rankings. Second, ranks are notoriously noisy. Even with moderately large samples, estimated ranks are unstable and can be misleading, violating well-known principles of quality control by encouraging decision makers to chase noise rather than understanding and reducing variation (Deming, 2000). Thus, although I am unhappy with the components of the methods being used here, I like some aspects of the output.

I encountered this news article, "Chicago school bans some lunches brought from home":

At Little Village, most students must take the meals served in the cafeteria or go hungry or both. . . . students are not allowed to pack lunches from home. Unless they have a medical excuse, they must eat the food served in the cafeteria. . . . Such discussions over school lunches and healthy eating echo a larger national debate about the role government should play in individual food choices. "This is such a fundamental infringement on parental responsibility," said J. Justin Wilson, a senior researcher at the Washington-based Center for Consumer Freedom, which is partially funded by the food industry. . . . For many CPS parents, the idea of forbidding home-packed lunches would be unthinkable. . . .

If I had read this two years ago, I'd be at one with J. Justin Wilson and the outraged kids and parents. But last year we spent a sabbatical in Paris, where . . . kids aren't allowed to bring lunches to school. The kids who don't go home for lunch have to eat what's supplied by the lunch ladies in the cafeteria. And it's just fine. Actually, it was more than fine because we didn't have to prepare the kids' lunches every day. When school let out, the kids would run to the nearest boulangerie and get something sweet. So they didn't miss out on the junk food either.

I'm not saying the U.S. system or the French system is better, nor am I expressing an opinion on how they do things in Chicago. I just think it's funny how a rule which seems incredibly restrictive from one perspective is simply, for others, the way things are done. I'll try to remember this story next time I'm outraged at some intolerable violation of my rights.

P.S. If they'd had the no-lunches-from-home rule when I was a kid, I definitely would've snuck food into school. In high school the wait for lunchtime was interminable.

Ryan King writes:

This involves causal inference, hierarchical setup, small effect sizes (in absolute terms), and will doubtless be heavily reported in the media.

The article is by Manudeep Bhuller, Tarjei Havnes, Edwin Leuven, and Magne Mogstad and begins as follows:

Does internet use trigger sex crime? We use unique Norwegian data on crime and internet adoption to shed light on this question. A public program with limited funding rolled out broadband access points in 2000-2008, and provides plausibly exogenous variation in internet use. Our instrumental variables and fixed effect estimates show that internet use is associated with a substantial increase in reported incidences of rape and other sex crimes. We present a theoretical framework that highlights three mechanisms for how internet use may affect reported sex crime, namely a reporting effect, a matching effect on potential offenders and victims, and a direct effect on crime propensity. Our results indicate that the direct effect is non-negligible and positive, plausibly as a result of increased consumption of pornography.

How big is the effect?

I had a couple of email exchanges with Jan-Emmanuel De Neve and James Fowler, two of the authors of the article on the gene that is associated with life satisfaction which we blogged the other day. (Bruno Frey, the third author of the article in question, is out of town according to his email.) Fowler also commented directly on the blog.

I won't go through all the details, but now I have a better sense of what's going on. (Thanks, Jan and James!) Here's my current understanding:

1. The original manuscript was divided into two parts: an article by De Neve alone published in the Journal of Human Genetics, and an article by De Neve, Fowler, Frey, and Nicholas Christakis submitted to Econometrica. The latter paper repeats the analysis from the Adolescent Health survey and also replicates with data from the Framingham heart study (hence Christakis's involvement).

The Framingham study measures a slightly different gene and uses a slightly life-satisfaction question compared to the Adolescent Health survey, but De Neve et al. argue that they're close enough for the study to be considered a replication. I haven't tried to evaluate this particular claim but it seems plausible enough. They find an association with p-value of exactly 0.05. That was close! (For some reason they don't control for ethnicity in their Framingham analysis--maybe that would pull the p-value to 0.051 or something like that?)

2. Their gene is correlated with life satisfaction in their data and the correlation is statistically significant. The key to getting statistical significance is to treat life satisfaction as a continuous response rather than to pull out the highest category and call it a binary variable. I have no problem with their choice; in general I prefer to treat ordered survey responses as continuous rather than discarding information by combining categories.

3. But given their choice of a continuous measure, I think it would be better for the researchers to stick with it and present results as points on the 1-5 scale. From their main regression analysis on the Adolescent Health data, they estimate the effect of having two (compared to zero) "good" alleles as 0.12 (+/- 0.05) on a 1-5 scale. That's what I think they should report, rather than trying to use simulation to wrestle this into a claim about the probability of describing oneself as "very satisfied."

They claim that having the two alleles increases the probability of describing oneself as "very satisfied" by 17%. That's not 17 percentage points, it's 17%, thus increasing the probability from 41% to 1.17*41% = 48%. This isn't quite the 46% that's in the data but I suppose the extra 2% comes from the regression adjustment. Still, I don't see this as so helpful. I think they'd be better off simply describing the estimated improvement as 0.1 on a 1-5 scale. If you really really want to describe the result for a particular category, I prefer percentage points rather than percentages.

4. Another advantage as describing the result as 0.1 on a 1-5 scale is that it is more consistent with intuitive notions of 1% of variance explained. It's good they have this 1% in their article--I should present such R-squared summaries in my own work, to give a perspective on the sizes of the effects that I find.

5. I suspect the estimated effect of 0.1 is an overestimate. I say this for the usual reason, discussed often on this blog, that statistically significant findings, by their very nature, tend to be overestimates. I've sometimes called this the statistical significance filter, although "hurdle" might be a more appropriate term.

6. Along with the 17% number comes a claim that having one allele gives an 8% increase. 8% is half of 17% (subject to rounding) and, indeed, their estimate for the one-allele case comes from their fitted linear model. That's fine--but the data aren't really informative about the one-allele case! I mean, sure, the data are perfectly consistent with the linear model, but the nature of leverage is such that you really don't get a good estimate on the curvature of the dose-response function. (See my 2000 Biostatistics paper for a general review of this point.) The one-allele estimate is entirely model-based. It's fine, but I'd much prefer simply giving the two-allele estimate and then saying that the data are consistent with a linear model, rather than presenting the one-allele estimate as a separate number.

7. The news reports were indeed horribly exaggerated. No fault of the authors but still something to worry about. The Independent's article was titled, "Discovered: the genetic secret of a happy life," and the Telegraph's was not much better: "A "happiness gene" which has a strong influence on how satisfied people are with their lives, has been discovered." An effect of 0.1 on a 1-5 scale: an influence, sure, but a "strong" influence?

8. There was some confusion with conditional probabilities that made its way into the reports as well. From the Telegraph:

The results showed that a much higher proportion of those with the efficient (long-long) version of the gene were either very satisfied (35 per cent) or satisfied (34 per cent) with their life - compared to 19 per cent in both categories for those with the less efficient (short-short) form.

After looking at the articles carefully and having an email exchange with De Neve, I can assure you that the above quote is indeed wrong, which is really too bad because it was an attempted correction of an earlier mistake. The correct numbers are not 35, 34, 19, 19. Rather, they are 41, 46, 37, 44. A much less dramatic difference: changes of 4% and 2% rather than 18% and 15%. The Telegraph reporter was giving P(gene|happiness) rather than P(happiness|gene). What seems to have happened is that he misread Figure 2 in the Human Genetics paper. He then may have got stuck on the wrong track by expecting to see a difference of 17%.

9. The abstract for the Human Genetics paper reports a p-value of 0.01. But the baseline model (Model 1 in Table V of the Econometrica paper) reports a p-value of 0.02. The lower p-values are obtained by models that control for a big pile of intermediate outcomes.

10. In section 3 of the Econometrica paper, they compare identical to fraternal twins (from the Adolescent Health survey, it appears) and estimate that 33% of the variation in reported life satisfaction is explained by genes. As they say, this is roughly consistent with estimates of 50% or so from the literature. I bet their 33% has a big standard error, though: one clue is that the difference in correlations between identical and fraternal twins is barely statistically significant (at the 0.03 level, or, as they quaintly put it, 0.032). They also estimate 0% of the variation to be due to common environment, but again that 0% is gonna be a point estimate with a huge standard error.

I'm not saying that their twin analysis is wrong. To me the point of these estimates is to show that the Adolescent Health data are consistent with the literature on genes and happiness, thus supporting the decision to move on with the rest of their study. I don't take their point estimates of 33% and 0% seriously but it's good to know that the twin results go in the expected direction.

11. One thing that puzzles me is why De Neve et al. only studied one gene. I understand that this is the gene that they expected to relate to happiness and life satisfaction, but . . . given that it only explains 1% of the variation, there must be hundreds or thousands of genes involved. Why not look at lots and lots? At the very least, the distribution of estimates over a large sample of genes would give some sense of the variation that might be expected. I can't see the point of looking at just one gene, unless cost is a concern. Are other gene variants already recorded for the Adolescent Health and Framingham participants?

12. My struggles (and the news reporters' larger struggles) with the numbers in these articles makes me feel, even more strongly than before, the need for a suite of statistical methods for building from simple comparisons to more complicated regressions. (In case you're reading this, Bob and Matt3, I'm talking about the network of models.)

As researchers, transparency should be our goal. This is sometimes hindered by scientific journals' policies of brevity. You can end up having to remove lots of the details that make a result understandable.

13. De Neve concludes the Human Genetics article as follows:

There is no single ''happiness gene.' Instead, there is likely to be a set of genes whose expression, in combination with environmental factors, influences subjective well-being.

I would go even further. Accepting their claim that between one-third and one-half of the variation in happiness and life satisfaction is determined by genes, and accepting their estimate that this one gene explains as much as 1% of the variation, and considering that this gene was their #1 candidate (or at least a top contender) for the "happiness gene" . . . my guess is that the set of genes that influence subjective well-being is a very large number indeed! The above disclaimer doesn't seem disclaimery-enough to me, in that it seems to leave open the possibility that this "set of genes" might be just three or four. Hundreds or thousands seems more like it.

I'm reminded of the recent analysis that found that the simple approach of predicting child's height using a regression model given parents' average height performs much better than a method based on combining 54 genes.

14. Again, I'm not trying to present this as any sort of debunking, merely trying to fit these claims in with the rest of my understanding. I think it's great when social scientists and public health researchers can work together on this sort of study. I'm sure that in a couple of decades we'll have a much better understanding of genes and subjective well-being, but you have to start somewhere. This is a clean study that can be the basis for future research.

Hmmm . . . .could I publish this as a letter in the Journal of Human Genetics? Probably not, unfortunately.

P.S. You could do this all yourself! This and my earlier blog on the happiness gene study required no special knowledge of subject matter or statistics. All I did was tenaciously follow the numbers and pull and pull until I could see where all the claims were coming from. A statistics student, or even a journalist with a few spare hours, could do just as well. (Why I had a few spare hours to do this is another question. The higher procrastination, I call it.) I probably could've done better with some prior knowledge--I know next to nothing about genetics and not much about happiness surveys either--but I could get pretty far just tracking down the statistics (and, as noted, without any goal of debunking or any need to make a grand statement).

P.P.S. See comments for further background from De Neve and Fowler!

Howard Wainer writes in the Statistics Forum:

The Chinese scientific literature is rarely read or cited outside of China. But the authors of this work are usually knowledgeable of the non-Chinese literature -- at least the A-list journals. And so they too try to replicate the alpha finding. But do they? One would think that they would find the same diminished effect size, but they don't! Instead they replicate the original result, even larger. Here's one of the graphs:

How did this happen?

Full story here.

A graduate student in public health writes:

I have been asked to do the statistical analysis for a medical unit that is delivering a pilot study of a program to [details redacted to prevent identification]. They are using a prospective, nonrandomized, cohort-controlled trial study design.

The investigator thinks they can recruit only a small number of treatment and control cases, maybe less than 30 in total. After I told the Investigator that I cannot do anything statistically with a sample size that small, he responded that small sample sizes are common in this field, and he send me an example of analysis that someone had done on a similar study.

So he still wants me to come up with a statistical plan. Is it unethical for me to do anything other than descriptive statistics? I think he should just stick to qualitative research. But the study she mentions above has 40 subjects and apparently had enough power to detect some effects. This is a pilot study after all so the n does not have to be large. It's not randomized though so I would think it would need a larger n because of the weak design.

My reply:

My first, general, recommendation is that it always makes sense to talk with any person as if he is completely ethical. If he is ethical, this is a good idea, and if he is not, you don't want him to think you think badly of him. If you are worried about a serious ethical problem, you can ask about it by saying something like, "From the outside, this could look pretty bad. An outsider, seeing this plan, might think we are being dishonest etc. etc." That way you can express this view without it being personal. And maybe your colleague has a good answer, which he can tell you.

To get to your specific question, there is really no such thing as a minimum acceptable sample size. You can get statistical significance with n=5 if your signal is strong enough.

Generally, though, the purpose of a pilot study is not to get statistical significance but rather to get experience with the intervention and the measurements. It's ok to do a pilot analysis, recognizing that it probably won't reach statistical significance. Also, regardless of sample size, qualitative analysis is appropriate and necessary in any pilot study.

Finally, of course they should not imply that they can collect a larger sample size than they can actually do.

Data mining and allergies


With all this data floating around, there are some interesting analyses one can do. I came across "The Association of Tree Pollen Concentration Peaks and Allergy Medication Sales in New York City: 2003-2008" by Perry Sheffield. There they correlate pollen counts with anti-allergy medicine sales - and indeed find that two days after high pollen counts, the medicine sales are the highest.


Of course, it would be interesting to play with the data to see *what* tree is actually causing the sales to increase the most. Perhaps this would help the arborists what trees to plant. At the moment they seem to be following a rather sexist approach to tree planting:

Ogren says the city could solve the problem by planting only female trees, which don't produce pollen like male trees do.

City arborists shy away from females because many produce messy - or in the case of ginkgos, smelly - fruit that litters sidewalks.

In Ogren's opinion, that's a mistake. He says the females only produce fruit because they are pollinated by the males.

His theory: no males, no pollen, no fruit, no allergies.

This announcement might be of interest to some of you. The application deadline is in just a few days:

The National Center for Complementary and Alternative Medicine at the National Institutes of Health is seeking an additional experienced statistician to join our Office of Clinical and Regulatory Affairs team. is accepting applications through April 22, 2011 for the general announcement and April 21 for status (typically current federal employee) candidates. To apply to this announcement or for more information, click on the links provided below or the USAJobs link provided above and search for NIH-NCCAM-DE-11-448747 (external) or NIH-NCCAM-MP-11-448766 (internal).

You have to be a U.S. citizen for this one.

Remember that bizarre episode in Freakonomics 2, where Levitt and Dubner went to the Batcave-like lair of a genius billionaire who told them that "the problem with solar panels is that they're black." I'm not the only one who wondered at the time: of all the issues to bring up about solar power, why that one?

Well, I think I've found the answer in this article by John Lanchester:

In 2004, Nathan Myhrvold, who had, five years earlier, at the advanced age of forty, retired from his job as Microsoft's chief technology officer, began to contribute to the culinary discussion board . . . At the time he grew interested in sous vide, there was no book in English on the subject, and he resolved to write one. . . . broadened it further to include information about the basic physics of heating processes, then to include the physics and chemistry of traditional cooking techniques, and then to include the science and practical application of the highly inventive new techniques that are used in advanced contemporary restaurant food--the sort of cooking that Myhrvold calls "modernist."

OK, fine. But what does this have to do with solar panels? Just wait:

Notwithstanding its title, "Modernist Cuisine" contains hundreds of pages of original, firsthand, surprising information about traditional cooking. Some of the physics is quite basic: it had never occurred to me that the reason many foods go from uncooked to burned at such speed is that light-colored foods reflect heat better than dark: "As browning reactions begin, the darkening surface rapidly soaks up more and more of the heat rays. The increase in temperature accelerates dramatically."

Aha! Now, I'm just guessing here, but my conjecture is that after studying this albedo effect in the kitchen, Myhrvold was primed to see it everywhere. Of course, maybe it went the other way: he was thinking about solar panels first and then applied his ideas to the kitchen. But, given that the experts seem to think the albedo effect is a red herring (so to speak) regarding solar panels, I wouldn't be surprised if Myhrvold just started talking about reflectivity because it was on his mind from the cooking project. My own research ideas often leak from one project to another, so I wouldn't be surprised if this happens to others too.

P.S. More here and here.

This came in the inbox today:

In the spirit of Gapminder, Washington Post created an interactive scatterplot viewer that's using alpha channel to tell apart overlapping fat dots better than sorting-by-circle-size Gapminder is using:


Good news: the rate of fattening of the USA appears to be slowing down. Maybe because of high gas prices? But what's happening with Oceania?

Kaiser nails it. The offending article, by John Tierney, somehow ended up in the Science section rather than the Opinion section. As an opinion piece (or, for that matter, a blog), Tierney's article would be nothing special. But I agree with Kaiser that it doesn't work as a newspaper article. As Kaiser notes, this story involves a bunch of statistical and empirical claims that are not well resolved by P.R. and rhetoric.

This post is by Phil Price.

An Oregon legislator, Mitch Greenlick, has proposed to make it illegal in Oregon to carry a child under six years old on one's bike (including in a child seat) or in a bike trailer. The guy says ""We've just done a study showing that 30 percent of riders biking to work at least three days a week have some sort of crash that leads to an injury... When that's going on out there, what happens when you have a four year old on the back of a bike?" The study is from Oregon Health Sciences University, at which the legislator is a professor.

Greenlick also says ""If it's true that it's unsafe, we have an obligation to protect people. If I thought a law would save one child's life, I would step in and do it. Wouldn't you?"

There are two statistical issues here. The first is in the category of "lies, damn lies, and statistics," and involves the statement about how many riders have injuries. As quoted on a blog, the author of the study in question says that, when it comes to what is characterized as an injury, "It could just be skinning your knee or spraining your ankle, but it couldn't just be a near miss." By this standard, lots of other things one might do with one's child -- such as playing with her, for instance -- might be even more likely to cause injury.

Substantial numbers of people have been taking their children on bikes for quite a while now, so although it may be impossible to get accurate numbers for the number of hours or miles ridden, there should be enough data on fatalities and severe injuries to get a semi-quantitative idea of how dangerous it is to take a child on a bike or in a bike trailer. And when I say "dangerous" I mean, you know, actually dangerous.

The second problem with Greenlick's approach is that it seems predicated on the idea that, in his words, "If I thought a law would save one child's life, I would step in and do it. Wouldn't you?" Well, no, and in fact that is just a ridiculous principle to apply. Any reasonable person should be in favor of saving children's lives, but not at all cost. We could make it illegal to allow children to climb trees, to eat peanuts, to cross the street without holding an adult's hand...perhaps they shouldn't be allowed to ride in cars. Where would it end?

Finally, a non-statistical note: another state rep has commented regarding this bill, saying that "this is the way the process often works: a legislator gets an idea, drafts a bill, introduces it, gets feedback, and then decides whether to try to proceed, perhaps with amendments, or whether to let it die." If true, this is a really wasteful and inefficient system. Better would be "a legislator gets an idea, does a little research to see if it makes sense, introduces it,..." Introducing it before seeing if it makes sense is probably a lot easier in the short run, but it means a lot of administrative hassle in introducing the bills, and it makes people waste time and effort trying to kill or modify ill-conceived bills.

Gur Huberman asks what I think of this magazine article by Johah Lehrer (see also here).

My reply is that it reminds me a bit of what I wrote here. Or see here for the quick powerpoint version: The short story is that if you screen for statistical significance when estimating small effects, you will necessarily overestimate the magnitudes of effects, sometimes by a huge amount. I know that Dave Krantz has thought about this issue for awhile; it came up when Francis Tuerlinckx and I wrote our paper on Type S errors, ten years ago.

My current thinking is that most (almost all?) research studies of the sort described by Lehrer should be accompanied by retrospective power analyses, or informative Bayesian inferences. Either of these approaches--whether classical or Bayesian, the key is that they incorporate real prior information, just as is done in a classical prospective power analysis--would, I think, moderate the tendency to overestimate the magnitude of effects.

In answer to the question posed by the title of Lehrer's article, my answer is Yes, there is something wrong with the scientific method, if this method is defined as running experiments and doing data analysis in a patternless way and then reporting, as true, results that pass a statistical significance threshold.

And corrections for multiple comparisons will not solve the problem: such adjustments merely shift the threshold without resolving the problem of overestimation of small effects.

i received the following press release from the Heritage Provider Network, "the largest limited Knox-Keene licensed managed care organization in California." I have no idea what this means, but I assume it's some sort of HMO.

In any case, this looks like it could be interesting:

Participants in the Health Prize challenge will be given a data set comprised of the de-identified medical records of 100,000 individuals who are members of HPN. The teams will then need to predict the hospitalization of a set percentage of those members who went to the hospital during the year following the start date, and do so with a defined accuracy rate. The winners will receive the $3 million prize. . . . the contest is designed to spur involvement by others involved in analytics, such as those involved in data mining and predictive modeling who may not currently be working in health care. "We believe that doing so will bring innovative thinking to health analytics and may allow us to solve at least part of the health care cost conundrum . . ."

I don't know enough about health policy to know if this makes sense. Ultimately, the goal is not to predict hospitalization, but to avoid it. But maybe if you can predict it well, it could be possible to design the system a bit better. The current system--in which the doctor's office is open about 40 hours a week, and otherwise you have to go the emergency room--is a joke.

Sander Wagner writes:

I just read the post on ethical concerns in medical trials. As there seems to be a lot more pressure on private researchers i thought it might be a nice little exercise to compare p-values from privately funded medical trials with those reported from publicly funded research, to see if confirmation pressure is higher in private research (i.e. p-values are closer to the cutoff levels for significance for the privately funded research). Do you think this is a decent idea or are you sceptical? Also are you aware of any sources listing a large number of representative medical studies and their type of funding?

My reply:

This sounds like something worth studying. I don't know where to get data about this sort of thing, but now that it's been blogged, maybe someone will follow up.

Diabetes stops at the state line?


From Discover:


Razib Khan asks:

But follow the gradient from El Paso to the Illinois-Missouri border. The differences are small across state lines, but the consistent differences along the borders really don't make. Are there state-level policies or regulations causing this? Or, are there state-level differences in measurement? This weird pattern shows up in other CDC data I've seen.

Turns out that CDC isn't providing data, they're providing model. Frank Howland answered:

I suspect the answer has to do with the manner in which the county estimates are produced. I went to the original data source, the CDC, and then to the relevant FAQ.

There they say that the diabetes prevalence estimates come from the "CDC's Behavioral Risk Factor Surveillance System (BRFSS) and data from the U.S. Census Bureau's Population Estimates Program. The BRFSS is an ongoing, monthly, state-based telephone survey of the adult population. The survey provides state-specific information"

So the CDC then uses a complicated statistical procedure ("indirect model-dependent estimates" using Bayesian techniques and multilevel Poisson regression models) to go from state to county prevalence estimates. My hunch is that the state level averages thereby affect the county estimates. The FAQ in fact says "State is included as a county-level covariate."

I'd prefer to have real data, not a model. I'd do the model myself, thank you. Data itself is tricky enough, as J. Stamp said.

OK, here's something that is completely baffling me. I read this article by John Colapinto on the neuroscientist V. S. Ramachandran, who's famous for his innovative treatment for "phantom limb" pain:

His first subject was a young man who a decade earlier had crashed his motorcycle and torn from his spinal column the nerves supplying the left arm. After keeping the useless arm in a sling for a year, the man had the arm amputated above the elbow. Ever since, he had felt unremitting cramping in the phantom limb, as though it were immobilized in an awkward position. . . . Ramachandram positioned a twenty-inch-by-twenty-inch drugstore mirror . . . and told him to place his intact right arm on one side of the mirror and his stump on the other. He told the man to arrange the mirror so that the reflection created the illusion that his intact arm was the continuation of the amputated one. The Ramachandran asked the man to move his right and left arms . . . "Oh, my God!" the man began to shout. . . . For the first time in ten years, the patient could feel his phantom limb "moving," and the cramping pain was instantly relieved. After the man had used the mirror therapy ten minutes a day for a month, his phantom limb shrank . . .

Ramachandran conducted the experiment on eight other amputees and published the results in Nature, in 1995. In all but one patient, phantom hands that had been balled into painful fists opened, and phantom arms that had stiffened into agonizing contortions straightened. . . .

So far, so good. But then the story continues:

Dr. Jack Tsao, a neurologist for the U.S. Navy . . . read Ramachandran's Nature paper on mirror therapy for phantom-limb pain. . . . Several years later, in 2004, Tsao began working at Walter Reed Military Hospital, where he saw hundreds of soldiers with amputations returning from Iraq and Afghanistan. Ninety percent of them had phantom-limb pain, and Tsao, noting that the painkillers routinely prescribed for the condition were ineffective, suggested mirror therapy. "We had a lot of skepticism from the people at the hospital, my colleagues as well as the amputee subjects themselves," Tsao said. But in a clinical trial of eighteen service members with lower-limb amputations . . . the six who used the mirror reported that their pain decreased [with no corresponding improvement in the control groups] . . . Tsao published his results in the New England Journal of Medicine, in 2007. "The people who really got completely pain-free remain so, two years later," said Tsao, who is currently conducting a study involving mirror therapy on upper-limb amputees at Walter Reed.

At first, this sounded perfectly reasonable: Bold new treatment is dismissed by skeptics but then is proved to be a winner in a clinical trial. But . . . wait a minute! I have some questions:

1. Ramachandran published his definitive paper in 1995 in a widely-circulated journal. Why did his mirror therapy not become the standard approach, especially given that "the painkillers routinely prescribed for the condition were ineffective"? Why were these ineffective painkillers "routinely prescribed" at all?

2. When Tsao finally got around to trying a therapy that had been published nine years before why did they have "a lot of skepticism from the people at the hospital"?

3. If Tsao saw "hundreds of soldiers" with phantom-limb pain, why did he try the already-published mirror therapy on only 18 of them?

4. How come, in 2009, two years after his paper in the New England Journal of Medicine--and fourteen years after Ramachandran's original paper in Nature--even now, Tsao is "currently conducting a study involving mirror therapy"? Why isn't he doing mirror therapy on everybody?

Ok, maybe I have the answer to the last question: Maybe Tsao's current (as of 2009) study is of different variants of mirror therapy. That is, maybe he is doing it on everybody, just in different ways. That would make sense.

But I don't understand items 1,2,3 above at all. There must be some part of the story that I'm missing. Perhaps someone could explain?

P.S. More here.

Scott Berry, Brad Carlin, Jack Lee, and Peter Muller recently came out with a book with the above title.

The book packs a lot into its 280 pages and is fun to read as well (even if they do use the word "modalities" in their first paragraph, and later on they use the phrase "DIC criterion," which upsets my tidy, logical mind). The book starts off fast on page 1 and never lets go.

Clinical trials are a big part of statistics and it's cool to see the topic taken seriously and being treated rigorously. (Here I'm not talking about empty mathematical rigor (or, should I say, "rigor"), so-called optimal designs and all that, but rather the rigor of applied statistics, mapping models to reality.)

Also I have a few technical suggestions.

1. The authors fit a lot of models in Bugs, which is fine, but they go overboard on the WinBUGS thing. There's WinBUGS, OpenBUGS, JAGS: they're all Bugs recommend running Bugs from R using the clunky BRugs interface rather than the smoother bugs() function, which has good defaults and conveniently returns graphical summaries and convergence diagnostics. The result is to get tangled in software complications and distance the user from statistical modeling.

2. On page 61 they demonstrate an excellent graphical summary that reveals that, in a particular example, their posterior distribution is improper--or, strictly speaking, that the posterior depends strongly on the choice of an arbitrary truncation point in the prior distribution. But then they stick with the bad model! Huh? This doesn't seem like such a good idea.

3. They cover all of Bayesian inference in a couple chapters, which is fine--interested readers can learn the whole thing from the Carlin and Louis book--but in their haste they sometimes slip up. For example, from page 5:

Randomization minimizes the possibility of selection bias, and it tends to balance the treatment groups over covariates, both known and unknown. There are difference, however, in the Bayesian and frequentist views of randomization. In the latter, randomization serves as the basis for inference, whereas the basis for inference in the Bayesian approach is subjective probability, which does not require randomization.

I get their general drift but I don't agree completely. First, randomization is a basis for frequentist inference, but it's not fair to call it the basis. There's lots of frequentist inference for nonrandomized studies. Second, I agree that the basis for Bayesian inference is probability but I don't buy the "subjective" part (except to the extent that all science is subjective). Third, the above paragraph leaves out why a Bayesian would want to randomize. The basic reason is robustness, as we discuss in chapter 7 of BDA.

4. I was wondering what the authors would say about Sander Greenland's work on multiple-bias modeling. Greenland uses Bayesian methods and has thought a lot about bias and causal inference in practical medical settings. I looked up Greenland in the index and all I could find was one page, which referred to some of his more theoretical work:

Greenland, Lanes, and Jara (2008) explore the use of structural nested models and advocate what they call g-estimation, a form of test-based estimation adhering to the ITT principle and accomodating a semiparametric Cox partial likelihood.

Nothing on multiple-bias modeling. Also I didn't see any mention of this paper by John "no relation" Carlin and others. Finally, the above paragraph is a bit odd in that "test-based estimation" and "semiparametric Cox partial likelihood" are nowhere defined in the book (or, at least, I couldn't find them in the index). I mean, sure, the reader can google these things, but I'd really like to see these ideas presented in the context of the book.

5. The very last section covers subgroup analysis and then mentions multilevel models (the natural Bayesian approach to the problem) but then doesn't really follow through. They go into a long digression on decision analysis. That's fine, but I'd like to see a worked example of a multilevel model for subgroup analysis, instead of just the reference to Hodges et al. (2007).

In summary, I like this book and it left me wanting even more. I hope that everyone working on clinical trials reads it and that it has a large influence.

And, just to be clear, most of my criticisms above are of the form, "I like it and want more." In particular, my own books don't have anything to say on multiple-bias models, test-based estimation, semiparametric Cox partial likelihood, multilevel models for subgroup analysis, or various other topics I'm asking for elaboration on. As it stands, Berry, Carlin, Lee, and Muller have packed a lot into 280 pages.

The deadline for this year's Earth Institute postdocs is 1 Dec, so it's time to apply right away! It's a highly competitive interdisciplinary program, and we've had some statisticians in the past.

We're particularly interested in statisticians who have research interests in development and public health. It's fine--not just fine, but ideal--if you are interested in statistical methods also.

Ethical concerns in medical trials


I just read this article on the treatment of medical volunteers, written by doctor and bioethicist Carl Ellliott.

As a statistician who has done a small amount of consulting for pharmaceutical companies, I have a slightly different perspective. As a doctor, Elliott focuses on individual patients, whereas, as a statistician, I've been trained to focus on the goal of accurately estimate treatment effects.

I'll go through Elliott's article and give my reactions.

Paul Nee sends in this amusing item:

After learning of a news article by Amy Harmon on problems with medical trials--sometimes people are stuck getting the placebo when they could really use the experimental treatment, and it can be a life-or-death difference, John Langford discusses some fifteen-year-old work on optimal design in machine learning and makes the following completely reasonable point:

With reasonable record keeping of existing outcomes for the standard treatments, there is no need to explicitly assign people to a control group with the standard treatment, as that approach is effectively explored with great certainty. Asserting otherwise would imply that the nature of effective treatments for cancer has changed between now and a year ago, which denies the value of any clinical trial. . . .

Done the right way, the clinical trial for a successful treatment would start with some initial small pool (equivalent to "phase 1″ in the article) and then simply expanded the pool of participants over time as it proved superior to the existing treatment, until the pool is everyone. And as a bonus, you can even compete with policies on treatments rather than raw treatments (i.e. personalized medicine).

Langford then asks: if these ideas are so good, why aren't they done already? He conjectures:

Getting from here to there seems difficult. It's been 15 years since EXP3.P was first published, and the progress in clinical trial design seems glacial to us outsiders. Partly, I think this is a communication and education failure, but partly, it's also a failure of imagination within our own field. When we design algorithms, we often don't think about all the applications, where a little massaging of the design in obvious-to-us ways so as to suit these applications would go a long ways.

I agree with these sentiments, but . . . the sorts of ideas Langford is talking about have been around in a statistics for a long long time--much more than 15 years! I welcome the involvement of computer scientists in this area, but it's not simply that the CS people have a great idea and just need to communicate it or adapt it to the world of clinical trials. The clinical trials people already know about these ideas (not with the same terminology, but they're the same basic ideas) but, for various reasons, haven't widely adapted them.

P.S. The news article is by Amy Harmon, but Langford identifies it only as being from the New York Times. I don't think this is appropriate to omit the author's name. The publication is relevant but it's the reporter who did the work. I certainly wouldn't like it if someone referred to one of my articles by writing, "The Journal of the American Statistical Association reported today that . . ."

Works almost as well, costs a lot less



The placebo effect in pharma


Bruce McCullough writes:

The Sept 2009 issue of Wired had a big article on the increase in the placebo effect, and why it's been getting bigger.

Kaiser Fung has a synopsis.

As if you don't have enough to do, I thought you might be interested in blogging on this.

My reply:

I thought Kaiser's discussion was good, especially this point:

Effect on treatment group = Effect of the drug + effect of belief in being treated

Effect on placebo group = Effect of belief in being treated

Thus, the difference between the two groups = effect of the drug, since the effect of belief in being treated affects both groups of patients.

Thus, as Kaiser puts it, if the treatment isn't doing better than placebo, it doesn't say that the placebo effect is big (let alone "too big") but that the treatment isn't showing any additional effect. It's "treatment + placebo" vs. placebo, not treatment vs. placebo.

That said, I'd prefer for Kaiser to make it clear that the additivity he's assuming is just that--an assumption. Like Kaiser, I don't know much about pharma in particular, but like Kaiser, I feel that the assumption of additivity is a reasonable starting point. I just think it would be clearer to frame this as a battle of assumptions (much as in Rubin's discussion of Lord's Paradox).

I also agree with Kaiser that the scientific questions about placebos are interesting. As in much medical research, it's frustrating how the ground seems to keep shifting and how little seems to be known. Or, to put it another way, a lot is known--lots of studies have been done--but nothing seems to be known with much certainty. There are few pillars of knowledge to hold on to, even in a field such as placebos that has been studied for so many decades.

Also, as Kaiser points out, the waters can be muddied by the huge financial conflicts of interests involved in medical research.

This came in the spam the other day:

College Station, TX--August 16, 2010--Change and hope were central themes to the November 2008 U.S. presidential election. A new longitudinal study published in the September issue of Social Science Quarterly analyzes suicide rates at a state level from 1981-2005 and determines that presidential election outcomes directly influence suicide rates among voters.

In states where the majority of voters supported the national election winner suicide rates decreased. However, counter-intuitively, suicide rates decreased even more dramatically in states where the majority of voters supported the election loser (4.6 percent lower for males and 5.3 lower for females). This article is the first in its field to focus on candidate and state-specific outcomes in relation to suicide rates. Prior research on this topic focused on whether the election process itself influenced suicide rates, and found that suicide rates fell during the election season.

Richard A. Dunn, Ph.D., lead author of the study, credits the power of social cohesion, "Sure, supporting the loser stinks, but if everyone around you supported the loser, it isn't as bad because you feel connected to those around you. In other words, it is more comforting to be a Democrat in Massachusetts or Rhode Island when George W. Bush was re-elected than to be the lonely Democrat in Idaho or Oklahoma."

Researchers have commonly thought that people who are less connected to other members of society are more likely to commit suicide. The authors of the study first became interested in this concept when studying the effect of job loss and unemployment on suicide risk, which theoretically causes people to feel less connected to society. The authors realized that while previous work had explored whether events that brought people together and reaffirmed their shared heritage such as elections, war, religious and secular holidays lowered suicide rates, researchers had generally ignored how the outcomes of these events could also influence suicide risk.

The study holds implications for public health researchers studying the determinants of suicide risk, sociologists studying the role of social cohesion and political scientists studying the rhetoric of political campaigns.

I want to laugh at this sort of thing . . . but, hey, I have an article (with Lane Kenworthy) scheduled to appear in Social Science Quarterly. I just hope that when they send out mass emails about it, they link to the article itself rather than, as above, generically to the journal.

More seriously, I don't want to mock these researchers at all. In most of my social science research, I'm a wimp, reporting descriptive results and usually making causal claims in a very cagey way. (There are rare exceptions, such as our estimates of the effect of incumbency and the effects of redistricting. But in these examples we had overwhelming data on our side. Usually, as in Red State, Blue State, I'm content to just report the data and limit my exposure to more general claims.) In contrast, the authors of the above article just go for it. As Jennifer says, causal inference is what people really want--and what they should want--and so my timidity in this regard should be no sort of model for social science researchers.

With regard to the substance of their findings, I don't buy it. The story seems too convoluted, and the analysis seems to have too many potential loopholes, for me to have any confidence at all in the claims presented in the article. Sure, they found an intriguing pattern in their data, but the paper does not look to me to be a thorough examination of the questions that they're studying.

P.S. to those who think I'm being too critical here:

Hey, this is just a blog and I'm talking about a peer-reviewed publication in a respectable journal. I'm not saying that you, the reader, should disbelieve Classen and Dunn's claims, just because I'm not convinced.

I'm a busy person (aren't we all) and don't have the time or inclination right now to go into the depths of the article and find out where their mistakes are (or, alternatively, to look at their article closely enough to be convinced by it). So you can take my criticisms as seriously as they deserve to be taken.

Given that I haven't put in the work, and Classen and Dunn have, I think it's perfectly reasonable for you to believe what they wrote. And it would be completely reasonable for them, if they happen to run across this blog, to respond with annoyance to my free-floating skepticism. I'm just calling this one as I see it, while recognizing that I have not put in the effort to look into it in detail. Those readers who are interested in the subject can feel free to study the matter further.

Hadley Wickham sent me this, by Keith Baggerly and Kevin Coombes:

In this report we [Baggerly and Coombes] examine several related papers purporting to use microarray-based signatures of drug sensitivity derived from cell lines to predict patient response. Patients in clinical trials are currently being allocated to treatment arms on the basis of these results. However, we show in five case studies that the results incorporate several simple errors that may be putting patients at risk. One theme that emerges is that the most common errors are simple (e.g., row or column offsets); conversely, it is our experience that the most simple errors are common.

This is horrible! But, in a way, it's not surprising. I make big mistakes in my applied work all the time. I mean, all the time. Sometimes I scramble the order of the 50 states, or I'm plotting a pure noise variable, or whatever. But usually I don't drift too far from reality because I have a lot of cross-checks and I (or my close collaborators) are extremely familiar with the data and the problems we are studying.

Genetics, though, seems like more of a black box. And, as Baggerly and Coombes demonstrate in their fascinating paper, once you have a hypothesis, it doesn't seem so difficult to keep coming up with what seems like confirming evidence of one sort or another.

To continue the analogy, operating some of these methods seems like knitting a sweater inside a black box: it's a lot harder to notice your mistakes if you can't see what you're doing, and it can be difficult to tell by feel if you even have a functioning sweater when you're done with it all.

Subtle statistical issues to be debated on TV.


There is live debate that will available this week for those that might be interested. The topic: Can early stopped trials result in misleading results of systematic reviews?

Cameron McKenzie writes:

I ran into the attached paper [by Dave Marcotte and Sara Markowitz] on the social benefits of prescription of psychotropic drugs, relating a drop in crime rate to an increase in psychiatric drug prescriptions. It's not my area (which is psychophysics) but I do find this kind of thing interesting. Either people know much more than I think they do, or they are pretending to, and either is interesting. My feeling is that it doesn't pass the sniff test, but I wondered if you might (i) find the paper interesting and/or (ii) perhaps be interested in commenting on it on the blog. It seems to me that if we cumulated all econometric studies of crime rate we would be able to explain well over 100% of the variation therein, but perhaps my skepticism is unwarranted.

My reply:

I know what you mean. The story seems plausible but the statistical analysis seems like a stretch. I appreciate that the authors included scatterplots of their data, but the patterns they find are weak enough that it's hard to feel much confidence in their claim that "about 12 percent of the recent crime drop was due to expanded mental health treatment." The article reports that the percentage of people with mental illness getting treatment increased by 13 percentage points (from 20% to 33%) during the period under study. For this to have caused a 12 percent reduction in crime, you'd have to assume that nearly all the medicated people stopped committing crimes. (Or you'd have to assume that the potential criminals were more likely to be getting treated.) But maybe the exact numbers don't matter. The 1960s/1970s are over, and nowadays there is little controversy about the idea of using drugs and mental illness treatments as a method of social control. And putting criminals on Thorazine or whatever seems a lot more civilized than throwing them in prison. For example, if you put Tony Hayward or your local strangler on mind-numbing drugs and have them do community service with some sort of electronic tag to keep them out of trouble, they'd be making a much more useful contribution to society than if they're making license plates and spending their days working out in the prison yard.

P.S. It looks like I was confused on this myself. See Kevin Denny's comment below.

Someone who works in statistics in the pharmaceutical industry (but prefers to remain anonymous) sent me this update to our discussion on the differences between approvals of drugs and medical devices:

The 'substantial equivalence' threshold is a very outdated. Basically the FDA has to follow federal law and the law is antiquated and leads to two extraordinarily different paths for device approval.

You could have a very simple but first-in-kind device with an easy to understand physiological mechanism of action (e.g. the FDA approved a simple tiny stent that would relieve pressure from a glaucoma patient's eye this summer). This device would require a standard (likely controlled) trial at the one-sided 0.025 level. Even after the trial it would likely go to a panel where outside experts (e.g.practicing & academic MDs and statisticians) hear evidence from the company and FDA and vote on its safety and efficacy. FDA would then rule, consider the panel's vote, on whether to approve this device.

On the other hand you could have a very complex device with uncertain physiological mechanism declared equivalent to a device approved before May 28, 1976 and it requires much less evidence. And you can have a device declared similar to a device that was similar to a device that was similar to a device on the market before 1976. So basically if there was one type I error in this chain, you now have a device that's equivalent to a non-efficacious device. For these no trial is required, no panel meeting is required. The regulatory burden is tens of millions of dollars less expensive and we also have substantially less scientific evidence.

But the complexity of the device has nothing to do with which path gets taken. Only it's similarity to a device that existed before 1976.

This was in the WSJ just this morning.

You can imagine there was nothing quite like the "NanoKnife" on the market in 1976. But it's obviously very worth a company's effort to get their new device declare substantially equivalent to an old one. Otherwise they have to spend the money for a trial and risk losing that trial. Why do research when you can just market!?

So this unfortunately isn't a scientific question -- we know what good science would lead us to do. It's a legal question and the scientists at FDA are merely following U.S. law which is fundamentally flawed and leads to two very different paths and scientific hurdles for device approval.

Sanjay Kaul wrotes:

By statute ("the least burdensome" pathway), the approval standard for devices by the US FDA is lower than for drugs. Before a new drug can be marketed, the sponsor must show "substantial evidence of effectiveness" as based on two or more well-controlled clinical studies (which literally means 2 trials, each with a p value of <0.05, or 1 large trial with a robust p value <0.00125). In contrast, the sponsor of a new device, especially those that are designated as high-risk (Class III) device, need only demonstrate "substantial equivalence" to an FDA-approved device via the 510(k) exemption or a "reasonable assurance of safety and effectiveness", evaluated through a pre-market approval and typically based on a single study.

What does "reasonable assurance" or "substantial equivalence" imply to you as a Bayesian? These are obviously qualitative constructs, but if one were to quantify them, how would you go about addressing it?

John Christie sends along this. As someone who owns neither a car nor a mobile phone, it's hard for me to relate to this one, but it's certainly a classic example for teaching causal inference.

Tapen Sinha writes:

Living in Mexico, I have been witness to many strange (and beautiful) things. Perhaps the strangest happened during the first outbreak of A(H1N1) in Mexico City. We had our university closed, football (soccer) was played in empty stadiums (or should it be stadia) because the government feared a spread of the virus. The Metro was operating and so were the private/public buses and taxis. Since the university was closed, we took the opportunity to collect data on facemask use in the public transport systems. It was a simple (but potentially deadly!) exercise in first hand statistical data collection that we teach our students (Although I must admit that I did not dare sending my research assistant to collect data - what if she contracted the virus?). I believe it was a unique experiment never to be repeated.

The paper appeared in the journal Health Policy. From the abstract:

At the height of the influenza epidemic in Mexico City in the spring of 2009, the federal government of Mexico recommended that passengers on public transport use facemasks to prevent contagion. The Mexico City government made the use of facemasks mandatory for bus and taxi drivers, but enforcement procedures differed for these two categories. Using an evidence-based approach, we collected data on the use of facemasks over a 2-week period. In the specific context of the Mexico City influenza outbreak, these data showed mask usage rates mimicked the course of the epidemic and gender difference in compliance rates among metro passengers. Moreover, there was not a significant difference in compliance with mandatory and voluntary public health measures where the effect of the mandatory measures was diminished by insufficiently severe penalties,

Brendan Nyhan gives the story.

Here's Sarah Palin's statement introducing the now-notorious phrase:

The America I know and love is not one in which my parents or my baby with Down Syndrome will have to stand in front of Obama's "death panel" so his bureaucrats can decide, based on a subjective judgment of their "level of productivity in society," whether they are worthy of health care.

And now Brendan:

Palin's language suggests that a "death panel" would determine whether individual patients receive care based on their "level of productivity in society." This was -- and remains -- false. Denying coverage at a system level for specific treatments or drugs is not equivalent to "decid[ing], based on a subjective judgment of their 'level of productivity in society.'"

Seems like an open-and-shut case to me. The "bureaucrats" (I think Palin is referring to "government employees") are making decisions based on studies of the drug's effectiveness:

Hal Pashler writes:

Ed Vul and I are working on something that, although less exciting than the struggle against voodoo correlations in fMRI :-) might interest you and your readers. The background is this: we have been struck for a long time by how many people get frustrated and confused trying to figure out whether something they are doing/eating/etc is triggering something bad, whether it be migraine headaches, children's tantrums, arthritis pains, or whatever. It seems crazy to try to do such computations in one's head--and the psychological literature suggests people must be pretty bad at this kind of thing--but what's the alternative? We are trying to develop one alternative approach--starting with migraine as a pilot project.

We created a website that migraine sufferers can sign up for. The users select a list of factors that they think might be triggering their headaches (eg drinking red wine, eating stinky cheese, etc.--the website suggests a big list of candidates drawn from the migraine literature). Then, every day the user is queried about how much they were exposed to each of these potential triggers that day, as well as whether they had a headache. After some months, the site begins to analyze the user's data to try to figure out which of these triggers--if any--are actually causing headaches.

Our approach uses multilevel logistic regression as in Gelman and Hill, and or Gelman and Little (1997), and we use parametric bootstrapping to obtain posterior predictive confidence intervals to provide practical advice (rather than just ascertain the significance of effects). At the start the population-level hyperparameters on individual betas start off uninformative (uniform), but as we get data from an adequate number of users (we're not there quite yet), we will be able to pool information across users to provide appropriate population-level priors on the regression coefficients for each possible trigger factor for each person. The approach is outlined in this FAQ item.

Looks cool to me.

Andrew Eppig writes:

I'm a physicist by training who is transitioning to the social sciences. I recently came across a reference in the Economist to a paper on IQ and parasites which I read as I have more than a passing interest in IQ research (having read much that you and others (e.g., Shalizi, Wicherts) have written). In this paper I note that the authors find a very high correlation between national IQ and parasite prevalence. The strength of the correlation (-0.76 to -0.82) surprised me, as I'm used to much weaker correlations in the social sciences. To me, it's a bit too high, suggesting that there are other factors at play or that one of the variables is merely a proxy for a large number of other variables. But I have no basis for this other than a gut feeling and a memory of a plot on Language Log about the distribution of correlation coefficients in social psychology.

So my question is this: Is a correlation in the range of (-0.82,-0.76) more likely to be a correlation between two variables with no deeper relationship or indicative of a missing set of underlying variables?

My reply:

Congressman Kevin Brady from Texas distributes this visualization of reformed health care in the US (click for a bigger picture):


Here's a PDF at Brady's page, and a local copy of it.

Complexity has its costs. Beyond the cost of writing it, learning it, following it, there's also the cost of checking it. John Walker has some funny examples of what's hidden in the almost 8000 pages of IRS code.

Text mining and applied statistics will solve all that, hopefully. Anyone interested in developing a pork detection system for the legislation? Or an analysis of how much entropy to the legal code did each congressman contribute?

There are already spin detectors, that help you detect whether the writer is a Democrat ("stimulus", "health care") or a Republican ("deficit spending", "ObamaCare").

D+0.1: Jared Lander points to versions by Rep. Boehner and Robert Palmer.

As part of his continuing plan to sap etc etc., Aleks pointed me to an article by Max Miller reporting on a recommendation from Jacob Appel:

Adding trace amounts of lithium to the drinking water could limit suicides. . . . Communities with higher than average amounts of lithium in their drinking water had significantly lower suicide rates than communities with lower levels. Regions of Texas with lower lithium concentrations had an average suicide rate of 14.2 per 100,000 people, whereas those areas with naturally higher lithium levels had a dramatically lower suicide rate of 8.7 per 100,000. The highest levels in Texas (150 micrograms of lithium per liter of water) are only a thousandth of the minimum pharmaceutical dose, and have no known deleterious effects.

I don't know anything about this and am offering no judgment on it; I'm just passing it on. The research studies are here and here. I am skeptical, though, about this part of the argument:

Interesting article by Sharon Begley and Mary Carmichael. They discuss how there is tons of federal support for basic research but that there's a big gap between research findings and medical applications--a gap that, according to them, arises not just from the inevitable problem that not all research hypotheses pan out, but because actual promising potential cures don't get researched because of the cost.

I have two thoughts on this. First, in my experience, research at any level requires a continuing forward momentum, a push from somebody to keep it going. I've worked on some great projects (some of which had Federal research funding) that ground to a halt because the original motivation died. I expect this is true with medical research also. One of the projects that I'm thinking of, which I've made almost no progress on for several years, I'm sure would make a useful contribution. I pretty much know it would work--it just takes work to make it work, and it's hard to do this without the motivation of it being connected to other projects.

My second thought is about economics. Begley and Carmichael discuss how various potential cures are not being developed because of the expense of animal and then human testing. I guess this is part of the expensive U.S. medical system, that simple experiments cost millions of dollars. But I'm also confused: if these drugs are really "worth it" and would save lots of lives, wouldn't it be worth it for the drug and medical device companies to expend the dollars to test them? There's some big-picture thing I'm not understanding here.

Earlier today, Nate criticized a U.S. military survey that asks troops the question, "Do you currently serve with a male or female Service member you believe to be homosexual." [emphasis added] As Nate points out, by asking this question in such a speculative way, "it would seem that you'll be picking up a tremendous number of false positives--soldiers who are believed to be gay, but aren't--and that these false positives will swamp any instances in which soldiers (in spite of DADT) are actually somewhat open about their same-sex attractions."

This is a general problem in survey research. In an article in Chance magazine in 1997, "The myth of millions of annual self-defense gun uses: a case study of survey overestimates of rare events" [see here for related references], David Hemenway uses the false-positive, false-negative reasoning to explain this bias in terms of probability theory. Misclassifications that induce seemingly minor biases in estimates of certain small probabilities can lead to large errors in estimated frequencies. Hemenway discusses this effect in the context of traditional medical risk problems and then argues that this bias has caused researchers to drastically overestimate the number of times that guns have been used for self defense. Direct extrapolations from surveys suggest 2.5 million self-defense gun uses per year in the United States, but Hemenway shows how response errors could be causing this estimate to be too high by a factor of 10.

Here are a couple more examples from Hemenway's 1997 article:

The National Rifle Association reports 3 million dues-paying members, or about 1.5% of American adults. In national random telephone surveys, however, 4-10% of respondents claim that they are dues-paying NRA members. Similarly, although Sports Illustrated reports that fewer than 3% of American households purchase the magazine, in national surveys 15% of respondents claim that they are current responders.

Gays are estimated to be about 3% of the general population (whether the percentage is higher or lower in the military, I have no idea), so you can see how it can be very difficult to interpret the results of "gaydar" questions.

P.S. This post really is about guns and gaydar, not so much about God, but to maintain consistency with the above title, I'll link to this note on the persistent overreporting of church attendance in national surveys.

Inequality and health


Several people asked me for my thoughts on Richard Wilkinson and Kate Pickett's book, "The Spirit Level: Why Greater Equality Makes Societies Stronger." I've outsourced my thinking on the topic to Lane Kenworthy.

Oil spill and corn production

| 1 Comment

See here.

Hank Aaron at the Brookings Institution, who knows a lot more about policy than I do, had some interesting comments on the recent New York Times article about problems with the Dartmouth health care atlas. which I discussed a few hours ago. Aaron writes that much of the criticism in that newspaper article was off-base, but that there are real difficulties in translating the Dartmouth results (finding little relation between spending and quality of care) to cost savings in the real world.

Aaron writes:

Reed Abelson and Gardiner Harris report in the New York Times that some serious statistical questions have been raised about the Dartmouth Atlas of Health Care, an influential project that reports huge differences in health care costs and practices in different places in the United States, suggesting large potential cost savings if more efficient practices are used. (A claim that is certainly plausible to me, given this notorious graph; see here for background.)

Here's an example of a claim from the Dartmouth Atlas (just picking something that happens to be featured on their webpage right now):

Medicare beneficiaries who move to some regions receive many more diagnostic tests and new diagnoses than those who move to other regions. This study, published in the New England Journal of Medicine, raises important questions about whether being given more diagnoses is beneficial to patients and may help to explain recent controversies about regional differences in spending.

Abelson and Harris raise several points that suggest the Dartmouth claims may be overstated because of insufficient statistical adjustment. Abelson and Harris's article is interesting, thoughtful, and detailed, but along the way it reveals a serious limitation of the usual practices of journalism, when applied to evaluating scientific claims.

In response to the post The bane of many causes in the context of mobile phone use and brain cancer, Robert Erikson wrote:

The true control here is the side of the head of the tumor: same side as phone use or opposite side. If that is the test, the data from the study are scary. Clearly tumors are more likely on the "same" side, at whatever astronomical p value you want to use. That cannot be explained away by misremembering, since an auxiliary study showed misremembering was not biased toward cell phone-tumor consistency.

A strong signal in the data pointed by Prof. Erikson is that the tumors are overwhelmingly likelier to appear on the same side of the head as where the phone is held. I've converted the ratios into percentages, based on an assumption that the risk for tumors would be apriori equal for both sides of the head.


There is a group of people with low-to-moderate exposure and high lateral bias, but the bias does increase quite smoothly with increasing exposure. It's never below 50%.

But even with something apparently simple like handedness, there are possible confounding factors. For example, left-handed and ambidextrous people have a lower risk of brain cancer, perhaps because they zap their brain with cell phones more evenly across both sides, reducing the risk that a single DNA strand will be zapped one too many times, but they also earn more. I've written about handling multiple potential causes at the same time a few years ago.

The authors also point out that people might be inclined to blame it all on the phones and to report phone use on the side where the tumor was identified. This could be resolved if the controls are led to think that they have a tumor too, or if instead of asking how the phone is held, the interviewers instead made a call and observed the subject, or asked about a value neutral attribute such as handedness. Still, even in papers that reject the influence of phones on brain tumors, it's always the case that more tumors are on the right side, just as we know that more people are right-handed than left-handed.

In the light of this investigation, I fully agree with Prof. Erikson that there is something going on.

The bane of many causes


One of the newsflies buzzing around today is an article "Brain tumour risk in relation to mobile telephone use: results of the INTERPHONE international case-control study".

The results, shown in this pretty table below, appear to be inconclusive.


A limited amount of cellphone radiation is good for your brain, but not too much? It's unfortunate that the extremes are truncated. The commentary at Microwave News blames bias:

The problem with selection bias --also called participation bias-- became apparent after the brain tumor risks observed throughout the study were so low as to defy reason. If they reflect reality, they would indicate that cell phones confer immediate protection against tumors. All sides agree that this is extremely unlikely. Further analysis pointed to unanticipated differences between the cases (those with brain tumors) and the controls (the reference group).

The second problem concerns how accurately study participants could recall the amount of time and on which side of the head they used their phones. This is called recall bias.

Mobile phones are not the only cause for development and detection of brain tumors. There are lots of factors: age, profession, genetics - all of them affecting the development of tumors. It's too hard to match everyone, but it's a lot easier to study multiple effects at the same time.

We'd see, for example, that healthy younger people at lower risk of brain cancer tend to use mobile phones more, and that older people sick with cancer that might spread to the brain don't need mobile phones. Similar could hold for alcohol consumption (social drinkers tend to be healthy and social, but drinking is an effect, not a cause) and other potential risk factors.

Here's a plot of the relative risk based on cumulative phone usage:


It seems that the top 10% of users has much higher risk. If the data wasn't discretized into just 10 categories, there could be interesting information here, beyond the obvious one that you need to be old and wealthy enough to accumulate 1600 hours of mobile phone usage.

[Changed the title from "many effects" to "many causes" - thanks to a comment by Cyrus]

Dan Lakeland asks:

When are statistical graphics potentially life threatening? When they're poorly designed, and used to make decisions on potentially life threatening topics, like medical decision making, engineering design, and the like. The American Academy of Pediatrics has dropped the ball on communicating to physicians about infant jaundice. Another message in this post is that bad decisions can compound each other.

It's an interesting story (follow the link above for the details), would be great for a class in decision analysis or statistical communication. I have no idea how to get from A to B here, in the sense of persuading hospitals to do this sort of thing better. I'd guess the first step is to carefully lay out costs and benefits. When doctors and nurses make extra precautions for safety, it could be useful to lay out the ultimate goals and estimate the potential costs and benefits of different approaches.

The (U.S.) "President's Cancer Panel" has released its 2008-2009 annual report, which includes a cover letter that says "the true burden of environmentally induced cancer has been grossly underestimated." The report itself discusses exposures to various types of industrial chemicals, some of which are known carcinogens, in some detail, but gives nearly no data or analysis to suggest that these exposures are contributing to significant numbers of cancers. In fact, there is pretty good evidence that they are not.

U.S. male cancer mortality by year for various cancers

My article with Daniel and Yair has recently appeared in The Forum:

We use multilevel modeling to estimate support for health-care reform by age, income, and state. Opposition to reform is concentrated among higher-income voters and those over 65. Attitudes do not vary much by state. Unfortunately, our poll data only go to 2004, but we suspect that much can be learned from the relative positions of different demographic groups and different states, despite swings in national opinion. We speculate on the political implications of these findings.

The article features some pretty graphs that originally appeared on the blog.

It's in a special issue on health care politics that has several interesting articles, among which I'd like to single out this one by Bob Shapiro and Lawrence Jacobs entitled, "Simulating Representation: Elite Mobilization and Political Power in Health Care Reform":

The public's core policy preferences have, for some time, favored expanding access to health insurance, regulating private insurers to ensure reliable coverage, and increasing certain taxes to pay for these programs. Yet the intensely divisive debate over reform generated several notable gaps between proposed policies and public opinion for two reasons.

First, Democratic policymakers and their supporters pushed for certain specific means for pursuing these broad policy goals--namely, mandates on individuals to obtain health insurance coverage and the imposition of an excise tax on high-end health insurance plans--that the public opposed. Second, core public support for reform flipped into majority opposition in reaction to carefully crafted messages aimed at frightening Americans and especially by partisan polarization that cued Republican voters into opposition while they unnerved independents.

The result, say Shapiro and Jacobs, "suggests a critical change in American democracy, originating in transformations at the elite level and involving, specifically, increased incentives to attempt to move the public in the direction of policy goals favored by elites policies and to rally their partisan base, rather than to respond to public wishes." They've written a fascinating and important paper.

Michael Spagat notifies me that his article criticizing the 2006 study of
Burnham, Lafta, Doocy and Roberts has just been published. The Burnham et al. paper (also called, to my irritation (see the last item here), "the Lancet survey") used a cluster sample to estimate the number of deaths in Iraq in the three years following the 2003 invasion. In his newly-published paper, Spagat writes:

[The Spagat article] presents some evidence suggesting ethical violations to the survey's respondents including endangerment, privacy breaches and violations in obtaining informed consent. Breaches of minimal disclosure standards examined include non-disclosure of the survey's questionnaire, data-entry form, data matching anonymised interviewer identifications with households and sample design. The paper also presents some evidence relating to data fabrication and falsification, which falls into nine broad categories. This evidence suggests that this survey cannot be considered a reliable or valid contribution towards knowledge about the extent of mortality in Iraq since 2003.

There's also this killer "editor's note":

The authors of the Lancet II Study were given the opportunity to reply to this article. No reply has been forthcoming.


Now on to the background:

More than six-and-a-half years have elapsed since the US-led invasion of Iraq in late March 2003. The human losses suffered by the Iraqi people during this period have been staggering. It is clear that there have been many tens of thousands of violent deaths in Iraq since the invasion. . . . The Iraq Family Health Survey Study Group (2008a), a recent survey published in the New England Journal of Medicine, estimated 151,000 violent deaths of Iraqi civilians and combatants from the beginning of the invasion until the middle of 2006. There have also been large numbers of serious injuries, kidnappings, displacements and other affronts to human security.

Burnham et al. (2006a), a widely cited household cluster survey, estimated that Iraq had suffered approximately 601,000 violent deaths, namely four times as many as the IFHS estimate, during almost precisely the same period as covered by the IFHS study. The L2 data are also discrepant from data provided by a range of other reliable sources, most of which are broadly consistent with one another. Nonetheless, there remains a widespread belief in some public and professional circles that the L2 estimate may be closer to reality than the IFHS estimate.

But Spagat says no; he suggests "the possibility of data fabrication and falsification." Also some contradictory descriptions of sampling methods, which are interesting enough that I will copy them here (it's from pages 11-12 of Spagat's article):

Some new articles


A few papers of mine have been recently accepted for publication. I plan to blog each of these individually but in the meantime here are some links:

Review of The Search for Certainty, by Krzysztof Burdzy. Bayesian Analysis. (Andrew Gelman)

Inference from simulations and monitoring convergence. In Handbook of Markov Chain Monte Carlo, ed. S. Brooks, A. Gelman, G. Jones, and X. L. Meng. CRC Press. (Andrew Gelman and Kenneth Shirley)

Public opinion on health care reform. The Forum. (Andrew Gelman, Daniel Lee, and Yair Ghitza)

A snapshot of the 2008 election. Statistics, Politics and Policy. (Andrew Gelman, Daniel Lee, and Yair Ghitza)

Bayesian combination of state polls and election forecasts. Political Analysis. (Kari Lock and Andrew Gelman)

Causality and statistical learning. American Journal of Sociology. (Andrew Gelman)

Can fractals be used to predict human history? Review of Bursts, by Albert-Laszlo Barabasi. Physics Today. (Andrew Gelman)

Segregation in social networks based on acquaintanceship and trust. American Journal of Sociology. (Thomas A. DiPrete, Andrew Gelman, Tyler McCormick, Julien Teitler, and Tian Zheng)

Here's the full list.

Saw a video link talk at a local hospital based research institute last Friday

Usual stuff about a randomized trail not being properly designed nor analyzed - as if we have not heard about that before

But this time is was tens of millions of dollars and a health concern that likely directly affects over 10% of the readers of this blog - the males over 40 or 50 and those that might care about them

Its was a very large PSA screening study and and

the design and analysis apparently failed to consider the _usual_ and expected lag in a screening effect here (perhaps worth counting the number of statisticians in the supplementary material given)

for an concrete example from colon cancer see here

And apparently a proper reanalysis was initially hampered by the well known - "we would like to give you the data but you know" .... but eventually a reanalysis was able to recover enough of the data from the from published documents

but even with the proper analysis - the public health issue - does PSA screening do more good than harm ( half of US currently males get PSA screening at some time? ) will likely remain largely uncertain or at least more uncertain than it needed to be

and it will happen again and again (seriously wasteful and harmful design and analysis)

and there will be a lot more needless deaths from either "screening being adopted" if it truly shouldn't have been or "screening was not more fully adopted, earlier" when it truly should have been (there can be very nasty downsides from ineffective screening programs, including increased mortality)

Dan Engber points me to an excellent pair of articles by Dave Johns, reporting on the research that's appeared in the last few years from Nicholas Christakis and James Fowler on social contagion--the finding that being fat is contagions, and so forth.

More precisely, Christakis and Fowler reanalyzed data from the Framingham heart study--a large longitudinal study that included medical records on thousands of people and, crucially, some information on friendships among the participants--and found that, when a person gained weight, his or her friends were likely to gain weight also. Apparently they have found similar patterns for sleep problems, drug use, depression, and divorce. And others have used the same sort of analysis to find contagion in acne, headaches, and height. Huh? No, I'm not kidding, but these last three were used in an attempt to debunk the Christakis and Fowler findings: if their method finds contagion in height, then maybe this isn't contagion at all, but just some sort of correlation. Maybe fat people just happen to know other fat people. Christakis and Fowler did address this objection in their research articles, but the current controversy is over whether their statistical adjustment did everything they said it did.

So this moves from a gee-whiz science-is-cool study to a more interesting is-it-right-or-is-it-b.s. debate.

The first two or three paragraphs of this post aren't going to sound like they have much to do with weight loss, but bear with me.

In October, I ran in a 3K (1.86-mile) "fun run" at my workplace, and was shocked to have to struggle to attain 8-minute miles. This is about a minute per mile slower than the last time I did the run, a few years ago, and that previous performance was itself much worse than a run a few years earlier. I no longer attempt to play competitive sports or to maintain a very high level of fitness, but this dismal performance convinced me that my modest level of exercise --- a 20- to 40-mile bike ride or a 4-mile jog each weekend, a couple of one-hour medium-intensity exercise sessions during the week, and an occasional unusual effort (such as a 100-mile bike ride) --- was not enough to keep my body at a level of fitness that I consider acceptable.

So after that run in October, I set some running goals: 200 meters in 31 seconds, 400m meters in der 64 seconds, and a mile in 6 minutes. (These are not athlete goals, but they are decent middle-aged-guy-with-a-bad-knee goals, and I make no apology for them). Around the end of October, I started going to the track 5 or 6 days per week, for an hour per workout. I started with the 200m goal. I alternated high-intensity workouts with lower-intensity workouts. All workouts start with 20 minutes of warmup, gradually building in intensity: skips, side-skips, butt-kicks, , a couple of active (non-stationary) stretching exercises, leg swings, high-knee running, backward shuffle, backward run, "karaokas" (a sort of sideways footwork drill), straight-leg bounds, and finally six or seven "accelerations", accelerating from stationary to high speed over a distance of about 30 meters. After the 20-minute warmup, I do the heart of the program, which takes about 30 minutes. (The final ten minutes, I do "core" work such as crunches, and some stretching). A high-intensity workout might include running up stadium sections (about 12 seconds at very close to maximum effort, followed by a 20- to 30-second break, then repeat, multiple times), or all-out sprints of 60, 100, or 120 meters...or a variety of other exercises at close to maximum effort. Every week or so, I would do an all-out 200m to gauge my progress. My time dropped by about a second per week, and within about 6 weeks I had run my sub-31 and shifted my workouts to focus on the 400m goal (which I am still between 1 and 2 seconds from attaining, almost three months later, but that's a different story).

So where does weight loss come in? I was shaving off pounds at about the same rate that I shaved off seconds in the 200m: I dropped from around 206 - 208 pounds at the end of October to under 200 in early December, and contined to lose weight more slowly after that, to my current weight of about 193-195. About twelve pounds of weight loss in as many weeks.

Update on gardens in school


Sebastian comments:

Take the claim that there were no claims of improvements in English or Math - that might be technically true (although there are studies that at least claim overall improvements in test scores). But I [Sebastian] hope everyone would agree that science is important?
Science achievement of third, fourth, and fifth grade elementary students was studied using a sample of 647 students from seven elementary schools in Temple, Texas. Students in the experimental group participated in school gardening activities as part of their science curriculum in addition to using traditional classroom-based methods. In contrast, students in the control group were taught science using traditional classroom-based methods only. Students in the experimental group scored significantly higher on the science achievement test compared to the students in the control group.

There are a bunch of others as far as I can tell - but contrary to what [Flanagan] seems to suggest, the empirical literature is actually quite small and mostly focused on nutritional benefits, the declared central goal of the school gardens.

So maybe the evidence on school gardens is more favorable than we thought. It makes sense that the literature would focus on nutritional benefits. But it also makes sense to look at academic outcomes to address the concern that the time being spent in the garden is being taken away from other pursuits. If Caitlin Flanagan sees this, perhaps she can comment.


Here's the story (which Kaiser forwarded to me). The English medical journal The Lancet (according to its publisher, "the world's leading independent general medical journal") published an article in 1998 in support of the much-derided fringe theory that MMR vaccination causes autism. From the BBC report:

The Lancet said it now accepted claims made by the researchers were "false".

It comes after Dr Andrew Wakefield, the lead researcher in the 1998 paper, was ruled last week to have broken research rules by the General Medical Council. . . . Dr Wakefield was in the pay of solicitors who were acting for parents who believed their children had been harmed by MMR. . . .

[The Lancet is now] accepting the research was fundamentally flawed because of a lack of ethical approval and the way the children's illnesses were presented.

The statement added: "We fully retract this paper from the published record." Last week, the GMC ruled that Dr Wakefield had shown a "callous disregard" for children and acted "dishonestly" while he carried out his research. It will decide later whether to strike him off the medical register.

The regulator only looked at how he acted during the research, not whether the findings were right or wrong - although they have been widely discredited by medical experts across the world in the years since publication.

They also write:

The publication caused vaccination rates to plummet, resulting in a rise in measles.

An interesting question, no? What's the causal effect of a single published article?

P.S. I love it how they refer to the vaccine as a "three-in-one jab." So English! They would never call it a "jab" in America. So much more evocative than "shot," in my opinion.

Following up on our recent discussion (see also here) about estimates of war deaths, Megan Price pointed me to this report, where she, Anita Gohdes, Megan Price, and Patrick Ball write:

Several media organizations including Reuters, Foreign Policy and New Scientist covered the January 21 release of the 2009 Human Security Report (HSR) entitled, "The Shrinking Cost of War." The main thesis of the HRS authors, Andrew Mack et al, is that "nationwide mortality rates actually fall during most wars" and that "today's wars rarely kill enough people to reverse the decline in peacetime mortality that has been underway in the developing world for more than 30 years." . . . We are deeply skeptical of the methods and data that the authors use to conclude that conflict-related deaths are decreasing. We are equally concerned about the implications of the authors' conclusions and recommendations with respect to the current academic discussion on how to count deaths in conflict situations. . . .

The central evidence that the authors provide for "The Shrinking Cost of War" is delivered as a series of graphs. There are two problems with the authors' reasoning.

Stephen Dubner reports on an observational study of bike helmet laws, a study by Christopher. Carpenter and Mark Stehr that compares bicycling and accident rates among children among states that did and did not have helmet laws. In reading the data analysis, I'm reminded of the many discussions Bob Erikson and I have had about the importance, when fitting time-series cross-sectional models, of figuring out where your identification is coming from (this is an issue that's come up several times on this blog)--but I have no particular reason to doubt the estimates, which seem plausible enough. The analysis is clear enough, so I guess it would be easy enough to get the data, fit a hierarchical model, and, most importantly, make some graphs of what's happening before and after the laws, to see what's going on in the data.

Beyond this, I had one more comment, which is that I'm surprised that Dubner found it surprising that helmet laws seem to lead to a decrease in actual bike riding. My impression is that when helmet laws are proposed, this always comes up: the concern that if people are required to wear helmets, they'll just bike less. Hats off to Carpenter and Stehr for estimating this effect in this clever way, but it's certainly an idea that's been discussed before. In this context, I think it wouldb useful to think in terms of sociology-style models of default behaviors as well as economics-style models of incentives.

Ben Hyde and Aleks both sent me this:


The graph isn't as bad as all that, but, yes, a scatterplot would make a lot more sense than a parallel coordinate plot in this case. Also, I don't know how they picked which countries to include. In particular, I'm curious about Taiwan. We visited there once and were involved in a small accident. We were very impressed by the simplicity and efficiency of their health care system. France's system is great too, but everybody knows that.

Commenter Michael linked to a blog by somebody called The Last Psychiatrist, discussing the recent study by Rank and Hirschl estimating that half the kids in America in the 1970s were on food stamps at some point in their childhood. I've commented on some statistical aspects of that study, but The Last Psychiatrist makes some good points regarding how the numbers can and should be interpreted.

Scientists behaving badly


Steven Levitt writes:

My view is that the emails [extracted by a hacker from the climatic research unit at the University of East Anglia] aren't that damaging. Is it surprising that scientists would try to keep work that disagrees with their findings out of journals? When I told my father that I was sending my work saying car seats are not that effective to medical journals, he laughed and said they would never publish it because of the result, no matter how well done the analysis was. (As is so often the case, he was right, and I eventually published it in an economics journal.)

Within the field of economics, academics work behind the scenes constantly trying to undermine each other. I've seen economists do far worse things than pulling tricks in figures. When economists get mixed up in public policy, things get messier. So it is not at all surprising to me that climate scientists would behave the same way.

I have a couple of comments, not about the global-warming emails--I haven't looked into this at all--but regarding Levitt's comments about scientists and their behavior:

1. Scientists are people and, as such, are varied and flawed. I get particularly annoyed with scientists who ignore criticisms that they can't refute. The give and take of evidence and argument is key to scientific progress.

2. Levitt writes, about scientists who "try to keep work that disagrees with their findings out of journals." This is or is not ethical behavior, depending on how it's done. If I review a paper for a journal and find that it has serious errors or, more generally, that it adds nothing to the literature, then I should recommend rejection--even if the article claims to have findings that disagree with my own work. Sure, I should bend over backwards and all that, but at some point, crap is crap. If the journal editor doesn't trust my independent judgment, that's fine, he or she should get additional reviewers. On occasion I've served as an outside "tiebreaker" referee for journals on controversial articles outside of my subfield.

Anyway, my point is that "trying to keep work out of journals" is ok if done through the usual editorial process, not so ok if done by calling the journal editor from a pay phone at 3am or whatever.

I wonder if Levitt is bringing up this particular example because he served as a referee for a special issue of a journal that he later criticized. So he's particularly aware of issues of peer review.

3. I'm not quite sure how to interpret the overall flow of Levitt's remarks. On one hand, I can't disagree with the descriptive implications: Some scientists behave badly. I don't know enough about economics to verify his claim that academics in that field "constantly trying to undermine each other . . . do far worse things than pulling tricks in figures"--but I'll take Levitt's word for it.

But I'm disturbed by the possible normative implications of Levitt's statement. It's certainly not the case that everybody does it! I'm a scientist, and, no, I don't "pull tricks in figures" or anything like this. I don't know what percentage of scientists we're talking about here, but I don't think this is what the best scientists do. And I certainly don't think it's ok to do so.

What I'm saying is, I think Levitt is doing a big service by publicly recognizing that scientists sometimes--often?--do unethical behavior such as hiding data. But I'm unhappy with the sense of amused, world-weary tolerance that I get from reading his comment.

Anyway, I had a similar reaction a few years ago when reading a novel about scientific misconduct. The implication of the novel was that scientific lying and cheating wasn't so bad, these guys are under a lot of pressure and they do what they can, etc. etc.--but I didn't buy it. For the reasons given here, I think scientists who are brilliant are less likely to cheat.

4. Regarding Levitt's specific example--he article on car seats that was rejected by medical journals--I wonder if he's being too quick to assume that the journals were trying to keep his work out because it disagreed with previous findings.

As a scientist whose papers have been rejected by top journals in many different fields, I think I can offer a useful perspective here.

Much of what makes a paper acceptable is style. As a statistician, I've mastered the Journal of the American Statistical Association style and have published lots of papers there. But I've never successfully published a paper in political science or economics without having a collaborator in that field. There's just certain things that a journal expects to see. It may be comforting to think that a journal will not publish something "because of the result," but my impression is that most journals like a bit of controversy--as long as it is presented in their style. I'm not surprised that, with his training, Levitt had more success publishing his public health work in econ journals.

P.S. Just to repeat, I'm speaking in general terms about scientific misbehavior, things such as, in Levitt's words, "pulling tricks in figures" or "far worse things." I'm not making a claim that the scientists at the University of East Anglia were doing this, or were not doing this, or whatever. I don't think I have anything particularly useful to add on that; you can follow the links in Freakonomics to see more on that particular example.

Harry Selker and Alastair Wood say yes.

P.S. The answer is no. The offending language is no longer in the bill (perhaps in response to Selker and Wood's article).

P.P.S. Somebody checked again, and the offending language is still there!

From Aaron Swartz, a link stating that famous sociologist Peter L. Berger was a big-time consultant for the Tobacco Insitute:

Peter L. Berger is an academic social philosopher and sociologist who served as a consultant to the tobacco industry starting with the industry's original 1979 Social Costs/Social Values Project (SC/SV). According to a 1980 International Committee on Smoking Issues/Social Acceptability Working Party (International Committee on Smoking Issues/SAWP) progress report, Berger's primary assignment was "to demonstrate clearly that anti-smoking activists have a special agenda which serves their own purposes, but not necessarily the majority of nonsmokers."

Margarita Alegría, Glorisa Canino, Patrick Shrout, Meghan Woo, Naihua Duan, Doryliz Vila, Maria Torres, Chih-nan Chen, and Xiao-Li Meng, write:

Although widely reported among Latino populations, contradictory evidence exists regarding the generalizability of the immigrant paradox, i.e., that foreign nativity protects against psychiatric disorders. The authors examined whether this paradox applies to all Latino groups by comparing estimates of lifetime psychiatric disorders among immigrant Latino subjects, U.S-born Latino subjects, and non-Latino white subjects.

The authors combined and examined data from the National Latino and Asian American Study and the National Comorbidity Survey Replication, two of the largest nationally representative samples of psychiatric information.

In the aggregate, risk of most psychiatric disorders was lower for Latino subjects than for non-Latino white subjects. Consistent with the immigrant paradox, U.S.-born Latino subjects reported higher rates for most psychiatric disorders than Latino immigrants. However, rates varied when data were stratified by nativity and disorder and adjusted for demographic and socioeconomic differences across groups. The immigrant paradox consistently held for Mexican subjects across mood, anxiety, and substance disorders, while it was only evident among Cuban and other Latino subjects for substance disorders. No differences were found in lifetime prevalence rates between migrant and U.S.-born Puerto Rican subjects.


Dan Lakeland writes:

Apropos your recent posting of the Churchill/Roosevelt poster, there has been a bit of a controversy over the effect of smoking bans in terms of heart attack rates. Recent bans in the UK have given researchers some plausible "experiments" to study the effect on a larger scale than the famous "Helena Montana" study. For example, this.

On the other hand, when looking for info about this to follow up your poster I found a variety of usually rather obviously biased articles such as this one. But that's no reason to ignore a point of view if it can be backed up by data. The second link at least attempts (poorly) to display some data which suggests that an existing downward trend could be responsible for the reductions, and if poorly done the statistical research could have missed this.

Have you looked at the statistical methodology of any smoking ban studies? it seems like an area ripe for Bayesian modeling, and could be a subject along the lines of the fertility and beauty more girls/more boys research that you recently meta-analyzed.

My reply:

Yes, I imagine that some people have looked into this. I would guess that a smoking ban would reduce smoking and thus save lives, but of course it would be good to see some evidence.

Smoking behavior is a funny thing: It can be really hard to quit, and I've been told that the various anti-smoking programs out there really don't work. It's really hard to make a dent in smoking rates by working with smokers one at a time. On the other hand, rates of smoking vary a huge amount between countries and even between U.S. states:


And smoking bans might work too. Thus, smoking appears to be an individual behavior that is best altered through societal changes.

Placebos Have Side Effects Too



Aleks points me to this blog by Neuroskeptic, who reports on some recent research studying the placebo effect:

From David Madigan:

The Observational Medical Outcomes Partnership (OMOP) seeks new statistical and data mining methods for detecting drug safety issues through the OMOP Cup Methods Competition.

Bear with me. I've got a lot of graphs here (made jointly with Daniel Lee). Click on any of them to see the full-size versions.

I'll start with our main result. From the 2004 Annenberg surveys:

Providing health insurance for people who do not already have it--should the federal government spend more on it, the same as now, less, or no money at all?

The maps below show our estimated percentages of people responding "more" (rather than "the same," "less," or "none") to this question:


Increased government spending on health was particularly favored by people under 65 and those in the lower end of the income distribution. Older and higher-income people are much more likely to be in opposition. And, yes, there's some variation by state--you can see a band of states in the middle of the country showing opposition--but age and income explain a lot more.

Following up on some links, I came across this:


As a beneficiary of indoor smoking bans, I can't say that I agree with the sentiment, but the poster is pretty clever, and it got me thinking. Imagine Churchill on his regular dose of alcohol but without the moderating influence of the tobacco. Maybe would've been a disaster. Seems like a joke, but maybe we'd all be blogging in German right now. I'd like to think, though, Churchill would've switched to chewing tobacco and all would be ok. A spitoon in the corner is a small price to pay for freedom.



Nice graph--especially good that they go back to 1980 (it would be better to go back even earlier but maybe it's not so easy to get the data). One could argue that the numbers would be better per-capita, but the patterns are clear enough that I don't think there's any need to get cute here.

My only criticism of the graph is . . . what's with all the fine detail on the y-axis? 0, 5 million, 10 million, 15 million: that would be enough. What do we gain by seeing 2.5, 7.5, 12.5, 17.5 on the graph? Nuthin. Really, though, this is a very minor comment. It's a great graph.

Ole Rogeberg writes:

Saw your comments on rational addiction - thought you might like to know that some economists think the "theory" is pretty silly as well. It's worse than you think: They assume people smoke cigarettes, shoot up heroin etc. at increasing rates because they've planned out their future consumption paths and found that to be the optimal way to adjust their "addiction stocks" in the way maximizing discounted, lifetime utility. To quote Becker and Murphy's original article: "[I]n our model, both present and future behavior are part of a consistent, maximizing plan."

Yeah, right

Here's Ole's article, "Taking Absurd Theories Seriously: Economics and the Case of Rational Addiction Theories," which begins:

Rational addiction theories illustrate how absurd choice theories in economics get taken seriously as possibly true explanations and tools for welfare analysis despite being poorly interpreted, empirically unfalsifiable, and based on wildly inaccurate assumptions selectively justified by ad-hoc stories. The lack of transparency introduced by poorly anchored mathematical models, the psychological persuasiveness of stories, and the way the profession neglects relevant issues are suggested as explanations for how what we perhaps should see as displays of technical skill and ingenuity are allowed to blur the lines between science and games.

I agree, and I'd also add that this problem isn't unique to economics. Political science and statistics also have lots of silly models that seem to have a life of their own.

Chris Blattman reports on a study by Seema Jayachandran and Ilyana Kuziemko that makes the following argument:

Medical research indicates that breastfeeding suppresses post-natal fertility. We [Jayachandran and Kuziemko] model the implications for breastfeeding decisions and test the model's predictions using survey data from India. . . . mothers with no or few sons want to conceive again and thus limit their breastfeeding. . . . Because breastfeeding protects against water- and food-borne disease, our model also makes predictions regarding health outcomes. We find that child-mortality patterns mirror those of breastfeeding with respect to gender and its interactions with birth order and ideal family size. Our results suggest that the gender gap in breastfeeding explains 14 percent of excess female child mortality in India, or about 22,000 "missing girls" each year.

Interesting. I wonder what Monica Das Gupta would say about this study--she seems to be the expert in this area.


The only thing that really puzzles me about Jayachandran and Kuziemko's article is that, on one hand, they produce an estimate of 14%, but on the other, they write:

In contrast to conventional explanations, excess female mortality due to differential breastfeeding is largely an unintended consequence of parents' desire to have more sons rather than an explicit decision to allocate fewer resources to daughters.

But they just said their explanation only explains 14%. Doesn't that suggest that the other 86% arises from infanticide and other "explicit decisions"? The difference between "14%" and "largely" is so big that I think I must be missing something here. Perhaps someone can explain? Thanks.

John reports on an article by Oeindrila Dube and Suresh Naidu, who ran some regressions on observational data and wrote:

This paper examines the effect of U.S. military aid on political violence and democracy in Colombia. We take advantage of the fact that U.S. military aid is channeled to Colombian army brigades operating out of military bases, and compare how changes in aid affect outcomes in municipalities with and without bases. Using detailed data on violence perpetuated by illegal armed groups, we …find that U.S. military aid leads to differential increases in attacks by paramilitaries . . .

It's an interesting analysis, but I wish they'd restrained themselves and replaced all their causal language with "is associated with" and the like.

From a statistical point of view, what Dubey and Naiduz are doing is estimating the effects of military aid in two ways: first, by comparing outcomes in years in which the U.S. spends more or less in military aid; second, by comparing outcomes in cities in Colombia with and without military bases.

My friend Seth wrote:

A few months ago, because of this blog, I got a free heart scan from HeartScan in Walnut Creek. It's a multi-level X-ray of your heart and is scored to indicate your heart disease risk. . . . What's impressive about these scans is three-fold:

1. The derived scores are strongly correlated with risk of heart disease death. . . . Here is an example of the predictive power. . . .

2. You can improve the score. Via lifestyle changes.

3. The scans provided by HeartScan are low enough in radiation that they can be repeated every year, which is crucial if you want to measure improvement. In contrast, a higher-tech type of scan (64 slice) is so high in radiation that it can't be safely repeated. . . .

Heart scans, like the sort of self-experimentation I've done, is a way to wrest control of your health away from the medical establishment. No matter what your doctor says, no matter what anyone says, you can do whatever you want to try to improve your score. . . .

This looked pretty good. Heart attacks are the #1 killer, maybe I should be getting a heart scan. On the other hand, Seth's references are to a journal article from 2000 and a news article from Life Extension magazine, hardly a trustworthy source. So I didn't know what to think.

I contacted another friend who works in medical statistics, who wrote:

I don't know any of this literature but the fact that his source publication dates back to 2000 while the screening method has clearly not gained widespread traction is an indicator that the cost/benefit ratio is not very favorable (though it's no doubt very favorable to HeartScan who make money out of doing the scanning).

I found this more recent (though skimpy) review, "CT-Based Calcium Scoring to Screen for Coronary Artery Disease: Why Aren't We There Yet?" which casts doubt on the whole idea (and given that it's written by radiologists it has some credibility because they would normally be the first to promote a radiology-based screening technique). There were also some links to reviews of the potential dangers (carcinogenic) of repeated CT scans.

From this information, I wouldn't try to talk Seth out of getting heart scans, but I won't rush out to get one of my own.

NYC datasets

| No Comments

Abhishek Joshi of the Columbia Population Research Center (CPRC) writes:

CPRC is pleased to offer an easy way to locate New York City related databases. The New York City Dataset link includes data such as; NYC Community Health Survey, NYC Youth Risk Behavior Survey, NYC HANES, MapPLUTO and much much more. The website provides easy access for downloading data sets, code books, data dictionaries, online data extraction tools and other relevant documentation.

I did a quick check and it seems that you can access a lot of information here without needing a Columbia University password.

Recent Comments

  • Nick Cox: E.B. Wilson pointed out that maximum likelihood alone would lead read more
  • Antony Unwin: This is a great example of a graphic grabbing our read more
  • Felix E: Two points are sometimes overlooked in the recent discussion of read more
  • Tom Ewer: Graphics may be an important part of statistics, but that read more
  • hankroberts: The link John Mashey posted above is broken, but this read more
  • hankroberts: This NAS* guy Wood is a hoot -- he's claiming read more
  • Andrew Gelman: Mayo: By regularization in this context I mean maximum penalized read more
  • Andrew Gelman: Manoel: Not at all. Graphics is an important part of read more
  • lylebot: I'm amused by the "at&T Labs Research" in the lower read more
  • Manoel Galdino: Not that I'm complaining, but is this a violation to read more
  • fraac: So pretty! read more
  • Alan Mainwaring: I thought I understood this material, but now I am read more
  • Matt Leifer: I stumbled across your post via a google alert. I read more
  • Peter: Scott Aaronson wrote a long piece on this article, read more
  • lylebot: You might be interested in Scott Aaronson's blog, which is read more

About this Archive

This page is an archive of recent entries in the Public Health category.

Political Science is the previous category.

Sociology is the next category.

Find recent content on the main index or look in the archives to find all content.