Forensic bioinformatics, or, Don’t believe everything you read in the (scientific) papers

Hadley Wickham sent me this, by Keith Baggerly and Kevin Coombes:

In this report we [Baggerly and Coombes] examine several related papers purporting to use microarray-based signatures of drug sensitivity derived from cell lines to predict patient response. Patients in clinical trials are currently being allocated to treatment arms on the basis of these results. However, we show in five case studies that the results incorporate several simple errors that may be putting patients at risk. One theme that emerges is that the most common errors are simple (e.g., row or column offsets); conversely, it is our experience that the most simple errors are common.

This is horrible! But, in a way, it’s not surprising. I make big mistakes in my applied work all the time. I mean, all the time. Sometimes I scramble the order of the 50 states, or I’m plotting a pure noise variable, or whatever. But usually I don’t drift too far from reality because I have a lot of cross-checks and I (or my close collaborators) are extremely familiar with the data and the problems we are studying.

Genetics, though, seems like more of a black box. And, as Baggerly and Coombes demonstrate in their fascinating paper, once you have a hypothesis, it doesn’t seem so difficult to keep coming up with what seems like confirming evidence of one sort or another.

To continue the analogy, operating some of these methods seems like knitting a sweater inside a black box: it’s a lot harder to notice your mistakes if you can’t see what you’re doing, and it can be difficult to tell by feel if you even have a functioning sweater when you’re done with it all.

6 thoughts on “Forensic bioinformatics, or, Don’t believe everything you read in the (scientific) papers

  1. It was interesting to read something Craig Venter said recently, that he gave up on analysing DNA data, there is too much of it, instead he is now focusing on creating synthethic life forms.

  2. Baggerly makes mistakes too (we've shown as much). The critical difference between his mistakes and those he deplores is that he makes it so that all of his analyses can be reproduced, inspected, and if necessary, updated or added onto.

    Opaque science isn't science at all. That's his point and it's a very, very good one. Several of my labmates are using Sweave for their dissertations as a result.

  3. This is a problem not just in bioinformatics, but in anything that requires programming.

    Bioinformatics is really no different than anything else as far as that goes. Specifically, methods aren't all black boxes. For microarrays, for instance, it's common to just use a first-order factor model, like dChip. In that setting, the coefficients are all interpretatable and linkable back to the data.

    There are also all kinds of sanity checks that can be done on results as a whole, or on an individual prediction basis.

    In natural language processing, the canonical example is the failure to reproduce the parser described in Michael Collins's UPenn dissertation, much less the precursor, an industry-standard 10-page ACL paper. As a result, Daniel Bikel's reconstruction, which itself was published in Computational Linguistics. According to Bikel's paper, the "unpublished details" accounted for an 11% reduction in error. In other words, the published model wasn't the model whose results were being reported.

    I don't mean to pick on Mike (he's a great guy). The reason this is all so well known in our field is that his parser was so good everyone wanted to understand it.

    A less well known example is my own retraction of high-recall named-entity recognition results. The original results were based on buggy evaluation code and overstated results by a factor of 10 or so (they were low to begin with). Very embarassing, but at least it led to a bug fix.

  4. The potential for errors in a complex analysis or in complex computer code is a concern, but the Anil Potti business is much worse to that, and seems to include outright fabrication. For example, consider this news item in Nature, which says:

    Coombes and Baggerly also found that the genes and probe sets listed in the Nature Medicine paper and the patent applications included several that were not produced when they re-ran a computer analysis of the Duke group's data, using software that it had made public via the web. Coombes and Baggerly believe that these genes were added to the lists by hand.

    That is, genes in the author's list included genes that were not actually on the microarrays being analyzed. And, if I recall correctly, these added genes were the ones discussed in order to justify the reasonableness of the results, biologically.

    The real shame, in this business, is that Duke completely ignored or waved away the concerns that Coombes and Baggerly raised, and it was only a misstatement on his biosketch that got Potti into trouble. See this NY Times article.

    Returning to the question of errors in analyses/software, a particularly important point to consider is, to what extent do we teach our statistics students how to properly design and test their software and manage their data, methods and analysis results? Lab scientists are taught (in many cases informally) to keep a lab notebook. Statistical scientists are left to figure this sort of thing out on their own. And computer simulation studies can be as complex as any laboratory experiment.

  5. Karl:

    I agree about the lab notebook thing. One difference about computer experiments is that they are reproducible. So keeping the "lab notebook" should be easier in statistical work.

    Regarding the negligence of Duke's administration: All I can say is that universities seem to act like a lot of other organizations in allowing favored individuals to break the rules. Or maybe a better way to say it is that they enforce the easy-to-enforce rules and slack off on enforcing the difficulty-to-enforce rules. One thing that unethical researchers are often good at is muddying the waters so that it takes a lot of effort to stop them. It does seem kinda funny that the American Cancer Society took away their grant because of a false statement on a resume. It's a bit like jailing Al Capone on tax evasion, I guess.

  6. One has to be careful about microarray analysis and the notion of which genes are "on the chip". What's really on the chip is a bunch of short oligo probes defined by a sequence of bases.

    We don't actually have genomes fully mapped, especially for variation, and we don't know where exactly all the genes are. Thus the mapping of probes on a microarray chip to genes is based on constantly-evolving organism-specific gene models which define where on the genome the exons (translated regions) of each gene are and more recently, which splice variants have been observed.

    Researchers are constantly finding new genes and redefining the gene models in terms of boundaries and splice variants. Often the chips include probes that weren't mappable to existing gene models when the chips were designed, but were found in RNA-seq experiments and thus thought to originate from a transcribed region.

    It's very common to analyze the data from older chips using newer gene models by redoing the mapping. All the common microarray analysis platforms support pluggable probe-gene mappings. For instance, we just reanalyzed old Affymetrix probes with new C. elegans gene models to calculate correlations between microarray and RNA-seq data.

Comments are closed.