Irreproducible analysis

John Cook has an interesting story here. I agree with his concerns. It’s hard enough to reproduce my own analysis, let alone somebody else’s. This comes up sometimes when revising a paper or when including an old analysis in a book, that I just can’t put all the data back together again, or I have an old Fortran program linked to S-Plus that won’t run in R, or whatever, and so I have to patch something together with whatever graphs I have available.

Also, when consulting, I’ve sometimes had to reconstruct the other side’s analysis, and it can be tough sometimes to figure out exactly what they did.

4 thoughts on “Irreproducible analysis

  1. Lately I've been encountering an even stranger problem: papers for which there's a detailed explanation of the analysis in the paper and for which they release their code, but the code doesn't match the details they give in the paper.

  2. I bought into the literate programming idea about 4 or so years ago, and have been using Sweave with LaTeX to write reproducible code. I have also just started releasing the source code+data and Sweave files along with published articles.

    Some expensive lessons I have learnt are:

    1. Except for trivial programs, it does not make sense to write Sweave code when you start out with the analysis. There should be an initial period of using just .R files, and when you have reached the final (hopefully elegant) version, i.e., the publication stage, that is when you should take some time to move to Sweave.

    2. Never use .Rprofile files because if you have custom functions that are autoloaded, you might forget to include them in the public version. Use .R files with custom functions that are explicitly loaded, silently (set echo=FALSE in the Sweave code chunk).

    3. If you are going to assume special subdirectory structures in the current working directory, run some preliminary R code in the Sweave file to check that such directories exist. E.g., if you have a directory to save figures in, make sure that directory is autocreated at startup if it does not exist.

    4. Use caching for complex files, via the weaver package. But when using xYplot, set caching off before plotting an xYplot, else the code will crash.

    5. Label your chunks.

    6. xtable is unable to identify the fact that an R output line containing, e.g., log(sigma^2), has to be in math-environment in the tex. In Sweave this has the disastrous consequence that the .tex file does not compile. My kludgy solution is to search and replace the .tex file after Sweaving it.

    7. It's a mistake to run the Sweave code chunk by chunk and hope that the final version works. Compile it once.

    8. Release the .R, .Rnw and .pdf files, not just a subset of these.

    9. When releasing the public version, remove all cached stuff.

    10. Don't lose the data or code. I know quite a few people who are unable to locate the data or code from a published paper. Just like that. Gone.

    The biggest problem with using Sweave is that it takes extra time away from the more urgent work, it is a considerable overhead on the science side of things. One gets no credit for it; your hirsch-index will remain static no matter how sweet your Sweave files are.

    So maybe use it only after you get tenure. (I was foolish enough to not follow my own advice, though ;-).

    It would be so great if everyone released their data in a usable form with the published paper. For reasons I do not understand, even getting the data afterwards is a huge exercise. Once published, data, code and all documentation (if there was any) tend to crumble, or even disappear completely.

  3. The Dataverse Network Frank points to sounds very interesting; I will try to use it.

    Actually, I don't mind if there is no replication; but I do mind if there is no data accompanying the publication. So I would propose that the slogan be

    "No publication without data."

    meaning, without releasing data along with the publication ;)

Comments are closed.