Splitting the data

Antonio Rangel writes:

I’m a neuroscientist at Caltech . . . I’m using the debate on the ESP paper, as I’m sure other labs around the world are, as an opportunity to discuss some basic statistical issues/ideas w/ my lab.

Request: Is there any chance you would be willing to share your thoughts about the difference between exploratory “data mining” studies and confirmatory studies? What I have in mind is that one could use a dataset to explore/discover novel hypotheses and then conduct another experiment to test those hypotheses rigorously. It seems that a good combination of both approaches could be the best of both worlds, since the first would lead to novel hypothesis discovery, and the later to careful testing. . . it is a fundamental issue for neuroscience and psychology.

My reply:

I know that people talk about this sort of thing . . . but in any real setting, I think I’d want all my data right now to answer any questions I have. I like cross-validation and have used it with success, but I don’t think I could bring myself to keep the split so rigorous as you describe. Once I have the second dataset, I’d form new hypotheses, etc.

Every once in awhile, the opportunity presents itself, though. We analyzed the 2000 and 2004 elections using the Annenberg polls. But when we were revising Red State Blue State to cover the 2008 election, the Annenberg data weren’t available, so we went with Pew Research polls instead. (The Pew people are great–they post raw data on their website.) In the meantime, the 2008 Annenberg data have been released, so now we can check our results, once we get mrp all set up to do this.

5 thoughts on “Splitting the data

  1. Could the debate be framed as ingenuity vs robustness?

    Doing a rigorous split of the data, reserving one part for exploration and one for confirmation, we can get a (very) robust measure of statistical uncertainty. But it's not an efficient way to do exploration, and we might miss something important.
    Using all the data for both jobs, with an unlimited remit to explore, the measures of statistical uncertainty obtained will be very sensitive, and unconvincing. But we've got a better chance of finding the interesting signal, which is good.

    Where is there guidance on this? – for work in "real settings", not written as theorems.

  2. I think this is one of those cases in which Kurt Lewin's ‘There is nothing so practical as a good theory’ fits like a glove.
    From practice using EFA and CFA I have learned that if you have a good theory (or at least some previous assumptions or at least "academic jurisprudence") you should try to keep confirming (or falsifying) your models.
    For exploring a new field… now that is a horse of a different colour.

  3. It depends on your discipline – or how easily you are likely to be (mis)lead by data.

    Almost the opposite of Manola's comment – the more you know the more safer hints picked up from data are.

    But as JG Gardin used to say – you can't rule out a hypothesis on the basis of how it was gennerated – its about the economy of research – how to become less wrong quicker.

    So do the CV stuff – but decide if such formalities were meant for researchers like you (with apologies to Napoleon Bonaparte)

    K?
    p.s. And I think Bayesians should be aware of David Draper's 3CV stuff.

  4. It is so hard to maintain the split without letting what you know about the data pollute your thinking. When I have tried to do this I invariably find that I am considering what I know about the testing dataset when fitting models in the exploratory dataset. I think it is just better to drop the pretense of separate datasets and to combine them. CV methods can be used afterwards no matter what and that has some usefulness.

  5. I was saved from embarrassment recently when my shiny new theory (strongly confirmed by the first dataset) disappeared in a puff of randomness when I tested additional datasets. I'm sad that my theory didn't pan out, but glad I found out before I published!

Comments are closed.