Kaggle: A platform for data prediction competitions

Anthony Goldbloom writes:

I’m writing because:

a) you may have some interest in our new project, Kaggle, a platform for data prediction competitions; and

b) to get your input.

First, the summary: Kaggle allows organizations to post their data and have it scrutinized by the world’s best statisticians. It will offer a robust rating system, so it will be easy to identify those with a proven track record. Organizations can choose either to follow the experts, or to follow the consensus of the crowd — which, (at least according to James Surowiecki) is likely to be more accurate than the vast majority of individual predictions. (It’ll be interesting to see who triumphs – the crowd or the forecasters with a track record.) The power of a pool of predictions was demonstrated by the Netflix Prize, a $1m data-prediction competition, which was won by a team of teams that combined 700 models.

Now, for my questions:
1. Can you think of any interesting problems that would be ripe for a data-prediction competition?
2. Can you blog about Kaggle? We’ve had some interest from the general tech community and the data mining community, but it’d be great to get statisticians involved.

For interest, we’re currently running a competition to forecast the voting in May’s Eurovision Song Contest. As you may know, Eurovision pits performers from all over Europe against each other, producing voting outcomes which are widely believed to be influenced by politics and alliances rather than performance quality. Contestants in Kaggle’s ‘Forecast Eurovision Voting’ competition will attempt to exploit these regularities to predict the voting matrix (who votes for who) for the 2010 Eurovision Song Contest. The winner of the Kaggle contest will collect a US$1,000 cash prize, which we hope to recoup by laying a bet based on the competition’s consensus forecast.

My reply:

This looks like fun. From a statistical perspective, one thing that interests me is the extent to which different methods would be useful for different problems.

I remember several years ago talking with a professor of computer science who’d told me about some machine learning methods he was using, and I told him about the hierarchical interaction models that I was playing with (and which I’m still struggling to fit, so many years later . . . but that’s another story, about the slowness of research in statistics compared to the rapid progress in computer science). Anyway, he told me about a problem he was working on–something to do with classification of proteins based on their molecular structure–and I told him about my problem–modeling voters based on geographic and demographic predictors. It turned out that his methods were useless for my problems and my methods were useless for his. Or, to be precise, his problem was so complex that I couldn’t easily figure out how to apply my ideas, and he felt that my problem was so small and noisy that his methods wouldn’t work for me.

At a deep level, I can’t believe this could really be so. I’d think that a fully fleshed-out machine learning method would work on my little survey analysis problems, and that a fully-operational hierarchical modeling approach would work on his huge-data problems. Or, to put it another way, there should be some larger structure that includes these different approaches as special cases. But, in the meantime, it would be interesting to see the extent to which different methods work better on different sorts of examples.

P.S. Goldbloom adds that they’re now offering a “spotting fee” for interesting competition ideas.

10 thoughts on “Kaggle: A platform for data prediction competitions

  1. It does look like fun. It's interesting to look at the M competitions in forecasting (Makridakis, et al).

    After 3 competitions, it's pretty clear that simpler models work best in an environment where you have little information outside the data itself.

    It's also clear that new methods have to show how well it does on these freely available data sets in order to be credible.

  2. What I'm struck by is how much interest there is in predicting trivialities such as the winner of the Eurovision pop song contest, while amazing data sources such as the National Longitudinal Study of Youth, now entering it's fourth decade of tracking a nationally representative sample of people (and their children!), which give us clues about how to predict the fates in life of people, are of so little interest to non-specialists.

    There was one bestseller based on NLSY79, The Bell Curve, but, since 1994, just a lot of unease with these data sources.

  3. Now witness the power of this fully armed and operational hierarchical modelstation….

    Sorry, The Star Wars Emperor's voice just came into my head when I read your sentence…

  4. Zbicyclist: One thing we've learned from many applications, from language processing to the Canadian lynx series, is that you can often get a lot more bang for your modeling buck from latent-variable models than from epicyclic complexity added to regression models.

    Steve: Interesting point about these high-quality public datasets. But such data sources are hardly ignored: there must be something like a zillion NBER reports that use NLSY. Economists love these datasets. Even more, they love the huge surveys such as the CPS, which allow them to estimate small effects with weak instruments and still reach statistical significance.

    Dan: As long as you're not hearing zombie voices, you're doing fine.

  5. Wondering why no mention of scaffolding ideas here – I might start with a simple nearest nieghbors to get a sense of how difficult the prediction (naively) was and the try a number of slightly different methods.

    Tibshirani and Leblanc wrote a paper using the analogy of a inverse variance weighted combination of unbiased estimates is always better (includes putting 0 weights on all but 1) some years back

    The borrowing strength stuff where one overlooks the possibility of borrowing weakness – read something recently where there was a comment about borrowing negatively – but no definition of what they actually meant nor references.

    Easy to simulate a situation of an unbiased imprecise RCT and a biased precise Non-RCT and show that if they agree closely you will be better to move the RCT interval away from the Non-RCT interval than pool the two or even just use the RCT interval.

    Is that what people mean by borrowing negatively?

    Dose that happen in prediction contests i.e. where you adjust your predictions "oppositely" to other predictors?

    K?

  6. Re: Steve's post
    We're very interested in running a more meaningful competition – using a data source such as the National Longitudinal Study of Youth.

    Possibly a competition to predict those who will have problems in later life? Such a model could then support early intervention. The only problem with this proposal is that the answers are already known, so competitors could retrofit their models.

    We could run a competition to predict certain elements of the next wave of data. Any thoughts anybody?

  7. What people are really engaged by is the unpredictable, so we have a lot of entertainments (singing contests, sports, Academy Awards, etc.) created to maximize unpredictability. If some genius statistician figured out how to predict who'd win, the organizers would just change the rules to make them more unpredictable.

  8. Americans, of course, cannot understand the importance of the Eurovision Song Contest. How many papers do American academics produce on baseball, American football etc? A fair few actually, to an extent that puzzles Europeans [like me].
    The song contest actually has a certain cultural significance and the voting patterns are interesting. The move to a telephone vote, away from a jury vote, has been important as has the entry of the accession states. The presence of minorities [Turks in Germany, Catholics in Northern Ireland] can be quite influential. Like the NCAA Basketball or the so-called "World Series" it is, indeed, ultimately trivial but it may be a useful testbed for predicting more important events. Like soccer.

  9. The slowness of research in statistics versus computer science – I'd love to hear more thoughts on that. I'm a stats grad student and my wife is a CS grad student and she's often surprised by how slowly my research moves. It could just be my own laziness or stupidity, but I also get the feeling that statistics is just a slower field. I think it's because we're so careful and probe our models much more deeply, but that's a very convenient opinion.

Comments are closed.