Data overload

Adam Kramer, a student in social psychology at the University of Oregon, writes,

I’m having a problem with “too much data.” I’m trying to construct a few HLMs on a pretty large data set: Around a million people, with 1 to 10,000 observations each, and 2 to 70 dependent variables (depending on the analysis). My personal computer can’t really handle that much data…since most stats programs (in specific, R and SPSS) attempt to read all the data into memory before analyzing it.

So I was hoping that someone somewhere had written a script or program that would be able to process such data in a serial format–not read the data set into the computer’s memory, but instead systematically update the relevant coefficients while reading through a data file. That said, I’m not exactly sure that REML estimation would be possible with such a setup, and might require a pass through the entire data file for each iteration…but that’d still be better than nothing.

First off, this reminds me of a line from one of my talks: you never have “too much (or too many) data,” any more than you have too much money. If you have too much money you buy fancier things, you move to a nicer neighborhood, eventually you start giving it away to worthy causes–you never have too much. Similarly, if you have an overflow of data, you can perform estimates for subgroups, you can estimate nonlinear effects, etc. That’s one reason you almost never see estimates that are 10 se’s away from zero: if you had that kind of data, you’d subdivide and learn more.

Getting back to your original question, I’d probably start with various simpler models–maybe you’ve already done this–first fit the regression on the raw data without varying intercepts or slopes for the individual people, then compute person-level residuals and do a second-level regression on these, then iterate if necessary. This can be seen as an implementation of the Gibbs sampler, or as a way to build the elaborate model starting with simpler steps, or as a way of setting up a two-stage model. In practice, a lot depends on whether your predictors are at the level of the person or the observation.

8 thoughts on “Data overload

  1. A related question: When you have that much data, any simple model is rejected (significance, BIC, whatever …). Your answer seems to be "elaborate the model".

    How does one choose a simple model? My "solution" has been to choose a minimum meaningful effect size. Do you have any alternate suggestions?

  2. Adam needs a database. Read all 10,000 x 1,000,000 observations into the database, which will live on a nice and spacious hard drive, and then query out one variable at a time as it is needed.

    I hate to self-promote on other people's blogs, but I happen to have
    written a book on this sort of thing,
    Modeling with Data,
    available in a PDF at that link. I don't use R, but most of the methods
    discussed there are directly applicable via comparable database-to-R
    interface packages.

  3. You should try SAS.

    As an undergraduate, I had a massive database from a market research firm that would crash R, S plus, and SPSS whenever I tried to work with the data. SAS however, had a different kind of batch processing that allowed us to work on our data without crashing our machines, seemingly by loading only the data required for processing.

  4. If you really want to do it brutal forcedly, try filehash package. The package help you to accesse the data without having to read it into R memory.

  5. 1. As others have said, try SAS.
    2. A million people? 10,000 observations each? OK, which way do you want to sample it? The nice part about databases of this size (and I've got a few hanging around the office myself) is that you can take a very large set of data — 50,000 people, or 1,000 observations — and just trash the heck out of it. Run all possible t-tests. Put it in a stepwise model. Do what you want. Explore to your heart's content. Take advantage of chance. Overfit. Write a program to see which of several transformations gives you 1% more variance explained.
    3. Then crossvalidate, and learn humility.

    By the way, although I like to make jokes in blog comments, I'm quite serious about #2 and #3.

  6. Where did you get 10.000 obserations per subject? That must be a joke.
    There are specially designed packages for large HLM, though not flexible as R, SAS, … Take a look at my response on R-sig-mixed-models list for a list of some packages from animal breeding area.

  7. This comment caught my eye The nice part about databases of this size (and I've got a few hanging around the office myself) is that you can take a very large set of data — 50,000 people, or 1,000 observations — and just trash the heck out of it. Run all possible t-tests.
    And yeah too much data is equal to lots of garbage

Comments are closed.