R book

John Kastellec points us to this book by Michael Crawley. It reminds me of what my old college professor wrote to me 12 years ago when I sent her a copy of my (then-)new book: “Thanks for sending this. A lot of people are writing books nowadays.” (I have no comments on Crawley’s book, having seen nothing but the link to the webpage.)

P.S. Somebody just said to me today, “Do you ever use R?” I said yes, and he said that he thought that applied people such as himself used R, but that the real statisticians used things like SAS. Grrrr…. I told him that SAS is said to have advantages with large datasets but isn’t so great at model-fitting and graphics.

12 thoughts on “R book

  1. Yes, that's what I've heard about SAS too — it has advantages for large datasets. But what exactly are the advantages? And how large must the dataset be to incur these advantages? Nobody seems to know.

  2. I find that doing analysis in R gives me real problems when working with greater than one million observations. I ended up having to switch from R to SAS for a linear mixed model when working with a series of data sets around that size.

    I learned SAS first and I work with large data bases almost exclusively. Many times I have found that doing the analysis in R was a major problem. However, the graphics and modeling in R are so much better that I do prefer it when there is a choice.

  3. I once read somewhere that > 2 GB of data will not work with R (or something in that range), unless one uses a relational database. I tried to set this configuration up once in R since I do have such large datasets, but I kind of failed to get it working (mainly because it was not a life-or-death matter then).

    Crawley's book seems like another manual type thing–I would rather use the online books and tutorials available for free (although this is an unfair comment since I haven't read the Crawley book; maybe it *is* earth-shatteringly good).

  4. I have used SAS in the past, but switched to R, due to its object oriented nature, easier model fitting, graphics and above all work with objects per se. There were times when it was very hard for me to extract something from "model fit".

  5. R requires that all data fits in memory, which is obviously a problem for very large datasets. SAS uses algorithms which operate on the data on disk, so the maximum feasible size is much larger.

    It would not be terribly difficult to implement algorithms in R which also work in the same way as SAS (as has been done for (generalised) linear models in the biglm package), but requires some methodological development and programming time.

  6. About R and large data sets – It is a (large) problem.

    I was just working on the KDD2007 cup at:
    http://www.kdd2007.com/kddcup.html
    And wanted to use R for my work.
    The dataset is the netflix prize dataset (which is 2GB of 100 million rows of data)

    And my story goes a little like this:
    setting up a MySQL database.
    moving the data files into the MySQL.
    Indexing the dataset (if you don't do that – each query takes 60-80 seconds !)
    installing RMySQL (which didn't work – and I had to find files from an older version of the package – in order to make it work).
    Finding that I should use the fact that I am looking at sparse matrixes (installing the SparseM package).
    Discovering that My algorithm still needs too much RAM. So then I tried to work with "raw" data type.
    Discovered that SparseM doesn't support "raw".

    And at the end what I did was give a simpler algorithm to one of the C programmers in the team – and he did the whole thing…
    (and I only used R to check his results – which where wrong – and he had to repair the program several times until it worked right…)

    So, I had a bad time with R and large datasets.
    Any thoughts on the subject from anyone ?

  7. SAS is very robust with large datasets but it is a bit of a mess as a programming language. Kind of a dinosaur (as my former boss called it a "loopy" language). They actually call datapoints "cards". There are things which are quite easy in it and other things which are major puzzles. When I used SAS, my former boss and I would get together and pose SAS puzzles to one another. How do you reverse the elements in a dataset? How do you create indicator variables based on an integer valued series? Some of these things are exceptionally difficult to do.

    The other problem with SAS is, there is no one way to do anything. There are generally several. It is a huge language. Yes, SAS is done well, and the statistics produced are amazing, but some things are damn near impossible, and using it, you have no idea how any of the tests/fittings/etc are done numerically. SAS meanwhile is generally better with cross-sectional models than timeseries. It is particularly good though with the large datasets. If you have millions of records with hundreds of columns (as I did when I did Mortgage research), SAS can handle it.

    R and SPlus and Matlab require more coding, generally do not work as naturally or robustly with large datasets (e.g., one may have to code the page faults oneself in order to ensure that not all the data is in memory), but give much more control to the user. I generally would categorize R and Splus and Matlab as toolboxes for the experienced statistician, one who cares about methods and not just results.

  8. I prefer to use R with MUSASHI for large datasets rather than with an RDB like MySQL. It's much faster and compact, and its ideology is "never uses relational database related technology." The developer said "RDB does a good job of finding one record from large datasets but is not suited for tasks like data-mining which require that the entire set of record is scanned, and which has few opportunities to benefit from indexing. Thus it is actually more efficient to use a simple sequential file for that purpose."

  9. Hi tomohiko.

    Thanks for the reference. I looked at some of the documentation of MUSASHI. And must admit that off-hand couldn't quite get what it does or how to connect it to R. could you recommend any links for a simple integration of the two ?

    Thanks,

    Tal.

  10. As a statistical consultant for some 30 years, I have primarily used SAS, but frequently use R these days.

    SAS has many virtues.

    As several have noted, it is very efficient with handling large databases. I often deal with datasets of order 10^5 to 10^8 records, and 10^2 variables. I can handle datasets of this size on PC workstations, especially as transformation and sorting routines and the like are multi-threaded in SAS, and can make use of multiple processors.

    Another less admirable — but nonetheless real — benefit of SAS is its market penetration. It is relatively easy to find good SAS programmers, to run SAS programs on quite diverse platforms, and to translate data from other environements into SAS. Occasionally I have the same irritation with SAS that I have with Microsoft products — I'm forced to use them because using an obuscure but superior product for some individulat task will end up in the long run causing me delays, heartbiurn and dollars.

    About ninety percent of the effort in my projects — even those with some sophisticated modeling at the back end — involves data preparation, cleaning, merging of datasets, etc. Invariably, character and date elements are important, and there are always substantial numbers of missing values. And errors in data prep are very, very costly.

    SAS is an older language. But for the above tasks, it is syntactically relatively simple, and the constraints of the language are really strengths. In my experience, time consuming syntax errors occur much less frequently in SAS, it is far harder to write overly compact "write once, read never" code, and customized I/O and most character and date handling is done more quickly and with fewer programmer errors in SAS than in R. And at the end of the day, you still have a dataset that then drops right into the analysis routines.

    In an environment such as this, the most used analytic tools to detect patterns and detect errors or anomalies are not sophisticated modeling routines, but structured prints and multidimensional tables (with grouping, subgrouping, etc.) I find SAS's abilities in the latter areas superior to those in R. It's not that you can't do these things in R, but in my experience the average programmer can do them much more efficiently and with fewer errors than in SAS.

    That said, SAS's big weaknesses in my environments are cost (especially given the long practice of unbundling routines into separate expensive subpackages), integrated graphics (see previous comment on unbundling and costs, although in my opinion quality and ease of use are issues as well), and slowness in incorporating newer analysis methods. In addition, any nontrivial simulation modeling or the like that involves complex program control structures, referencing into and out of large multidimensional arrays, optimization of a complex objective function, and/or a clear need for modulare subroutines with independent environments are ghastly or impossible in basic SAS. (Some of this kind of work work can be done in the SAS IML subpackage, but see previous comments re cost and unbundling.)

    I use R more than SAS these days not only for the preceding reasons, but also because SAS won't run on my MAC! However, we won't be giving up our SAS licenses any time soon.

  11. Tal,
    So sorry for the late reply.
    I haven't found any bridgeware like R2WinBUGS or RMySQL, or any info about integration. I usually write some code bridging R and MUSASHI via text files on a case by case basis.

Comments are closed.