R in the news

See here. I pretty much agree with what they’re saying, except that I think R occupies a position as much as it serves a function. By this I mean that, if R didn’t exist, we’d be doing similar things using something else, whether it be Matlab, Mathematica, or some Python-based confection. When I think about what I actually do in R, it wouldn’t actually be so hard to do most of it from scratch. This is not to disparage R, just to say that it’s filled a niche.

And I certainly wouldn’t characterize R as “a supercharged version of Microsoft’s Excel spreadsheet software.” Or maybe I should say that I didn’t know that R had spreadsheet capabilities. One more thing to learn, I guess. Also another motivation for Jouni to finish Autograph.

And it’s good to hear that SAS is in trouble. I just hate SAS. It’s not just a matter of it having poor capabilities; SAS also makes people into worse statisticians, I think.

P.S. The reporter contacted me about this story a few weeks ago, but I don’t actually remember what I said (or even whether I was ever actually reached). Certainly nothing memorable enough to quote.

35 thoughts on “R in the news

  1. I thought it was a good article — it's a real vindication of the use of R given its use in the major organizations mentioned, like Pfizer and Bank of America. (REvolution Computing was also contacted about this article and we pointed the author to several of our customers, but REvolution R didn't get a mention, sadly. Maybe next time.)

    SAS doesn't have a great reputation in the open source community to begin with, but that quote about not using "freeware" build a jet really missed the mark. For one thing, I know several Boeing engineers here in Seattle that use R.

  2. I think that the main strength of SAS is not in its statistical procedures but in "data step" processing especially if I'm working with extremely large data sets e.g. census data. Unfortunately, once I've done all the data prep work in SAS there is very little incentive to move onto another software. For the past 10 years or more I've toyed with different software – Stata, R, SPSS and I haven't come across anything that comes close to its capabilities in processing large amounts of data.

    For teaching or demo purposes where data sets are relatively small (less than 1MB) I can see where R has an advantage.

  3. I’m not aware that R has any spreadsheet capabilities (beyond the fix() command), but I too tend to think of it as a “supercharged version of Excel” in that both are engines for calculation, and both operate on vectors and arrays. I guess after all these years of working with Excel, it has become a reference point.

    But it wasn’t always so. I remember my reaction the first time I saw a spreadsheet: it was VisiCalc, running on an Apple II. I said, what good is that? I couldn’t see any reason why you’d want to see your arrays all written out on the screen like that. Wouldn't all those numbers be confusing? I had already done a fair bit of programming in Fortran and Pascal, and in my experience the answer usually involved the sum or average of an array’s columns.

    In any case, it was interesting to see R discussed in the New York Times. What’s next, a feature on WinBugs?

  4. “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

    They should have asked her if SAS runs on Linux!

  5. The availability of large dataset processing packages for R based on sqlite makes R perfectly suited to the big-dataset processing that SAS is famous for. The only difficulty may be importing variously random formatted datasets into the sqlite DB systems in the first place. If you have some skills with Perl, Python, sed, and/or awk you can do most formatting tasks in a much easier way than in SAS.

    see packages RSQLite, SQLiteDF, and biglm for examples.

  6. >Sadly SAS is the necessary evil for large data sets.

    Not these days. SPSS has been capable of handling large data sets for several years. S-Plus has been capable of handling large data sets (out-of-core algorithms) since version 7.

  7. Hi Andrew–

    Can you explain how you think SAS makes people into worse statisticians, in a way that's different from other software?

    FTR, I think you can do anything in any sufficiently flexible environment, and both SAS and R fit the bill.

    The main difference IMO is support. If the developer of your favorite R package stops developing it, it might no longer be available to you. Or they might not be willing or able to fix errors. With SAS you pay for and get backwards compatibility and committed support. In addition, the documentation for R can be quite spotty. SAS's can be cumbersome, but it's almost always possible to find out what the software is doing, with a cite.

  8. R's weakness is indeed with very large data sets, but your limit is hardly 1MB of compressed data set. 2G of RAM is a low-end box these days. Even a memory hog like R isn't going to scratch that with a 1M data set. Join the full ACS person data set (3m rows, a couple hundred columns) to the full household set (another 1m rows, another several score columns), though, and R will indeed start to choke on all but the most memory-rich machines.

    Pspp is a free version of spss that works on disk for those very large sets. It was not ready for prime time last time I tested it, but it's getting closer. Anyone here use it recently?

  9. Ken,

    SAS spews out pages and pages of output for any analysis. The output isn't easy to post-process; as a result, people stare at the output and pick out numbers. R more easily allows graphical and other postprocessing of inferences.

  10. …Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”

    Interesting point. Mathematical formulas can not be patented (freeware of sorts), which makes me pretty happy, especially when I about to board the plane ;)

  11. If SAS decides to drop support for a procedure, there's pretty much nothing you can do. This alone has hurt me bad enough in the past to steer away from SAS.

    If R package maintainer decides to stop, you can (if the license is right) take the source and carry on.

    Personally, I've found Perl better suited for handling messy source data (better than R or SAS). There's a huge library of tools for this, not unlike CRAN.

    For large data (as in many rows) a proper RDBMS is allways worth the investment. Especially if you need simultaneous writes or refined acces control.

    For wide data (as in many variables), R is usually most efficient. You don't need to use data frames.

  12. "I just hate SAS."

    Why such extreme hostility? It is a statistical language, albeit seems in your opinion, a less capable one. Nevertheless, it is a tool that some statisticians use to draw inferences and quantify uncertainties.

    What's the purpose of saying something like this? Does it reflect a belief that the world would be a better place had SAS disappeared completely on the surface of the earth?

  13. A perhaps thoughtful quip about all new technologies (R) is that the benifits are exageratted, the risks are downplayed and the ability of the old technology (SAS) to respond is underestimated …

    My disclosure here is that I prefer R and recently have been "forced" to use SAS again

    The benifit of R to me was that you can modify and develope variations of proceedures and explore their properties easily and quickly via simulation – AND also get all those neat new procedures from people like Tibshirani almost immediately.

    The risk of R to me is that it is easier to unknowingly fall off the beaten path and do "silly things" – it is (unsupported) academic software and it seems even at times the developers take delight in users not getting it right.

    For SAS responding – the new ods and macro stuff does allow one to control and manipulate the output and do some good graphics.

    But as users switch to R (I believe many will) they will be at high risk of approving drugs, investments, aircraft designs, etc. based on computational errors AND the more that can be done to help them minimize this risk the better for all of us.

    Keith

  14. F. Chen:

    1. What it means to "hate" a statistical package is different from what it means to hate a person, say. Of course, my antagonism toward SAS pales heside my hatred for spammers, robocallers, bike theives, and the guy who designed that impossible-to-take-apart plastic packaging.

    2. Would the world be a better place had SAS disappeared completely? Maybe. Yes, I think so, for example, if all SAS users were to switch to Stata.

    3. The purpose of saying something like this is to express my unhappiness with SAS-style analyses, using a dramatic figure of speech that will capture people's attention.

  15. If I wanted to read 5 variables from a person level file in the ACS data for instance and then merge it to the 7 variables from the household level file I would not be able to do this in R (at least my looking around the R forum has not uncovered anything that resembles this).

    Most likely I would need to do this external to R using Perl or Awk for instance. In this case, I would need to learn 2 languages. This data pre-processing step is the main thing that keeps me bound to SAS.

    I've been trying to get off SAS for the past couple of months now but I keep falling off the wagon whenever I encounter these pre-processing steps, which is pretty often.

  16. Bccheah: I also have big troubles with inputting and merging data in R. My impression is that Stata is much better. I'd like to think that there's something in R that could allow me to do it better. As it is, I often spend hours writing awkward programs in R to do simple data processing.

  17. Paul, I think that was PMR's point. SAS runs on Linux, which is free software. So, by the logic of the SAS marketing person, we should be concerned with any SAS analysis that was done on a linux system.

  18. I also find SAS' handling of large datasets far superior. Of course, most enterprise-level stuff resides in relational databases. I'm using Postgresql and plr currently and am happy (of course getting plr to work on rhel was a little bit of a pain). Actually, I'd say R fills the niche of filling a niche. Much like Matlab but without the price tag. Which reminds me of something a friend of mine said when leaving SAS for MathWorks: "The people at SAS would rather sit around and count their money than innovate."

  19. I agree with bccheah that SAS has great capabilities in manipulating data structures and cleaning data via the data step. The SAS environment enables the user to 'automate' routine data cleansing processes – very useful to prevent your job from becoming too repetitive. So I feel that SAS has a solid lead in this niche, and as a statistician working in industry 90% of my time is in preparing data for statistical analysis.

    I will look into R more (used it a little in Grad School with Prof Gelman) thanks to this article and the support from this community.

  20. I agree that SAS's strength is in data step processing. A lot of the time I am creating "study data sets" i.e. final version data sets for publication. I need to be able to keep a piece of code that is easy to read and will get me the same output if I run it again.

    If two years after the data was created a reviewer queries something I need to be able to easily dig back and see what was going on.

    R is a very compact, powerful language but it's hard to read when you go back and look. If you are really disciplined (i.e. have the time => have the money to pay for the time) then it's probably fine.

    In SAS I usually ODS the data I want to a HTML file and then post-process in Excel and/or graph in R. Sometimes the data may come from the stat procedure itself or from a proc print after the stat procedure.

  21. 1- I use Python and whenever I need I can call R, Matlab or SPSS from within my code, so I'm very happy with python as my main language and a wrapper to other statistical software. I encourage everyone else to look into that as well.

    2- About R not being able to handle large datasets, I wouldn't worry that much about it, In free software those things gets solved quickly. now that R is getting more exposure perhaps more developers are willing to work on its engine

    3-I'm thinking if some companies are willing to invest in R like IBM is doing with Linux! It would be a perfect idea for Pfizer to hire a number of programmers and statisticians and contribute a little back to this product

    3.5- R's syntax sucks so bad though it reminds me of COBOL

    4- and R is terrible in terms of user interface

  22. Bccheah,

    I must say my experience is different. I made the jump from SAS to S/R in around 1995, and it took me about two years to unlearn SAS and to be efficient in S.

    The trick is to realize that in S/R you work with variables whereas in SAS you work with observations. So merging is done by

    for each variable X in data A
    add X to B so that some ID matches

    Essentially, you transform all your "merges" (SAS lingo) into "left outer joins" (RDBMS lingo) by "matching" (R lingo).

    for(i in names(A)) B[[i]]

  23. the aesthetic superiority of R to SAS is far beyond just graphing. i've written programs in each, and the intuitive flow of R coding and the visual look of the words on the terminal are an improvement.

  24. I think the point MV made above is key. R works with matrix operations and it takes time to unlearn "loop" thinking, which is endemic in SAS, awk, etc. That is also why R is the furthest thing from an Excel spreadsheet. Have those people actually used R before making such a comment?

    Also agree with Luke that the syntax of R is much more elegant (though in practice elegance is often counterproductive as it produces code that is difficult to understand and may hide failures of logic).

    When I picked up JMP, I found that I had to learn yet again their particular way of manipulating data. It's very visual, like Lego blocks; and quite powerful too.

  25. Okay, I'm in SAS marketing and I'm very familiar with Anne Milley whose quote made it into the Times article. Anne is a bright, thoughtful proponent of the software Andrew loves to hate. But for the record, neither SAS nor Anne hates R or open source. We run on Linux and we love Apache. For a little more info on SAS and R, take a look at a followup on Anne's blog at http://blogs.sas.com/sascom/

  26. I have further comments in a couple more recent blog entries, but one thing I wanted to say quickly is that I'm not a fan of the vectorize-everything strategy of R program. I find loops easier to follow (perhaps a legacy of having taken a class in Fortran, thirty-some years ago). And in my book with Jennifer, we put in loops in some places where vectorized computing would be faster but, to my eyes, more opaque.

    One thing that I _don't_ like about R is that I find that naming conventions and programming style have to take the place of structured programming the way I'd prefer it.

  27. As a Mac user I am pretty much used to this kind of discussions… Anne Milley's remark on freeware is extremely disqualifying, and I keep on wondering how so many (apparently) professionals can stick to these prejudices which might hold true on average, but are complete nonsense in the case of R.

    The "real" benefit of SAS nowadays is to get an integrated data warehousing solution from data storage over reporting to analytical functionality.

    SAS programming is certainly just an ancient artefact, compared to writing code in R which is based on a sleek grammar.

    SAS should face it sooner or later: Java was quoted "dead" by MS and other quite often, but was used so much in teaching that it finally prevailed. R will pretty much go a similar war.

    Jim, can you hear me?

  28. I made a career change to biostatistics about 4-5 years ago. I that time I completed a Masters of Biostatistics (including a Bayesian subject using Andrew Gelman's book); worked on various statistical projects with a government health department; worked in a medical research institute; and now work in a large government research agency. I've also learned and used SAS, Stata and R, among others.

    Although I have used SAS more than the others, I hate it. It was hard to learn and I need to use the help documentation when doing anything slightly new or different as each procedure has its own nuances. It is a dinosaur that has been progressively made to look modern, but can't hide it's ancient roots. SAS is however, almost indispensable when analysing and reporting on large institutional datasets due to its speed and data management capabilities.

    I equally like R and Stata, but for different reasons. Stata (using the command line) is easy to learn and reasonably consistent when it comes to commands. It is also quite intuitive. We used it in our course. Increasingly Stata can do any analysis SAS can do, including mixed models and (in Stata10) exact logistic and Poisson regression. The manuals are good without being excessive and online help is adequate for most purposes. The books available are much easier to read than those on SAS.

    I am a naturally cautious person and prefer to do my analysis at least two different ways to verify my results. When I have found discrepancies between R packages and Stata I have always found the R result to be wrong.

    Stata is an ideal stats package for those without a programming background and who are not mathematically adept, which includes many statisticians and epidemiologists.

    I started with R only 12 months ago. It is not easy to learn if you are not familiar with object-oriented programming and matrices. It is the most versatile and in some ways the most powerful of stats packages, but its "freeware" origin shows in the inconsistent quality of packages and the often poor documentation.

    One application I use it for is to analyse gene expression arrays and the like (using Bioconductor packages). There is almost no substitute for R in this application, with the exception of very expensive specialised packages like Partek and GeneSpring, and neither of these has the statistical capabilities of R. Expression arrays can have 30,000 or more variables – some now have 1.5 million – well beyond Stata's capabilties, and as far as I know SAS has shown no interest in this application.

    So I see R as occupying the cutting edge niche of statistical software development. For example, our organisation has developed an R package called GeneRaVE that uses Bayesian techniques to select a small subset of variables from a multitude to form an efficient classifier.

    R's power makes it ideal for academic and research statisticians, but care should be exercised when relying on the results it produces, even from the base package.

  29. I met R first in my undergraduate statistics courses and I liked it a lot. Now I am in a graduate economics program. For some strange reason, the academic economics community has not caught up with R's popularity yet. Most people have heard of it, some have used it, even fewer actually use it on regular basis. It seems like economists who need to use mostly standard routines use something with canned routines such as SAS or Stata, while those who need to code their routines use Matlab or Gauss. However, I am noticing that some of the new graduate students often ask professors if they will be allowed to use R instead of say Matlab for their homeworks.

Comments are closed.