Ross Ihaka to R: Drop Dead

Christian Robert posts these thoughts:

I [Ross Ihaka] have been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from S and some are peculiar to R.

One of the worst problems is scoping. Consider the following little gem.

f =function() {
if (runif(1) > .5)
x = 10
x
}

The x being returned by this function is randomly local or global. There are other examples where variables alternate between local and non-local throughout the body of a function. No sensible language would allow this. It’s ugly and it makes optimisation really difficult. This isn’t the only problem, even weirder things happen because of interactions between scoping and lazy evaluation.

In light of this, I [Ihaka] have come to the conclusion that rather than “fixing” R, it would be much more productive to simply start over and build something better. I think the best you could hope for by fixing the efficiency problems in R would be to boost performance by a small multiple, or perhaps as much as an order of magnitude. This probably isn’t enough to justify the effort (Luke Tierney has been working on R compilation for over a decade now). . . .

If we’re smart about building the new system, it should be possible to make use of multi-cores and parallelism. Adding this to the mix might just make it possible to get a three order-of-magnitude performance boost with just a fraction of the memory that R uses.

I don’t know what to think about this. Some of my own recent thoughts on R are here. Although I am developing some R packages, overall I think of myself as more of a user than a developer. I find those S4 data types to be an annoyance, and I’m not happy at the “bureaucratic” look of so many R functions. If R could be made 100 times faster, that would be cool. When writing ARM, I was careful to write code in what I considered a readable way, which in many instances involved looping rather than vectorization and the much-hated apply() function. (A particular difficulty arises when dealing with posterior simulations, where scalars become matrices, matrices become two-way arrays, and so forth.) In my programming, I’ve found myself using notational conventions where the structure in the program should be, and I think this is a common problem in R. (Consider the various objects such as rownames, rows, row.names, etc etc.) And anyone who’s worked with R for awhile has had the frustration of having to take a dataset and shake it to wring out all the layers of structure that are put there by default. I’ll read in some ascii data and then be going through different permutations of functions such as as.numeric(), as.vector(), as.character() to convert data from “levels” into numbers or strings.

Don’t get me wrong. R is great. I love R. And I recognize that many of its problems arise from its generality. I think it’s great that Ross Ihaka and others are working to make things even better.

30 thoughts on “Ross Ihaka to R: Drop Dead

  1. Joel Spolsky calls rewriting software from scratch the "single worst strategic mistake that any software company can make":

    http://www.joelonsoftware.com/articles/fog0000000

    Basically: no matter how bad you think an existing body of code is (and every programmer on earth confronts new code and thinks immediately that he should throw it out and rewrite it), the existing code encapsulates thousands of hard-won bits of knowledge. Throwing it out and starting over, you lose all that accumulated knowledge.

  2. I was going to send the Spolsky link posted by Steve when I read Andrew's comments about the how complicated glm is, and how it only has "10 lines of functioning code". That's exactly the sort of thinking that Spolsky is warning against.

    Ross is talking about change at a fundamental level, fixing design flaws that are impeding the development of the language but aiming for compatibility at a higher level with R (You need to read the original to see this, not the edited version above).

  3. Martyn:

    Again, I think R is great. But . . . when I wrote bayesglm(), I adapted the existing glm() and glm.fit() functions. Which were messy enough that, in retrospect, I wish I'd written bayesglm() from scratch. It was a tradeoff: By using the existing glm.fit(), I was able to take advantage of all the work that had gone into it. The disadvantage was that glm.fit() is enough of a mess that I'm not completely sure I've plugged into it completely. In the old days when S functions really were just 10 lines, it was easier to modify them and maintain compatability.

  4. I absolutely agree with Steve: starting from scratch might sound very tempting, but would most probably not end up in "the same" product with "all problems" fixed – too much will be lost, or the effort might be too big.

    There are quite a few things (design choices), I would like to change in Mondrian looking back on the code of the last 12 years, but I would never dare to start from scratch.

    Software developers should be more open to certain compatibility cuts (as Apple did it quite often very successfully) which might "hurt" the users for a short time, but pay off in the long run (Did you ever wonder what the median()-function does in SAS's proc sql).

    There are certainly things that can hardly be changed, as they are deeply routed in the software design … although we should demand more from commercial software!

  5. Andrew: It's not the generality that's a problem, it's the craziness as a programming language.

    The "bureaucracy" could be much better organized. I find the main problem is that R functions try to do everything themselves rather than having more processing on the outside and then simple error propagation on the inside.

  6. Steve & Martyn: I agree that very often, people reinvent the wheel when it's not necessary or throw the baby out with the bathwater if you prefer that metaphor.

    The problem is that you hit a local maximum. If we maintain backward compatibility, we're sunk, because R's such a lousy programming language qua programming language.

    It's often worth starting over again from scratch.

    I don't think Python started from Perl's code base, but they probably reused algorithms for regular expressions. I'm not sure where R started with respect to S. I don't think Java started from a C compiler, but they certainly used a lot of compiler know-how that was hard-won in the C world.

    Didn't that same Joel Spolsky develop bug tracking software from scratch when there were plenty of open-source alternatives?

  7. Cryptic lines often resolve bugs you don't realize are there, but they can also be fixing bugs that no longer exist. Old code assumes its in the environment it once was, and at a certain point you'd spend as much time relearning the real bugs for a rewrite as you will dealing with useless code or bugs caused by attempts to fix bugs that no longer exist. If the core assumptions of a software system are changing, it makes more sense to rewrite. One browser version to another? Probably not. But going from a dynamically typed, pass-by-value language to a statically typed pass-by-reference language? That might be a significant enough change to invalidate the assumptions of swaths of code.

    Plus commercial software operates on a different time-scale then languages. Java 1.6 is 4 years old. C++ is scheduled for an update after 12 years. Plenty of time to do proper quality control on a new language. We'll see if the current batch is different, but up until now languages have had very real shelf lives, and learning the lessons from one language and applying them to the next has worked well.

  8. If you use read.table to read in a data set then you can specify as.is=TRUE to stop character data being converted to factors.

  9. I really don't think R is as bad of a language as some people have claimed in various threads like this. At the same time, I think people proposing "starting from scratch" or "building a new system" need to clarify what they mean, since there are a wide variety of options…

    Some of Ross's (original) comments seem to indicate that they are experimenting with keeping R's syntax, but modifying scoping, semantics, etc, to remove the warts: keeping a New R that is compatible enough with R Classic that you could use scripts to convert from old to new. (Something like Python versus Python 2, I imagine.)

    Others have proposed essentially resurrecting Lispstat in Clojure (i.e. Incanter). As much as I like Lisp, most people are turned off to it immediately and it's never gained widespread acceptance. Building on the JVM won't make Lisp more popular.

    Others have proposed building on the basic scientific/statistics capabilities of Python (with Scipy, Numpy, Matplot, and a collection of other third-party tools, some of which may still not work with Python 3, which is what a future-oriented initiative should build upon). It seems like Scipy, et al, have been around for a while now and still haven't matched R's base capabilities, much less CRAN. (I could be wrong on this.)

    Others might like R as it is, but want to put LLVM or something underneath it, and evidently that's not quite gotten off the ground after many years of trying.

    Which option are we discussing? And to what's the actual goal: to fix R, to build upon R, or to designate the successor to R?

  10. Numpy has already been ported to Python 3, Scipy is AFAIK is on the way (much simpler port once Numpy is done). All other packages who are currently under development and/or being maintained should go forward with Py 3 port. I don't there is an issue there.

  11. People use R (over languages like python) because of the graphics and the CRAN. The graphics do not need to be 'compatible' between versions b/c it's for human-consumption, but any rewrite/rebuild/etc which breaks the CRAN is software-suicide which would probably be followed by a slow exodus of users to python/ruby/perl/etc and a fracturing of the community. I say it'd be a terrible move.

  12. Gabe: I've never used Sas but my impression is that it is not set up to do exploratory analysis of data or models. To put it another way, it spits out inferences without easy ways to do post-processing. I'm sure it's possible to do much of the relevant work by writing programs . . . but the advantage of an environment like R or Matlab is that you can do a lot of graphing and analysis pretty quickly and interactively.

  13. You have to pay for SAS.

    Does anyone else think "numpy" is a terrible name? I can't help but read it "numpty" (an idiot, for US readers).

  14. Gabe: A single commercial base SAS licenses costs around $20000, and maybe 5 times that if you want any of the interesting libraries. That's what's wrong with it. (Never mind that the language makes cobol look modern.)

    But I think you bring up a good point… if R loses its standing as the de facto free stats language (by breaking CRAN, creating several competing versions of R, etc), then professors are going to replace it in their Intro courses with SAS/SPSS/Stata/Minitab, NOT another free language like Python, since they most likely know the former better than the latter.

  15. If I didn't have R, I'd probably use Matlab (I've heard that it's very similar to R) or I guess I'd bite the bullet and learn Python. Stata is kinda great, but it's such a different sort of thing. I could imagine using Stata for a lot of things but it's so different from what I'm used to, I don't know if I could live within it.

    From what I've seen of Sas, I wouldn't use it if they paid me to.

    And I can't see using Mathematica after seeing this sort of thing.

  16. R has a lot of problems.

    R is awesome.

    They're both true.

    Some of the problems include lack of a way of enforcing local vs global variables; very poor provision for making interactive plots; awkward handling of variables representing time; some ill-chosen defaults (one of many examples, which others have already mentioned on this thread, is frequently treating even numeric values as "categories" unless they are read with as.is = F); the overloading of the par() function for defining graphical parameters; and several other problems.

    But try using another package for exploratory analysis, and you see why R is so great. I certainly can't speak for (or against) all other programs, but I have tried several, and they make me grateful for several features of R. A few great things:

    (1) R is a language. Try using Excel to do anything — click click click click click click to make a plot…oops, now I want to remake it but highlight certain cases, click click click click click — and the huge superiority of a scripted language is apparent in a hurry.

    (2) There's extremely compact syntax for extracting portions of matrices. You want to see all of the rows for which the third column exceeds 10 and the fifth column is equal to zero, here it is: Mat[Mat[,3] > 10 & Mat[,5] ==0,] Really hard to see how you could design a language that makes that any easier. Combined with R's other features, this makes it very easy to, say, make a scatterplot of Y vs X, with different symbols or colors for different cases.

    (3) Although I wish R had a "use strict" type of option to enforce local variables when desired, almost everything else about defining an R function is great. I love the way of defining defaults in the function definition, I love the ability to pass extra argument strings with "…", I love the ability to create a NULL data structure and then add named pieces to it ( Output$x, Output$y, Output$otherstuff, etc.) with no hassle, and I love the ability to create functions so the arguments don't have to be in any particular order.

    (4) Very easy exploratory graphics. Create a matrix of plots, plot data and overlay fitted curves…lots of stuff that is hard or at least somewhat hard in other systems is a breeze in R.

    It would be great to see some major changes to address some of R's shortcomings, which are indeed considerable…but I wouldn't want to lose any of R's strengths, which are even more important. I'm especially wary of letting stereotypical computer scientists (whose stereotype, like so many others, contains some truth) design a new statistics language: we might end up with something that is elegant, logical, unambiguous…and such a pain that nobody will actually use it. (Insert your favorite Lisp joke here.)

    R is the worst statistics language, except for all the others.

  17. What about using MATLAB? Its basic implementation is much faster than R (JIT compilation, multicore/CUDA support, implicit pass by reference), and the Embedded MATLAB Toolbox allows you to compile your code to C. I personally also much prefer the language syntax. On the downside, it's not open source and quite expensive, but there are some free clones out there that are mostly compatible. (e.g. Octave, Scilab, etc.)

  18. True R is not the best programming language (but which is?) and the fact that so many different people many of which are not hardcore computer-scientists contribute makes it a mess with many lines of very badly written code. So for those who want to get some serious programming done it is a big frustration. On the other hand for those that want to produce working code for their complicated new statistical tool that probably depends on other cutting edge procedures it's still the fastest way to get something done with reasonably low effort. In the end it just works even if its not perfect. Also for end users it's great. It is easy to install on all platforms and easy to put it to work on even complex stat.-analysis problems.
    The only thing I'd really wish to work better was inclusion of code in foreign languages other than C or Fortran. I know it's already possible but try to give someone on a windows machine a package that depends on some haskell or even python stuff. But that again is probably more a windows problem than an R one.

  19. Bite the bullet and learn Python? That might be a lot quicker than you think.

    If you had to choose one language for web applications, 2D/3D Games, robotics, bioinformatics, machine learning, statistics and data mining, then that would be Python. Check out PyBrain on youtube. Using Cython, f2py or weave.blitz you can even make it a lot faster without writing full scale extension packages with the C-API. You can even integrate R code and packages (almost) painlessly. You can especially interface nicely with web services, HTML, CSS and Javascript code. I would not want to do that in R.

    That said, I basically came to R from having used Python exclusively for years. Ad hoc Data analysis, especially exploratory analysis is even easier to do in R than in Python. The vector syntax sometimes is a pain in the posterior distribution, but usually it makes for quick and concise exploration.

    I think in the current world for most people from academia or at least not in big highly profitable companies, SAS is a no go. Not only is the license cost prohibitive if you only get one license, what will you if you want to parallelize your computations? Buy licenses for like 100 processors? I don't think so …

    Sometimes people are afraid of switching computer languages. I'm not switching, I am adding. After learning the second or third one you get a lot quicker at learning them and normally you understand all of them rather better than before.

    So maybe we don't need a better R, but rather we can use Python to glue together R, native Python, Fortran/C and the web?

  20. @float

    This argument that R needs more computer scientists to make it a better language… or that computer scientists should have been the creators of R… is absolutely ludicrous!

    Many R contributors are just statisticians who enjoy writing software. You don't have to BE a computer scientist to write great successful code.

  21. Why not use Matlab? Well, I can't bring myself to use a language that does not have keyword arguments, which has wonky support for variable argument lists, whose key-value data structure has no syntactic support, and which insists on one function per file-of-the-same-name. Not to mention that the Matlab programs I've examined look like they were written by bitter APL programmers who were forced to use Fortran and their layout and variable naming reflect this.

    Python? Scipy is a month away from release 0.9, which will finally bring Python 3 compatibility, but this is the last release before 1.0, which will lock SciPy down, meaning it's the last chance for the developers to make big changes. So Python + SciPy won't be viable as an R replacement (i.e. encouraging lots of non-Pythonites to learn it) until sometime next year.

    As I've poked around (again), Python is impressive. But graphing is still fragmented (as if R had no base graphics and Lattice was only a recommended package), and Scipy doesn't seem to have R's print/summary/plot defaults for all objects. Also, I haven't looked deeply, but my guess is that Python's OO is much heavier-weight than R's S3.

    That said, it looks to me that Python is the best alternative — given that R must be entirely replaced — moving forward from sometime next year.

    (I've been looking into Clojure/Incanter and Scala, and I do like each, though I don't think Lisp will ever make it mainstream. I'm looking into how easy it would be to get Weka/Rapidminer, jblas, and other stat/learning java libraries working in Scala, and to see if a subset of Scala could be simple enough and reminiscent enough of R to be an option.)

    SAS? If you don't program anything, it's the gold standard, and costs its weight in gold as well. But programming is, as I understand it, kludged on as a 1970's-style afterthought. And even if you don't try to program, it's syntax reflects a 1960's JCL, punched-card style that really has no place in this century.

    I also agree that anything other than a cleaned-up R (compatible enough so you're learning changes, not a new language) will fragment the R community: many will stay, many will go to the commercial packages that they already use, some will go to Python because of SciPy, some will go to Incanter (Lispstat lovers in particular), some will go to Octave which will be the worst-of-both-worlds option. They question is: is it possible to revamp R but keep it as R (ala Python 3) or must we go another way?

  22. I've used R, Stata, Matlab, and now Python for various statistical/computational projects… my feeling is that Python is the way to go. Computer scientists aren't the only ones who write good code, but they know, from hard-won experience, how to develop systems that can scale up in projects/developers/lines of code while still being fast. The commercial program languages are even more hidebound and cumbersome than R.

  23. I went off and took a look at Python (again). Scipy doesn't (yet) work on Python 3, so I stuck with Python 2. You need to install Scipy and Numpy to get a reasonable mathematical/statistical baseline. Then matplotlib for graphics. I personally don't think matplotlib's graphs look as good, by default, as even R's base graphics, but Andrew will probably like that it has a default settings file that lets you set dozens of options. (Not sure that it includes the options that he would want to change, but…)

    I decided to investigate two other graphics packages: Hippodraw and Chaco. Both had different install mechanisms than matplotlib, and on my Mac neither worked (each for different reasons). And that reminded me how powerful CRAN is: one place to gather them all, one place to compile them, one place to pick and choose, and with one line install them. I've downloaded hundreds of R packages over the years and only had about 3 that required a source download (instead of the usual binary download).

    And I think a CRAN-like mechanism is critical for any tool that would succeed R. A programmer who is already using Python and doesn't mind rolling their own system is probably enthralled with Scipy and not having to use yet another language (R), but end users need more magic than that.

    Another area to address is the help system. R's help includes a TeX-like language so that help pages can be well-formatted and with proper mathematical notation. (And you can include R calculations in the output, too.) R's culture also encourages vignettes and other niceties for heavy-duty packages. You can search by command (?lm) and by agrep (??linear), etc. The one thing R doesn't have is categorical help (for example, all control structures: if, while, for, etc).

    Some people here have claimed that computer scientists weren't involved in R/S's design. I actually do not know the specifics, but have they actually read the R document? The discussion of issues like scope, lazy evaluation, environments, etc, is reasonably impressive and it's obvious that S wasn't thrown together by statisticians who knew nothing about programming.

  24. Well said Wayne. R core should focus on improving R.

    Python is not radically better than R from the perspective a statistics user. One would almost certainly have to fork python to get sensible defaults. For instance, base level NA support and typical floating point behavior (1/0 = Inf rather than an exception). Then there's the problem that Python doesn't have a formula support.

    R will be replaced, at some point, either by a new commercial package that offers something compelling, or by a new open source package that manages to get traction. I know several groups of people working on what they believe is the next generation of statistical or technical computing. One may catch on. But this will happen out of the blue, by some talented grad student or professor. Distracting the R core team should not contemplated, in my opinion.

  25. Recently I had to work with OpenCV, the open source machine vision library. It came with out of the box Python interface which is one of the benefits of a general purpose programming language that is used, and accepted by many. An R interface exists but not as well documented (and there is more than one Python interface)

    As a side note: My port of ARM and BDA code is going well. I just ported Dr. Gelman's sim() function to Python in fact.

    http://ascratchpad.blogspot.com/2010/09/arm-35.ht

    Cheers,

  26. Maybe, just like Python is moving from version 2 to 3, R could improve some niggling issues through a version update, one that may break compatibility, but could easily be converted to compatible code. Just a thought.

  27. Just when I was getting good at R they will replace it… Christian relieved my distress over apply functions. Related to apply are the numerous data reformat structures he talks about like as.matrix.

    I am told by R Pros to use apply functions instead of loops. First of all it is hard to figure out which one to use when. Once you know whether you want tapply or sapply, then the parameters have to be in the write format. Then they don't seem to work when the functions are longer, like a complicated panel plot. When I have to deliver something, I use loops.

Comments are closed.