They started me out on SPSS . . .

Yph writes,

They started me out on SPSS, then quickly moved me to Stata. Now I’m learning R. Do you think R is it? Will I have to learn a new programming language in the future? I get apprehensive about investing time learning new technology when the turnover rate of programming language seems so high.

My reply:

I like R but it has some problems: it’s sort of a pain to make good graphs (in theory this can be fixed by writing better front-end functions, but I’ve done little of that), it chokes on large datasets, it can be slow, and some of its internal functions can be hard to follow. (In the old days of S, the internal functions were pretty clear: you just type the function name and see what’s going on. But in the new, strongly-typed world of S4 objects, it’s common for 90% of the inside of a function to be “paperwork”: handling defaults, exceptions, names, etc. We’ve been struggling with this in adapting “glm” to create “bayesglm” and “bayesglm.h”.) Also, the system for calling Fortran routines is more complicated than it used to be.

But, yeah, I’d go with R. Spss might be great, but the analyses I’ve seen using Spss have been pretty ugly. Stata is better, so I wouldn’t throw away your license for it, but I think you can do more in R. Some people swear by Matlab–it’s probably better than R for a lot of things, however R’s canned statistical routines are pretty good, and I seem to recall that some of Matlab’s statistical routines are pretty sloppy.

25 thoughts on “They started me out on SPSS . . .

  1. A nice comparison of various stats packages can be obtained from:
    http://www.ats.ucla.edu/stat/technicalreports/

    Bottom line:
    R is a programming language you can use to do statistics, Stata is a data analysis program.

    Especially, with the new Mata language in Stata the programmability edge R had over Stata is now gone.

    That of course leaves the price…, but than your time spent learning to use R isn't free either.

  2. I was a computer scientist before I came to statistics, and I also started with Stata, but as soon as I found R, I never looked back. From a programmer's perspective, the syntax just makes sense. And it's much easier to extend.

    R code is often a mess because statisticians are rarely trained as programmers. (That's not meant to be a knock–it's just that people trained in software engineering tend to be better at it than those who aren't. I've long wanted to write a short guide to programming aimed at statisticians–there are lots of very simple things, like using better variable names, that would dramatically improve the readability of most statisticians' code that I've seen.) Stata code, on the other hand, is often a mess because I'm not convinced you can write clean code with its scripting language.

    There are certainly things I don't like about the R language–I'd much rather see a straightfoward implementation of object orientation, as opposed to the halfway S4 implementation.

    One major caveat, though, and Andrew is right: R is *terrible* for large datasets. And the leaders of the R project generally seem to dismiss this, with the view that more memory and faster computers will solve this problem. However, I think the growth in the size of datasets (especially in genetics) and complexity of models is far outpacing the growth in computational power. The biostatisticians seem to be better than the rest at following their own path when it comes to R development, and they're the ones who really need the bigger data capacity, so maybe they'll be the ones to drive a change. But there will have to be a major change in the data architecture before R will be able to handle bigger datasets.

  3. That's not entirely true. R's actual strength is the fact that the people developing methodology are using the same tool that you're using to do data analysis. Implementing something from some paper you've read is a pain. Downloading a package from CRAN is not so bad.

    You're never really going to get that with Mata (or any of the other control languages around for that matter, I'm just picking on Mata). You might get some stuff, but the language is not really designed for actual extensibility of the system in mind. Unfortunately, it seems that commercial entities have this deep seated need to create Yet Another Control Language (they're not really programming languages. In the case of Mata. goto? Really? In 2007? You sure about that?) with Yet Another Haphazard Syntax (I call it PHP disease). I dunno, to give their tech writers something to do or something.

    If they were smart they'd take a page from the videogame industry's playbook and embed Python or Lua. Then you might actually have something that could challenge R as a language for the development and dissemination of methodology.

  4. related to this… does anyone know of alternatives to (win/Open)BUGS, either for R or Matlab, for mac users? Are there not any statisticians who use macs? all the BUGS variants only run on windows or linux.

  5. The package should fit the user:

    1) If you are a programmer by training than I definately understand why you prefer the R syntax over the Stata syntax.

    b) If you want bleeding edge methodology, R has an edge for now, but these things tend to change often (that's what bleeding edge is all about).

    c) If you do actual data analysis than 99.9% is not bleeding edge but boring data preparation and standard statistical methodology. Here Stata has an edge, the boring stuff it is solidly programmed, and exceptionally well documented (you pay for it, but if you ever wrote documentation you can appreciate the number of man hours in that metre or so of documentation).

    If you want to occationally do something more fancy that does not already exists in official, than there often is a user writen program available, as there is an active user community in Stata which among others produces a large number of user written programs, a large portion of which is distributed through ssc:

    http://ideas.repec.org/s/boc/bocode.html

    This is actively supported by StataCorp (as they should, this is free added value for their product) The big advantage of programming in Stata is that you don't have to wory about the boring stuff: you paid someone else to do that for you.

  6. Thanks Byron Ellis, but JAGS is advertised only for Linux/Unix derivatives, not explicitly for mac os… I know mac os x is technically a unix variant, but when I tried compiling JAGS, it failed.

  7. I often use Stata, since many of my projects involve large datasets. I definitely agree with maarten that Stata has better data management built in. Their documentation is also the best I've seen from any package (though R's isn't bad by any means).

    If I only needed canned functions, Stata would be just fine. But I can't think of a single project I've worked on in the past 3 years where I didn't need to do something that Stata won't do. And I won't touch Stata's scripting language with a 10-foot-pole (Mata offered more functions but little improvement as a language). I don't think of myself as doing anything "fancy," but every dataset has quirks and I usually want modeling flexibility. And if I'm doing anything with MCMC, it's pretty easy to code something up in C and use it in R; right now it seems basically impossible to do Bayesian computation in Stata.

    If all you're doing is linear regression and don't need to worry about extensibility, Stata is probably a great solution. For me, though, it always feels very constraining, and even with large datasets, I usually do my data management in Stata then use R to run my models.

  8. Anon: Look in the JAGS manual on Martyn's website. There are instructions on how to compile it there written by, IIRC, Bill Northcott. It definitely compiles on OS X. The primary issue is that you need an Rmath build since JAGS uses R's standalone math library.

  9. There is a good news which can encourage users to stick to R.

    There are packages now available for R to deal with large data sets: "R.huge" and "filehash." "filehash" is more general than "R.huge." Basically, "filehash" dumps data into your hard drive. So the limitation for the size of the data is the physical size of your hard drive now.

  10. How come no mention of SAS?

    When doing boring data analysis with large datasets (basically contracting stuff for industry), I use SAS – it seems to be the industry standard. I think it also has some very nice stuff when doing mixed models.

    But running simulations or preforming any matrix calculations, forget about it (SAS does have a proc iml – but it's pretty slow). I use R.

  11. Thanks Yu-Sung Su for the pointer on filehash.

    From some looking into, I could not find advantages of r.huge over filehash – do you know what either is better for ?

    For the rest of you who are interested: there is a good explanation on the filehash package here:
    http://cran.r-project.org/src/contrib/Description

    And if you combine it with sparse matrix storage system ( see:
    http://cran.r-project.org/src/contrib/Description

    We are talking very immediate and powerful large dataset solution.

    (last time I dealt with such a dataset, I tries RMySQL with R – and at the end, gave up and went to some hard coding in C. but today, I believe I would have done otherwise)

  12. &In terms of graphics, at least, you might try looking at my ggplot2 package, which is another alternative to base and lattice graphics. It uses the ideas of Wilkinson's grammar of graphics to provide a rich plot object that can be manipulated in a variety of ways.

    For example, if you have created a scatterplot:
    (p <- qplot(mpg, wt, data=mtcars))

    You can easily add a loess smoother to that plot:
    p + stat_smooth()

    Or facet (trellis) it by cylinders:
    p + facet_grid()

    There are over 500 examples at http://had.co.nz/ggplot2 if you are interested in finding out more.

  13. I use SAS and R. SAS is much better for large data sets, and for data management. It has great tech support, and a wonderful mailing list. But it's expensive, and, although the recently added ODS is a help (to be further helped in the next version), graphics are a big pain. Almost any graph can be created, but it isn't easy.

    R is, of course, free. I find it very un-intuitive, but it's completely extendable, and it's out-of-the-box graphics are good, and easily modifiable in a way that makes more sense to me than SAS Graph's system. R has the latest techniques, a lot more often than SAS. But R has no help line and the R help list is full of sharks looking for blood.

  14. At least in my department (non-statistics) most of the faculty are moving from STATA to R because while STATA might be more intuitive to run the actual canned model, the code needed to manipulate the data beforehand is simply a clusterfuck.

  15. I've never used STATA but….

    -R is free
    -R has an incredible amount of available packages which are free…matlab's packages are expensive.
    -R can handle large datasets when combined with some database system and 'RODBC' or using features like 'scan' or the above mentioned 'filehash' and 'R.huge'
    -If you can use Unix then you can probably pare down your data to a usable level if you have to. You could also get a few more machines if your dataset is that large…their's some parallel processing features as well, 'Snow' for instance.
    -Matlab also breaks itself on very large datasets. As I recall it's limited to about 2 G of virtual and physical memory before it starts to crash out…

    I've used it for some fairly large datasets and figured out if you have to the best method to process the data in chunks which can be easy with some forsight and planning.

    Not sure that any language can really allow you to load up something like 10 G data file into it's memory and process away on it.

  16. SPSS has integrated Python as a programming language in to it's syntax. It means that you can run SPSS through the GUI, through it's own syntax, through VBA or through Python. The Python extension gives you 'real' programming control over SPSS, which you never previously had.

  17. Theo V: quite late to the thread, but 64 bit Stata does a fine job with very large datasets, as the dataset size is limited by the size of the installed RAM. I have had no problems thus far with 8G files.

  18. How is SPSS for large datasets? Anyone have any opinions on this topic? Ever used JMP? I've heard bad things about large datasets in JMP.

  19. At least on my machine, SPSS/PASW is painfully slow with large datasets. By "large" I mean over 200,000 rows and 1,000 columns. Once I go over that limit even simple things like frequencies slow down dramatically. Simple Copy and Paste functions are also practically broken in SPSS/PASW for anything beyond a few thousand cases, which I find really annoying.

  20. James, above, is right. I've been trying to get SPSS/PASW v18 to cope with a data set with 12 variables across 3 million cases, and it's just useless. If I want to simply paste a full stop into, say, a million of those cases I have to leave my 32 core, 16 Gig PC on overnight, and into most of the next day. Ugh.

  21. I use SPSS for ANALYZING large datasets but do not recommend it for data management. So for example, if you want to cut a lot of rows or do a lot of data cleaning, I find it easier to import the data into Excel, do the messy work there, and then import back to SPSS. Working together, these two programs can do pretty much whatever you need.

Comments are closed.