Both R and Stata

A student I’m working with writes:

I was planning on getting a applied stat text as a desk reference, and for that I’m assuming you’d recommend your own book. Also, being an economics student, I was initially planning on doing my analysis in STATA, but I noticed on your blog that you use R, and apparently so does the rest of the statistics profession. Would you rather I do my programming in R this summer, or does it not matter? It doesn’t look too hard to learn, so just let me know what’s most convenient for you.

My reply: Yes, I recommend my book with Jennifer Hill. Also the book by John Fox, An R and S-plus Companion to Applied Regression, is a good way to get into R. I recommend you use both Stata and R. If you’re already familiar with Stata, then stick with it–it’s a great system for working with big datasets. You can grab your data in Stata, do some basic manipulations, then save a smaller dataset to read into R (using R’s read.dta() function). Once you want to make fun graphs, R is the way to go. It’s good to have both systems at your disposal.

16 thoughts on “Both R and Stata

  1. This is really sound advice. I tend to use both SAS and R — the decision to use both has been a great one. R has some amazing packages written for it (I'm a big fan of BMA and MICE) that can't really be replicated in SAS. SAS does great data management and is really nice for large databases (100,000's of records or more).

    This is not to endorse just these two pieces of software, my post-doctoral department swore by STATA and I have seen a lot of good work done with it.

    But the more flexibility you have with different programming tools the better. Every package seems to have a different strength (or weakness — if you cosndier SAS graphics) so knowing more is a pure good thing, in my opinion.

  2. I agree with this completely. Seems there is much discussion about why software package X is better than Y when the more important thing to keep in mind is to use the right tool for the job.

    I tend to use a combination of several software tools in my work and it has served me well so far. Not only does experience with multiple tools make one versatile, it can facilitate collaboration with others who may not be flexible in their software use.

    I also think that understanding several software tools can reduce time wasted spent trying to get something to work in software X when Y does it better or with less effort.

  3. I would only take issue with the implied argument that R is somewhat inferior when working with large datasets. In such cases, you don't need to keep everything in memory, it's often much better to work directly with relational databases. R has *very* good packages that can communicate with those (e.g. RODBC, sqldf, etc.).

  4. I want to second what Vincent said. R is very competent at handling large datasets through relational database interfaces (100,000 entries is small for a sqlite or RODBC based analysis).

    I've successfully processed the entire human genome (several tens of gigs of sequence data IIRC) through Rmysql for example.

    there are packages for offline (not-in-memory) fitting of linear models on huge datasets (CRAN package biglm I think).

    R used to be an "in memory only" sort of thing back in the day, but the interfaces to handle LARGE datasets have been around for 5 to 7 years and are very mature.

    also, running R on a 64 bit linux machine with say 16 gigs of RAM doesn't cost that much these days.

  5. Andrew wrote "Once you want to make fun graphs, R is the way to go."

    Andrew has found R good for graphics, and that's fine, but I don't gather that he has the same experience with Stata for graphics.

    Stata has a well-developed graphics system, many convenience commands and an interactive graph editor. There are some gaps, but its graphics are serious too.

  6. When it's academic research we're talking about, I'm a little uncomfortable with the use of expensive, proprietary software. If I write in R, anyone in the world who has the inclination can replicate my analysis for free. What's more, when there are excellent open-source alternatives, should we really be spending our university budgets on Stata licenses?

    I don't know how things are in other fields, but my colleagues in economics use Stata or Matlab because it's what they've been taught in grad school, not because these packages are well-suited to their particular problems. Just this week one of my colleagues was struggling to implement an estimator from scratch in Matlab for which there is an excellent R package.

  7. Hi Vincent:

    I really did not intend to imply that R was inferior at large data sets; merely that I have found SAS to work well in that environment. I have also seen GWAS analysis done in R and seen some amazing R programmers. I learned how to use large databases (credit bureau data for the US) in a SAS-only shop (15 years ago when it was a good choice) so a SAS environment is where my best instincts for dealing with large databases are but that's not a reason to think it is only good way to do it.

    SAS also has it's own well known limitations for large data sets (sorting in SAS is very inefficient, for example).

    For some reason, software issues are touchy. I'm going to teach a class on data analysis in a SAS-only department in the fall and I'm planning to do the class in R just so the students have the experience of a second statistical language (and were I competent in STATA then I would give STATA examples too).

    But what I wanted to focus on was the virtues of flexibility, not trying to imply that specific language has issues. I could only draw from my own experiences for examples of why I've used each language. I'm sure if I ever lost access to one of them that I would get better in the other one.

    Joseph

  8. I'm a big fan of having a complete one click reproducible set of analyses.
    That is to say, there is a final script that manipulates and transforms raw data and produces the formatted output for inclusion in the final report.
    I find doing everything in R can make this easier.

    For more on what I mean see the comments by Hadley Wickham and Josh Reich on Stack Overflow
    http://stackoverflow.com/questions/1429907/workfl

    or my post on Sweave:
    http://jeromyanglim.blogspot.com/2010/02/getting-

  9. I think everyone would agree that it is crazy to pay when something at least good enough to do the job is free. Conversely, people who buy software are sensible to do that when if offers more than you can get for free, you can afford it, and the judgment is that it is good value for money. After all, it is a long time since it was common for hobbyists to build their own PCs and even then they had to buy most if not all of their parts. Nowadays "we" tend to take for granted that our employers will supply computers for our work just as they provide heating, lighting, etc. And for at least many of us purchase of some proprietary software licenses remains a strong expectation of employers, or even ourselves.

    I respect the idea of free open source software mightily. But I don't understand why people who regard it as an absolute principle, and consistently never buy any software, don't seem to regard it as a bit inconsistent to pay for their hardware…. In other words, I think we all draw the line somewhere.

    Disclosure: I am fairly well informed about Stata, and far less well informed about MATLAB (incidentally, correct capitalisation is that way round). What I have to say is arguably true of Stata, and I think most if not all also applies to MATLAB. SAS I won't comment on through even greater ignorance.

    1. Reproducibility through shared scripts is emphasised very highly in the Stata community.

    2. Although you have to pay for the core code, thousands of user-written packages are available for free over the internet.

    3. What you pay for with proprietary software usually includes very extensive documentation and the guarantee of _some_ support from the company. Of course, the R community shares much free documentation and a tradition of fast and free expert support on the internet (Stata too). Conversely, many R users now seem to regard substantial outlay on R books as part of their necessary expenses: the money now goes to Springer, Wiley, CRC Press, etc. rather than some evil software company.

    I don't think there is any hostility to R in the Stata community (less than the other way round). I guess that R has surprised even most of its own enthusiasts: who predicted a decade ago it would be this big in 2010? Very big statistical programs have faded away (where is BMDP now?), so nothing is forever. I think the next ten years are going to be the hardest for all the major statistical programs, and not just because of current economic austerity.

  10. @Jeromy Anglim

    I too is a fan of having a complete one click reproducible set of analyses.

    But I wounder how you solve the problem with large data sets? – as several of the other comments mention.

  11. A "me too" to Nick Cox: I live in an open-source bubble most of the time, but Stata is the one program I regularly (get my department to) pay money for.

    Oh, and "me too" also on multi-package flexibility, and on reproducible analyses.

  12. I woould add that it is good to have knowledge on the most popular programs because in the future you could be forced to use any of them when you go to the job market. I had to switch from SPSS to R, to Matlab, to Stata.

  13. Joseph,

    I was actually referring to the original post, and not your comment. I don't have any experience with SAS, so I find your comments interesting. Thanks for sharing!

    Vincent

  14. I don't need the large data set functionality for the kind of work I do, so I use R – because it's free, because I can take it easily from job to job with me, and because the only alternatives my department would / do pay for are Matlab and SPSS. (Some people also use Java and Python routines.)

    Matlab is great for signal processing. In my earlier work, I would sometimes resort to SPSS because it was hard to figure out the R equivalents for some standard statistical analyses in psychology, but then, Quick-R and Jonathan Barron's notes came along.

    Disclaimer: I'm not a psychologist, but a computational linguist and phonetician.

  15. I've also been wondering what is the best way to use computing packages for analysis. I know some STATA and R, and I use STATA for manipulating large datasets and try to get it into R as soon as the datasets are small enough.

    I'm guessing that this isn't optimal, though. I seems that I am using STATA just for manipulating data, which I would guess a database app would be better for (mySQL, SQLite, postgresql?). I'm not as fluent in SQL, but I'm wondering if it is worth the investment to learn how to use it well, given that these play nice with R.

    I've hobbled along using STATA and R, but I find that I can sometime trip myself up when I read files back and forth on trickier fields like dates and such.

    Thoughts from those who are using DBs with R to handle big datasets?

Comments are closed.