Databases and R

Alex Reed writes,

Your examples on your website deal with a pretty massive amount of data. I imagine you aren’t using flat files to build those out. Frankly, I am having a little trouble finding a suitable go-between for databases and R. Can you recommend a database system, and with that system, an R library for accessing those databases? If you’re not using a single system, could you elaborate where you choose either?

I don’t actually understand the question but maybe this is something that would interest some of you. . . . For our book, we did all our data processing in R and sometimes in Stata. Some of our datasets are large but I don’t think they’re huge. On occasion when we do work with huge datasets, we do have difficulty in R and we sometimes do some preprocessing first.

15 thoughts on “Databases and R

  1. I like to use the RODBC package to link access databases to R. It can use sql statements as well as fetching specific tables

  2. For this sort of thing, I'd recommend using MySQL or PostGres to store the data if it is relational or you need lots of screening/preprocessing. Then you can use a scripting language (perl/python/ruby/etc.) to easily extract data using queries and to perform further transformation on the data. Once you have what you need, you can use the language to write out flat files to be read by R.

  3. There are a variety of R packages for interfacing to RDBMS systems for MySQL, two PostgreSQL packages, an Oracle package as well as a general ODBC package. At work we use MySQL so I tend to use the RMySQL package and the DBI standard database interface. My desktop machine has sufficient RAM to work with realized data frames of hundreds of thousands of rows so all preprocessing tends to be done in R.

    If you don't need to share your data in a read/write manner you might also look into RSQLite and RSQLiteDF, which provide a local RDBMS system and a data frame-like interface respectively.

  4. I've had a lot of luck working with SQLite databases, and with MySQL, there are standard R/CRAN packages for both. The interfaces are based on a standardized database interface for R. There's also a sqlite based data frame interface.

    RODBC also exists, but I've had more experience with the two listed above.

  5. i think he's asking how he can use databases (as opposed to delimited text files) with R? if that's what he's asking, mysql, which is free can be used using the RMySQL package on CRAN.

  6. My own experience is that if you read text files into R the speed increases a lot if you define previously the classes of the columns (in the "colClasses" argument of "read.table()" function.

    However, if you need a database system, R can work well with MySQL and there's a library (RMySQL) that helps a lot. Here is a very simple function to load data into R with that library:

    <pre>
    db = function(data.frame.output, database, query) {
    mgr = dbDriver("MySQL")
    con = dbConnect(mgr, dbname=database)
    assign(data.frame.output, dbGetQuery(con, query), env=.GlobalEnv)
    dbDisconnect(con)
    dbUnloadDriver(mgr)
    }
    </pre>

    The R Data Import / Export manual has a chapter devoted to relational databases:
    http://cran.r-project.org/doc/manuals/R-data.html

    HTH

  7. I use the Rdbi and RdbiPgSQL packages to pull data out of a PostgreSQL database and into R.

    SQL should be a standard tool for anyone working with large datasets. Keeping data in a proper database makes it easier to ask new questions about it. Not that this cannot be done in strict R, SQL just makes it more straightforward to organize and relate data.

  8. I would strongly recommend SQLite, with the companion R package RSQLite.
    The system works out-of-the box (no separate DBMS install and configuration required) and can handle very *huge* datasets.
    Among advantages, an SQLite database is contained in a single, portable file, so you can very easily copy data from one computer to another.

    SQLite, as a DBMS, isn't good at concurrent/multiuser db access, but if your main concern with handling a dataset is just its size, I assume this isn't at all an issue for you.

  9. If you are using R:

    The DBI R package interfaces well with PostgreSQL. The RMySQL package is based on DBI and also works well for use with the MySQL database. MySQL is probably more commonly used, but if you're working with enterprise-scale data or spatial data you will probably want PostgreSQL. The DBI package also works with commercial databases like Oracle. If you must go Windows then the RODBC package is the appropriate interface.

  10. Also, if you were dealing with truly massive datasets, SQLiteDF stores and interacts with data frames that are actually in sqlite tables. It also works with Lumley's biglm. Most of the time datasets I'm working with do fit in memory though. (MCMC samples are another story…)

Comments are closed.