Cloud computing

Posted on October 27, 2009 1:42 PM by Andrew

Richard Morey asks:

I was wondering if you or your blog readers have any experience using cloud computing to do simulations or analyses. The idea of packaging a simulation and having many copies of it running on a cloud (like, say, Amazon’s EC2) is appealing. And using it for storage would be nice too.

I’d like to get the opinion of statisticians who might have used this approach before spending time figuring it out.

I have no idea. Any suggestions are welcome.

8 thoughts on “Cloud computing”

Chaz Littlejohn on October 27, 2009 10:53 AM at 10:53 am said:

I have used EC2 for statistical computing. It's a little close to the metal so you have to deal with a lot of network set up, virtual consoles, putty, and learning a bit of linux before you can really get started. I'd suggest installing the two Firefox plug-ins, Elasticfox & S3 Organizer, to automate some of the command line code. There are tons of images to choose from, windows included, but it boots a lot slower than the linux and solaris images, and it costs more. I'd suggest one of the Ubuntu images you can find at http://alestic.com/. They have desktop versions that can be accessed through a VC client (like NX Windows). I'm skipping over some steps, but hopefully this leads you down the right path.

Also, Amazon just released ultra high memory instances with 68.4 GB of memory, and 1690 GB worth of storage – which is obviously pretty sick.
Michael Braun on October 27, 2009 11:35 AM at 11:35 am said:

I've looked into the idea of running simulations in parallel on computing clusters quite a bit, and my current thinking is that it is not a good idea. The problem occurs when clusters are "distributed memory," meaning that each node is a separate computer that can access only its own memory. So everything (your application, code, packages and data) need to be present on each node of the cluster, and you need an efficient way to put the results back together on your own computer. Most computing clusters have software to help you manage jobs across nodes, but I've found distributed computing to be inefficient, and a lot of work for an individual researcher to manage. This, of course, would not be the case for massive database projects, with a dedicated IT staff, where the goal is to have excess capacity available to meet surges in demand. In those cases, cloud computing makes lots of sense.

Shared memory computing is different. This is when you send each iteration of an algorithm to a different processor in a single machine. All of the cores are accessing the same memory, so you know that each code can access your code and your data. OpenMP works on this model, and I've found it to be very useful. With OpenMP, you don't need to explicitly tell the system which data goes where–the compiler does that automatically. I work a lot in C, and I can parallelize a for loop with exactly one additional line of code, and run it across all 8 cores on my computer. Most of the newer versions of BLAS (the software that does linear algebra computation–R and Matlab will call a BLAS to multiply matrices) use OpenMP to parallelize matrix operations (the Intel MKL is an example). And I believe Stata parallelizes operations this way as well.

Now that processor manufacturers like Intel are starting to put more and more cores on a chip, I suspect that the long-term advantages to shared-memory parallelization will dominate those of distributed-memory models like cloud computing. Without knowing the details of your particular application, I think that for parallelizing repeated simulations, shared-memory is a better model.
John Cook on October 27, 2009 12:15 PM at 12:15 pm said:

We use Condor for running simulations. Condor turns every desktop computer in the department into a node in a cluster with no additional hardware costs. When a PC is in active use, Condor stays out of the way. But when a computer has been idle for a while, say on evenings and weekends, Condor gives the PC the next job in the queue.
Jurgen Van Gael on October 27, 2009 12:39 PM at 12:39 pm said:

We have used quite a bit of distributed computing for solving machine learning problems and I can highly recommend it – if that's what it takes to scale your problem. Always remember that there is a learning curve, more implementation and harder debugging involved when using distributed computing. A few comments on the previous posters:

Chaz Littlejohn: EC2 is not the only options (and I would argue the least attractive) for doing distributed computing. Look at Amazon Elastic Map-Reduce which is a much higher level system that you can drive from Java-Python or any other language that generates executable. Elastic MR frees you from all the pain of setting up virtual machines etc. Data input/output is really easy and itnegrates nicely with Amazon S3. We are running a blocked Gibbs sampler for the infinite Hidden Markov Model on Amazon Map-Reduce. Hopefully there will be a blog post about this on my blog soon!

Michael Braun: packages like Hadoop are extremely easy to use and you don't have to worry about distributing binaries etc. Another system that we are using from Microsoft (called DryadLinq) is even easier to use. I think managing the distribution is not an issue anymore; distributed debugging is a much harder problem though. OpenMP is an interesting model but again it is much closer to the metal. Although some packages like Intel MKL -might- automatically distribute your data, be aware that if you are programming OpenMP yourself, you will have to do this all by yourself. In our research group, we found that the overhead in managing an OpenMP program (distributing code etc.) is much higher than in a mature map-reduce environment (Hadoop – DryadLinq).

If anyone wants more info or some tips and tricks, feel free to get in touch.
Paul Bleicher on October 27, 2009 5:50 PM at 5:50 pm said:

There is a very active project in R that implements a cloud computing workbench for analysis. It is available for download, and many examples and use cases are available on the website:
http://biocep-distrib.r-forge.r-project.org/
Richard D. Morey on October 28, 2009 12:36 AM at 12:36 am said:

Jurgen, I do most of my programming in C called from R. Is using Map Reduce possible with this setup? Perhaps I could move to R with gsl.

Actually, what would be nice is if we could install some kind of client on all my (nonquant) coworkers computers so I could steal some of their cycles. They're only using Word and SPSS, right? They wouldn't miss them :)
Avram Aelony on October 28, 2009 9:16 AM at 9:16 am said:

Richard, The Hadoop framework can be used with any language via <a href="http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#Hadoop+Streaming" rel="nofollow"> Hadoop Streaming . See <a href="http://developer.amazonwebservices.com/connect/thread.jspa?threadID=32112&tstart=0" rel="nofollow"> here for a discussion of using R with Hadoop Streaming. There are also a growing number of R packages that aim to fill this space, e.g. the R package <a href="http://cran.r-project.org/web/packages/HadoopStreaming/index.html" rel="nofollow"> HadoopStreaming .
Jurgen Van Gael on October 28, 2009 1:06 PM at 1:06 pm said:

Hi Richard,

Yes, Hadoop does support this style of working. As Avram said, if you can make an executable or script, Hadoop can use it.

As for stealing your worker's cycles; something like Condor can do that for you but I don't know how nice Condor and Hadoop play together. To be honest, Amazon's elastic Map-Reduce is so cheap that it might not be worth going through the process of installing Condor on all your machines …

Comments are closed.