MCMC machine

David Shor writes:

My lab recently got some money to get a high-end machine. We’re mostly going to do MCMC stuff, is there anything specialized that I should keep in mind, or would any computing platform do the job?

I dunno, any thoughts out there? I’ve heard that “the cloud” is becoming more popular.

6 thoughts on “MCMC machine

  1. You'll have to quantify what you mean by "high end", and what you're going to run. Is it a lot of little jobs, or really big memory-intensive jobs? Is your software multi-threaded? Is it easily decomposable using something like map-reduce? Are lots of people going to be using this high-end machine at the same time? Do you need remote logins? Who's going to maintain the hardware? Do you have enough cooling? Can you stand the noise of the fans?

    I just had the pleasure of using a 32-core 128GB memory monster Windows machine at a consulting gig — they can be had for about US$30K, about the same price as 32 dual core 4GB memory machines, which you could set up as your own cloud if you have the network capability.

    If you have lots of little jobs, occasionally want to run big jobs, ideally have problems that are easy to decompose, you don't want to manage the machines yourself, and you are willing to endure some pain getting things up and running, cloud computing like Amazon's EC2 can be very attractive in terms of performance for price.

  2. In my experience with MCMC on larger-scale systems (64-core machines to 4000+ core clusters), parallelization is key. Instead of using the scale to run a few chains to massive iterations, think of your scale this way: run to burn-in from varied starting points, and take a small number of samples after that. If you can parallelize like crazy (massive number of simultaneous processes, like 1e2-1e3), you can get great sampling in very little time. Also, watch the memory architecture like a hawk. MCMC can play very unfun games bouncing data between L2, L3, RAM, and HDD sometimes (think continuous writes for fault tolerance and data structures within iterations). Storage architecture is another issue you'll need to confront. The only major difference between standard best-practice and what you need for MCMC is the increased importance of scratch space. Consider datacenter-grade SSDs for the scratch space, if possible. I'm just starting to experiment with EC2, but data transfer for terabyte-scale datasets may be a bit rough. Good luck and I would love to hear how it goes.

  3. Here's where I've had a lot of success in parallelizing MCMC code. Suppose I have N conditionally-independent unit level parameters (N people in the dataset), and K population-level parameters (population prior means or whatnot), in a Gibbs sampler. If the dataset is very large, then N is very large, but K is probably small. Of course I want to be efficient with the number of Gibbs iterations, but what I *really* need to do is have each of these individual Gibbs sweeps run quickly. So even though I cannot run the Gibbs iterations themselves in parallel, I CAN (usually) run the N unit-level updates within each Gibbs iteration in parallel.

    If each of the unit-level updates is quick, then the key is to use a system that exploits shared-memory parallelization. This is typically what happens when you farm out jobs to multiple cores on the same system. I write my MCMC code in C, and I have had a lot of success using OpenMP to run the N unit-level updates after adding only 2 extra lines of code. If I allocate 7 of the 8 cores on my Mac Pro to the job, I usually get about 6.8X speedup in that parallel section of the code, for each MCMC iteration. I suspect that there are R packages that do something similar to OpenMP (multicore? snow?)

    For MCMC runs like I just described, the important thing is that it work on a shared-memory model. If you run jobs like this on a distributed cluster, there is a lot of overhead to move data among the nodes of the cluster. That just doesn't happen as much if all of the cores are on the same machine, sharing the same memory. If each of the N updates takes a long time, or you want to run multiple long chains in parallel, the overhead may be dwarfed by the savings from running the jobs in parallel. But if not, using a distributed cluster (like a cloud) will take a lot longer than not parallelizing at all, and certainly longer than a shared-memory solution. In any event, I've just found that this has been an effective strategy for me when conditional independence lets me decompose a single Gibbs iteration into lots of separate parallelizable updates.

    There's also been talk about using GPUs, and this is more of a question to those who know more about this than I do. Doesn't a single GPU have a smaller instruction set than a CPU? So, I can see how very simple tasks (like a straightforward matrix multiplication) could be parallelized easily on a GPU, but more complicated steps (say, some kind of grid-based or importance sampling simulation) would not be able to take advantage of the GPU structure. What I'm thinking is that running a parallel BLAS that calls the GPU, but still parallelizing each Gibbs update on CPU cores, puts the right tools on each task.

  4. I don't do MCMC calculations (though I had a friend who did on our system) but I do use a cluster for embarrassingly parallel computations. I'd like to point out the issue of overheads. Our cluster wasn't very expensive to buy and install originally, but we don't have someone whose job it is to manage the cluster. As a result, the network is a disaster (you write much to on RAID array and everyone's home directory becomes unusably slow) and the software is in chaos (different nodes have broken/different versions of key software). So keep maintenance and overhead in mind when planning what to buy – in particular, a single multicore monster may be much easier to manage than a cluster. On the other hand, having a cluster does mean that when some piece of hardware dies, you can just sacrifice the node and it becomes spare parts for all the other nodes.

    As for GPUs, they can offer tremendous performance, but you will have to write all the software yourself, and not all algorithms are well-suited to them. For the right problem, though, they're great.

Comments are closed.