Recently in Statistical computing Category

Asa writes:

I took your class on multilevel models last year and have since found myself applying them in several different contexts. I am about to start a new project with a dataset in the tens of millions of observations. In my experience, multilevel modeling has been most important when the number of observations in at least one subgroup of interest is small. Getting started on this project, I have two questions:

1) Do multilevel models still have the potential to add much accuracy to predictions when n is very large in all subgroups of interest?

2) Do you find SAS, STATA, or R to be more efficient at handling multilevel/"mixed effects" models with such a large dataset (wont be needing any logit/poisson/glm models)?

My reply:

Regarding software, I'm not sure, but my guess is that Stata might be best with large datasets. Stata also has an active user community that can help with such questions.

For your second question, if n is large in all subgroups, then multilevel modeling is typically not needed. But if n is large in all subgroups, you can simply fit a separate model in each group. That is equivalent to a full-interaction model. At that point you might be interested in details within subgroups, and then you might want a multilevel model.

Asa then wrote:

Yes, a "full interaction" model was the alternative I was thinking of. And yes, I can imagine the results from that model raising further questions about whats going on within groups as well.

My previous guess was that SAS would be the most efficient for multilevel modeling with big data. But I just completely wrecked my (albeit early 2000's era) laptop looping proc mixed a bunch of times with a much smaller dataset.

I don't really know on the SAS vs. Stata issue. In general, I have warmer feelings toward Stata than SAS, but, on any particular problem, who knows? I'm pretty sure that R would choke on any of these problems.

On the other hand, if you end up breaking the problem into smaller pieces anyway, maybe the slowness of R wouldn't be so much of a problem. R does have the advantage of flexibility.

Masanao sends in this.

(The oven that Masanao was referring to is described here.)

Cloud computing

| 8 Comments

Richard Morey asks:

I was wondering if you or your blog readers have any experience using cloud computing to do simulations or analyses. The idea of packaging a simulation and having many copies of it running on a cloud (like, say, Amazon's EC2) is appealing. And using it for storage would be nice too.

I'd like to get the opinion of statisticians who might have used this approach before spending time figuring it out.

I have no idea. Any suggestions are welcome.

Don't comment code

| 15 Comments

I'd heard this before, but good advice is typically worth repeating. For those of you who program in R, I'd also recommend that scripts be written so it can run from scratch in an empty R environment. Many many times I've found my R scripts and environments to be palimpsests whose meanings are difficult to unravel. (The official recommendation, I guess, is to put everything in R packages, but I've never actually learned how to do this.)

Bing sucks

| 11 Comments

When I search Gelman in Google, I'm right up there at #2. With Bing, I'm not even on the front page. Heck, I'm not even on the second page! Or the third, or the fourth, or the fifth, or the sixth, . . . OK, enough already! I know, I know, I shouldn't be searching myself anyway, but I had a legitimate reason . . . I had to find my talk on the web today from someone else's computer.

P.S. OK, I take it all back about Bing. I searched my name on Yahoo, and again my homepage did not appear in the first seven pages of search listings. So, really, I shouldn't be blaming Bing, I should be thanking Google for being so nice to me.

Thanks, Google!

Steve Farmer writes:

I am working on a hierarchical logistic regression model is SAS using the GLIMMIX function. I really liked your average predictive comparisons Figure 21.7 from your book, "Data Analysis Using Regression and Multilevel Hierarchical Models" (2007, p 473). I wondered if you could direct me to the easiest way to accomplish this analysis short of doing it manually. Is there a command or macro you could direct me to in SAS or STATA?

My reply: Some version of this may actually be available in Stata, although my guess is that it would be based on a single central point rather than averaging over all the data as we do in our book (and, before that, in my article with Pardoe). I've been planning to put it in an R package sometime but have never gotten around to it.

I received the following email:

I was hoping if you could take a moment to counsel me on a problem that I'm having trying to calculate correct confidence intervals (I'm actually using a bootstrap method to simulate 95%CIs). . . . [What follows is a one-page description of where the data came from and the method that was used.]

My reply:

Without following all the details, let me make a quick suggestion which is that you try simulating your entire procedure on a fake dataset in which you know the "true" answer. You can then run your procedure and see if it works there. This won't prove anything but it will be a way of catching big problems, and it should also be helpful as a convincer to others.

If you want to carry this idea further, try to "break" your method by coming up with fake data that causes your procedure to give bad answers. This sort of simulation-and-exploration can be the first step in a deeper understanding of your method.

And then I got another, unrelated email from somebody else:

I am working on a mixed treatment comparison of treatments for non-small cell lung cancer. I am doing the analysis in two parts in order to estimate treatment effects (i.e. log hazard ratios) and absolute effects (by projecting the log hazard ratios onto a baseline treatment scale parameter; the baseline treatment times to event are assumed to arise from a Weibull distribution. . . . .[What follows is a one-page description of the model, which was somewhat complicated by constraints on some of the variance parameters] . . . I can get my analysis to run with constraints imposed on the treatment specific prior distributions for PFS and OS, and on the population log hazard ratios for PFS and OS. However, my proble is that the constraint does not appear to be doing anything and the results are similar to what I obtain without imposing the constraint. This is not what I expect . . .

My reply:

Sometimes the data are strong enough that essentially no information is supplied by external constraints. You can, to some extent, check how important this is for your problem by simulating some fake data from a setting similar to yours and then seeing whether your method comes close to reproducing the known truth. You can look at point estimates and also the coverage of posterior intervals.

R Flashmob

| 2 Comments

Dan Goldstein, Michael E. Driscoll, and JD Long write:

We're organizing an "R Flashmob" to get students and researchers asking and answering R questions on the very useful questions site "stackoverflow.com" during one fun-filled hour.

We were wondering if you might help out in promoting this event by blogging something about it and/or emailing it to students or researchers who would benefit by having a friendly, efficient resource for asking and answering R questions. It would be great if you could!

MICE, by Stef van Buuren and others, is an R package with many similarities to our "mi". Actually, MICE came first, and it's recently been updated. Should be available as an R download very soon. It would be interesting to see how the programs differ. We should probably look more carefully into cases where they give different results.

Everybody says OmniGraphSketcher is great, but I can't use it because I don't have a Mac. Is there anything that people recommend for Windows? I'm trying to do a lot of remote meetings with people, so this seems like it could be a useful tool.

Big juicy datasets

| No Comments

From Ben Olding, the MIT cell phone data and a corresponding article. I can't remember what this was about, but when Ben described it to me a few months ago, I recall that it sounded cool.

I'm still waiting for someone to work with me to reanalyze Iyengar and Fishman's speed-dating data using hierarchical models.

Lee Wilkinson sends along this cool paper demonstrating a simple but effective automatic classifier:

Linf is a classi er that was designed to address the curse of dimensionality and polynomial complexity by using projection, binning, and covering in a sequential framework. For class-labeled points in high-dimensional space, Linf employs computationally-efficient methods to construct 2D projections and sets of rectangular regions on those projections that contain points from only one class. Linf organizes these sets of projections and regions into a decision list for scoring new data points.

Linf is not a hybrid or modi cation of existing classifi ers; it employs a new covering algorithm. The accuracy of Linf on widely-used benchmark datasets is comparable to the accuracy of competitive classi fiers and, in some important cases, exceeds the accuracy of competitors. Its computational complexity is sub-linear in number of instances and number of variables and quadratic in number of classes.

I also like the article's delightfully understated conclusion. After a page of bullet points on the virtues of their method, the authors write:

Given these distinctive features and its fundamental di fferences from other classi fiers, Linf is a candidate for inclusion in portfolios of classi fiers.

If only all of us could be so modest.

P.S. Table 1 and Figure 5 shouldn't be in alphabetical order, and I think Figure 6 would work better as a parallel coordinates plot. These are pretty minor comments, but Lee is an authority on statistical graphics so I hold him to a high standard.

Hybrid Monte Carlo

| 4 Comments

Richard Morey writes:

On your blog a while back, you asked why more people aren't using Hybrid (Hamiltonian) Monte Carlo. I have tried it, and found that it works quite well for many applications, but not so well for others (parameters with bounded space, and parameters with whose log-posterior has exponential functions in them, specifically). When I started using it, there wasn't much out there about it, precisely because it hasn't caught on. Well, to help remedy that a bit, I've created a CRAN package to do hybrid Monte Carlo sampling (HybridMC), and I thought this may be of interest to your readers. The back end is written in C, so it is quite fast. I've had good luck with it so far.

Cool. We should take a look at this.

R Web Services

| No Comments

Ed Sanchez writes:

My company, Cumulo Software, has developed a very powerful technology that allows you to turn any R program into a web service in minutes. There is no network programming - you only need to parse simple command line arguments inside R, and then return values via 'cat'. We have many samples in R, and we are adding more every day.

Our product is called SAASi, and it has an easy to use web interface to define web services. All web services created with SAASi have strict access controls that you specify. We also provide detailed usage statistics that you can use to monetize your web services, among other things.

We are hoping to attract R experts that want to bring innovative R technologies online.

I haven't had a chance to look at this, but I thought it might interest some of you.

Upgrading R

| 8 Comments

Every couple of months there's a new version of R. Now it's R 2.9.1. I better download it, since some packages I use might depend on the latest version.

Can somebody out there in R-land please put an "update" button in the R console? Or, better yet, have R check occasionally for updates and then allow me to install with one click, in a way that will transfer all my downloaded packages automatically.

Thanks.

P.S. Yes, yes, I know. R is free, and if I really want this done, I can do it myself. But I'm doing other things for R! And others would be much better able than I to set up the automatic install as described above.

P.P.S. If youall are making changes to R for me, I also suggest replacing the current display of lm and glm fits with the output from display() in the arm package.

Ian Fellows writes:

Being as you are an R user at the intersection of the social sciences and statistics, I thought some recent work I've done might be of interest to you. SPSS has long dominated the teaching and practice of statistics in the social sciences (at least among non-statisticians). I've created a new menu driven data analysis graphical user interface aimed at replacing SPSS (or at least that's the long term lofty goal). It has just been released under GPL-2 on CRAN. Feel free to check out some screen shots in the online wiki manual (not yet complete).

I don't know SPSS, but just yesterday someone told me that people can run R from SPSS and get a convenient menu system, so if this freeware would have the same capacity, that would be great. Here's the description:

It's been a dramatic month: A month ago, a coalition of some of the leading teams qualifies for the $1 million grand prize for improving the accuracy of the movie-recommending model by more than 10%. But, they would close the competition 30 days afterward, in case someone else is able to improve upon the result. This happened less than a day before the deadline, by The enormous Ensemble, composed of 23 previously separate teams and individuals. Of course, most of the progress towards the victory was through the models making use of new significant patterns in the data, such as that of time.

The development of an ensemble from many separate teams was another accomplishment, and the GPT's inclusion rules provide some insight into the process: "shares" of the winnings were distributed based on how much was a contribution able to improve the result in terms of percentage points. Simon Owens describes what it was like to participate in The Ensemble.

Bayesian statistics always works with ensembles: the posterior is a weighted average of all models, the weight being based on the fit of each model times the prior quality of the model. There are some additional Bayesian elements that could be a part of future competitions, such as Bayesian scoring functions.

In the past I was asked to contrast Occam's razor with the Epicurean principle. Occam's razor is the Bayesian prior, or the the yang principle: simpler models have greater a priori weight (because we tend to economize that what is useful). Occam's razor goes back to Aristotle, who wrote "For the more limited, if adequate, is always preferable," and "For if the consequences are the same, it is always better to assume the more limited antecedent" in his Physics. We mathematically express it as the prior.

Epicurean principle is the yin, or mathematically expressed as the integral over the model space. Ensembles go back to Epicurus' letter to Herodotus: "When, therefore, we investigate the causes of [...] phenomena, [...] we must take into account the variety of ways in which analogous occurrences happen within our experience." Thus, Bayesian statistics combines the yin and the yang, balancing the pursuit of simplicity with the limitations of uncertainty.

[7/31/09: Added a link to Simon Owens' interview with The Ensemble.]

Our article (by Yu-Sung, Jennifer, Masanao, and myself, and based also on work with Kobi, Grazia, and Peter Messeri) will be appearing in the Journal of Statistical Software, in a special issue on missing-data imputation. Here's the abstract:

Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: flexible choice of predictors, models, and transformations for chained imputation models; binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data in one and two dimensions. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.

We've made lots of improvements since listing the package last year (here). There's still a lot more work to do, in many different directions (including multilevel models, nonignorable models, the self-cleaning oven, and making the program run faster in sorts of ways), and we keep improving it. But it's good to have something out there.

To actually get the R package, just open your R window, click on Packages, Install packages, and grab mi.

Google Fusion Tables

| 4 Comments

Google just launched a pre-alpha "Fusion Tables". The visualization capability is okay, the interface is not fully stable, but the cool thing is the ability to merge two tables, something I've spent a lot of time doing manually in the past, or with ad-hoc scripts.

Here's an example where I merge their GDP table with a disease table. I need to pick the "WHO Regions/Country" in the right column, so that both tables get aligned:

fusion-tables.png

Afterwards, I can do a scatter plot of GDP rank (X) with child mortality/1000 (Y):

gdp-child-mortality.png

So, high GDP makes child mortality less likely, but not always, and it's not a correlation.

Even if Fusion tables is pre-alpha, the table fusion capability makes it immediately useful. The collaboration features look cool, but it will take some time to get them to work right. Then we'll have proper horizontal collaboration.

The Dice-O-Matic

| 1 Comment

Aleks points me to this cool-looking automatic dice thrower. I like it!

Predicting the spread of the flu

| 1 Comment

The NY Times has had an article today on Sunday about predicting the spread of swine flu using a computer program with data on air traffic, commuter traffic, and the movement of dollar bills. I don't know a lot about epidemiology, so I will leave it to others to comment on the intricacies, but I appreciate the idea of this sort of model, especially when they discussed adding in other human behavior factors that change the playing field (like closing schools and people becoming more germ-conscious).

I particularly liked the last paragraph in the article:
"...one thing remains true: 'People have a very weird perception of large numbers,' [Dirk Brockman, an engineering professor at Northwestern] said. 'If you have 2,000 cases of flu in a country of 300 million, most people think they're going to be one of the 2,000, not one of the 299,998,000.'"

When demonstrating his Alice in Wonderland example, Brad Paley showed how the words in the center of the display were located by grabbing a word with his mouse, clicking to show its connections with places in the text, and then moving the word, showing the lines of the connections stretching, then letting go to show the word bounce back. The image of the word connected to the places using rubber bands was clear.

What I want to know is, can somebody rig a robot arm to do this so I could feel the pull? Imagine a robot arm that can be moved within a 30cm x 40cm box. You could use this to feel the springiness of the connections in Brad's diagram; the idea is that you'd say a word (for example, "Alice"); the pointer of the robot arm would move to the position of the word in the display, and then you could--with effort--move the robot arm away from this place. When you let go or relax your grip, the pulling of the virtual rubber bands would return the arm back to its original place, and you could feel the strength of the pull.

The arm could also be used to feel a curve (for example, a nonlinear regression or spline, or a mathematical function such the logarithm or the normal distribution curve), as follows: the arm would start at one end of the curve and the user could grip it and move it along, with the motion physically constrained so that the arm would trace the curve.

In displaying several curves--for example, level curves indicating indifference curves - the arm could start on one of the curves and be programmed to stay on that curve, unless the user pushes hard, in which case there would be resistance during which the arm moves between curves. It would then lock into the next curve, which the user could again trace until he or she pushes hard enough to get the arm unstuck again.

More generally, the robot arm could be used for exploring three-dimensional functions such as physical potentials, likelihood functions, and probability densities. From any point in the two-dimensional box, a "gravitational force" would pull the arm toward a local minimum (or, for a likelihood or probability density, the maximum) of the function. Then with moderate effort the user could move the arm around and, by feeling the resistance, get a sense of gradients, minima, and constraints.

(for example, a nonlinear regression or spline, or a mathematical function such the logarithm or the normal distribution curve), as follows: the arm would start at one end of the curve and the user could grip it and move it along, with the motion physically constrained so that the arm would trace the curve.

In displaying several curves--for example, level curves indicating indifference curves - the arm could start on one of the curves and be programmed to stay on that curve, unless the user pushes hard, in which case there would be resistance during which the arm moves between curves. It would then lock into the next curve, which the user could again trace until he or she pushes hard enough to get the arm unstuck again.

More generally, the robot arm could be used for exploring three-dimensional functions such as physical potentials, likelihood functions, and probability densities. From any point in the two-dimensional box, a "gravitational force" would pull the arm toward a local minimum (or, for a likelihood or probability density, the maximum) of the function. Then with moderate effort the user could move the arm around and, by feeling the resistance, get a sense of gradients, minima, and constraints.

An example of how I'd like to use the robot arm together with a visual graph of data and model fit

I was originally thinking of this as a statistical tool for blind people, but I'd actually like to have one of these myself, for example to understand the sensitivity of a model fit to changes in parameter values. I'm thinking of twp graphs next to each other: a graph of parameter space and a graph of data with fitted curves. The robot arm would be pointed to the posterior mode or maximum likelihood estimate in parameter space. As I moved the robot arm around, I'd feel resistance--it would be difficult to move far in parameter space without feeling the increase in -log(density)--and at the same time the curve would be moving in the fitted curve + data plot. The muscular resistance information on one graph and the visual information on the other graph would together give me a sense of what aspects of the data are determining the model fit.

Here's an example of what it might look like:

2graphs.png

I'd also like to be able to use the robot arm to pull on the fitted curve and feel the resistance as I move it away from the data.

P.S. Here's the R code for the above graphs:

Aurélien Madouasse writes:

Bugs sucks

| 11 Comments

I programmed a Bugs model (see below) and got this horrible and useless error message: Bugs opened a "Trap: incompatible copy" window, and there was nothing in the log window but this:


display(log)
check(C:\DOCUME~1\ANDREW~1\LOCALS~1\Temp\RtmpMCdEbN/pewpoststrat2.bug.txt)

No syntax error, no nothing. My first thought, of course, was that there was something wrong with the default directory, but when I copied the "schools" example over, it worked fine. So I went through the laborious process of debugging, stripping down the code until I could get it to run and then gradually adding back lines until it crashed. I still couldn't figure out what was wrong so I brought Matt over and he figured it out at a glance.

Here was the original code:

I have a problem that comes up a lot in programming, and I'm sure the experts can help me out. It has to do with giving names to objects that I'm working with. Sometimes the naming is natural; for example, if I have a variable for sex in a survey data, I can define "male" as 1 for men and 0 for women and then continue with names such as "male00" for data in the 2000 survey, "male04" for the 2004 survey, and so forth, or else use a consistent set of names and define things within dataframes. In either case, this is no problem.

What is more of a hassle is coming up with names for temporary varaibles. For example, should I use "i" for every looping index, with "j" for the nested loops, "k" for the next level, and so forth? This can get confusing when referring to array indexes (sometimes you end up needing things like a[j,i]), or should I go for mnemonics, such as "i" to index survey respondents, "s" for states, "e" for ethnic groups, "r" for religious denominations, etc? Sometimes I've tried to reserve "j" for the lowest level of poststratification cel, but that isn't always convenient.

At other times I've tried more descriptive names, for example n.state for the number of states (50 or 51, depending on whether D.C. is included) and i.state for state indexes, thus giving loops such as "for (i.state in 1:n.state)" or "for (i.state in states)," where "states" has been pre-defined as the vector of state numbers. This approach is currently my favorite--I tried to stick to something like it in my book with Jennifer--but can also create its own difficulties, as I have to remember the names of all the indexing variables.

Beyond looping indexes, there are all sorts of temporary variables I'm creating all the time; for example, after fitting a multilevel model, pulling out coefficient estimates: "fix <- fixef (M2000.all)." There's always a tension in naming these temporary quantities: on one hand, I don't want meaningless names such as "temp" floating around; on the other, every new variable name is another thing to remember, which is annoying if it's only being used in two lnes of the program.

I guess the real solution is to ruthlessly compartmentalize: in R, this essentially means that you turn every paragraph of code into a function and get rid of all globally-defined variables. I haven't always had the discipline to do this, but maybe I should try.

GPU Supercomputers

| 12 Comments
An example of a computer cluster

Image via Wikipedia

Computer games have been the driver behind large advances in 3D graphics hardware over the past decade. It turned out that the hardware developed for rotating, projecting and rendering many triangles can also be used for other purposes, and this is the notion of "general-purpose computing on graphics processing units" or just GPGPU.

The research and development community's center is gpgpu.org. There are several environments for development of software on such hardware, CUDA, brook, OpenCL, libsh, CTM.

I have looked through their list of CUDA example applications, but couldn't find any statistical applications. Some related ones in machine learning and Markov chains claim 50-fold speedups over conventional PC architectures, without the complexity of running a whole cluster of computers. Now, computation of likelihood and MCMC are inherently extremely paralelizable, and such hardware could make it easier to fit sophisticated models. This would be a good topic for a computationally minded PhD thesis.

Jose Aleman points me to this conference on 18-19 June at Fordham University:

This conference is about applications of the R software and Graphics system to important policy and research problems, not about R per se. It provides an excellent opportunity to bring together researchers from various disciplines using R in their reproducible research work. We hope to provide practical help to students and researchers alike.

It says here that I'm an invited speaker. I don't actually remember being asked to do so, but if I did, then I guess I'll be there! Fordham is conveniently located near the zoo so perhaps I can somehow combine this with a family trip.

P.S. I checked more carefully and the conference is actually at the Manhattan branch of Fordham (at Lincoln Center). So no zoo trip, unfortunately!

$88 (or $110 list)

| 9 Comments

Why I don't (usually) publish with Wiley. I want to get it, though.

I posted the pretty maps at 538. (I'd post them here, but Jeff was complaining that I was crossposting too much.) But one thing that people do like here is R code, so here's some:

> M1 <- glmer (rvote ~ z.inc2*z.state.income.full + (1|inc2) + (1 + z.inc2 | stnum), family=binomial(link="logit")) > display (M1) glmer(formula = rvote ~ z.inc2 * z.state.income.full + (1 | inc2) + (1 + z.inc2 | stnum), family = binomial(link = "logit")) coef.est coef.se (Intercept) -0.06 0.05 z.inc2 0.52 0.07 z.state.income.full -0.49 0.07 z.inc2:z.state.income.full -0.27 0.11

Error terms:
Groups Name Std.Dev. Corr
stnum (Intercept) 0.20
z.inc2 0.30 0.62
inc2 (Intercept) 0.06
Residual 1.00
---
number of obs: 20510, groups: stnum, 49; inc2, 5
AIC = 27668.9, DIC = 27652.9
deviance = 27652.9

I used the estimates from this model to extract McCain's estimated two-party vote share for each income category within each state. I then took a weighted average for each state (weighting by the number of respondents in each income category) and did one final adjustment by shifting the estimates for each category so that the average came out to the same as the actual vote outcome within each state.

I won't show you this R code because it's too damn ugly; I'm embarrassed.

Yair showed me this. It's simply amazing. Click on the link RIGHT AWAY and be awed. Great examples and all the code you'll ever need.

Our new R package: R2jags

| 2 Comments

I have got emails occasionally from JAGS users, asking about our new R package: R2jags. Basically, R2jags runs JAGS via R and makes postanalysis easier to be done in R. Taking advantage of the functions provided by JAGS, rjags and R2WinBUGS, R2jags allows users to run BUGS in the same way as they would do it in R2WinBUGS. Nonetheless, R2jags has some powerful features that facilitate the BUGS model fitting:

  1. If your model does not converge, update it. If you used to be a R2WinBUGS user, you must feel frustrated that your model does not converge. The best thing you can do is to use your current states of parameters as the starting values for the next MCMC. On the other hand, in R2jags, you just type in the R console:

    fit <- update(fit) 
    

    This will update the model.
  2. If you have to shutdown your machine but the model is still not converged.... In R2jags, you can go ahead and save the R workspace and shutdown your machine. When you are ready to run the model again, load the workspace in R and type:

    recompile(fit)
    fit.upd <- update(fit)
    

    This will recompile the model, which means you can update the model again!!

If you want to explore more features of the R2jags, just type ?jags in the R console. The example code contains all the functions in the R2jags.

Installing JAGS and rjags can be tricky in the Mac OS or in the Linux system. I have a blog entry here that shows how this can be done. If you are not a Window user, this post might help you.


I got an email from the people at Revolution computing reporting:

REvolution Computing has now made a public version of its commercial grade REvolution R program available for download from its website. REvolution R is REvolution Computing's distribution of the popular R statistical software, optimized for use in commercial environments. A key feature includes the use of powerful optimized libraries capable of boosting performance by a factor of 5 or 10 for commonly used operations.

Here it is.

There's also a payware version, which "features advanced functionality, including ParallelR, which speeds deployment across both multiprocessor workstations and clusters to enable the same codes to be used for prototyping and production. REvolution R Enterprise is functional with 64-bit platforms and Linux enterprise platforms and provides for telephone support and response guarantees."

I don't know anything about this but at least in theory it sounds like a good idea! If anyone has any comments, feel free to share them.

Multicolor text in R

| 1 Comment

Hey, I've wanted to do this for awhile! Example code here.

I gotta say, I find the expression() function incredibly difficult to use. Examples are key.

It's all R all the time around here, as Chris Blattman asks:

How can busy economists and political scientists learn R quickly? Is there a good guide? A set of handy ready-made programs? I learned MATLAB and STATA inside out in econ grad school, but now that I'm in a poli sci dept, and because the technology has progressed, I feel like I should learn R. I tried to teach myself a year ago (with no aids) and it was not pretty, even though I am usually pretty good at these things. So I abandoned it. But the article suggests R is easy to use. Did I miss the magic instruction book? If you have time to post on this, it would be a real boon to me and others I am sure.

My suggestion is to start with John Fox's book, An R and S-Plus Companion to Applied Regression and follow up with my book with Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, which is full of R code. Sure, lots of the code is messed up, but we're planning to put together a clean version soon. . . .

If anyone has other suggestions for Chris, feel free to say something in the comments.

Equal time for SAS

| 12 Comments

In a comment here, Steve Polili of SAS writes, "neither SAS nor Anne [Milley] hates R or open source. We run on Linux and we love Apache. For a little more info on SAS and R, take a look at a followup on Anne's blog [entry]," where Anne writes:

First, SAS and I applaud the innovative contributions and passion of the R community, and users who apply R to solve problems. In a very real sense, we are grateful for R, as it provides a freely available venue for bleeding-edge and experimental data analysis methods, and underscores the increasing importance of advanced analytical and graphical methods in this age of massive data volumes.

Over the past three decades, SAS has made and continues to make many noteworthy contributions to advanced data computing. As a trusted supplier to a large and diverse set of organizations, SAS provides analysis software that has been refined over years of customer application and feedback. SAS software is fully supported and used daily in countless ways. SAS is also scalable to very large data sets (multi-threaded, grid-enabled, etc.). These issues remain very important to the organizations we serve.

This seems reasonable to me. There's certainly room in the world for SAS, R, Stata, and SPSS, as well as Fortran, C, Python, and even Excel. I find it useful to have different software for different purposes. I'd like R to have better data input and merging tools, but until I become aware of such tools, I'll continue to use Stata (indirectly) to do some of this.

So, in that case, how could I go around saying, "I just hate SAS"?

Bob pointed me to this article by Ashlee Vance following up on the recent newspaper article. Bob writes that "Pfizer's bought into the FUD (fear, uncertainty and doubt) argument that marketers employ to discourage the use of open-source or other free software." From the Vance article:

Pfizer was a prominent R user mentioned in the story. The company relies on R for its nonclinical drug studies and has shied away from using the technology for clinical research that will ultimately be presented to regulators. For such work, Pfizer instead turns to software from SAS Institute, which brings in more than $2 billion a year in revenue from data analytics software and services.

Were Pfizer to use R in clinical studies, it would run the risk of seeing its research questioned or even rejected by regulators doubting the veracity of results based on what they view as an unknown quantity.

"It's very hard to displace the industry standard in those types of cases," said Max Kuhn, associate director of nonclinical statistics at Pfizer.

I'm actually working with Neal Thomas and other people at Pfizer on an improved and more trustworthy OpenBugs implementation that they can use for their research. It's actually worth it for Pfizer to put resources into an open-source project: open-source can mean more beta-testing and more reliability.

At a technical level, Sam Cook and I are working with them on implementing unit testing (the so-called "self-cleaning oven"; see item 5 here) for Bayesian modeling, following our earlier work in this area.

Finally, Vance concludes with a discussion of the size of R's user community. I imagine this is tricky to define--for example, do you count students?

R in the news

| 35 Comments

See here. I pretty much agree with what they're saying, except that I think R occupies a position as much as it serves a function. By this I mean that, if R didn't exist, we'd be doing similar things using something else, whether it be Matlab, Mathematica, or some Python-based confection. When I think about what I actually do in R, it wouldn't actually be so hard to do most of it from scratch. This is not to disparage R, just to say that it's filled a niche.

And I certainly wouldn't characterize R as "a supercharged version of Microsoft's Excel spreadsheet software." Or maybe I should say that I didn't know that R had spreadsheet capabilities. One more thing to learn, I guess. Also another motivation for Jouni to finish Autograph.

And it's good to hear that SAS is in trouble. I just hate SAS. It's not just a matter of it having poor capabilities; SAS also makes people into worse statisticians, I think.

P.S. The reporter contacted me about this story a few weeks ago, but I don't actually remember what I said (or even whether I was ever actually reached). Certainly nothing memorable enough to quote.

We recently uploaded on to CRAN multiple imputation package "mi" which we have been developing.

The aim of package mi is to make multiple imputation transparent and easy to use for the user. Hence there are few characteristics that we believe are valuable.
1. Graphical diagnostics of imputation models and convergence of the imputation process.
2. Use of bayesglm to treat the issue of separation.
3. Imputation model specification is made similar to how you would fit a regression model in R.
4. It automatically detects some problematic characteristics in the given dataset and alerts the user.

Please give it a try if you have any dataset that has missingness.

Also we are still in the process of improving the package, thus your input is most welcome.

One caution is if you are using big dataset with large number of missingness across many variables, it may take some time for process to converge. We admit, it is not the fastest imputation package on the market.

However, once we can get the basics down, speeding things up is not so difficult. So please bare with it for now.

There are future directions we plan to expand such as imputation of time-series cross-sectional data, hierarchical data, etc. But for now these features are not part of the package.

Happy Holidays!!

Inference from simulations

| 4 Comments

We talk a lot about inference from Markov chain simulation (Gibbs sampler and Metropolis algorithm), convergence, etc. But inference from simple random samples can be nonintuitive as well.

Consider this example of inference from direct simulation. The left column shows five replications of 95% intervals for a hypothetical parameter theta that has a unit normal distribution, each based on 100 independent simulation draws. Right column: five replications of the same inference, each based on 1000 draws. For both columns, the correct answer is [-1.96, 1.96].

Inferences based on Inferences based on 100 random draws 1000 random draws [-1.79, 1.69] [-1.83, 1.97] [-1.80, 1.85] [-2.01, 2.04] [-1.64, 2.15] [-2.10, 2.13] [-2.08, 2.38] [-1.97, 1.95] [-1.68, 2.10] [-2.10, 1.97]

From one perspective, these estimates are pretty bad: even with 1000 simulations, either bound can easily be off by more than 0.1, and the entire interval width can easily be off by 10%. On the other hand, for the goal of inference about theta, even the far-off estimates above aren't so bad: the interval [-2.08, 2.38] has 97% probability coverage, and [-1.79, 1.69] has 92% coverage.

So, in this sense, my intuitions were wrong and wrong again: first, I thought 100 or 1000 independent draws were pretty good, so I was surprised that these 95% intervals were so bad. But, then, I realized that my corrected intuition was itself flawed: actually, nominal 95% intervals of [-2.08, 2.38] or [-1.79, 1.69] aren't so bad at all.

This is an example I came up with for my chapter that's going into our forthcoming Handbook of Markov Chain Monte Carlo (edited by Galin Jones, Xiao-Li Meng, Steve Books, and myself).

Computational Finance with R

| 3 Comments

Jan Vecer is co-organizing a conference here at Columbia on 4 Dec on computational finance with R. Registration information is at the link.

bayesglm in Matlab?

| 3 Comments

I received the following email:

I recently became aware of two papers by David van Dyk on a new approach to Gibbs sampling using incompatible conditional distributions. This seems similar to the parameter expansion or redundant parameter idea developed by C. Liu, J. Liu, Meng, Rubin, van Dyk, and others, but perhaps a bit more generalizable and thus usable in routine problems.

Here's the theoretical paper (with Taeyoung Park).

And here's the more applied paper (which has a logistic regression example), with Hosung Kang.

This looks great, although I'm still not sure exactly how to apply this to our problems. Maybe we're getting closer, though...

The following is my discussion of the article, "Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations" by H. Rue, S. Martino and N. Chopin, for the Journal of the Royal Statistical Society:

Statisticians often discuss the virtues of simple models and procedures for extracting a simple signal from messy noise. But in my own applied research I constantly find myself in the opposite situation: fitting models that are simpler than I would like—models that clearly miss important features of the data and, more importantly, important features of the underlying system I am modeling—because of computational limitations.

Bob Carpenter writes the following regarding Extreme Programming, focusing specifically on some of my struggles in statistical computing:

Well, it's a bit extreme. I think you'll find better overall advice in Hunt and Thomas's Pragmatic Programmer without all the silver bullet rhetoric. I wouldn't bother with the Agile development stuff, but that's the currently trendy descendant of this whole line of thinking.

Having said that, there are several good take-away messages from the Extreme Programming (XP) process, but I don't believe, as its proponents do, that you need to do everything their way.

I love pair programming -- it's not only a great way to learn for a novice/expert or expert/expert pair, it's a great way to keep quality high by keeping each other honest and it's a great way to catch bugs early on. But if you follow the XP advice to the letter, that's the only way you'd program, which is impractical in most groups.

You should start using unit testing, which I believed your group referred to as a "self-cleaning oven" (though don't keep the tests in the same file as the code).

Research coding is both good and bad for XP. The specifications tend to move around even more than in the usual XP project, because you don't even know if what you're trying to do is possible before you start much of the time.

And you definitely need version control, which Masanao and Yu-Sung set up for you through RForge.

With some R and BUGS under my belt, I really wish it was easier to stick to the don't repeat yourself (DRY) principles. All of the R code I've seen could use much more modularity.

Finally, you should get into refactoring; it's really what you're trying to do with BayesGLM, though it may be easier to refactor GLM first if you're going to start from scratch.

Bayesian computation in Java?

| 5 Comments

John Payne writes:

I am writing a Java program to do ecosystem modeling and we wish to use Bayesian MCMC methods for parameter estimation. I am interested in finding flexible, customizable Bayesian MCMC code that can be called in Java. I looked at documentation for JAGS, BUGS, and also Gregory Warnes's Hydra program (which is in Java). I have been unable to get a reply from Dr. Warnes but it seems Hydra is no longer being supported. As far as I can tell, BUGS is written in Component Pascal, which I am ignorant about. I have never tried JAGS; would you have any advice as to which avenue would be the most fruitful to pursue?

My reply: Right now, I think Jags is probably the best way to go. But others might have other suggestions here.

Burn-in Man

| 3 Comments

Mike McLaughlin writes:

I was wondering about MCMC burn-in and whether the oft-cited emphasis on this in the literature might not be a bit overstated.

My thought was that the chain is Markovian. In a Metropolis (or Metropolis-Hastings) context, once you establish the scale of the proposal distribution(s), successful burn-in gets you only a starting location inside the posterior -- nothing else is remembered, by definition! However, there is nothing really special about this particular starting point; it would have been just as valid had it been your initial guess and the burn-in would then have been superfluous. Moreover, the sampling phase will eventually reach the far outskirts of the posterior, often a lot more extreme than the sampling starting location, yet it will still (collectively) describe the posterior correctly. This implies that *any* valid starting point is just as good as any other, burn-in or no burn-in.

The only circumstance that I can think of in which a burn-in would be essential is in the case in which prior support regions for the parameters are not all jointly valid (inside the joint posterior), if that is even possible given the min/max limits set for the priors. Am I missing something?

My response: What you're missing is that any inference from a finite number of simulations is an approximation.

Consider an extreme example in which your sample takes independent draws from a N(mu,sigma^2) distribution, but you pick a starting value of X. The average of n simulations will then have the value, in expectation, of (1/n)X+ ((n-1)/n)mu (instead of the correct value of mu). If, for example, X=100 and n=100, you're in trouble! But a burn-in of 1 will solve all your problems in this example. (And in this example, n=100 would work just fine for most purposes.) True, if you draw a few gazillion simulations, the initial values will be forgotten, but why run a few zillion simulations if you don't have to? That will just slow down your important work.

More generally, your starting values will persist for awhile, basically as long as it takes for your chains to mix. If your starting values persist for a time T, then these will pollute your inferences for some time of order T, by which time you can already have stopped the simulations if you'd discarded some early steps.

P.S. See here for a different perspective, from Charlie Geyer. For the reasons stated above, I don't agree with what he writes, but you can read for yourself.

P.P.S. In my example above, you might say that it would be ok if you were to just start at the center of the distribution. One difficulty, though, is that you don't know where the center of the distribution is before you've done your simulations. More realistically, we start from estimates +/- uncertainty as estimated from some simpler model that was easier to fit.

How many simulations do you need?

| 1 Comment

Ernest Sergenti writes with a question regarding how to determine or quantify how many simulation iterations are "enough" to calculate quantities of interest, after burn-in has been achieved:

Radford Neal's blog

| 4 Comments

Radford's a leading researcher in statistical computing. He started a new blog. Radford writes:

Many of my technical posts will point out flaws in research, methods, and tools that are commonly used. Such negative comments are essential to the scientific enterprise, being the social counterpart of the crucial role of self-criticism in individual research. I find that one of the main challenges in supervising PhD students is getting them to constantly ask whether their results, or the results of others, might be wrong. The aim in research is to discard bad ideas quickly, and with minimal effort. For this a blog is much more efficient than formal publication.

I hope he posts some positive things too. I mean, our blog here could be all Kanazawa all the time, and that would be fun for awhile, but generally it's much more of a challenge to make a positive contribution than a negative contribution, so I hope Radford applies his blogging talents, as he applies his research talents, to this area.

P.S. Here's my favorite Radford Neal quote. It's from his Ph.D. thesis:

Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.

Extreme Programming

| 13 Comments

Is this stuff useful? (Link from John Cook.) I pretty much think that any idea in programming can also apply to statistics (not to mention the programming that goes into statistics).

Ben Hinchliffe writes,

In the paper "Tools for Bayesian Data Analysis in R", you mention the need for a flexible computing format that allows for the manipulation and summarization of simulations of a Bayesian probability model. I work for a company (Blue Reference, Inc.) that has developed a software environment that enables the creation of dynamic, interactive documents for reproducible research and teaching using Microsoft Office and R. After two-and-a-half years of development work, we are now ready to release Inference for R and focus on its implementation in reproducible research and dynamic document teaching practices.

For an introduction to Inference for R, visit our website at www.InferenceForR.com and view the 2-minute overview screencast.

For a sense of the scope of our Inference project, selectively view the collection of screencasts, postings, documents and screenshots at www.inference.us. For a hands-on assessment of the capabilities of Inference, download a copy of our release candidate at www.InferenceForR.com.

I don't know anything about these people but I thought it might interest some of you.

Alan Lenarcic sent along this. He writes, "The amount of strange Windows settings you have to set is a little daunting."

R is too strongly typed

| 6 Comments

I fit a multilevel model in R and called it M2, then innocently put together some coefficients in order to make a prediction:

a.hat <- fixef(M2)["(Intercept)"] + fixef(M2)["u.full"]*u + ranef(M2)$county

Then I tried

a.hat[26]

and got the following response:

Error in `[.data.frame`(a.hat, 26) : undefined columns selected

OK, OK, I had to go back and change the original line to:

a.hat <- fixef(M2)["(Intercept)"] + fixef(M2)["u.full"]*u + unlist (ranef(M2)$county)

This is a pain, because sometimes it's "as.vector," sometimes it's "as.numeric," sometimes it's something else. It's so hard sometimes to just access the data. R is so strongly typed now that I have to waste a lot of time simply extracting things from objects that I already have sitting in memory.

Sandra McBride writes:

My current model (in Winbugs) runs very slowly on my very slow laptop, so I am getting a new desktop. Here are some questions in the hopes of speeding up my model:

Is Winbugs or Openbugs multithreaded? (Then I'd buy a quad core rather than a duo core)

When using from within R, is Openbugs faster than Winbugs in general?

My reply: I don't know, but I've heard that JAGS is the fastest. I'm not sure if R2WinBUGS is set up to run JAGS, but if not, it could be done (right, Yu-Sung)? Also, I would think that in a parallelized implementation of R, it would be straightforward for R2WinBUGS to run 4 chains and send one to each of 4 processors; however, I don't know that this has actually been implemented.

In any case, I'm hoping that not too far in the future we'll have some a version of Bugs that is much faster, at least for the sorts of hierarchical regression models that Bugs currently chokes on.

Vector autoregression

| 2 Comments

Yefin Dain writes:

Here's the title/abstract for my talk at the R conference in August:

Many statistical methods of all sorts have tuning parameters. How can default settings for such parameters be chosen in a general-purpose computing environment such as R? We consider the example of prior distributions for logistic regression.

Logistic regression is an important statistical method in its own right and also is commonly used as a tool for classification and imputation. The standard implementation of logistic regression in R, glm(), uses maximum likelihood and breaks down under separation, a problem that occurs often enough in practice to be a serious concern. Bayesian methods can be used to regularize (stabilize) the estimates, but then the user must choose a prior distribution. We illustrate a new idea, the "weakly informative prior," and implement it in bayesglm(), a slight alteration of the existing R function. We also perform a cross-validation to compare the performance of different prior distributions using a corpus of datasets.

The title is "Bayesian generalized linear models and an appropriate default prior," and it's based on this paper with Aleks, Grazia, and Yu-Sung.

There are some changes in arm. Most of them are related to the big changes lme4 has made. If your arm version number is > arm_1.1.8, then you might want to read the followings (or if you are a reader of Data Analysis Using Regression and Multilevel/Hierarchical Models, you should read this):

This is another example of why defaults matter a lot.

I got an email of Evan Cooch forward by Matt, saying that there exists a trick to speed up R matrix caculation. He found that if we replace the default Rblas.dll in R with the proper one. It can boost R's speed in doing matrix caculation.

The file is here (This file only works under Windows). For Mac and Linux users, see here.

I just read an article on a faster Gibbs/slice sampler algorithm and it sparked the following thoughts:

1. Faster convergence is not just about speed, it's also about having confidence that your results are correct. The worst thing about not having a good algorithm is never knowing whether I've really fit the model I'm trying to fit. The next worst thing is that then I don't have the confidence to go to the next, more complicated, model I'd like to try.

2. The researchers run things for 10,000 iterations. In statistical simulation research, 10,000 is a small number of iterations. But in practice, 10,000 can take a long time. If the algorithm is as efficient as claimed, wouldn't 1000 be enough? This might sound silly but in big problems--or in routine analyses--you don't necessarily want to wait for 10,000.

In response to the comments on his forthcoming book, Paul Murrell writes:

After I remarked here on the notorious rudeness of one of the frequent posters to the R listserv, several commenters agreed with me (for example, Richard Morey wrote, "I've found that the R-help forums are legendary for the rude poster(s)"), but Nick Cox writes,

problems of clashing styles and expectations are generic to all technical lists that I've ever heard of that are not selective about membership. . . . The problem is a political question, not a technical one. The question is what to do when people, for whatever reason, do not follow the standards laid down for proper behaviour in a group, a discussion list, and one that they willingly join. . . . The solution being complained about is that some people -- usually "senior" people on a list with recognisable expertise -- are very firm in reminding posters of poor questions about the need to be much more precise about what their difficulty is, to read the documentation, etc. As this advice is very much part of the guidelines that people are asked to follow, it seems disingenuous, if not hypocritical, to complain when those people are trying their level best to maintain the standards of the list, exactly as advertised.

Cox's comments are interesting--and they suggest that when I and others think the R posters are particularly rude, we just don't have much experience with large listservs--but I actually want to take this in a different direction.

In my previous entry, I wrote, "I think it's ok for you to just post your questions and ignore that one [rude] person if he replies to you." Personally, I find rudeness from strangers unpleasant, even on a listserv, but I recognize that's just the way things are, and that's why I advise people to post to the list anyway.

Why is rudeness so upsetting, and how does it relate to altruism?

The more interesting question, perhaps, is why does this sort of rudeness hurt so much? Even though, as a logical matter, we should be able to ignore this rudeness, it actually hurts enough to dissuade people from posting.

The other aspect of listserv rudeness that intrigues me is that posting answers to a listserv is basically an act of altruism. People don't get paid to do it, they don't get academic credit, and in fact if they're rude they don't even get a lot of respect for it (at least, not as much respect as they might deserve, given their contributions). So, it's not enough of an answer to describe rude posters as assholes--if they were real assholes, they wouldn't be posting at all, right? They're sort of like those legendary caring-but-firm teachers who put a huge amount of effort into helping each individual student, but show it in this crusty, sarcastic, tough-guy fashion. I guess I can see, following Cox's argument above, that rude posters are serving the greater good by hurting people's feelings. I just think it's impressive that people would be so altruistic as to do this.

P.S. Posting on a blog is similar but less altruistic. For example, yes, I answer people's questions here but I also get a chance to promote my own work, which I don't see being done so much on the R listserv.

P.P.S. Given all the comments below, I fear I wasn't being clear enough in my own views here. I am being serious above, not sarcastic. Although certain listserv posters can be abrasive and even rude, I really do feel they are altruistic in providing all this free help on the list. Stylistic issues aside, I think that people who give help in this way are performing a valuable service, and I appreciate this, both for myself (on the occasions that I use the list) and on behalf of students and others whom I refer to the list.

Debugging

| 18 Comments

Reading Paul Murrell's new book made me think some more about debugging. Jennifer and I discuss debugging in Chapter 19 of our book. I'll give Paul's recommended steps for debugging and then my comments. Paul's advice:

Paul Murell's new book begins as follows:

The basic premise of this book is that scientists are required to perform many tasks with data other than statistical analyses. A lot of time and effort is usually invested in getting data ready for analysis: collecting the data, storing the data, transforming and subsetting the data, and transferring the data between different operating systems and applications.

Many scientists acquire data management skills in an ad hoc manner, as problems arise in practice. In most cases, skills are self-taught or passed down, guild-like, from master to apprentice. This book aims to provide a more structured and more complete introduction to the skills required for managing data.

This seems like a great idea, although it makes me think the title should say "data management" rather than "data technologies." Also, I hope that he clarifies that "data" does not simply mean "raw data." We often spend a lot of time working with structured data, in which the structuring (multilevel structures, missing data patterns, time series and spatial structure, etc.) is an important part of the data that is often obscured in traditional computer representations of data objects. As we've recently discussed, even something as simple as a variable constrained to lie in the range [0,1] is not usually stored as such on the computer.

I like the book a lot and think every statistician should have a copy. I have some comments on Section 2.3.4, on Debugging Code, which I'll place in a separate blog entry. For now, here are some little comments on various parts of the book:

The Applied Statistics Center at Columbia University invites applications for a post-doctoral fellowship focusing on missing data and Bayesian inference. Eligible candidates have received a PhD in Statistics or related field. We will also consider candidates who have received appropriate quantitative training while earning a PhD in the social, behavioral or health/medical sciences. The successful candidate will be responsible for helping to build an innovative multiple imputation program including development of new models and diagnostics. Accordingly some expertise in Bayesian statistics is necessary. Also, strong programming experience in R is a must and facility with C++ is strongly encouraged. The position will also involve working with social science and health research practitioners, so the ability to perform interdisciplinary work is necessary.

The principal investigators on this project are Andrew Gelman, Jennifer Hill, and Peter Messeri, and we're also working with Angela Aidala, Jane Waldfogel, Irv Garfinkel, and other researchers in social science and public health.

The Applied Statistics Center is an exciting research environment at Columbia involving many students, faculty, and postdocs in dozens of methodological and applied research projects.

The fellowship is for one year but may be extended by mutual agreement and contingent on funding considerations. Salary is commensurate with degree and experience.

Please submit the following materials electronically to Juli Simon Thomas: your letter of application, curriculum vitae, writing sample, programming sample, three letters of recommendation, and a two- to three-page research statement describing your research interests.

Backround

Some of our previous papers in this area include the following:

  • [2008] Diagnostics for multivariate imputations. {\em Applied Statistics}, to appear. (Kobi Abayomi, Andrew Gelman, and Marc Levy)

  • [2005] Multiple imputation for model checking: completed-data plots with missing and latent data. {\em Biometrics}. (Andrew Gelman, Iven Van Mechelen, Geert Verbecke, Daniel F. Heitjan, and Michel Meulders)

  • [2001] Using conditional distributions for missing-data imputation. Discussion of ``Conditionally specified distributions,'' by Arnold et al. {\em Statistical Science} {\bf 3}, 268--269. (Andrew Gelman and T. E. Raghunathan)

  • [1998] Not asked and not answered: multiple imputation for multiple surveys (with discussion). {\em Journal of the American Statistical Association} {\bf 93}, 846--874. (Andrew Gelman, Gary King, and Chuanhai Liu)

    Links have been fixed (thanks for pointing out, Jeremiah).

  • I mentioned this in class all the time this semester so I thought I should share it with the rest of you. The folk theorem is this: When you have computational problems, often there's a problem with your model. I think this could be phrased more pithily--I'm not so good in the pithiness department--but in any case it's been true in my experience.

    Also relevant to the discussion is this paper from 2004 on parameterization and Bayesian modeling, which makes a related point:

    Progress in statistical computation often leads to advances in statistical modeling. For example, it is surprisingly common that an existing model is reparameterized, solely for computational purposes, but then this new conŽ guration motivates a new family of models that is useful in applied statistics. One reason why this phenomenon may not have been noticed in statistics is that reparameterizations do not change the likelihood. In a Bayesian framework, however, a transformation of parameters typically suggests a new family of prior distributions.

    I was thinking more about axes that extend beyond the possible range of the data, and I realized that it's not simply an issue of software defaults but something more important, and interesting, which is the way in which graphics objects are stored on the computer.

    R (and its predecessor, S) is designed to be an environment for data analysis, and its graphics functions are focused on plotting data points. If you're just plotting a bunch of points, with no other information, then it makes sense to extend the axes beyond the extremes of the data, so that all the points are visible. But then, if you want, you can specify limits to the graphing range (for example, in R, xlim=c(0,1), ylim=c(0,1)). The defaults for these limits are the range of the data.

    What R doesn't allow, though, are logical limits: the idea that the space of the underlying distribution is constrained. Some variables have no constraints, others are restricted to be nonnegative, others fall between 0 and 1, others are integers, and so forth. R (and, as far as I know, other graphics packages) just treats data as lists of numbers. You also see this problem with discrete variables; for example when R is making a histogram of a variable that takes on the values 1, 2, 3, 4, 5, it doesn't know to set up the bins at the correct places, instead setting up bins from 0 to 1, 1 to 2, 2 to 3, etc., making it nearly impossible to read sometimes.

    What I think would be better is for every data object to have a "type" attached: the type could be integer, nonnegative integer, positive integer, continuous, nonnegative continuous, binary, discrete with bounded range, discrete with specified labels, unordered discrete, continuous between 0 and 1, etc. If the type is not specified (i.e., NULL), it could default to unconstrained continuous (thus reproducing what's in R already). Graphics functions could then be free to use the type; for example, if a variable is constrained, one of the plotting options (perhaps the default, perhaps not) would be to have the constraints specify the plotting range.

    Lots of other benefits would flow from this, I think, and that's why we're doing this in "mi" and "autograph". But the basic idea is not limited to any particular application; it's a larger point that data are not just a bunch of numbers; they come with structure.

    Longhai Li did a really cool Ph.D. thesis (under the supervision of Radford Neal) on computing for models with deep interactions. The website containing all stuff about this software, including
    the R packages, documentations and references, is here and here. Here's a quick description (from the website):

    This R package is used in two situations. The first is to predict the next outcome based on the previous states of a discrete sequence. The second is to classify a discrete response based on a number of discrete covariates. In both situations, we use Bayesian logistic regression models that consider the high-order interactions. The time arising from using high-order interactions is reduced greatly by our compression technique that represents a group of original parameters as a single one in MCMC step. In this version, we use log-normal prior for the hyperparameters. When it is used for the second situation --- classification, we consider the full set of interaction patterns up to a specified order.

    And here's the research paper (by Longhai and Radford). I wonder if they've achieved some of my goals in wanting weakly informative priors for models with interactions. That Cauchy thing rings a bell.

    P.S. to Longhai: I don't recommend keeping your software in two places. Won't it be a pain to keep both sites up-to-date? Or maybe it's done automatically, I don't know.

    Databases and R

    | 15 Comments

    Alex Reed writes,

    What's out there? I have a few desires:

    1. A speech-oriented statistics package--a front-end to something like Stata or R with voice commands and spoken output. For example:

    User: Regress income on height and sex.
    Computer: [repeats, to make sure no misunderstanding] Regress income on height and sex.
    User: Yes
    Computer: There is no "income" variable
    U: What variables do we have?
    C: height, sex, weight, occupation, earnings, age---
    U: [interrupts] Set y to earnings
    C: Set y to earnings
    U: Yes
    C: Regression of income on height and sex. The intercept is 3.4 with a standard error of 1.2. The slope for height is . . .
    U: Add the interaction of height and sex
    C: Add the interaction of height and sex
    U: Yes
    C: Regression of income on height, sex, and height times sex. The intercept is . . .

    It would be good to have lots of functions here, but I imagine we could start with regressions and simple statistics and then see what else is useful.

    2. A statistical graphics program that uses touch and sound to convey information. For a scatterplot or two-dimensional intensity graph could be conveyed with a setup where as you move a mouse (or a pen, or your hand) over a pad, the computer makes louder sounds where there are more data. I'm thinking of something that sounds like rain, with individual drops for single data points and various sounds of heavy rain or rushing water where there are lots of data.

    I'm sure lots more could be done here, for example using some combinations of pitch, timing, chirps, etc., to convey different patterns in data.

    Does anyone know what's out there? A quick web search yields this for SPSS and this, which claims to let you hear images, and this screen reader. But what I think we should really be doing is creating some software that is so cool that sighted people will want to use it too.

    Starbucks/Walmart update

    | 28 Comments

    Alex F. commented here about problems with our Starbucks and Walmart data. Elizabeth Kaplan, who collected the data for me, replied:

    Yeah Walmart was a bit of a pain to find the locations for as you can not search just by state on their website, like for Starbucks. In order to find the locations I relied on the yellow page results. Even though I looked through to eliminate double postings for walmarts with the same address, after I looked into it again tonight, it appears the yellow pages dramatically over represented the number of walmarts per state. I have attached the correct data. All of these numbers come from this website (http://www.walmartfacts.com/StateByState/) which I was unable to locate before.

    As far as the data for starbucks that should be correct as I got it straight from their website. The one thing is that they don't list all affiliate stores (that is stores not own and operated by the company). There is no reliable source of data on affiliate stores by state, and obviously the yellow pages are not a good source. So the data I sent to you just includes Starbucks owned and operated stores.

    Also for population I used the 2006 Census Bureau estimates.

    This sort of thing happens all the time to me, so I certainly don't think Elizabeth should feel too bad about this. I'm just glad that Alex noticed and pointed out the problem. Anyway, here are the corrected maps:

    consumer1a.png

    consumer2a.png

    and scatterplot:

    consumer3a.png

    And also, following Seth's suggestion, the scatterplot on the log scale:


    consumer5a.png

    And, following Kaiser's suggestion, a reparameterization showing people per store (rather than stores per million people):


    consumer4a.png

    WinEdt tip

    | No Comments

    Masanao writes,

    In WinEdt, if you just highlight the lines you want to indent and press tab it automatically indents the block for you. If the space is set to 4 as it is by default, you can go to options->preference->tab and set the number to 2.

    Hey, thanks!

    Michael Braun writes,

    My particular problem involves a Dirichlet process mixture, but I think that the issue I describe also applies to finite mixtures where label switching can be a problem.

    I am trying to construct a reasonably "close to optimal" adaptive scaling algorithm for a random walk Metropolis update, within a Gibbs sampler, for a model that includes a Dirichlet process mixture.

    Bill DuMouchel wrote:

    I recently came across your paper, "A default prior distribution for logistic and other regression models," where you suggest the student-t as a prior for the coefficients. My application involves drug safety data and very many predictors (hundreds or thousands of drugs might be associated with an adverse event in a database). Rather than a very weakly informative prior, I would prefer to select the t-distribution scale parameter (call it tau) to shrink the coefficients toward 0 (or toward some other value in a fancier model) as much as can be justified by the data. So I want to fit a simple hierarchical model where tau is estimated. Is there an easy modification of your algorithm to adjust tau at every iteration and to ensure convergence to the MLE of tau (or maximum posterior estimate if we add a prior for tau)? And do you know of any arguments for why regularization by cross-validation would really be any better than fitting tau by a hierarchical model, especially if the goal is parameter interpretation rather than pure prediction?

    I replied:

    We also have a hierarchical version that does what you want, except that the distribution for the coeffs is normal rather than t. (I couldn't figure out how to get the EM working for a hierarchical t model. The point is that the EM for the t model uses the formulation of a t as a mixture of normals, i.e., it's essentially already a hierarchical normal.)

    We're still debugging the hierarchical version, hope to have something publicly available (as an R package) soon.

    Regarding your qu about cross-validation, yes, I think a hierarchical model would be better The point of the cross-validation in our paper was to evaluate priors for unvarying parameters which would not be modeled hierarchically.

    Bill then wrote:

    I did have my heart set on a hierarchical model for t rather than normal, because I wanted to avoid over shrinking very large coefficients while still "tuning" the prior scale parameter to the data empirically. (Although my worry about over shrinking might be less urgent if I use prior information to create "batches" that can have their own centers of shrinkage, as in your in-progress hierarchical bayesglm program.)

    Lee Edlefsen and I [Bill D.] are working on a drug adverse events dataset with about 3 million rows and three thousand predictors, using logistic regression and some extensions of LR, and with thousands of different response events to fit. Plus the potential non repeatability of MCMC results would be a real turnoff for the FDA regulators and pharma industry researchers.

    An EM question

    I have a question for Chuanhai or Xiao-Li or someone like that: is it possible to do EM with two levels of latent variables in the model? In the usual formulation, there are data y, latent parameters z, and hyperparameters theta, and EM gives you the maximum likelihood (or posterior mode) estimate of theta, conditional on y and averaging over z. This can commonly be done fairly easily because z commonly has (or can be approximated with) a simple distribution given y and theta. This scenario describes regression with fixed Student-t priors, or regression with normal priors with unknown mean and variance.

    But what about regression with t priors with unknown center and scale? There are now two levels of latent variables. Can an EM, or approximate EM, be constructed here? As Bill and I discussed in our emails, Gibbs is great, and it's much easier to set up and program than EM, but it's harder to debug. There's something nice about a deterministic algorithm, especially if it's built with bells and whistles that go off when something goes wrong.

    I had various course titles floating around: my course at Columbia this spring is officially called Applied Statistics, and I had promised people that it would cover Bayesian statistics. At Harvard they asked me to teach Statistical Computing, but I wanted to focus on applied Bayesian methods. So I'm putting it all together in the title given above.

    If you're interested in taking the class, let me know if you have any questions or just show up to the first few lectures; it's Wed Fri 9:00-10:30 at Columbia (if you're in New York), or Mon 11:30-2:30 (if you're in Boston).

    Motivation:

    Statistical computing is to statistics as statistics is to science: necessary but a distraction from the main event. I hate computing, yet I do it all the time. For those of us in this position, it makes sense to spend a bit more time thinking harder about how to compute efficiently. Learning statistical computation is an investment in becoming a more effective practitioner and researcher.

    Overview:

    We will cover topics in Bayesian computation, statistical graphics, and software validation, as well as special topics that interest the class.

    There will be some homework (writing programs and making graphs in R) and a final project to be done in pairs.

    The (tentative) syllabus is below.

    Version numbers

    | 1 Comment

    As regular readers know, I'm a big fan of Doug Bates's lmer() function, which comes with his lme package in R. Anyway, I had an exchange with another lmer() user which included the following funny bit:

    Using the version from CRAN (lme 0.99875-9) everything works; however using lme4_0.999375-0 it does not! I was running lme4_0.999375-0...

    I'm thinking that maybe it's time to go to version 1.0 already!

    I wanted to calculate numerical derivatives and I found the numDeriv package in R and loaded it. It seems like a pretty serious effort and seems to work fine. I just have one problem . . .

    This should be of interest to people who run Gibbs samplers, Metropolis algorithms, whatever. (Including if you just run them in Bugs, even if you don't program them yourself.)

    Why does R indent 4 spaces with its functions? Indenting 2 is much easier to read. I find functions in R packages to be really hard to read because they space things out so far. Compare this (from R2WinBUGS):

    varpostvar <- max(0, (((n - 1)^2) * varW + (1 + 1/m)^2 * varB + 2 * (n - 1) * (1 + 1/m) * covWB)/n^2)
    to the line from the original function before it got put into a package:
    varpostvar <- (((n-1)^2)*varW + (1+1/m)^2*varB + 2*(n-1)*(1+1/m)*covWB)/n^2
    Here's another. From R2WInBUGS:
    covWB <- (n/m) * (var(s2, xdot^2) - 2 * muhat * var(s2, xdot))
    From the original function:
    covWB <- (n/m)*(cov(s2,xdot^2) - 2*muhat*cov(s2,xdot))
    Is all that spacing really necessary??? It gets kinda ridiculous when single-line statements get broken into two lines.

    Which reminds me . . .

    When I write if() statements or loops, I always put in the brackets {}, even if there's only one line inside the condition. It just makes the function easier to follow, also helps avoid errors if I change the function later. Why is the convention in R packages to not include the {}? I can see this being an issue inside nested loops where speed is a concern, but I see this everywhere.

    Aleks points to two papers that might be relevant for our statistical computing efforts: state-of-the-art optimization for laplace-prior logreg and bayesian regularization vs leave-one-out.

    Just as a reminder, here's our project that relates to the above topics:

    log-.png

    Advice on debugging your code

    | No Comments

    I received the following email:

    Hello, Dr. Gelman,

    I am working on person-fit index for reading comprehension test using PPMC. I am pretty new for Bayesian, WinBUGS, and PPMC. . . .Could you take a look at my WinBUGS code for me? I calculated likelihood for posterior and predictive posterior and PPP-value . . .

    Good question. I don't have time to look at people's code but I responded that you can check things by simulating fake data from the model and then checking that the estimated parameters are close to the true values; see Section 8.3 of my recent book with Hill. I'm usually too lazy to do this fake-data checking, but when I get around to it, it often is helpful.

    Sam and Don and I wrote a paper with a systematic approach to fake-data checking. Often, though, just simulating one fake dataset and fitting the model is enough to reveal problems.

    Exploratory data analysis course

    | No Comments

    Hadley Wickham writes to Jouni and me:

    I have just been reading your Stat Computing paper on your rv package. It's a nice paper, but you seem to be unaware of the very related ideas coming out of functional programming community - google for "probability monads" to get started. The notation and target uses are quite different, but the ideas are interesting, and there are some hints on how to compile statements to be more computationally efficient.

    I also wonder if you have thought about applying these techniques to classical statistical procedures.

    MCMC starting values

    | No Comments

    Jun Xiang writes,

    I am using your R2WinBUGS to estimate some hierarchical logistic regression and have a question about how to set initial values. First, when I use some uninformative initial values the model has some convergence problem (Rhat indicates non-convergence after 10,000 iterations). Next, I use MLE as the initial values and the model has better convergence results. I want to know if the latter way is right since each chain in this case gets the same initial values.

    My reply: Yes, you can have problems if your starting values are too wacky. Basically, the usual noninformative prior distributions contain parameter values that are so extreme (e.g., theta=10^4) that there can be convergence issues. In principle it would be best to use more reasonable prior distributions, but in practice it often works fine if you pick starting values that are less extreme. This is discussed a bit in the Gelman and Hill book in the discussion of using Bugs. In addition, it makes sense to parameterize reasonably (for example, scaling predictors) so that you won't get coefficients such as 10^4 or 10^-4. We discuss the scaling issue in the book also.

    Make R beep

    | 5 Comments

    Yu Sung says: Just type alarm() or cat("\a").

    Many of you buy and rank books, movies on the web, you click on links, bookmark them, blog about them. By doing this, you are leaving traces behind. The traces are of great help to those who will find themselves in the same situation as you. Personalization technology tries to help you navigate the choices using the actions of people who were there before you, and with the the implicit (clicks or purchases you've made) or explicit (preferences you've expressed) knowledge about yourself.

    Greg Linden's blog is an excellent source of insightful posts on personalization technology. A while ago he posted a link to a collection of material from KDD about the Netflix Prize: a challenge where one has to predict how much you will like a particular movie based on your history of movies you've seen and based on others' ratings of movies they've seen.

    What's notable is that some of the current competition leaders have written extensive papers about their approach. BellKor's approach is quite simple and combines nearest-neighbor ideas with a more global factor model. On the other hand, Gravity employs a diverse collection of tools, including matrix factorization, neural networks, nearest neighbor models and clustering. The Gravity team provides an interesting picture of their factor model for movie Constantine:

    constantine.png

    They assume adjacent factors to be correlated, they infer this matrix purely from the ratings data, and they named some of the factors manually in the end. Compare their annotations with the hand-designed list of genres (Action / Drama / Fantasy / Horror / Thriller) or keywords (Androgyny / Twin / Black Cat / Christian Horror / Electrocution / ...) assigned to the movie by human editors. Many of these keywords might rarely be relevant for determining whether the movie is worth seeing or not.

    In recent weeks, the two currently leading groups (judging from the leaderboard) were formed by fusing the AT&T Labs group BellKor with another researcher into KorBell, and the Hungarian BME group with the Princeton into Gravity-Dinosaurs) have consolidated their efforts: as it is becoming clear that several models are better than just one. I am not sure either group used background knowledge that can be obtained by crawling the web or using other databases. At the moment, both groups are at around 8.4% improvement over baseline. They need 10% improvement over baseline to win the $1e6 grand prize, and a year ago the improvement was 4.5%.

    This is just another piece of support for the Epicurus' principle of multiple explanations ("Keep all hypotheses that are consistent with the facts!", often used by proponents of algorithmic probability as a summary of a longer statement from his letter to Herodotus, "If then we think that an event could happen in one or other particular way out of several, we shall be as tranquil when we recognize that it actually comes about in more ways than one as if we knew that it happens in this particular way.") leads to superior practical results. This corresponds to the Bayesian practice of integrating over all possible parameter settings (but assigning them weights in proportion to their predictive power - likelihood - and in proportion to our a priori trust in them - prior) instead of picking just the best one. To the famous statement "All models are wrong, but some are useful." we should add "but use them all!"

    [Correction 10/12:] Chris Volinsky kindly corrected my statement that BellKor combined their predictions with others groups (although other groups offered this). They did not do that, and they argue against this approach. They did, however, combine researchers Bell, Koren with researcher Volinsky into the KorBell approach.

    David Afshartous writes,

    What is the difference between sim() and mcsamp(), if any? If there is a difference, what should I expect in terms of difference in results?

    My reply:

    I was corresponding with a friend about R being great but having some problems (see here). My friend wrote,

    I have this hypothesis, though, that ugly/improper/hodgepodge languages will always bury pretty/proper/curated languages. Ugly languages get things done and attract those who are looking to get things done. These adoptees modify the language to do new things and so the ugly languages quickly evolve to handle cutting-edge problems with little, albeit inefficient, code. We've seen this happen with PHP, which is very inelegant, but is the most useful and widely-used language for web development. Unless you are a professional programmer (working on a team, maintaining code, concerned with execution speed) uglier languages are better for almost every project.

    I think it has something to do with the way human languages evolve. When a langauge accepts bottom-up adaptations (from the users) it will handle new topics and new problems more efficiently than when it need to wait for top-down approval of such adaptations. For programming languages, this means that the bottom-up adapted language will be less thought out, but in a greater state of readiness for new problems.

    I agree that, for my purposes, R is usually better than the alternative. I use R almost exclusively. But I don't necessarily agree with the categorization "ugly/improper/hodgepodge." In particular, some of the difficulties with R arise because of efforts to make it "pretty"--I'm thinking particularly of the S4 structure but also of the pain-in-the-butt exception handling and other "paperwork" that take up 90% of the space in functions such as glm().

    P.S. I removed the discussions of English and other human languages since they were a distraction from the main point which was about R.

    Running Bugs under Linux

    | 1 Comment

    I don't know anything about this but it could be useful, I think. From the Bugs mailing list, David Stivers writes:

    Data overload

    | 8 Comments

    Adam Kramer, a student in social psychology at the University of Oregon, writes,

    Yu-Sung writes,

    Style guide for R code

    | 8 Comments

    A colleague writes,

    Now that I work for a large software company, my computer code is heavily scrutinized: there are style guides, rules for indenting, conventions for variable naming, etc. I've come around to it--it really does make my code a lot better. blah blah blah. But my company doesn't really have formal rules for R because hardly any engineers use it. I'm working on writing said rules, and it seems incomplete not to include something about writing efficient code: avoiding nested loops, etc. Do you know of any good references on how to write good R code?

    My reply:

    1. I'm curious what the R experts say. My own stylistic preferences differ slightly from what appears to be the default in R packages: in particular, I like to indent 2 characters, but R seems to indent a lot more, which to my taste makes the code hard to read in a text editor. (My own stylistic preferences can be deduced from the examples of R code in my books.)

    2. I know that there is general advice to avoid global variables--it's better to pass information in function arguments. When writing Umacs we found this to be awkward and so we used global variables instead, but since then somebody explained to me how to do it all using local variables (without requiring a huge effort in passing lists of variables). Unfortunately I can't remember now how I was going to do it. Maybe the idea was to put all the arguments in a list.

    3. I personally like to fill up arrays with NA's when I set them up, so that if something goes wrong, I'll get lots of NA's in the result, and I can track back where the problem is.

    4. I think R has a debugger but I've never used it. I probably should.

    5. As you know, I've been moving toward the idea of simulating fake data for every problem as a test of the algorithm and code. I call this the self-cleaning oven principle: a good package should contain the means of its own testing. We haven't yet done this with "arm" but we should.

    6. I agree about avoiding nested loops--when it causes programs to be slow. On the other hand, sometimes a matrix implementation can just be mysterious, and I find it helpful to spell things out with loops. (Again, we discuss this in our book--we even have a footnote or two explaining why we have some loops.)

    7. I like to follow the general principle that lines of code should (almost) never be repeated. I'm always seeing students write scripts with cut and pasted code, and I'm always telling them to use a function and a loop instead.

    8. A silly little thing: with if () statements, I recommend always using braces (curly brackets), even if the conditional command is just one line. If you or someone else wants to modify a function, it's much easier to do so if the braces are already there.

    9. R functions are getting uglier and uglier. I'd say the typical R function is 90% "paperwork" (exception handling, passing of names, etc) and only 10% "meat" (to mix analogies). I attribute some of this to the S4 system of objects with sockets etc. For one thing, it's typically no longer possible to see what a function does by typing its name. I don't know what to say here, except to recommend not drinking the Kool-Aid: maybe you can try to keep your functions clean rather than putting all the effort into the paperwork. (Unfortunately, we didn't really follow this advice with bayesglm: we made the mistake of adapting the existing glm function.)

    10. Scalability is a big issue in R. Ideally any new function would be accompanied by a statement explaining how it scales as the inputs increase in size.

    11. When summarizing the results of your output, I recommend working with "display()" (from the arm package) rather than "summary()". The summary() function always seems to give a lot of crap, and we've tried to be cleaner and more focused with display(). One option is to set up functions for both so that users can typically use display(), with some extra information in summary().

    12. It's a good idea to graph inferences. Graphs aren't just for raw data.

    This seems like too much advice; maybe some of the above rules are unnecessary or can be written more generally.

    In any case, if you're writing guidelines, I recommend giving examples of the recommended approach and also the bad approach for each rule.

    Perhaps others have suggestions too (or comments on my ideas)? Once you've written your guidelines, I hope you can publish them, with discussion, in a statistics journal so all can see. There may already be some R style guide that you can adapt and react to.

    Dragomir Radev is visiting from the University of Michigan and teaching this cool course, which I highly recommend to our computationally-minded statistics students:

    Multilevel modeling in Stata

    | 1 Comment

    Roberto Gutierrez sends along this presentation describing new multilevel modeling options in Stata 10. There's some interesting discussion of computation time. Also, when they introduce the models, I'd suggest looking at the 5 different ways of writing a multilevel model (as in Chapter 12 of our book).

    Yph writes,

    They started me out on SPSS, then quickly moved me to Stata. Now I'm learning R. Do you think R is it? Will I have to learn a new programming language in the future? I get apprehensive about investing time learning new technology when the turnover rate of programming language seems so high.

    My reply:

    Generic advice for Bugs bugs

    | 1 Comment

    Peter Park writes,

    Confidence building

    | 1 Comment

    Confidence-building is an under-researched area in statistics. Some pieces of confidence-building:

    Tom Moertel writes,

    I wanted to let you know that I have packaged the CRAN "arm" package and its prerequisite R packages for Fedora Linux, making it easy to install using Fedora's integrated package-management system. Normally, installing "arm" is somewhat tricky because of the dependency upon on R2WinBUGS, which requires BRugs, which doesn't currently build on Linux. (I got around this problem by making R2WinBUGS not require BRugs, which is the best work-around I could come up with. The resulting functionality is therefore incomplete, but at least Linux users get the functionality that is possible on their platform rather than a failed installation.)

    I don't know how many of your students or readers use Fedora Linux, but if any do, feel free to point them to my packages: http://blog.moertel.com/articles/2007/04/25/new-fedora-core-rpms-for-cran-packages

    Just a warning: we've been updating "arm" occasionally, mostly improvements to bayesglm() and fixing the calls to lmer().

    Umacs at last

    | 2 Comments

    Jouni reports:

    Dear all,

    I'm sending this because you've been interested in my R packages 'rv' (simulation-based random variable class) and/or 'Umacs' (Universal Markov chain sampler).

    The latest version of rv and the first "public beta" of Umacs are now available at CRAN (cran.r-project.org)

    Please feel free to try them out!

    http://cran.r-project.org/src/contrib/Descriptions/rv.html
    http://cran.r-project.org/src/contrib/Descriptions/Umacs.html

    The "rv" package is based on Jouni's thesis on fully Bayesian computing. The idea is to be able to manipulate random variable objects without having to get lost in the subscripting of simulations. (See here.)

    Umacs is an automated way to program Bayesian computations (Metropolis algorithm and Gibbs sampler). It is roughly based on the function-based approach of the R programs in Appendix C of Bayesian Data Analysis.

    Bill Harris writes,

    R book

    | 12 Comments

    John Kastellec points us to this book by Michael Crawley. It reminds me of what my old college professor wrote to me 12 years ago when I sent her a copy of my (then-)new book: "Thanks for sending this. A lot of people are writing books nowadays." (I have no comments on Crawley's book, having seen nothing but the link to the webpage.)

    P.S. Somebody just said to me today, "Do you ever use R?" I said yes, and he said that he thought that applied people such as himself used R, but that the real statisticians used things like SAS. Grrrr.... I told him that SAS is said to have advantages with large datasets but isn't so great at model-fitting and graphics.

    Recent Comments

    • Asa: Thanks everyone. I figured out a pretty solid solution to read more
    • Stuart Buck: Is it that medical schools are trying to screen out read more
    • Jacob: BTW, in no way I am putting down R. R read more
    • Jacob: Nick, Of course, my comment on MATLAB's popularity is based read more
    • Steven: http://www.cockeyed.com/science/gallon/liquid.html See for more info read more
    • Andrew Gelman: Jonathan: You are giving the conventional definition of risk aversion read more
    • Jonathan: As an economist who does his work with "the public," read more
    • BrendanH: I'll second the lme4/R recommendation, on the grounds that it read more
    • Chris Brew: In Linguistics Ohio State invites people to on-site recruiting visits read more
    • PalMD: Independent of the evidence, med schools and residencies require very read more
    • Eliot: I agree with Elizabeth and Jeremy -- Americans and Europeans read more
    • ssendam: One of the irritating things about economics is that it read more
    • wei: probabilities should be conditional upon all possible paragraphs to express read more
    • Jonathan: It's far from surprising that a guy who doesn't believe read more
    • Jeremy Miles: I'd never thought of it like that - but risk read more
    • Elizabeth: You could just describe the same phenomena by saying that read more
    • manuelg: Yes, I think I understand what you are saying. Converting read more
    • derek: How meaningful is it for an American to talk about read more
    • Andrew Gelman: Thorfinn: I don't actually think the Type 1 and Type read more
    • jonathan: My father gave medical boards, did some admissions, taught in read more