Manipulating and summarizing posterior simulations using random variable objects

Posted on November 12, 2007 11:09 AM by Andrew

Hadley Wickham writes to Jouni and me:

I have just been reading your Stat Computing paper on your rv package. It’s a nice paper, but you seem to be unaware of the very related ideas coming out of functional programming community – google for “probability monads” to get started. The notation and target uses are quite different, but the ideas are interesting, and there are some hints on how to compile statements to be more computationally efficient.

I also wonder if you have thought about applying these techniques to classical statistical procedures.

It seems like the following would be pedagogically useful:

x <- 1:10 y <- x + rvnorm(mean = 0, var = 1) model <- lm(y ~ x) coef(model) # To see the distributions of the estimates predict(model) resid(model) # etc and then you could explore what happens to those distributions when assumptions are violated y <- x + rvpois(lambda = 1) coef(lm(y ~ x)) I could also imagine this being useful in a theory class where you could use these computationally derived distributions to check/compare with those derived analytically. (A minor note: In R, apply is no more efficient than a for-loop)

Jouni responds,

Thanks for the reference to probability monads. I am wondering whether these ideas can be used in R programming unless R has this logic built in.

Yes I’ve thought of making lm and t.test rv-aware. And, pbeta, pnorm etc as well so for example I could easily compute operating characteristics of decision rules based on beta, normal etc. posterior probabilities.

and Hadley concludes with,

Yes, the monad ideas would be difficult to implement in R as it doesn’t have most of the infrastructure that you would need. In terms of making more R functionality rv-aware, I wonder if you could bootstrap (pardon the pun) off of the bootstrapping packages in R. There must be existing code for bootstrapping linear models (etc.) and reporting summaries of the coefficients. It seems like it would be productive to separate out the generation of the random sample (via bootstrapping or rv or other method) from the fitting and summary of results.

I have nothing to add, for now. I look forward to the day when we can manipulate random variables on the computer as easily as we now can work with vectors.

P.S. Hadley writes: You might also be interested in this paper: Generative Code Specialisation for High-Performance Monte-Carlo Simulations. I am not aware of any similar work in statistics.