Yingnian’s ideas on teaching probabilistic simulation

Yingnian writes,

I accumulated quite a number of intuitions about Monte Carlo in particular, and statistics in general, for teaching purpose. I think they can be made obvious to elementary school children.

MCMC can be visualized by a population of say 1 million people immigrating from state to state (e.g., 51 states on US territory) where transition prob is like the fraction of people in California who will move to Texas the next day. So p^{(t)} is a movie of population distribution over time. If all 1 million people start from California, the population will spread over all 51 states, and stabilize at a distribution which is the stationary distribution.

Metropolis-Hastings is just a treaty between the visa offices of any two countries x and y. It can be any c(x, y) = c(y, x), and the fraction of people who get the visa is c(x, y)/pi(x)T(x, y), since pi(x)T(x, y) is the number of people who go to apply for visa. To choose c(x, y) = min(pi(x)T(x, y), pi(y)T(y, x)) is just to maximize the immigration flow. Here the detailed balance means there is a balance between any two states, like China and US. There are more Chinese applying visa to US than vice versa, so we can let all US people go to China, but only accept a fraction of Chinese people to go to US (I just use this example for the benefit of intuition, without any disrespect for US or Chinese people). This is the acceptance probability.

Gibbs sampling can also be easily visualized. Consider a two-dimensional density f(x, y). It is a surface. Consider a billion points (x, y, z) uniformly distributed under this surface, i.e., z < f(x, y). Then the (x, y) coordinates (say, altitude and latitude) of all these 1 bullion points have a distribution f(x, y). Gibbs sampling consists of two steps. One is [X|Y]. This is nothing but randomly relocate each point (x, y, z) to another location that is in the slice with the same latitude. If f(x, y) is the surface of a cake, you can see the slice by cutting the cake at y. Clearly, this random re-shuffling does not change the uniform distribution of the points. The other step [Y|X] is to randomly relocate these points within the slice of the same altitudes. This won’t change the uniform distribution of the points either. So f(x, y) is the stationary distribution. The rate of convergence is about diversity of where people come from. Let T(x, y) be the transition matrix, then the diversity matrix B(y, x) is just the backward transition, i.e., Prob(X_t=x|X_t+1 = y) in stationarity. A fast MCMC should have all their states like US, immigration countries. The multi-modality or bottleneck does not encourage the diversity. Like in ancient times, different continents are local modes, and the rate of converge is slow. ------- About rejection sampling: We can think about a density curve g(x), or any curve cg(x) for a constant c. We can look at the region under this curve. If we randomly sample a point in this region, then the x-coordinate (the street address) follows g(x). Suppose cg(x) envelopes f(x). Then f(x) divides the region under cg(x) into downtown (area below f(x)) and uptown (area above f(x)). So the rejection sampling is just to randomly sample a point, and only accept it if it is in downtown. This amounts to randomly sampling a point in downtown. So the distribution of the x-coordinate is f(x).