Anova: why it is more important than ever

Kaiser writes with a question that comes up a lot, on getting a good baseline comparison of variation:

I [Kaiser] have been thinking about changing how I estimate variance in analyzing randomized experiments and want to ask if you know of any reference for what I’m hoping to do. Imagine that the data consists of two groups of customers receiving either treatment A or B. Randomization is done daily, and the experiment is run for 8 weeks and the two groups are compared on some metric (a proportion).

The typical way to analyze this is to run a z-test for comparison of proportions. This means that the underlying variation is modeled using a normal distribution (approx. to binomial).

The problem I encounter is due to the large sample size. The typical method becomes useless when n is large because the standard error becomes too small despite the underlying variability being still there. To illustrate this point, if I break up the data into daily cohorts and then look at the variability of the metric over time, it is clear that the underlying variability is quite high. (I think the trouble is that increasing n does not remove variability due to human behavior, etc. and it is a mistake to think that large n can approach the “population”.)

So what I want to do is to say take the 8 weekly cohorts, compute standard errors for each week, and take their weighted average as an estimate for the underlying variability, then apply the z-test.

In my mind, this works like ANOVA. One way to estimate the variability is to treat all 8 weeks as one big pool. A different way is to treat each week as a random sample from the population and average the sample variances. If there is underlying week to week variability, then the second estimate will be higher than the first estimate.

Each week can indeed be considered a “random” sample because only customers who fall into a particular status is eligible for the treatments and each new day, a new batch of customers reach that status so there are no issues with overlaps, etc.

My reply: Yes, you’re basically right. It’s like Anova or a multilevel model: you want to compare to a relevant baseline level. That’s the classic problem of Anova in complicated designs: picking the right variance comparison (or, as we say in Anova jargon, the right “error mean square”). The only difference is that, in classical Anova, the error mean square is chosen based on the randomization (as in the notorious split-plot design, in which the within-plot mean square is used for within-plot comparisons, and the between-plot mean square is used for between-plot comparisons; that’s something I learned in Rubin’s class that I bet most stat Ph.D. students don’t learn anymore), and in the modern Bayesian formulation, the comparison is based on the structure of the data, which may correspond to the randomization but doesn’t have to.

Getting back to Kaiser’s example: we don’t really cover this in our book. We do cover Anova but really this particular issue. The closest is Figure 2.3, in which we have a separate s.e. for each year, but the year-to-year variability is so large that these se’s are not particularly relevant in judging the importance of longer-term trends.