Rejoinder to discussion of "Analysis of variance: why it is more important than ever," for the Annals of Statistics Andrew Gelman 13 July 2004 Anova is more important than ever because we are fitting models with many parameters, and these parameters can often usefully be structured into batches. The essence of "Anova" (as we see it) is to compare the importance of the batches and to provide a framework for efficient estimation of the individual parameters and related summaries such as comparisons and contrasts. Classical Anova is associated with many things, including linear models, F-tests of nested and non-nested interactions, decompositions of sums of squares, and hypothesis tests. Our paper focuses on generalizing the assessment, with uncertainty bounds, of the importance of each row of the "Anova table." This falls in the general category of data reduction or model summary, and presupposes an existing model (most simply, a linear regression) and an existing batching of batching of coefficients (or more generally "effects," as noted by McCullagh) into named batches. We thank the discussants for pointing out that more work needs to be done to generalize these ideas beyond classical regression settings with exchangeable batches of parameters. In this rejoinder, we review the essentials of our approach and then address some specific issues raised in the discussions. General comments McCullagh states that we regard "analysis of variance as an algorithm or procedure with a well-defined sequence of computational steps to be performed in fixed sequence." We appreciate this comment, especially in light of Tjur's complaint that it is not clear what our statistical model is. We would like to split the difference and say they are both right: our procedure is indeed performed in a fixed sequence, and the first step is to take a statistical model that must be specified from the outside. A statistical model is usually taken to be summarized by a likelihood, or a likelihood and a prior distribution, but we go an extra step by noting that the parameters of a model are typically batched, and we take this batching as an essential part of the model. If a model is already set up in a fully Bayesian form, our Anova step is merely to summarize each batch's standard deviation (whether superpopulation or finite-population; this depends on the substantive context, as discussed by Zaslavsky and in our Section 3.5). If only a likelihood is specified, along with a batching of parameters, we recommend fitting a multilevel model with a variance parameter for each batch, to be estimated from data. Yes, this is an automatic step, and yes, this can be inappropriate in particular cases, but we think it is a big step forward from the current situation in which the analyst must supply redundant information to avoid making inappropriate variance comparisons. Tjur also recognizes our goal of making split-plot and other analyses more understandable to students. More generally, we want to set up a framework where non-students can get the correct (classical) answer too (and avoid difficulties such as illustrated in Figure 1)! Our procedure gives an appropriate answer in a wide range of classical problems, and we find the summary in terms of within-batch standard deviations to be more relevant than the usual Anova table of sums of squares, mean squares, and F tests. None of the discussants dispute either of these points, but they all would like to go beyond classical linear models with balanced designs. We provide a more general example in Section 7.2 of our paper (an unbalanced logistic regression) and discuss other generalizations in Section 8.3, but we accept the point that choices remain when implementing Anova ideas in nonexchangeable models. The model comes first The discussants raised several important points that we agree with and regret not emphasizing enough in the paper. First, all the discussants, but especially Tjur and McCullagh, emphasize that the model comes first, and the model should ideally be motivated by substantive concerns, not by mathematical convenience and not by the structure of the data or the design of data collection. As noted above, our conception of Anova is a way of structuring inferences given that a model has already been fit and that its parameters are already structured into batches. As McCullagh points out, such batches should not necessarily be modeled exchangeably; we defend our paper's focus on exchangeable batches as they are an extremely important special case and starting point (we assume that the coauthor of an influential book on generalized linear models will appreciate the importance of deep understanding of a limited class of models) but note in Section 8.3 that more can be done. Anova is not just for linear models Our paper emphasized simple models in order to respond to the unfortunate attitude among many statisticians and econometricians that Anova is just a special case of linear regression. Sections 2 and 3 of our paper demonstrate that Anova can only be thought of this way if "linear regression" is interpreted to include multilevel models. But Anova applies in much more general settings. The vote-preference example of Section 7.2 is closer to our usual practice, which is to use Anova ideas to structure and summarize hierarchical models that have dozens of parameters. In this example, we didn't fit a multilevel model because of any philosophical predilections or because we had any particular interest in finite populations, superpopulations, or variance parameters. Rather, we sought to capture many different patterns in the data (in particular, state effects to allow separate state estimates, and demographic effects to allow poststratification adjustment for survey nonresponse). The multilevel model allows more accurate inferences---the usual partial pooling or Bayesian rationale (see Park, Gelman, and Bafumi, 2004)---and Anova is a convenient conceptual framework for understanding and comparing the multiplicity of inferences that result. Compare Figures 6 and 7 to the usual tables of regression coefficients (in this case, with over 50 or 100 parameters) to see the practical advantages of our approach. Various complications arose naturally in the model fitting stage. For example, state effects started out as exchangeable and then we put in region indicators as state-level predictors. We are currently working on extending these models to time series of opinion polls as classified by states and demographics. So, even in the example of our paper, the modeling is not as automatic as our paper unfortunately made it to appear. What was automatic was the decision to estimate variance parameters for all batches of parameters and to summarize using the estimated standard deviations. A small contribution, but one that moves us from a tangle of over seventy logistic regression parameters (with potential identifiability problems if parameters are estimated using maximum likelihood or least squares) to a compact and informative display that is a starting point to more focused inferential questions and model improvements. As the discussants emphasize, in a variety of important application areas, we can and should go beyond linear models or even generalized linear models, to include nonlinear, nonadditive, and nonexchangeable structures. We have found the method of structuring parameters into batches to be useful in many different sorts of models, including nonlinear differential equations in toxicology, where population variability can be expressed in terms of a distribution of person-level parameters (e.g., Gelman, Bois, and Jiang, 1996) and Boolean latent-data models in psychometrics, which have batches of parameters indexed by individuals, situations, and psychiatric symptoms (e.g., Meulders et al., 2001). We cite our own work here to emphasize that we certainly were not trying to suggest that the analysis of variance be restricted to linear models. With modern Bayesian computation, a lot more is possible, as Hox and Hoijtink point out (and as they have demonstrated in their own applied work). We recommend that practitioners consider Anova ideas in summarizing their inferences in these multilevel settings. Anova as a supplement to inferences about quantities of interest In a discussion of the example of our Section 2.2.2 (to which we shall return below), Hox and Hoijtink point out that in any specific application, an applied researcher will typically be interested in particular treatment effects, or comparisons of treatment effects, rather than in variance components. We agree and thank these discussants for emphasizing this point. As with classical Anova, our goal in summarizing variance components is to understand the model as a whole--how important is each source of variation in explaining the data?--as a prelude or accompaniment to more focused inferences. Anova may be "more important than ever" but it is intended to add perspective to, not to take the place of, inference for quantities of substantive interest. To put it another way: if you are already fitting a statistical model, its parameters can probably be grouped into batches, and it is probably interesting to compare the magnitude of the variation of the parameters in each batch. Recent statistical research has revealed many sorts of useful densely-parameterized models, including hierarchical regressions, splines, wavelets, mixture models, image models, etc. However, it can be tricky to understand such models or compare them when they are fit to different datasets. A long list of parameter estimates and standard errors will not necessarily be helpful, partly for simple reasons of graphical display, and partly because an ensemble of point estimates will not capture the variance of an ensemble of parameters (Louis, 1984). The two examples provided by Zaslavsky illustrate ways in which inferences for variance components can be relevant in applied settings. Tjur asks about partial confounding and other unbalanced designs. We would simply handle these using Bayesian inference. For example, Section 7.2 gives an example of an unbalanced design. Our paper discussed classical estimates for balanced designs, to connect to classical Anova and provide fast calculations for problems like the internet example, but more generally one can always use full Bayesian computations, as pointed out by Hox and Hoijtink. Estimation and hypothesis testing As Zaslavsky notes, our treatment of Anova focuse on estimation of variance components (and, implicitly, of individual coefficients and contrasts via shrinkage estimation), rather than on hypothesis testing. In the application areas in which we have worked, interested has lain in questions of the form, "How important are the effects of factor X?", rather than "Does factor X have an effect?"; see Figure A for an example. (We acknowledge McCullagh's point that our focus is the product of our experiences in environmental, behavioral, and social sciences; in other fields such as genetics, hypotheses of zero effects are arguably more relevant research questions.) [Figure A goes here] In settings where hypothesis testing is desired, we agree with Zaslavsky that posterior predictive checking is the best approach, since it allows a hypothesis about any subset of parameters to be tested while accounting for uncertainty in the estimation of the other parameters in the model. Posterior predictive checking can also be applied to the multilevel model as a whole to test assumptions such as additivity, linearity, and normality. Finite-population and superpopulation summaries Zaslavsky points out that, in settings where one is interested in generalizing or predicting for new groups, superpopulation summaries are most relevant. We emphasized finite-population summaries in our paper so as to provide more continuity with classical Anova. For example, with only five treatment levels, nonzero superpopulation variances are inherently difficult to estimate, a problem that is somewhat ducked by the usual classical analysis which focuses on testing hypotheses of zero variance. A related issue arises in hierarchical regression models, where the concept of a "contrast" in Anova plays the role of a finite-population regression coefficient, while the coefficient in the group-level regression has a superpopulation interpretation. (For instance, in the latin square example shown in Figure 3, suppose we are interested in the linear contrast of treatments A,B,C,D,E. The finite-population contrast is $-2\cdot\beta_1+(-1)\cdot\beta_2+0\cdot\beta_3+ 1\cdot\beta_4+ 2\cdot\beta_5$, whereas the superpopulation contrast is the appropriately-scaled coefficient of (-2,1,0,1,2) included as a treatment-level predictor in the multilevel model. A key technical contribution of our paper is to disentangle modeling and inferential summaries. A single multilevel model can yield inference for finite-population and superpopulation inferences. For example, in the example of Section 2.2.2, the structure of the problem implies a model with treatment and machine effects, as noted by Hox and Hoijtink. These authors state that their preferred procedure is "not what [we] had in mind," but they do not fully state what their model is. The key question is: what is the population distribution for the four treatment effect parameters? Our recommendation is to fit a normal distribution with mean and standard deviation as hyperparameters estimated from the data. This is the "superpopulation" standard deviation in the terminology of our Section 3.5; fitting the model would also give inferences for the individual treatment effects and their standard deviation. Hox and Hoijtink question whether the variance of the treatment effects is "an interesting number to estimate"; as we note above in our discussion, we agree with this point but find the general comparison of all the variance parameters to be a useful overall summary (as illustrated by the Anova graphs in our paper) without being a replacement for the estimation of individual treatment effects. To continue with Hox and Hoijtink's discussion of our example: we are not sure what analysis they are suggesting in place of our recommended hierarchical model. One possibility is least-squares estimation for the treatment effect parameters, which would correspond to our hierarchical model with a variance parameter pre-set to infinity. This seems to us to be inferior to the more general Bayesian approach of treating this variance as a hyperparameter and estimating it from data, and it would also seem to contradict Hox and Hoijtink's opposition to noninformative prior distributions later in their discussion. Another possibility would be a full Bayesian approach with a more informative hyperprior distribution than the uniform distribution that we use. We agree that in the context of any specific problem, a better prior distribution (or, for that matter, a better likelihood) should be available, but we find the normal model useful as a default or starting point. Fixed and random effects We suspect that statisticians are generally unaware of the many conflicting definitions of the terms "fixed" and "random"--in fact, a reviewer of an earlier version of this paper criticized the multiple definitions in Section 6 as "straw men," which is why we went to the trouble of getting references for each. We are glad that Zaslavsky liked our discussion of fixed and random effects and that McCullagh recognized the "linguistic quagmire." Hox and Hoijtink would like to define a fixed effect as "a varying effect with components that do not come from the same distribution." This distinction may be important, but we are not hopeful that they will be successful in establishing a new meaning to an already overloaded expression that has at least five other existing interpretations in the statistical literature! Is the phrase "fixed effect" so important that it is worth fighting over this patch of linguistic ground? We use the terms "constant" and "varying" effects because they are unambiguous statements about parameters in a model, and we have the need to communicate with researchers in a wide range of substantive fields. If Hox and Hoijtink find it useful to label sets of effects that are batched but do not come from a common distribution, we recommend they use an unambiguous phrase such as "differently-distributed effects" that communicates the concept directly. Tjur states that our "basic idea seems to be to let all effects enter formally as random effects." We are disappointed to see that he seems to have skipped over Section 6 of our paper! The term "random effect" has no clear (let alone "formal" definition), so we certainly do not consider it to be any part of our basic idea! On the contrary, our basic idea is to recognize that the parameters in a model are not simply a long undifferentiated vector but can be usefully grouped into batches, which in fact are already specified in the classical Anova table. Summary: why is Anova important now? First, as noted above, if you're already fitting a complicated model, your inferences can be better understood using the structure of that model. We have presented a method for doing so in the context of batches of exchangeable parameters, and we anticipate future developments in other classes of models such as discussed by McCullagh. Second, if you have a complicated data structure and are trying to set up a model, it can help to use multilevel modeling--not just a simple units-within-groups structure but a more general approach with crossed factors where appropriate. This is the way that researchers in psychology use Anova, but they are often ill-served by the classical framework of mean squares and F-tests. We hope that our estimation-oriented approach will allow the powerful tools of Bayesian modeling to be used for the applied goals of inference about large numbers of structured parameters. Additional references Gelman, A., Bois, F. Y., and Jiang, J. (1996). Physiological pharmacokinetic analysis using population modeling and informative prior distributions. Journal of the American Statistical Association 91, 1400-1412. Gelman, A., and Huang, Z. (2004). Estimating incumbency advantage and its variation, as an example of a before/after study. Under revision for Journal of the American Statistical Association. Louis, T. A. (1984). Estimating a population of parameter values using Bayes and empirical Bayes methods. Journal of the American Statistical Association 78, 393-398. Meulders, M., De Boeck, P., Van Mechelen, I., Gelman, A., and Maris, E. (2001). Bayesian inference with probability matrix decomposition models). Journal of Educational and Behavioral Statistics 26, 153-179. Park, D. K., Gelman, A., and Bafumi, J. (2004). Bayesian multilevel estimation with poststratification: state-level estimates from national polls. Political Analysis, to appear. Additional figure (Figure A, attached as a postscript file. This figure should just be shrunk to fit the page so that it can be viewed without turning the page 90 degrees.) Caption: Estimates +/- standard errors for an average treatment effect and three variance parameters for a hierarchical model of elections for the U.S. House of Representatives, fit separately to pairs of successive elections from the past century. The graphs illustrate how we are interested in estimating the magnitude of each source of variation, not simply testing whether effects or variance components equal zero. From Gelman and Huang (2004).