Mechanistic Understanding of Models for Educational Assessments Andrew Gelman, Department of Statistics, Columbia University I comment on this article (Mislevy, Steinberg, and Almond, 2002) as a statistician with no particular expertise in educational measurement. The authors make a strong case for their general inferential framework (a complicated structure of latent variables, with multiple parameters for each student, estimated using Bayesian updating). Interestingly, they do not make this case by claiming that their student and task models fit the data better or that they are more parsimonious than other available IRT models. Rather, they justify their modeling strategy by showing how it allows a better understanding of the processes being studied. Phenomenological and mechanistic models The sparseness of real-world data limits the extent to which complex models can be estimated from data alone (for example, in a standardized test, there is a limit to the number of questions that can be asked of each student). In contrast, when a statistical model is tied to reality, it can have much more complexity, with a structure set by substantive understanding. For an example from our own research in pharmacology (Gelman et al., 1996), researchers commonly try to learn about internal bodily processes given only information about the time series of concentrations of certain compounds in blood and exhaled air. If set up "phenomenologically" with parameters representing, for example, coefficients in a model of exponential decay, such a model is ill-posed and cannot be estimated well, in the sense that the data cannot reliably be fit in a way that allows accurate predictions for new experimental conditions. Instead, the problem is solved by setting up a mechanistic model in which parameters represent volumes and concentrations of the compounds in bodily organs, for example, so that they have natural biological constraints and can be generalized accurately across human and even animal populations. In pharmacokinetic jargon, such mechanistic models are called "physiologically-based." The connection between mechanistic modeling and identifiability of parameters varies from problem to problem but becomes clear in specific cases. For example, in the pharmacokinetic experiment that we studied, several individuals were exposed to a toxin for a few hours, and then this exposure was ended and the concentration of the toxin in their blood and exhaled air was measured several times over the period of a week. The basic pattern of the data--concentrations that declined gradually, with an asymptote at zero--could be fit by a phenomenological model under which the data follow a mixture of declining exponential functions: y(t) = A exp(-at) + B exp(-bt) + C exp(-ct) + ... + error. For our data, four terms would supply a good fit, but then it would be difficult to simultaneously estimate all eight parameters. This would present a problem since the ultimate use of the model is to make inferences about what would happen to the compound under different exposure conditions. In contrast, the mechanistic model has even more parameters than the phenomenological model, but its parameters have a direct physiological interpretation, which helps in two ways. First, it is possible to get reasonable prior distributions for these parameters based on studies of other persons and other compounds; and second, inference about the physiological parameters can more believably generalize to other exposure conditions. In general, phenomenological models are characterized by relatively few parameters that, ideally, can be estimated directly from data and easily applied to similar situations. Mechanistic models are more complicated but, ideally, are based on more fundamental characteristics that, once understood, can apply in a wider range of new settings. For example, in educational assessments, it is more "phenomenological" to assign a single "ability" parameter to each student, or even to have several unspeficied ability parameters as in a factor analysis. It would be more "mechanistic" for these ability parameters to be identified with specific abilities (for example, different skills relating to reading) that combine, possibly nonlinearly. In either case, the fundamental statistical task is to generalize to new situations: other students, more difficult tests, other topics, and so forth. Inference and model checking The article by Mislevy et al. describes how mechanistic models can be used to learn from educational assessments more than might have been thought possible, especially compared what can be done with statistical approaches such as Rasch and other IRT models. Educational tests are certainly designed to tell us detailed information, so the task should not be impossible. However, as the authors point out, actually fitting their model can be difficult. In addition, the proposers of a complex model perhaps have a duty to demonstrate that it actually fits available data. This can be done straightforwardly for their Bayesian models by simulated replicated data sets and comparing them, visually and quantitatively, to existing data. One of the strengths of the fully-probabilistic modeling approach proposed by the authors is that it allows for targeted checking of the fit to existing and future data. An awkward issue here is that few such checks appear in the scientific literature: when researchers find flaws in a model, they will (or should) want to fix it. Thus, although model formulation and criticism are iterative processes, the "criticism" steps tend to be invisible (Box, 1980, Gelman et al., 1996). The computer implementation described by Mislevy et al. is potentially a methodological advance also, if it allows the statistical model to be conceptualized as an ongoing project, to be checked and improved in the hands of practitioners. Researchers in the field of educational testing could perhaps go one step further and try to assess their models by designing student assessments that will push the boundaries of their models and reveal their weak points. This is in turn related to a common goal of testing--to find and extend the boundaries of the students' knowledge. Perhaps the general principle of testing students can apply to testing our own understanding of how students understand and learn. References Box, G. E. P. (1980). Sampling and Bayes inference in scientific modelling and robustness. Journal of the Royal Statistical Society A 143, 383-430. Gelman, A., Bois, F. Y., and Jiang, J. (1996). Physiological pharmacokinetic analysis using population modeling and informative prior distributions. Journal of the American Statistical Association 91, 1400-1412. Gelman, A., Meng, X. L., and Stern, H. S. (1996). Posterior predictive assessment of model fitness via realized discrepancies (with discussion). Statistica Sinica 6, 733-807. Mislevy, R., Steinberg, L., and Almond, R. (2002). On the structure of educational assessments (with discussion). Measurement: Interdisciplinary Research and Perspectives.