Mechanistic Understanding of Models for Educational Assessments
Andrew Gelman, Department of Statistics, Columbia University
I comment on this article (Mislevy, Steinberg, and Almond, 2002) as a
statistician with no particular expertise in educational measurement.
The authors make a strong case for their general inferential framework
(a complicated structure of latent variables, with multiple parameters
for each student, estimated using Bayesian updating). Interestingly,
they do not make this case by claiming that their student and task
models fit the data better or that they are more parsimonious than
other available IRT models. Rather, they justify their modeling
strategy by showing how it allows a better understanding of the
processes being studied.
Phenomenological and mechanistic models
The sparseness of real-world data limits the extent to which complex
models can be estimated from data alone (for example, in a
standardized test, there is a limit to the number of questions that
can be asked of each student). In contrast, when a statistical model
is tied to reality, it can have much more complexity, with a structure
set by substantive understanding.
For an example from our own research in pharmacology (Gelman et al.,
1996), researchers commonly try to learn about internal bodily
processes given only information about the time series of
concentrations of certain compounds in blood and exhaled air. If set
up "phenomenologically" with parameters representing, for example,
coefficients in a model of exponential decay, such a model is
ill-posed and cannot be estimated well, in the sense that the data
cannot reliably be fit in a way that allows accurate predictions for
new experimental conditions. Instead, the problem is solved by
setting up a mechanistic model in which parameters represent volumes
and concentrations of the compounds in bodily organs, for example, so
that they have natural biological constraints and can be generalized
accurately across human and even animal populations. In
pharmacokinetic jargon, such mechanistic models are called
"physiologically-based."
The connection between mechanistic modeling and identifiability of
parameters varies from problem to problem but becomes clear in
specific cases. For example, in the pharmacokinetic experiment that
we studied, several individuals were exposed to a toxin for a few
hours, and then this exposure was ended and the concentration of the
toxin in their blood and exhaled air was measured several times over
the period of a week. The basic pattern of the data--concentrations
that declined gradually, with an asymptote at zero--could be fit by a
phenomenological model under which the data follow a mixture of
declining exponential functions:
y(t) = A exp(-at) + B exp(-bt) + C exp(-ct) + ... + error.
For our data, four terms would supply a good fit, but then it would be
difficult to simultaneously estimate all eight parameters. This would
present a problem since the ultimate use of the model is to make
inferences about what would happen to the compound under different
exposure conditions.
In contrast, the mechanistic model has even more parameters than the
phenomenological model, but its parameters have a direct physiological
interpretation, which helps in two ways. First, it is possible to get
reasonable prior distributions for these parameters based on studies
of other persons and other compounds; and second, inference about the
physiological parameters can more believably generalize to other
exposure conditions.
In general, phenomenological models are characterized by relatively
few parameters that, ideally, can be estimated directly from data and
easily applied to similar situations. Mechanistic models are more
complicated but, ideally, are based on more fundamental
characteristics that, once understood, can apply in a wider range of
new settings. For example, in educational assessments, it is more
"phenomenological" to assign a single "ability" parameter to each
student, or even to have several unspeficied ability parameters as in
a factor analysis. It would be more "mechanistic" for these ability
parameters to be identified with specific abilities (for example,
different skills relating to reading) that combine, possibly
nonlinearly. In either case, the fundamental statistical task is to
generalize to new situations: other students, more difficult tests,
other topics, and so forth.
Inference and model checking
The article by Mislevy et al. describes how mechanistic models can be
used to learn from educational assessments more than might have been
thought possible, especially compared what can be done with
statistical approaches such as Rasch and other IRT models.
Educational tests are certainly designed to tell us detailed
information, so the task should not be impossible. However, as the
authors point out, actually fitting their model can be difficult.
In addition, the proposers of a complex model perhaps have a duty to
demonstrate that it actually fits available data. This can be done
straightforwardly for their Bayesian models by simulated replicated
data sets and comparing them, visually and quantitatively, to existing
data. One of the strengths of the fully-probabilistic modeling
approach proposed by the authors is that it allows for targeted
checking of the fit to existing and future data. An awkward issue
here is that few such checks appear in the scientific literature: when
researchers find flaws in a model, they will (or should) want to fix
it. Thus, although model formulation and criticism are iterative
processes, the "criticism" steps tend to be invisible (Box, 1980,
Gelman et al., 1996). The computer implementation described by
Mislevy et al. is potentially a methodological advance also, if it
allows the statistical model to be conceptualized as an ongoing
project, to be checked and improved in the hands of practitioners.
Researchers in the field of educational testing could perhaps go one
step further and try to assess their models by designing student
assessments that will push the boundaries of their models and reveal
their weak points. This is in turn related to a common goal of
testing--to find and extend the boundaries of the students' knowledge.
Perhaps the general principle of testing students can apply to testing
our own understanding of how students understand and learn.
References
Box, G. E. P. (1980). Sampling and Bayes inference in scientific
modelling and robustness. Journal of the Royal Statistical Society A
143, 383-430.
Gelman, A., Bois, F. Y., and Jiang, J. (1996). Physiological
pharmacokinetic analysis using population modeling and informative
prior distributions. Journal of the American Statistical Association
91, 1400-1412.
Gelman, A., Meng, X. L., and Stern, H. S. (1996). Posterior
predictive assessment of model fitness via realized discrepancies
(with discussion). Statistica Sinica 6, 733-807.
Mislevy, R., Steinberg, L., and Almond, R. (2002). On the structure
of educational assessments (with discussion). Measurement:
Interdisciplinary Research and Perspectives.