This is Jessica. Human subjects experiments are starting to appear in machine learning venues. For example, ICML, one of the big ML conferences, accepted a few papers on quantifying prediction uncertainty with user studies.
Overall, I’m glad to see this. Theoretically rigorous uncertainty approaches that provide calibration or coverage guarantees may not be worth the effort (e.g., in terms of the model retraining or holding out of additional data that they require) if providing that information doesn’t impact human decisions, or leads to under- or overreliance. We know from decades of research that people can seem to resist uncertainty information or use it inappropriately.
But it’s also important to acknowledge that human decision-making under uncertainty can be tricky to study. Details matter. In my own experimental work I’ve had multiple moments where I realized something new about how people were making decisions that I hadn’t necessarily considered a few studies earlier. So I have some advice on this topic.
First, some points specific to studying uncertainty quantification:
- Make sure that the question you are asking about the value of uncertainty is truly about human behavior, and therefore requires an experiment. Consider that many questions about the value of uncertainty information can be worked out without running an experiment.
- Related to #1, be clear on what it would mean for a decision-maker to use the uncertainty information in the ideal case. If you’re not sure, figure that out before you run the experiment.
- Be careful that how you describe things to people accurately reflects what the uncertainty means. For example, don’t make false promises about uncertainty for individual instances, such as by implying that we know that there is a 95% chance the true value is in this specific interval. It might seem like a minor detail, but if your participants have unrealistic expectations about what they are seeing, it’s hard to interpret your results. You also risk misleading readers of your paper who don’t catch the distinction.
- Don’t assume the intended calibration of the uncertainty method applies to the instances that you actually present to people in your experiment. For example, just because you calibrated your method to have 95% coverage on average doesn’t mean that this must be what is achieved on the instances you actually present people with. So acknowledge the potential difference in analyzing the results. Same goes for accuracy and other aggregate performance measures you might estimate on held-out data. Unless it’s part of your research question, try to avoid telling people one thing but show them something pretty different.
- Think carefully about how you elicit probabilistic information from people like their confidence in a judgment. Don’t assume they will be able to tell you exactly what they believe because you asked them, even if that’s lots of prior studies have made the assumption.
- Keep in mind that there’s something ironic about papers about the importance of uncertainty quantification that ultimately present all of their results in a dichotomous, significant vs not-significant style.
Regarding that last point, I could see some people saying it’s too picky, and arguing that there is a long history of using NHST to analyze experiment results. But I think it’s fair to ask whether one wants to be part of the problem or the solution. If you care about good uncertainty expression in the artificial world of your experiment, shouldn’t you also care about it when you present the results of your research?
The above list was specific ot experiments on expressions of uncertainty, but I also have some more general advice:
- Related to #6 on not falling back on dichotomous reasoning, be careful with how you decide sample size. “Large sample size” is relative. If you are testing 10 different conditions, even if you have repeated measures, having what seems like a large number of people total doesn’t necessarily mean you have high power to detect the differences you care about. Fake data simulation will give you a much better sense of how well you’ll be able to distinguish effects you care about than online power calculators.
- Attempt to account for the design of the experiment when you model your results, not just the factor(s) you have research questions about. If you’re worried that the model might not fit when you include everything, you can preregister a model selection process rather than a specific specification. Something I’m seeing a fair amount is use of repeated trials designs, but where the modeling doesn’t account for individual-specific effects (e.g., through random intercepts and possibly random slopes) and what trial number they were on. In my experience, these things often have non-trivial effects, such that leaving them out can change the other effects you’re estimating in your model. This paper by Yarkoni has a nice demonstration of how ignoring individual heterogeneity leads to overconfident interpretations.
- Pre-registration is a property of a data collection and analysis process. Deviations from part of what was planned are not uncommon, so don’t treat preregistration as a label to that applies on the level of a whole project. You don’t really need the term preregistration in the abstract and the introduction and the conclusion. Just discuss it when you get to methods and if needed to note deviation when you talk about results.
I sent this list to Alex Kale, who added a few more:
- Be clear with yourself, your participants, and your reader, about how the provided uncertainty information is relevant to the task.
- Use proper scoring rules in situations where you want the task to have a clear objective and performance to have a clear meaning.
- Try to account for the role of prior knowledge and risk attitudes. Do not ignore these things, and pretend that results will generalize.
- Do not pretend that “high stakes” decision scenarios can be emulated in the typical controlled experiment.
- Do not confuse the effectiveness of presentation format with the effectiveness of providing uncertainty information in general. Do not treat poorly designed or implemented uncertainty communication as representative of what people can do with a more careful presentation.
- ^ especially when the uncertainty provided is non-informative for the task at hand, as in many XAI studies. This means your experiment is ill conceived.
- Do not treat “understanding” as something that is easily quantified with measures like simulability alone. Do not collapse the complex cognitive processes that underly understanding to vacuous concepts like “intuition.”
- Study how people learn to use uncertainty information, not just one-shot performance.
- Study deployments with actual users and do user-centered design or co-design if you want to develop tools that will have much hope of being adopted outside the lab. The road from theory to practice is messy. Embrace the mess!
Finally, a personal pet-peeve: stop implying that hardly anyone has ever studied the effect of presenting uncertainty in model predictions. Decision-making under uncertainty is a cross cutting topic that has been discussed for many years in a number of fields. Predictive uncertainty is not a new concept.
Some of this appears with further discussion in our papers on why we need decision theory to evaluate human decision-making, how to evaluate reliance on model predictions, how to benchmark the value of information in studies like on information displays, and why evaluating uncertainty expressions is hard.