Who understands alignment anyway

Posted on May 9, 2024 1:32 PM by Jessica Hullman

This is Jessica. I don’t usually post multiple times a week, but turns out I have more to say on the topic of machine learning and human problems.

“Alignment” is the term used in the AI and ML communities to refer to the goal of aligning machine learning models with human values and preferences, so as to avoid risks ranging from the mundane to the catastrophic. It’s the topic of papers, workshops, talks, funding calls, etc.

There has been criticism of the nebulousness of what alignment is supposed to actually represent. Some of the critique of the ML conception of alignment comes from HCI research, the very interdisciplinary field that studies how people interact with technology and how to design human-computer interfaces. This pushback predates the “alignment” buzzword actually. I remember watching many in the HCI community bristle when in 2016 Michael Jordan wrote a blog post calling for the creation of a new “human-centric engineering discipline.” Seeing human-centered concerns get called out as a new frontier in AI/ML circles was enough to motivate some of the better resourced HCI researchers to create centers on Human-Centered AI or install themselves as team leaders in big tech companies, ensuring they wouldn’t be overlooked. Others have worked to make AI-related applications a bigger part of HCI research. Many are left to stew about wheels being reinvented, trying to be patient and issuing the occasional plea for everyone to recognize the overlapping goals.

My take is that HCI can help quite a bit with alignment, but that what it can offer is not what much of the ML research community wants or perceives themselves to need. It’s kind of like what a consulting statistician can offer to a data analysis versus what they are perceived to offer by those that recruit them. The real value of adding the statistician is often their role in helping you rethink your objective from the ground up. It’s not necessarily that they’re going to give you exactly the best tools to address some narrower problem you’ve convinced yourself needs to be solved. E.g., you’re convinced that if you just find the right causal inference technique you can confound a big messy dataset you’ve amassed and learn exactly how to improve some outcome X, but the pesky statistician comes in and spoils it by telling you, “No, if X is ultimately the goal, you’re going to need a different data collection procedure altogether.”

In the case of aligning ML, there are certainly human-oriented questions that arise within the current paradigm for aligning models. For example, questions of eliciting specific information from humans become important for deploying generative models. Reinforcement learning from human feedback (RLHF) is a standard method for fine-tuning a large pretrained model like GPT-4, where some group of annotators is recruited, often with no special experience required, and asked to select their preferred model output in a series of forced choice tasks, usually given some loosely defined criteria like “most helpful” or “least harmful”. Behavioral models for aggregating preferences across people like Bradley-Terry-Luce are used to learn a utility function. Human-oriented concerns include how to design the forced choice task and interface, how much information can reasonably be obtained from a single person, and how to crowdsource this efficiently. Beyond the common need to collect human annotations, other examples where human concerns arise in the current ML paradigm include questions like how to represent fairness ideals or how to evaluate post-hoc explanation techniques.

Could an HCI researcher be helpful for these questions? Sure, though I suspect that the most relevant work for some of these elicitation problems is likely to be found elsewhere, like psychophysics or decision science. Could the ML researcher figure this kind of stuff out without the HCI researcher? Probably. In many cases it may be more efficient for them to do it themselves, since HCI is a very large and interdisciplinary field. So I’m not surprised that ML researchers are often doing these things themselves, nor do I blame them.

On the other hand, I think the HCI pushback to ML alignment is valid when you consider the broader goal of creating predictive models that are well-aligned with human goals and values. If there’s a secret sauce that your average HCI researcher can bring, it’s the mindset of user-centered design, which makes serious attempts to understand the needs of the people being designed for. HCI research also demonstrates what it looks like to hold the conviction that human values are not monolithic, contributing knowledge on a variety of methods to try to get at what different groups want from technology. When taken to heart, I expect this kind of perspective suggests rethinking pretty much everything about human-facing ML models from the ground up.

Unfortunately though, I don’t really see much incentive for the average ML researcher interested in alignment to invest in the HCI way of doing things. All interdisciplinary collaborations tend to be hard, and this one seems likely to be particularly slow and messy. Meanwhile AI/ML research is moving at a faster pace than ever.

I also tend to believe that when someone is peering into a field, and believes that they can bring in some new perspective or methods that will be transformative, there’s an onus on that person to invest enough in understanding the field they hope to change to be able to demonstrate the value they want to bring. You can’t really expect people to listen if you haven’t taken the time to understand their concerns well enough to show them that you really could provide concrete suggestions. If the HCI researcher wants alignment to be done better or differently, maybe it’s time they temporally reinvent themselves as an ML researcher. Figuring out how to publish HCI-oriented papers at ML venues may not be easy, but it’s a step toward real impact.

I don’t mean this last part to sound dismissive, or like I’m trying to defend ML alignment. I think it’s just how things work. I’ve had multiple times in my career where I’ve looked at some other field and thought, I bet I could improve that. It’s how I’m feeling right now, actually. And every time I’ve been in this position, it seems clear to me that the only way to have that impact is to invest enough time in the new field to internalize how they think about it.

This is a very disturbing map.

Posted on May 9, 2024 9:25 AM by Andrew

From xkcd:

Kinda related to this and this.

I just got a strange phone call from two people who claimed to be writing a news story. They were asking me very vague questions and I think it was some sort of scam. I guess this sort of thing is why nobody answers the phone anymore.

Posted on May 8, 2024 5:57 PM by Andrew

Just got a weird phone call from two people who claimed to be from the New York Times, asking me very vague questions. They said someone recommended my name to them, but they wouldn’t say who recommended them. They asked me what my research was about, and I said I do a lot of research so it would help if they could tell me what their story was about. They said they couldn’t tell me. I recommended they go to my webpage and look at my published and unpublished papers and then if they had any questions they could ask me something specific. Or they could send me an email. It was a very weird conversation. At one point they asked me what I thought about political changes in recent decades (I can’t remember the exact question) and what did I think about the upcoming election. It still wasn’t clear what they were asking me. I concluded the conversation by saying that if they had questions they should email me and identify themselves in the email.

It was all super weird. My guess is that they are not from the Times or from any news organization at all! Really creepy, actually. Maybe it was some sort of scam—they keep me on the line long enough and then they switch to some investment pitch? Or maybe they were some far-left or far-right political group trying to get me to say something that they could twist into an endorsement of their position? I have no idea. It seemed so pointless to me, but at no point was it funny enough to be explainable as a prank call.

I’d rather not think about this any further, but I thought maybe I should post it here just in case these people, whoever they are, try to misrepresent me or others who they were calling in this way. Maybe they’re calling all the Columbia professors? I have no idea.

This kind of thing really annoys me in that they are abusing the openness that we have as teachers. Our job is a mixture of teaching, research, and service. Helping out journalists, or just about anyone who contacts me, is part of service. Whoever called me just now was at best an incompetent time-waster and at worst a lying manipulator. If there are enough people out there doing this sort of thing, it makes it that much harder for us to do our public service.

In retrospect, my mistake was not to just end the conversation right away after they wouldn’t tell me the purpose of their call. Anyway, I’m posting this here so that if it happens to any of you, you can just hang up and move on with your day. The call was 6 minutes long, of which approximately 5.5 minutes were completely unnecessary.

Applied modelling in drug development? brms!

Posted on May 8, 2024 2:30 PM by Sebastian Weber

Colleagues of mine here at Novartis – Björn Holzhauer, Lukas Widmer & Andrew Bean – and myself have recently published the new website on “Applied modelling in drug development via brms”. The idea is to present real-life case studies from drug development and demonstrate how these can be solved using the amazing R package brms by Paul Bürkner (short for Bayesian regression models using Stan). brms lets you specify your model using the R model formula syntax you may be familiar with so that you do not need to write your own Stan code – while offering a surprising number of features in a super flexible manner. Each case study showcases a key modelling technique or a specific brms feature.

Every case study starts with a problem statement from drug development and then shows how to solve it in a stepwise manner. Most case studies are self-contained and can be run directly with the code provided on the website. The code to generate the website is open-source and hosted on GitHub.

The material has been created for annual workshops organised at Novartis over the last three years with Paul as a guest instructor. Each case study is contributed by a Novartis colleague, making the website almost like an edited online book full of problems in biostatistics. The material has become a valuable resource to jump-start my work at hand as quite often I reuse some code from the website (or use the website as a reminder for how to do things). I’d hope that others find the material just as useful, and I would be curious to hear what your thoughts are.

Finally, colleagues of mine – David Ohlssen, Björn Holzhauer & Andrew Bean – will teach a half-day course based on the website material at the JSM conference in Portland on Monday, August 5th. In case you are interested, you can sign-up for the course during conference registration.

Studying causal inference in the presence of feedback:

Posted on May 8, 2024 9:15 AM by Andrew

Kuang Xu writes:

I’m a professor at the business school at Stanford working on operations research and statistics. Recently, I shared one of our new preprints with a friend who pointed out some of your blog posts that seem to be talking about some related phenomenon. In particular, our paper studies how, by using adaptive control, the states of a processing system are effected in such a way that congestion no longer “correlates” with the underlying slowdown of services.

You mentioned in the blog where you wonder if there’s some formal treatment of this phenomenon where control removes correlation in a system, and I thought you might find this to be interesting, possibly a formal example of the effect you were thinking about.

We’ve been wondering if there are other similar, concrete examples in the policy realm that resemble this.

My reply: I’m not sure. On one hand, the difficulty of causal inference with observational data is well known—it’s a strong theme of all presentations of causal inference—but it seems that most of the concerns come with selection rather than feedback.

Xu responds:

We tried to explore this connection to a small degree in the lit review – there’s some similarity to how people use inverse [estimated] probability weighting to debias selection, but these are generally one-time interventions so not so much of a feedback loop. Like you wrote in that blog post, something like monetary policy is more like a feedback loop, but it’s hard to isolate such effects in these complex systems.

As I wrote in my earlier post on the topic, I’m pretty sure there was tons of work back in the 1940s-1960s in this area of feedback in control systems. I can just picture a bunch of guys in crewcuts wearing short-sleeved button-up shirts with pocket protectors working on these problems. For some reason, though, I haven’t hear much about any of this nowadays within statistics or econometrics. Seems like there’s room for some unification, along with some communication so that the rest of us can make use of whatever has been doing in this area already.

Papers on human decision-making under uncertainty in ML venues! We have advice.

Posted on May 7, 2024 1:10 PM by Jessica Hullman

This is Jessica. Human subjects experiments are starting to appear in machine learning venues. For example, ICML, one of the big ML conferences, accepted a few papers on quantifying prediction uncertainty with user studies.

Overall, I’m glad to see this. Theoretically rigorous uncertainty approaches that provide calibration or coverage guarantees may not be worth the effort (e.g., in terms of the model retraining or holding out of additional data that they require) if providing that information doesn’t impact human decisions, or leads to under- or overreliance. We know from decades of research that people can seem to resist uncertainty information or use it inappropriately.

But it’s also important to acknowledge that human decision-making under uncertainty can be tricky to study. Details matter. In my own experimental work I’ve had multiple moments where I realized something new about how people were making decisions that I hadn’t necessarily considered a few studies earlier. So I have some advice on this topic.

First, some points specific to studying uncertainty quantification:

Make sure that the question you are asking about the value of uncertainty is truly about human behavior, and therefore requires an experiment. Consider that many questions about the potential value added by providing uncertainty information for a decision task can be worked out without running an experiment.
Related to #1, be clear on what it would mean for a decision-maker to use the uncertainty information in the ideal case. If you’re not sure, figure that out before you run the experiment so you know what optimal use looks like.
Be careful that how you describe things to people accurately reflects what the uncertainty means. For example, don’t make false promises about uncertainty for individual instances, such as by implying that we know that there is a 95% chance the true value is in this specific interval. It might seem like a minor detail, but if your participants have unrealistic expectations about what they are seeing, it’s hard to interpret your results. You also risk misleading readers of your paper who don’t catch the distinction.
Don’t assume the intended calibration of the uncertainty method applies to the instances that you actually present to people in your experiment. For example, just because you calibrated your method to have 95% coverage on average doesn’t mean that this must be what is achieved on the instances you actually present people with. Acknowledge the potential difference in analyzing the results. Same goes for accuracy and other aggregate performance measures you might estimate on held-out data. Unless it’s part of your research question, try to avoid telling people one thing but show them something pretty different.
Think carefully about how you elicit probabilistic information from people like their confidence in a judgment. Don’t assume they will be able to tell you exactly what they believe because you asked them, even if lots of prior studies have made the assumption.
Keep in mind that there’s something ironic about papers about the importance of uncertainty quantification that ultimately present all of their results in a dichotomous, significant vs not-significant style.

Regarding that last point, I could see some people saying it’s too picky, and arguing that there is a long history of using NHST to analyze experiment results. But I think it’s fair to ask whether one wants to be part of the problem or the solution. If you care about good uncertainty expression in the artificial world of your experiment, shouldn’t you also care about it when you present the results of your research?

The above list was specific to experiments on expressions of uncertainty, but I also have some more general advice:

Related to #6 on not falling back on dichotomous reasoning, be careful with how you decide sample size. “Large sample size” is relative. If you are testing 10 different conditions, even if you have repeated measures, having what seems like a large number of people total doesn’t necessarily mean you have high power to detect the differences you care about. Fake data simulation will give you a much better sense of how well you’ll be able to distinguish effects you care about than online power calculators.
Attempt to account for the design of the experiment when you model your results, not just the factor(s) you have research questions about. If you’re worried that the model might not fit when you include everything, you can preregister a model selection process rather than a specific specification. Something I’m seeing a fair amount is use of repeated trials designs, but where the modeling doesn’t account for individual-specific effects (e.g., through random intercepts and possibly random slopes) and what trial number they were on. In my experience, these things often have non-trivial effects, such that leaving them out changes the other effects you’re estimating in your model. This paper by Yarkoni has a nice demonstration of how ignoring individual heterogeneity leads to overconfident interpretations.
Pre-registration is a property of a data collection and analysis process. Deviations from part of what was planned are not uncommon, so don’t treat preregistration as a label to that applies on the level of a whole project. You don’t really need the term preregistration in the abstract and the introduction and the conclusion. Just discuss it when you get to methods and if needed to note deviation when you talk about results.

I sent this list to Alex Kale, who added a few more:

Be clear with yourself, your participants, and your reader, about how the provided uncertainty information is relevant to the task.
Use proper scoring rules in situations where you want the task to have a clear objective and performance to have a clear meaning.
Try to account for the role of prior knowledge and risk attitudes. Do not ignore these things, and pretend that results will generalize.
Do not pretend that “high stakes” decision scenarios can be emulated in the typical controlled experiment.
Do not confuse the effectiveness of presentation format with the effectiveness of providing uncertainty information in general. Do not treat poorly designed or implemented uncertainty communication as representative of what people can do with a more careful presentation.
^ especially when the uncertainty provided is non-informative for the task at hand, as in many XAI studies. This means your experiment is ill conceived.
Do not treat “understanding” as something that is easily quantified with measures like simulability alone. Do not collapse the complex cognitive processes that underly understanding to vacuous concepts like “intuition.”
Study how people learn to use uncertainty information, not just one-shot performance.
Study deployments with actual users and do user-centered design or co-design if you want to develop tools that will have much hope of being adopted outside the lab. The road from theory to practice is messy. Embrace the mess!

Finally, a personal pet-peeve: it’s not true that hardly anyone has ever studied the effect of presenting uncertainty in model predictions. Decision-making under uncertainty is a cross cutting topic that has been discussed for many years in a number of fields. Predictive uncertainty is not a new concept.

Some of this appears with further discussion in our papers on why we need decision theory to evaluate human decision-making, how to evaluate reliance on model predictions, how to benchmark the value of information in studies like on information displays, and why evaluating uncertainty expressions is hard.

P.S. I realized that maybe this post sounds sort of complain-y, but that is not the intent. I’m impressed by many of the papers I’m seeing from the ML community that bring in human behavior. If anything, I just want to help others avoid the trap of treating experiments involving human behavior as black boxes for producing aggregate-level statistics about which of multiple treatments is better, because you can miss a lot that way. The more effort I’ve put into to trying to understand what’s going on, through simulations/doing the math and through modeling the experiment data, the more I’ve learned.

“There is a war between the ones who say there is a war, and the ones who say there isn’t.”

Posted on May 7, 2024 9:04 AM by Andrew

I was reminded of the above line (it’s from a Leonard Cohen song) after reading something stupid on the internet regarding some technical issue, and various people were trying to place the dispute in the context of a so-called academic “war.” I won’t get into the details because there are a million such arguments on the internet, and here I want to focus on a general problem in communication that this example illustrates.

I’ve been involved in lots of academic disagreements over the years, and I pretty much never think it’s a good idea to frame any of them as a “war,” even as a joke.

Indeed, I suspect the technical error that got that discussion going arose from a misguided “war” framing.

Here’s the problem. If you’re in a “war,” or you think you’re in a war, then maybe it’s ok to consider some people as your opponents and attribute to them positions that they have never taken. It’s “war,” right? The other side has surely done much worse, no? Never back down and all that.

So that’s a good reason to avoid the “war” framework. There are lots of better ways to move forward, especially in science, by focusing on areas of legitimate disagreement and admitting where we’ve messed up.

MCMC draws cannot fill the posterior in high dimensions

Posted on May 6, 2024 4:41 PM by Bob Carpenter

Word on the street

I often hear people say things like “We do MCMC to get a full representation of the posterior.” Their intuition seems to be that MCMC is going to take a set of draws that covers the posterior in some way.

Misleading textbook examples

I’m just as guilty of this as everyone else. The textbook examples are in 1D or 2D. We throw darts at a unit square and keep the ones in an inscribed circle to calculate pi, for example, or we sample independently form a 1D distribution and fill in a histogram (in the limit of infinitely many draws and infinitesimal bins, the histogram approaches the density). In low dimensions, we can effectively “cover” the posterior.

A simple counterexample in high dimensions

This is not how things work in high dimensions. There’s no way to get enough draws to even remotely fill the volume of the posterior. Let’s suppose we have an N-dimensional standard normal y ~ normal(0, I). There are 2**N different quadrants in which a draw is equally likely to land, representing the signs of the numbers y[1], y[2], …, y[N]. If N = 20, there are more than one million quadrants. So we won’t even get a draw in each quadrant in 20 dimensions, much less a set of draws that in any sense “covers” or “fills” the posterior.

Luckily, we don’t need coverage

Our goal isn’t to fill the posterior’s volume with closely packed draws. That’s a relief, because that goal’s clearly impossible in high dimensions. Instead, our goal is to compute posterior expectations of the form E[f(Theta) | y] for parameters Theta and data y. That we can do effectively with a sample that does not cover the posterior. In fact, we can usually get away with something like an effective sample size of 100, at which point the standard error in the expectation is 10% of the posterior standard deviation.

A cook, a housemaid, a gardener, a chauffeur, a nanny, a philosopher, and his wife . . .

Posted on May 6, 2024 9:09 AM by Andrew

From Ray Monk’s biography of Bertrand Russell:

Though the Russells were not especially wealthy, they employed—as was common in Britain until after the Second World War—a number of servants: a cook, a housemaid, a gardener, a chauffeur and a nanny.

Arguably this is not so much different than modern society: even if we who live in comfortable circumstances do not employ personal servants, we still benefit from the labor of thousands of people working in farms, factories, and everything in between.

What struck me about the above story regarding Russell is not so much that he had all these servants—it’s indeed hard to picture the great philosopher shopping in the supermarket or frying an egg or folding the sheets or whatever—, but rather that he must have had some flexibility in his finances. Monk also said that Russell did a lot of writing just for the money, which may have been the case, but did he really need the money if that’s what he was spending it on? Without any particular knowledge of Russell, I kinda suspect it went the other direction: he wrote a lot for general audiences because he enjoyed writing, it was a way for him to work out his ideas, he was a good writer (ok, Monk also shares snippets from many of Russell’s private letters, and they are pretty uniformly cringe-worthy and unreadable, so let me just say he was a good writer when it came to his public writings), and he wanted to communicate with and, if possible, influence a broad public. But he had this aristocratic background that made all those motivations suspect. From that perspective, “I did it for the money” is a convenient excuse. I’m guessing he did the writing because he wanted to, and for good reasons, and then he kept spending that money, which helped motivate him to keep writing.

On lying politicians and bullshitting scientists

Posted on May 5, 2024 9:46 AM by Andrew

Greg Sargent writes:

I disagree with Sargent’s statement that “The reason Trump regularly tells lies that are very easy to debunk . . . is to assert the power to say what truth is.” I think it’s simpler than that. People like to say things that make them look better, and this is easier to do if you’re not constrained by the truth.

The big question is not so much why someone who has lied a lot in the past keeps lying—people typically keep doing with what’s worked for them before—but rather why so many of his supporters don’t seem to mind.

And there I’d like to draw a connection to something I’m more familiar with, which is scientists bullshitting. Sometimes out-and-out lying, but more commonly not “lying” exactly, which would imply some awareness of what they’re doing, but rather saying things that sound good even though they are not accurate.

As with political parties, what’s striking to me is not so much that some scientists bullshit, but that the mainstream of the scientific establishment doesn’t seem to care.

When a major psychology journal publishes a paper claiming to be “a long-term study,” and it turns out that the study only spanned 3 days, that’s pretty bad, no? You’d think they’d be kind of embarrassed, no? I’m not taking about the authors here, I’m saying the journal editors would be embarrassed, the head of the psychology society would be embarrassed, etc. Mistakes happen, but this is a pretty bad one. Kind of like the thing about Trump’s lies being so easy to debunk—this is a piece of bullshit that’s particularly easy to catch.

Anyway, the punch line is, No, nobody cares. I care, Retraction Watch cares, whoever got papers rejected by that journal because they were publishing bullshit instead, they probably care, various grad students and postdocs who email me saying they’re upset that their advisors don’t seem to care about getting things right, they care too . . . but that’s about it.

Why don’t they care? They’re committed to the entire enterprise, just as politicians that lie are enabled by members of their party who presumably think that their larger causes will suffer if they were to confront the lies. To be upset about the bullshit would require giving up too much.

P.S. I’m not trying to draw any equivalence between a politician trying to overturn an election and scientists publishing things that don’t make sense. I’m just trying to leverage my from-the-inside understanding of the science bullshit to get some insight into political lies.

P.P.S. If you want to argue on policy grounds that Trump as president did more good than harm, and that if you support a political party’s larger goals you gotta go with who won the primary elections as there’s no alternative . . . sure, you can make that argument. The analogy to the science discussion is to argue that, sure, there are lots of really bad papers, but to draw attention to them would reduce trust in science. You could also make the argument that some amount of bad science is necessary, as the review system isn’t perfect, but if that’s all, then you should be happy with loudly pointing out the b.s. The striking part to me is not when people say, “Yeah, there’s lying but on balance it’s positive,” so much as when people wriggle around and try to deny that the lying or bullshitting is happening.

Indeed, it’s a sad sign of partisan polarization that it seems almost like a political stance to point to a politician’s pattern of lying, in the same way that it’s a sad sign of scientific communication that it seems almost like a form of rabble-rousing to discuss scientific bullshit that just sits in the literature forever.

As I wrote last year (but it could’ve been 5 years ago, or 10 years ago):

Science is kind of like . . . someone poops on the carpet when nobody’s looking, some other people smell the poop and point out the problem, but the owners of the carpet insists that nothing has happened at all and refuses to allow anyone to come and clean up the mess. Sometimes they start shouting at the people who smelled the poop and call them terrorists. Meanwhile, other scientists carefully walk around that portion of the carpet: they smell something but they don’t want to look at it too closely.

P.S. As indicated by this title, this post focuses on the toleration of lies by politicians and scientists, but the topic is more general. “Lies” blurs into “incompetence” (recall Clarke’s Law), and “politicians and scientists” blurs into “anyone who’s trying to sell you something.”

“You want to gather data to determine which of two students is a better basketball shooter. You plan to have each student take N shots and then compare their shooting percentages. Roughly how large does N have to be for you to have a good chance of distinguishing a 30% shooter from a 40% shooter?”

Posted on May 4, 2024 9:08 AM by Andrew

Elden Griggs writes:

I have been grading a homework problem where the students have come to two different conclusions. The question was:

You want to gather data to determine which of two students is a better basketball shooter. You plan to have each student take N shots and then compare their shooting percentages. Roughly how large does N have to be for you to have a good chance of distinguishing a 30% shooter from a 40% shooter?

Essentially, there were two sets of student answers. The first set simulated and found the answer to be around 120-130 depending how they rounded (generally they wanted to see whether the stronger shooter won 95% of the time—though some students added some nuance and checked other proportions of winning). The other set of students went about using power analysis from notes further along in the book or their own knowledge and came to the exact answer of N = 354. Here is code from one of the students:

Based on the simulations, I know the answer for the power analysis is pretty conservative (see attached graphic showing the difference between the two n values). My question I guess is why is the power analysis so off?

My reply was this recommended approach:

1. Start with a guess, for example N = 100.

2. If N = 100, then the estimated difference has standard error approximately sqrt(0.5^2/100 + 0.5^2/100) = 0.07.

3. So, if the true difference in probabilities is 0.10, that’s 1.4 standard errors from zero, so the probability that the 40% shooter does better in the actual experiment is pnorm(1.4) = 0.92. With N = 100, there’s a 92% chance that the good shooter wins. 92% seems like more than “a good chance,” so I’d say that N = 100 is just fine.

4. We could try a lower N. For example, if N = 50, then the estimated difference has a s.e. of approx 0.10. Then a true difference of 0.10 is 1.0 s.e.’s from 0, and the probability that the 40% shooter does better in the actual experiment is then pnorm(1.0) = 0.84. 84% is “a good chance” too, I’d say. So, by that measure, N = 50 would be just fine.

5. If you wanted a 95% chance, you can work that out . . . qnorm(0.95) is 1.64, so if the true value is 1.64 s.e.’s from 0, there’s a 95% chance that the experiment will identify the winner. When N = 100, the true value is 1.4 s.e.’s from 0, so to get this to 1.64 s.e.’s from 0, you need to up the sample size by a factor of (1.64/1.4)^2, i.e. you will need a sample size of 100*(1.64/1.4)^2 = 137.

6. Buuuuut . . . I don’t think we need 95%, as the homework just asked for “a good chance.”

7. On the other hand, what does it mean to “distinguish” the two shooters? Maybe it implies that the estimated difference in their abilities should exceed some threshold? One possible rule is that the difference should be “statistically significant” at the 95% level. A conventional criterion is that there be at least an 80% probability that the experiment is successful in the sense of yielding a “95% statistically significant” result. For that, the true difference must be at least 2.8 standard errors from 0, for reasons explained in chapter 16 of Regression and Other Stories. We already know that when N = 100, the true value is 1.4 s.e.’s from 0, so to get to 2.8, you need to up the sample size to 100*(2.8/1.4)^2 = 400.

8. Given the numbers in the problem, we could replace sqrt(0.5^2/100 + 0.5^2/100) by sqrt(0.3*0.7/100 + 0.4*0.6/100) in step 2 above (and changing the later steps accordingly). This is a minor thing, though, and for reasons of general practice I’d recommend using the 0.5’s unless the probabilities are really far from 1/2.

So there are lots of reasonable answers. N=100 would be fine, N=50 would be fine, N=137 would be fine, N=400 would be fine, etc. The key is for them to have reached their results following a series of understandable steps, not just using a formula. The problem with the formula is that it doesn’t generalize to harder problems. The steps above generalize better.

P.S. The above is not supposed to represent any sort of realistic analysis of basketball! For example, I’m ignoring time trends in any hypothetical data and abilities. It’s just a homework problem. The point is to clear up a point that confused some students, and which I’m sure confuses many practitioners as well.

Combining multiply-imputed datasets, never easy

Posted on May 3, 2024 9:41 AM by Andrew

Thomas Hühn writes:

I’m thinking about doing a Bayesian analysis of a very small subset of PISA or TIMSS data. Those large-scale education surveys do not report students achievement scores as single numbers, but they report five or ten numbers, so called plausible values. Those plausible values have been sampled from a constructed probability distribution.

The user guides and methodology papers strongly warn against taking those five plausible values as five observations, and also against taking the mean of those five plausible values and doing statistical analysis on that.

Instead you’re supposed to do five analyses on each set of plausible values and then take the arithmetic mean of the descriptive statistics metrics that you care about. Also, you’re supposed to calculate some imputation error, but I haven’t really understood that, yet.

How would a Bayesian proceed here?

I guess they would also do five analyses and then end up with five posterior probability distributions. How can they be combined? It surely cannot be as easy as doing a vector add in R and dividing by five (you’re sensing my small glimmer of hope here)?

My response: I’m not sure. I guess it depends where the plausible values come from? The method they describes sounds similar to the analysis of multiple imputations. There can be reasons that such an analysis will work better than taking the average for each person, because if you take the average for each person, your analysis kinda thinks that people are less variable than they really are.

One way to get a handle on this would be to simulate fake data based on a known model.

Two kings, a royal, a knight, and three princesses walk into a bar . . . (Dude from Saudi Arabia accuses the lords of AI of not giving him enough credit.)

Posted on May 2, 2024 9:54 AM by Andrew

Roger Critchlow points us to this post from Jürgen Schmidhuber, “How 3 Turing Awardees Republished Key Methods and Ideas Whose Creators They Failed to Credit.”

The whole thing is too long and detailed to follow—it’s like one of those pieces of outsider art with scribblings all over the margins of an elaborate painting on an already-patterned fabric—but I did follow one of the links in the post which pointed to this:

This reminds me of Stan’s pedantic mode where it gives a warning if you do something like define a variable named sigma without constraining it to be positive or if you assign an inverse-gamma(0.001, 0.001) prior distribution. Ask Facebook’s AI about Jürgen Schmidhuber and you get a warning!

I have no idea what to think about the priority dispute. This sort of thing comes up in statistics; see for example this discussion of attribution for the potential outcomes model in causal inference, summarized by this comment from Guido Imbens.

Critchlow also points to this thread on Hacker News which has lots of takes in all directions.

My proposed solution to the problem

Jürgen Schmidhuber works at the King Abdullah University; Geoffrey Hinton was educated at Kings College; Yoshua Bengio is a Fellow of the Royal Society; Yann LeCun is a Chevalier of the French Legion of Honour; and Bengio, Hinton, and LeCun have received the Princess of Asturias Award.

That’s two kings, a royal, a knight, and three princesses.

Can’t they just settle this like aristocrats?

Who wrote the music for In My Life? Three Bayesian analyses

Posted on May 1, 2024 9:21 AM by Andrew

A Beatles fan pointed me to this news item from a few years ago, “A Songwriting Mystery Solved: Math Proves John Lennon Wrote ‘In My Life.'” This surprised me, because in his memoir, Many Years from Now, Paul McCartney very clearly stated that he, Paul, wrote it.

Also, the news report is from NPR. Who you gonna trust, NPR or Paul McCartney? The question pretty much answers itself.

But I was curious, so I read on:

Over the years, Lennon and McCartney have revealed who really wrote what, but some songs are still up for debate. The two even debate between themselves — their memories seem to differ when it comes to who wrote the music for 1965’s “In My Life.” . . .

Mathematics professor Jason Brown spent 10 years working with statistics to solve the magical mystery. Brown’s the findings were presented on Aug. 1 at the Joint Statistical Meeting in a presentation called “Assessing Authorship of Beatles Songs from Musical Content: Bayesian Classification Modeling from Bags-Of-Words Representations.” . . .

The three co-authors of this paper — there was someone called Mark Glickman who was a statistician at Harvard. He’s also a classical pianist. Another person, another Harvard professor of engineering, called Ryan Song. And the third person was a Dalhousie University mathematician called Jason Brown. . . .

[Regarding In My Life,] it turns out Lennon wrote the whole thing. When you do the math by counting the little bits that are unique to the people, the probability that McCartney wrote it was .018 — that’s essentially zero. In other words, this is pretty well definitive. Lennon wrote the music. And in situations like this, you’d better believe the math because it’s much more reliable than people’s recollections.

Wait a minute . . . Mark Glickman! He’s a serious statistician. I know Mark (or, as we called him in grad school, Smiley). He’s knows what he’s doing. If Smiley says the probability is 0.018, that’s saying something.

On the other hand . . . my inclination is believe Paul, as John was so notoriously screwed up.

The above link only pointed to an abstract, so I sent an email to Mark asking if he had a paper written up on this. He pointed me to this published article he wrote with Brown and Song (great name for this paper, by the way!), (A) Data in the Life: Authorship Attribution in Lennon-McCartney Songs. Here’s what they say about In My Life:

Our model produces a probability of 18.9% that McCartney wrote the verse, and a 43.5% probability that McCartney wrote the bridge.

Also this:

[The song] may in fact have been written by McCartney who stated he composed the song in the style of Smokey Robinson and the Miracles (Turner, 1999), but actually wrote in the style of Lennon, whether consciously or subconsciously.

Neither of these sounds like 0.018 or “this is pretty well definitive. Lennon wrote the music.” Whassup?

Mark replied:

That 0.018 estimate came from an early version of our analysis well before the paper was published – we realized after those early interviews that the model at that point was overfitting, so we needed to back up and be more principled with the analysis.

Fair enough. In the final paper, they give a probability of 19%, which could go either way, also they cover their bets by saying that Paul could’ve written it anyway, if he’d been following a Lennon style, which would make sense given that they were Lennon’s words.

One other thing. The Bayesian analysis by Glickman et al. used lots of musical information—I’ll leave it to the music theorists in the audience (you know who you are!) to assess that. I’m also wondering whether it would be possible to incorporate the written testimony, in particular my impression that Paul is a more reliable narrator than John. The fact that Paul said one thing and John said another, that should provide some information too, no? Maybe not a lot, but a Bayes factor of 2, say, would shift the Paul/John odds from 0.19/0.81 to (0.19/0.81)*2, which would bring Pr (song is by Paul | data) to 0.32.

I asked the authors if their data were posted anywhere so others could try their own analyses, and they said no.

In any case, it’s good they revised their analysis and it’s just too bad that the news media overreacted to the press release.

Moral of the story: Don’t trust NPR.

Bayesian Workflow, Causal Generalization, Modeling of Sampling Weights, and Time: My talks at Northwestern University this Friday and the University of Chicago on Monday

Posted on April 30, 2024 10:36 PM by Andrew

Fri 3 May 2024, 11am at Chambers Hall, Ruan Conference Room – lower level:

Audience Choice: Bayesian Workflow / Causal Generalization / Modeling of Sampling Weights

The audience is invited to choose among three possible talks:

Bayesian Workflow: The workflow of applied Bayesian statistics includes not just inference but also building, checking, and understanding fitted models. We discuss various live issues including prior distributions, data models, and computation, in the context of ideas such as the Fail Fast Principle and the Folk Theorem of Statistical Computing. We also consider some examples of Bayesian models that give bad answers and see if we can develop a workflow that catches such problems. For background, see here: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf

Causal Generalization: In causal inference, we generalize from sample to population, from treatment to control group, and from observed measurements to underlying constructs of interest. The challenge is that models for varying effects can be difficult to estimate from available data. We discuss limitations of existing approaches to causal generalization and how it might be possible to do better using Bayesian multilevel models. For background, see here: http://www.stat.columbia.edu/~gelman/research/published/KennedyGelman_manuscript.pdf and here: http://www.stat.columbia.edu/~gelman/research/published/causalreview4.pdf and here: http://www.stat.columbia.edu/~gelman/research/unpublished/causal_quartets.pdf

Modeling of Sampling Weights: A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights came from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest. For background, see here: http://www.stat.columbia.edu/~gelman/research/unpublished/weight_regression.pdf

Topic will be chosen live by the audience attending the talk.

Mon 6 May 2024, 11:30am at Jones Hall, Room 303:

It’s About Time

Statistical processes occur in time, but this is often not accounted for in the methods we use and the models we fit. Examples include imbalance in causal inference, generalization from A/B tests even when there is balance, sequential analysis, adjustment for pre-treatment measurements, poll aggregation, spatial and network models, chess ratings, sports analytics, and the replication crisis in science. The purpose of this talk is not to go over traditional time-series models, important as they are, but rather to consider how adding time into our models can change how we think about many different problems in theoretical and applied statistics.

If you can come, please be prepared with some tough questions!

Does this study really show that lesbians and bisexual women die sooner than straight women? Disparities in Mortality by Sexual Orientation in a Large, Prospective JAMA Paper

Posted on April 30, 2024 9:19 AM by Andrew

This recently-published graph is misleading but also has the unintended benefit of revealing a data problem:

Jrc brought it up in a recent blog comment. The figure is from an article published in the Journal of the American Medical Association, which states:

This prospective cohort study examined differences in time to mortality across sexual orientation, adjusting for birth cohort. Participants were female nurses born between 1945 and 1964, initially recruited in the US in 1989 for the Nurses’ Health Study II, and followed up through April 2022. . . .

Compared with heterosexual participants, LGB participants had earlier mortality (adjusted acceleration factor, 0.74 [95% CI, 0.64-0.84]). These differences were greatest among bisexual participants (adjusted acceleration factor, 0.63 [95% CI, 0.51-0.78]) followed by lesbian participants (adjusted acceleration factor, 0.80 [95% CI, 0.68-0.95]).

The above graph tells the story. As some commenters noted, there’s something weird going on with the line for heterosexual women 25+ years after exposure assessment. I guess these are the deaths from 2020 until the end of data collection in 2022.

Maybe not all the deaths were recorded? I dunno, it looks kinda weird to me. Here’s what they say in the paper:

The linkages to the NDI [National Death Index] were confirmed through December 31, 2019; however, ongoing death follow-up (eg, via family communication that was not yet confirmed with the NDI) was assessed through April 30, 2022. In the sensitivity analyses, only NDI-confirmed deaths (ie, through December 31, 2019) were examined.

Given the problem in the above graph, I’m guessing we should just forget about the main results of the paper and just look at the estimates from the sensitivity analyses. They say that sexual orientation was missing for 22% of the participants.

Getting back to the data . . . Table 1 shows that the lesbian and bisexual women in the study were older, on average, and more likely to be smokers. So just based on that, you’d expect they would be dying sooner. Thus you can’t really do much with the above figures that mix all the age and smoking groups. I mean, sure, it’s a good idea to display them—it’s raw data, and indeed it did show the anomalous post-2000 pattern that got our attention in the first place—but it’s not so useful as a comparison.

Regarding age, the paper says “younger cohorts endorse LGB orientation at higher rates,” but the table in the paper clearly shows that the lesbians and bisexuals in the study are in the older cohorts in their data. So what’s up with that? Are the nurses an unusual group? Or perhaps there are some classification errors?

Here’s what I’d like to see: a repeat of the cumulative death probabilities in the top graph above, but as a 4 x 2 grid, showing results separately in each of the 4 age categories and for smokers and nonsmokers. Or maybe a 4 x 3 grid, breaking up the ever-smokers into heavy and light smokers. Yes, the data will be noisier, but let’s see what we’ve got here, no?

They also run some regressions adjusting for age cohort. That sounds like a good idea, although I’m a little bit concerned about variation within cohort for the older groups. Also, do they adjust for smoking? I don’t see that anywhere . . .

Ummmm, OK, I found it, right there on page 2, in the methods section:

We chose not to further control for other health-related variables that vary in distribution across sexual orientation (eg, diet, smoking, alcohol use) because these are likely on the mediating pathway between LGB orientation and mortality. Therefore, controlling for these variables would be inappropriate because they are not plausible causes of both sexual orientation and mortality (ie, confounders) and their inclusion in the model would likely attenuate disparities between sexual orientation and mortality. However, to determine whether disparities persisted above and beyond the leading cause of premature mortality (ie, smoking), we conducted sensitivity analyses among the subgroup of participants (n = 59 220) who reported never smoking between 1989 and 1995 (year when sexual orientation information was obtained).

And here’s what they found:

Among those who reported never smoking (n = 59 220), LGB women had earlier mortality than heterosexual women (acceleration factor for all LGB women, 0.77 [95% CI, 0.62-0.96]; acceleration factor for lesbian participants, 0.80 [95% CI, 0.61-1.05]; acceleration factor for bisexual participants, 0.72 [95% CI, 0.50-1.04]).

I don’t see this analysis anywhere in the paper. I guess they adjusted for age cohorts but not for year of age, but I can’t be sure. Also I think you’d want to also adjust for lifetime smoking level, not just between 1989 and 1995. I’d think the survey would also have asked about whether participants’ earlier smoking status, and I’d think that smoking would be asked in followups?

Putting it all together

The article and the supplementary information present a lot of estimates. None of them are quite what we want—that would be an analysis that adjusts for year of age as well as cohort, adjusts for smoking status, and also grapples in some way with the missing data.

But we can triangulate. For all the analyses, I’ll pool the lesbian and bisexual groups because the sample size within each of these two groups is too small to learn much about them separately. For each analysis, I’ll give the reported 95% adjusted acceleration factor as an estimate +/- margin of error:

Crude analysis, not adjusting for cohort: 0.71 +/- 0.10
Main analysis, adjusting for cohort: 0.74 +/- 0.10
Using only deaths before 2020, adjusting for cohort: 0.79 +/- 0.10
Imputing missing responses for sexual orientation, adjusting for cohort: 0.79 +/- 0.10
Using only the nonsmokers, maybe adjusting for cohort: 0.77 +/- 0.17

Roughly speaking, adjusting for cohort adds 0.03 to the acceleration factor, imputing missing responses adds 0.05, and adjusting for smoking adds 0.03 or 0.06. I don’t know what would happen if you add all three adjustments; given what we have here, my best guess would be to add 0.03 + 0.05 + (0.03 or 0.06) = 0.11 or 0.14, which would take the estimated acceleration factor to 0.82 or 0.85. Also, if we’re only going to take the nonsmokers, we’ll have to use the wider margin of error.

Would further adjustment do more? I’m not sure. I’d like to do more adjustment for age and for smoking status; that said, I have no clear sense what direction this would shift the estimate.

So our final estimate is 0.82 +/- 0.17 or 0.85 +/- 0.17, depending on whether or not their only-nonsmokers analysis adjusted for cohort. They’d get more statistical power by including the smokers and adjusting for smoking status, but, again, without the data here I can’t say how this would go, so we have to make use of what information we have.

This result is, or maybe is not, statistically significant at the conventional 95% level. That is not to say that there is no population difference, just that there’s uncertainty here, and I think the claims made in the research article are not fully supported by the evidence given there.

I have similar issues with the news reports of this study, for example this:

One possible explanation for the findings is that bisexual women experience more pressure to conceal their sexual orientation, McKetta said, noting that most bisexual women have male partners. “Concealment used to be thought of as sort of a protective mechanism … but [it] can really rot away at people’s psyches and that can lead to these internalizing problems that are also, of course, associated with adverse mental health,” she said.

Sure, it’s possible, but shouldn’t you also note that the observed differences could also be explained by missing-data issues, smoking status, and sampling variation?

And now . . . the data!

Are the data available for reanalysis? I’m not sure. The paper had this appendix:

I don’t like the requirement of “co-investigator approval” . . . what’s with that? Why they can’t just post the data online? Anyway, I followed the link and filled out the form to request the data, so we’ll see what happens. I needed to give a proposal title so I wrote, “Reanalysis of Disparities in Mortality by Sexual Orientation in a Cohort of Female Nurses.” I also had to say something about the scientific significance of my study, for which I wrote, “The paper, Disparities in Mortality by Sexual Orientation in a Cohort of Female Nurses, received some attention, and it would be interesting to see how the results in that paper vary by age and smoking status.” For my “Statement of Hypothesis and Specific Aims,” I wrote, “I have no hypotheses. My aim is to see how the results in that paper vary by age and smoking status.” . . .

Ummmm . . . I was all ready to click Submit—I’d filled out the 6-page form, and then this:

A minimum of $7800 per year per cohort?? What the hell???

To be clear, I don’t think this is the fault of the authors of the above-discussed paper. It just seems to be the general policy of the Nurses Health Study.

Summary

I don’t think it’s bad that this paper was published. It’s pretty much free to root around at available datasets and see what you can find. (OK, in this case, it’s not free, it’s actually a minimum of $7800 per year per cohort, but this isn’t a real cost; it’s just a transfer payment: the researchers get an NIH grant, some of which is spent on . . . some earlier recipient of an NIH grant.) The authors didn’t do a perfect statistical analysis, but no statistical analysis is perfect; there are always ways to improve. The data have problems (as shown in the graph at the top of this post), but there was enough background in the paper for us to kind of figure out where the problem lay. And they did a few extra analyses—not everything I’d like to see, but some reasonable things.

So what went wrong?

I hate to say it because regular readers have heard it all before, but . . . forking paths. Researcher degrees of freedom. As noted above, my best guess of a reasonable analysis would yield an estimate that’s about two standard errors away from zero. But other things could’ve been done.

And the authors got to decide what result to foreground. For example, their main analysis didn’t adjust for smoking. But what if smoking had gone the other way in the dataset, so that the lesbians and bisexuals in the data had been less likely to smoke? Then a lack of adjustment would’ve diminished the estimated effect, and the authors might well have chosen the adjusted analysis as their primary result. Similarly with exclusion of the problematic data after 2020 and the missing data on sexual orientation. And lots of other analyses could’ve been done here, for example adjusting for ethnicity (they did do separate analyses for each ethnic group, but it was hard to interpret those results because they were so noisy), adjusting for year of age, and adjusting more fully for past smoking. I have no idea how these various analyses would’ve gone; the point is that the authors had lots of analyses to choose from, and the one they chose gave a big estimate in part because a lack of adjustment for some data problems.

What should be done in this sort of situation? Make the data available, for sure. (Sorry, Nurses Health Study, no $7800 per year per cohort for you.) Do the full analysis or, if you’re gonna try different things, try all of them or put them in some logical order; don’t just pick one analysis and then perturb it one factor at a time. And, if you’re gonna speculate, fine, but then also speculate about less dramatic explanations of your findings, such as data problems and sampling variation.

It should be no embarrassment that it’s hard to find any clear patterns in these data, as all of it is driven by a mere 81 deaths. n = 81 ain’t a lot. I’d say I’m kinda surprised that JAMA published this paper, but I don’t really know what JAMA publishes. It’s not a horrible piece of work, just overstated to the extent that the result is more of an intriguing pattern in a particular dataset rather than strong evidence of a general pattern. This often happens in the social sciences.

Also, the graphs. Those cumulative mortality graphs above are fine as data summaries—almost. First, it would’ve been good if the authors had noticed the problem post-2020. Second, the dramatic gap between the lines is misleading in that it does not account for the differences in average age and smoking status of the groups. As noted above, this could be (partly) fixed by replacing with a grid of graphs broken down by cohort and smoking status.

All of the above is an example of what can be learned from post-publication review. Only a limited amount can be learned without access to the data; it’s interesting to see what we can figure out from the indirect evidence available.

Job Ad: Spatial Statistics Group Lead at Oak Ridge National Laboratory

Posted on April 29, 2024 3:00 PM by Bob Carpenter

Robert Stewart, of Oak Ridge National Lab (ORNL), who we met at StanCon, is looking to fill the following role:

ORNL Job ad: Group Leader for Spatial Statistics

It’s a research group leader position with an emphasis on published research that’s relevant for the Department of Energy (the group that runs the national labs). Robert came and gave a talk here at Flatiron Institute and the scale and range of problems they’re tasked with solving is really interesting (e.g., inferring the height, construction, and use of every human-constructed building in the world). They’ve got access to the computing to tackle this kind of thing, too.

After meeting Robert, I invited him out to Flatiron Institute to give a talk to our group (Center for Computational Mathematics). They’re working on a fascinating range of problems around climate science, energy, etc. In all cases, there is a challenging mix of measurement uncertainty, missing data, and prior knowledge to combine.

ORNL is looking for someone with a Ph.D. and 6 years of experience—that is, about the point where you’d get tenure in academia. Unlike academia, they also list an M.S. with 15 years of experience as an alternative.

The visa sponsorship” section says “not available”, so I assume that means you have to be a U.S. citizen.

Boris and Natasha in America: How often is the wife taller than the husband?

Posted on April 29, 2024 9:27 AM by Andrew

Shane Frederick, who sometimes sends me probability puzzles, sent along this question:

Among married couples, what’s your best guess about how often the wife is taller than the husband?
• 1 in 10
• 1 in 40
• 1 in 300
• 1 in 5000

I didn’t want to cheat so I tried to think about this one without doing any calculations or looking anything up. My two sources of information were my recollection of the distributions of womens’ and mens’ heights (this is in Regression and Other Stories) and whatever I could dredge up from my personal experiences with friends and couples I see walking on the street.

I responded to Shane as follows:

I’m embarrassed to say I don’t know the answer to this question of yours. I’d guess 1 in 10, but I guess that 1 in 40 is not entirely out of the question, given that the two heights are positively correlated. So maybe I’d guess 1 in 40. But 1 in 10 is not out of the question…

Shane responded:

It’s roughly 1 in 40 if you assumed non assortative mating (just randomly picked men and women from the respective height distribution of men and women).

In reality, it is closer to 1 in 300. (Women don’t like shorter men; men may not like taller women, and shorter men don’t marry as often, in part because they earn less.)

1 in 300 didn’t sound right to me. So I did some calculating and some looking up.

First the calculating.

compare_heights <- function(mu, sigma, rho, n=1e6){
  Sigma <- diag(sigma) %*% cbind(c(1,rho), c(rho, 1)) %*% diag(sigma)
  library("MASS")
  heights <- mvrnorm(n, mu, Sigma)
  p_wife_taller <- mean(heights[,1] > heights[,2])
  cat("Assuming corr is", rho, ":  probability the wife is taller is", p_wife_taller, ":  1 in", 1/p_wife_taller, "\n")
}
mu <- c(63.7, 69.1)
sigma <- c(2.4, 2.9)
options(digits=2)
for (rho in c(0, 0.3, 0.5, 0.7)){
  compare_heights(mu, sigma, rho)
}

And here's the output:

Assuming corr is 0 :  probability the wife is taller is 0.076 :  1 in 13 
Assuming corr is 0.3 :  probability the wife is taller is 0.044 :  1 in 23 
Assuming corr is 0.5 :  probability the wife is taller is 0.022 :  1 in 45 
Assuming corr is 0.7 :  probability the wife is taller is 0.0052 :  1 in 191

Using simulation here is overkill---given that we're assuming the normal distribution anyway, it would be easy enough to just compute the probability analytically using the distribution of the difference between husband's and wife's heights (from the numbers above, the mean is 69.1 - 63.7 and the sd is sqrt(2.4^2 + 2.9^2 - 2*rho*2.4*2.9), so for example if rho = 0.5 you can just compute pnorm(0, 69.1 - 63.7, sqrt(2.4^2 + 2.9^2 - 2*0.5*2.4*2.9)), which indeed comes to 0.22.)---but brute force seems safer. Indeed, I originally did the simulation with n=1e4 but then I upped the draws to a million just to be on the safe side, since it runs so fast anyway.

A correlation of about 0.5 sounds right to me, so the above calculations seems to be consistent with my original intuition that the probability would be about 1 in 10 if independent or 1 in 40 accounting for correlation.

Next, the looking up. I recalled that sociologist Philip Cohen had posted something on the topic, so I did some googling and found this press release:

The average height difference between men and women in the U.S is about 6 inches. . . . Cohen also analyzed the height of 4,600 married couples from the 2009 Panel Study of Income Dynamics and found that the average husband's height was 5'11" while the average wife was 5'5". Moreover, he found that only 3.8 percent of the couples were made up of a tall wife and a short husband . . .

Fiona Macrae and Damien Gayle of the Daily Mail picked up this story and compared it to similar research on couples in the U.K, where they found that for most couples - 92.5 percent - the man was taller.

Hmmm . . . 5'5" and 5'11" are taller than the numbers that I'd been using . . . perhaps the people in the PSID were wearing socks?

I also found this post from Cohen based on data from the 2017 wave of the PSID. He supplies these graphs:

There's something weird about these graphs---if heights are given to the nearest inch, they should be presented as histograms, no?---but I think we can read off the percentages.

If you round to the nearest inch, it appears that there is a 4% chance that the two members of the couple have the same height, a 3% chance that the wife is taller, and a 93% chance the husband is taller. So for discrete heights, the probability the wife is taller is approximately 1 in 30. If you allow continuous heights and you allocate a bit less than half the ties to the "wife taller than husband" category, then the probability becomes something like 1 in 25.

Given all this, I wondered how Shane could ever had thought the frequency was 1 in 300. He replied that he got his data from this paper, "The male-taller norm in mate selection," published in Personality and Psychology Bulletin in 1980, which reports this:

Just as an aside, I find it a bit creepy when researchers use the term "mate" when referring to married couples. Wouldn't "spouse" be more appropriate? "Mate" has this anthropological, scientistic feel to it.

To return to the substance, there are two mysteries here. First, how did they calculate that the probability of the woman being taller should be only 2%? Second, how is it that only 1 of 720 of their couples had the wife taller than the husband?

The answer to the first question is supplied by this table from the 1980 paper:

Their standard deviations (2.3" within each sex) are too low. That sort of thing can happen when you try to estimate a distribution using a small and nonrepresentative sample. They continued by assuming a correlation of 0.2 between wives' and husbands' heights. I think 0.2 is too small a correlation, but when you combine it with the means and standard deviations that they assumed, you end up getting something reasonable.

As to the second question . . . I don't know! Here's what they say about their data:

My best guess is that people did some creative rounding to avoid the husband's recorded height being lower than the wife's, but it's hard to know. I don't fault the authors of that paper from 1980 for using whatever data they could find; in retrospect, though, it seems they were doing lots of out-of-control theorizing based on noisy and biased data.

I did some more googling and found this paper from 2014 with more from the Panel Study of Income Dynamics. The authors write:

Height was first measured in the PSID in 1986 and then at every wave starting in
1999. . . . In 1986, 92.7% of men were taller than their spouses; in 2009, 92.2% were taller.

They break the data up into couples where the husbands is shorter than the wife and couples where they are the same height. In 1986, 3.14% of husbands were shorter and 4.13% were the same height. In 2009, 3.78% were shorter and 4.00% were the same height. From these data, if you include something less than half of the "same height" group, you'll get an estimate of approximately 5%, or 1 in 20, couples where the wife is taller than the husband.

So, yeah, between 1 in 10 and 1 in 40. Not 1 in 300.

To be fair to Shane on this one, this was just an example he used in one of his classes. It's an interesting example how it's easy to get confused if you just look at one source, reminiscent of some examples we've discussed earlier in this space:

- What is Russia's GDP per capita?

- The contradictory literature on the effects on political attitudes of having a son or daughter.

- Various claims about the relation between age and happiness.

All these examples feature contradictory claims in the literature that are each made very strongly. The problem is not so much in the disagreement---there is legitimate uncertainty on these issues, along with real variation---but rather that the separate claims are presented as deterministic numbers or as near-certain conclusions. Kinda like this notorious story from Hynes and Vanmarcke (1977) and retold in Active Statistics:

Shane summarizes:

The general message has become even more cynical; transmogrifying from “You can’t trust everything you read,” to “You can’t trust anything you read.”

I wouldn't quite put it that strongly; rather, I'd say that it's a good idea to find multiple data sources and multiple analyses, and don't expect that the first thing you find will tell the whole story.

“Often enough, scientists are left with the unenviable task of conducting an orchestra with out-of-tune instruments”

Posted on April 28, 2024 9:08 AM by Andrew

Gaurav Sood writes:

Often enough, scientists are left with the unenviable task of conducting an orchestra with out-of-tune instruments. They are charged with telling a coherent story about noisy results. Scientists defer to the demand partly because there is a widespread belief that a journal article is the appropriate grouping variable at which results should “make sense.”

To tell coherent stories with noisy data, scientists resort to a variety of underhanded methods. The first is simply squashing the inconvenient results—never reporting them or leaving them to the appendix or couching the results in the language of the trade, e.g., “the result is only marginally significant” or “the result is marginally significant” or “tight confidence bounds” (without ever talking about the expected effect size). Secondly, if good statistics show uncongenial results, drown the data in bad statistics, e.g., report the difference between a significant and an insignificant effect as significant. The third trick is overfitting. A sin in machine learning is a virtue in scientific storytelling. Come up with fanciful theories that could explain the result and make that the explanation. The fourth is to practice the “have your cake and eat it too” method of writing. Proclaim big results at the top and offer a thick word soup in the main text. The fifth is to practice abstinence from interpreting “inconsistent” results as coming from a lack of power, bad theorizing, or heterogeneous effects.

The worst outcome of all of this malaise is that many (expectedly) become better at what they practice—bad science and nimble storytelling.

But storytelling is great! Stories are interesting when they have twists and turns, surprises that represents anomalies that confound our expectation. Telling a good story reveals these anomalies and also gives insight into those expectations, which derive from our implicit models of the world. I think all this is super-important for the social sciences.

I guess people just have different takes on what’s an interesting story. To me, the story that a creaking attic door or an intermittently-working radio is caused by ghosts, or the story that being exposed to elderly-related words causes you to walk more slowly, or the story that women are three times more likely to wear red during certain times of the month, or the story that Cornell students have ESP, . . . these are all boring. Or I guess we could characterize them as big if true. The real world in its complexity, that’s interesting. I study social science because it’s interesting and important, not because I think there are buttons we can push to get people to do things.

Evaluating MCMC samplers

Posted on April 27, 2024 3:00 PM by Bob Carpenter

I’ve been thinking a lot about how to evaluate MCMC samplers. A common way to do this is to run one or more iterations of your contender against a baseline of something simple, something well understood, or more rarely, the current champion (which seems to remain NUTS, though we’re open to suggestions for alternatives).

Reporting comparisons of averages (and uncertainty)

Then, what do you report? What I usually see is a report of averages over runs, such as average effective sample size per gradient eval. Sometimes I’ll see medians, but I like averages better here as it’s a fairer indication of expected time over multiple runs. What I don’t often see is estimates of uncertainty in the estimate of the average (std error) or across runs (std deviation). I can’t remember ever seeing someone evaluate something like the probability that one sampler is better than another for a problem, though you do occasionally see those “significance” asterisks (or at least you used to) in ML or search evaluations.

NUTS ESS has high variance

The range of errors we get from running NUTS on even mildly difficult problems implies variations of effective sample size of an order of magnitude. This variance is a huge problem for evaluation. If you follow Andrew Gelman’s advice, then you will run multiple chains and wait for them all to finish. That makes the run time an order statistic which is going to depend strongly on the cross-chain variance.

Time-bound runs?

In the past, Andrew asked about running for a fixed time rather than a fixed number of iterations, but we were worried about inducing bias. Chains tend to slow down in speed when they adapt to a too-small step size or when they wander around big flat regions for too long. It’d be nice if we could fulfill Andrew’s request to figure out how to combine timed runs in a way that doesn’t introduce bias. Suggestions more than welcome! If you run 1024 chains in the style of Matt Hoffman and Charles Margossian, then the order statistics for timing become very sensitive to tails.

Within-class versus across-class variation

This is a classic statistical comparison problem. In the sampler I’m evaluating now that is inspiring me to rethink evals, the within-group variance of NUTS and the within-group variance of the contender is greater than the difference between the mean time of NUTS and the mean time of the contender.

Distributions of ratios?

So I’m thinking that rather comparing ratios of average run times, we should look at histograms of ratios of draws from NUTS’s statistics and draws for the contender’s statistics. This will give us a much wider range of values, which for some of which will have NUTS taking twice as long and for some of them will have the contender taking twice as long. Luckily, it shouldn’t be as bad as a ratio of two normals, as we don’t expect values near zero in the denominator giving us massive values like in generating Cauchy variates (as the ratio of two normals). Averaging ratios doesn’t make a lot of sense because it’s sensitive to direction, but averaging log odds might make sense, because it’s symmetric around even odds.

Just show everything

We don’t need to try to reduce things to a single statistic. I often argued when I was in NLP that I’d rather see a confusion matrix than an accuracy (sensitivity, F-measure, etc.) and I’d rather see the ROC curve than just get the single value for AUC. These reductions to single statistics are very lossy and a single number is often a poor summary of a complex multivariate situation. And in NLP, a client often has a specific operating point in mind (high sensitivity, high specificity, etc.).

RMSE of standardized parameters

What I’m currently doing is showing the boxplots of root mean square error for standardized parameters. Standardizing is a trick I learned from Andrew that brings everyone’s scales into alignment. That is, standard error is just standard deviation divided by the square root of effective sample size. Scaling the parameters normalizes all parameters to have a standard deviation of 1, so they’ll be weighted equally when comparing results.

Why not minimum ESS?

The problem with minimum ESS is that it’s an extreme order statistic in high dimensional models, and is thus going to be difficult to estimate. It also ignores the errors on all the other parameters.

Direct standard error calculation

If you know the true value and can perform multiple estimation runs, you can calculate standard error empirically as the standard deviation of the errors. Given the standard error, we can back out expected ESS from the relation that standard error is standard deviation divided by the square root of ESS. I learned that technique from Aki Vehtari, who used it to calibrate our new ESS estimators that discount for approximate convergence (actually a few years old as of now).

Something multivariate?

Maybe I should be thinking of something that’s more multivariate, like the multivariate ESS from Vats, Flegal, and Jones? I’m afraid that would introduce a scaling problem in dimensionality.