GIST: Gibbs self-tuning for HMC

I’m pleased as Punch to announce our new paper,

We followed the mathematician alphabetical author-ordering convention.

The basic idea

The basic idea is so simple, I’m surprised it’s not more popular. The general GIST sampler couples the HMC algorithm tuning parameters (step size, number of steps, mass matrix) with the position (model parameters) and momentum. Each iteration of the Markov chain, we resample momentum using a Gibbs step (as usual in HMC), then we sample the tuning parameters in a second Gibbs step, conditioning on the current position and resampled momentum. The proposal is generated by the leapfrog algorithm using the sampled tuning parameters. The accept/reject step then uses the ratio of the joint densities, which now includes the tuning parameters. Of course, we need to specify a conditional distribution of tuning parameters to make this concrete.

The same idea could be applied to other Metropolis samplers.

Some prior art

We were inspired by a combination of NUTS and some negative results from Nawaf regarding acceptance bounds in delayed rejection generalized HMC (more on that later).

Radford Neal used the coupling idea for the Metropolis acceptance probability in his recent paper Non-reversibly updating a uniform [0,1] value for Metropolis accept/reject decisions. This is for generalized HMC, which is one-step HMC with partial momentum refresh. The partial refresh means the momentum flips in HMC matter and will typically thwart directed movement. Neal’s appraoch groups the acceptances together so that generalized HMC can make directed movement like HMC itself. Although this easily fits within the GIST framework, the proposal for acceptance probability doesn’t depend on the current position or momentum.

We (Chirag Modi, Gilad Turok and I) are working on a follow-up to my recent paper with Chirag and Alex Barnett on delayed rejection HMC. The goal there was to properly sample multiscale and varying scale densities (i.e., “stiff” Hamiltonians). In the follow up, we’re using generalized HMC with delayed rejection rather than Radford’s coupling idea, and show it can also generate directed exploration. The paper’s drafted and should be out soon. Spoiler alert! DR-G-HMC works better than DR-HMC in its ability to adjust to local changes in scale, and it’s competitive with NUTS in some cases where there are varying scales, but not in most problems.

Algorithms that randomize the number of steps (or stepsize), like those in Changye Wu and Christian Robert’s Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale, can also be seen as an instance of GIST. From this view, they are coupling the number of steps and sampling that number from a fixed uniform distribution each iteration. This doesn’t condition on the current position. The general framework immediately suggests biasing those draws toward longer jumps, for example by sampling uniformly from the second half of the trajectory.

Sam Livingstone was visiting Flatiron Institute for the last couple of weeks, and when I ran the basic idea by him, he suggested I have a look at the paper by Chris Sherlock, Szymon Urbas, and Matthew Ludkin, The apogee to apogee path sampler. This is a really cool idea that uses U-turns in potential energy (negative log density), which is univariate, rather than in position space. They evolve the Hamiltonian forward and backward in time for a given number of U-turns in log density, and then sample from among the points on the path (like NUTS does). One issue is that a regular normal distribution on momentum can have trouble sampling through the range of log densities (see, e.g., Sam et al.’s paper on kinetic energy choice in HMC). Because we were going to be discussing his paper in our reading group, I wrote to Chris Sherlock and he told me they have been working on the apogee to apogee idea since before NUTS was released! It’s a very similar idea, with similar forward/backward in time balancing. The other idea it shares with NUTS is a biasing toward longer jumps—this is done in a really clever way that we can borrow for GIST. Milo and Nawaf figured out how NUTS and the apogee-to-apogee sampler can be fit into the GIST framework, which simplifies their correctness proofs. Once you’ve defined a proper conditional distribution for GIST, you’re done. The apogee-to-apogee paper is also nice in evaluating a randomized stepsize version of HMC they call “blurred” HMC (and which of course, fits into the general GIST framework the same way the Wu and Robert sampler does).

If you know of other prior art or examples, we’d love to hear about them.

Our concrete alternative to NUTS

The alternative to NUTS we discuss in the GIST paper proposes a number of steps by iterating the leapfrog algorithm until a U-turn, then randomly sampling along the trajectory (like the improvement to the original NUTS that Michael Betancourt introduced or the apogee-to-apogee sampler). We then use a crude biasing mechanism (compared to the apogee-to-apogee sampler or NUTS) toward longer paths. That’s it, really. If you look at what that means for evaluating, we have to run a trajectory from the point sampled backwards in time until a U-turn—you don’t really get away from that forward-and-backward thing in any of these samplers.

We evaluate mean-square jump distance, rejections, and errors on parameter estimates and squared parameter estimates. It’s a little behind NUTS’s performance, but mostly in the same ballpark. In most cases, the variance among NUTS runs was greater than the difference between the mean NUTS and new algorithm run times. The evals demonstrate why it’s necessary to look at both parameter and squared parameter estimates. As draws become more anti-correlated, which happens by maximizing expected square jump distance, estimates for parameters improve, but error goes way up on estimates of squared parameters. I provide an example in this Stan forum post, which was inspired by a visit with Wu and Robert to discuss their randomized steps algorithm. Also, before we submit to a journal, I need to scale all the root mean-square-error calculations to Z scores.

What we’re excited about here is that it’s going to be easy to couple step size adaptation. We might even be able to adapt the mass matrix this way and get a cheap approxmation to Riemannian HMC.

Try it out

The evaluation code is up under an MIT License on my GitHub repo adaptive-hmc. I’m about to go add some more user-facing doc on how to run the evaluations. I’m really a product coder at heart, so I always find it challenging to hit the right level of doc/clarity/robustness/readability with research code. To get log densities and gradients from Stan models to develop our algorithm, we used BridgeStan.

I’m always happy to get suggestions on improving my code, so feel free to open issues or send me email.

Thank you!

This project would’ve been much harder if I hadn’t gotten feedback on the basic idea and code from Stan developer and stats prof Edward Roualdes. We also wouldn’t have been able to do this easily if Edward hadn’t developed BridgeStan. This kind of code is so off-by-one, numerator-vs-denominator, negative vs. positive, log vs. exponent mistake prone, that it makes my head spin. It doesn’t help that I’ve moved from R to Python, where the upper bounds are exclusive! Edward found a couple of dingers in my original code. Thanks also to Chirag for helping me understand the general framework and on the code. Both Edward and Chirag are working on better alternatives to our simple alternative to NUTS, which will be showing up in the same adaptive HMC repo—just keep in mind this is all research code!

What’s next?

Hopefully we’ll be rolling out a bunch of effective GIST samplers soon. Or even better, maybe you will…

P.S. In-person math and computing

The turnaround time on this paper from conception to arXiv is about the fastest I’ve ever been involved with (outside of simple theoretical notes I used to dash out). I think the speed is from two things: (1) the idea is easy, and (2) Milo, Nawaf and I spent four days together at Rutgers about a month ago working pretty much full time on this paper (with some time outs for teaching and talks!). We started with a basic idea, then worked out all the theory and developed the alternative to NUTS and tried some alternatives over those four days. It’s very intense working like that, but it can be super productive. We even triple coded on the big screen as we developed the algorithm and evaluated alternatives. Then we divided the work of writing the paper cleanly among us—as always, modularity is the key to scaling.

For that price he could’ve had 54 Jamaican beef patties or 1/216 of a conference featuring Gray Davis, Grover Norquist, and a rabbi

It’s the eternal question . . . what do you want, if given these three options:

(a) 54 Jamaican beef patties.

(b) 1/216 of a conference featuring some mixture of active and washed-up business executives, academics, politicians, and hangers-on.

(c) A soggy burger, sad-looking fries, and a quart of airport whisky.

The ideal would be to put it all together: 54 Jamaican beef patties at the airport, waiting to your flight to the conference to meet Grover Norquist’s rabbi. Who probably has a lot to say about the ills of modern consumerism.

I’d pay extra for airport celery if that’s what it took, but there is no airport celery so I bring it from home.

P.S. The above story is funny. Here’s some stuff that makes me mad.

Postdoc Opportunity at the HEDCO Institute for Evidence-Based Educational Practice in the College of Education at the University of Oregon

Emily Tanner-Smith writes:

Remote/Hybrid Postdoc Opportunity—join us as a Post-Doctoral Scholar at the HEDCO Institute for Evidence-Based Educational Practice in the College of Education at the University of Oregon!

The HEDCO Institute specializes in the conduct of evidence syntheses that meet the immediate decision-making demands of local, state, and national school leaders. Our work is carried out by a team of faculty and staff who work collaboratively with affiliated faculty at the UO and an external advisory board. The HEDCO Institute also provides research and outreach training and experience to students from across the COE and the UO.

We are looking for a new Post-Doctoral Scholar to join our team and contribute to our work aiming to close the gap between educational research and practice. The postdoc will work with Dr. Sean Grant on creating, maintaining, and disseminating living systematic reviews on school-based mental health prevention. Principal responsibilities include organizing and analyzing evidence synthesis data, collaborating with members from the larger institute team, participating in project meetings and conference calls, and working closely with Dr. Grant and other team members to achieve institute goals. Examples of these responsibilities include:
– Implementing protocols for evidence synthesis research data collection, data management, data analysis, and data presentation.
– Collecting and archiving evidence synthesis research data, ensuring integrity of data collection and archival procedures to ensure reproducibility and reuse.
– Analyzing data, interpreting results, and disseminating evidence to researchers and decision-makers (e.g., authoring/co-authoring technical reports, peer-reviewed journal articles, and policy briefs; delivering conference presentations and webinars).
– Assisting Dr. Grant with guidance for and oversight of undergraduate and graduate students at the institute.

We are seeking a highly motivated individual with a Ph.D. in relevant scientific field (including education, psychology, prevention science, or quantitative methodology) by start of position. Competitive candidates will have experience participating in evidence synthesis research projects (such as authoring or co-authoring a published systematic review), training in quantitative methods (particularly meta-analysis and data science), and proficiency with statistical analysis software (particularly R, RStudio, and Shiny) and evidence synthesis software (particularly DistillerSR).

This position is full-time for 1 year, with multi-year appointments possible contingent upon receipt of ongoing funding. This position is housed in the Eugene COE building, though remote/hybrid working options are also available for the entirety of the position (several current team members work remotely). The desired start date is August 2024 and expected salary range is $60,000 – $69,000. The position will have a formal mentor plan and involve professional development opportunities throughout the appointment to improve evidence synthesis and knowledge mobilization skills.

The University of Oregon is an equal opportunity, affirmative action institution committed to cultural diversity and compliance with the ADA. All qualified individuals are encouraged to apply! Applications will be reviewed on a rolling basis. For full consideration, please apply to our open pool by May 24, 2024. Please contact [email protected] if you have any questions about this opportunity.

I don’t post every job ad that’s sent to me, but this one seemed particularly relevant, as it has to do with evidence synthesis in policy analysis. The announcement doesn’t mention Stan, but I can only assume that experience with Bayesian modeling and Stan would be highly relevant to the job.

What is your superpower?

After writing this post, I was thinking that my superpower as a researcher is my willingness to admit I’m wrong, which gives me many opportunities to learn and do better (see for example here or here). My other superpower is my capacity to be upset, which has often led me to think deeper about statistical questions (for example here).

That’s all fine, but then it struck me that, whenever people talk about their “superpower,” they always seem to talk about qualities that just about anyone could have.

For example, “My superpower is my ability to listen to people,” or “My superpower is that I always show up on time.” Or the classic “Sleep is your superpower.”

A quick google yields, yields, “The superpower question invites you to single out a quality that has made it possible for you to achieve, and to give an example of a goal that you were able to reach as a result. Our first tip is to choose a simple but strong and effective superpower, for example: Endurance, strength or resilience.”

And “My superpower is the fact I am STRONG, DETERMINED, AND RESILIENT.”

And this: “Your superpower is your contribution—the role that you’re put on this Earth to fill. It’s what you do better than anyone else and tapping into it will not only help your team, but you’ll find your work more satisfying, too.” Which sounds different, but then it continues with these examples: Empathy, Systems Thinking, Creative Thinking, Grit, and Decisiveness.

I’m reminded of that Ben Stiller movie where he played a superhero whose superpower was that he could get really annoyed. Kinda like a Ben Stiller character, actually!

How it could be?

OK, superheroes aren’t real. So it’s not like people can say their superpower is flying, or invisibility, or breathing underwater, or being super-elastic, etc.

But . . . lots of people do have special talents. So you could imagine people saying that their superpower is that they have a really good memory, or they’re really good at learning languages, or that they’re really flexible, or some other property which, if not superhuman or even unique, is at least unusual and special. Instead you get things like “grit” or “sleep.”

And, as noted above, even in own thinking, I was saying that my superpower is the commonplace ability to admit I’m wrong, or the characteristic of being easily upset. I could’ve said that my superpower is my mathematical talent or my ability to rapidly spin out ideas onto the page—but I didn’t!

I don’t know what this all means, but it seems like a funny thing that “superpower” is so often used to refer to commonplace habits that just about anyone could develop. I mean, sure, it fits with the whole growth-mindset thing: If I say that my superpower is that I can admit I’m wrong or that I work really hard, then anyone can emulate that. If I say that my superpower is that math comes easy to me, well, that’s not something you can do much with, if you don’t happen to have that superpower yourself.

So, yeah, I kind of get it. Still it seems off that, without even thinking about it, we use the term “superpower” for these habits and traits that are valuable but are pretty much the opposite of superpowers.

Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins this Friday)

Fri 26 Apr, 10am in Shriver Hall Boardroom and 5pm in Gilman Hall 50 (see also here):

Storytelling and Scientific Understanding

Andrew Gelman and Thomas Basbøll

Storytelling is central to science, not just as a tool for broadcasting scientific findings to the outside world, but also as a way that we as scientists understand and evaluate theories. We argue that, for this purpose, a story should be anomalous and immutable; that is, it should be surprising, representing some aspect of reality that is not well explained by existing models of the world, and have details that stand up to scrutiny.

We consider how this idea illuminates some famous stories in social science involving soldiers in the Alps, Chinese boatmen, and trench warfare, and we show how it helps answer literary puzzles such as why Dickens had all those coincidences, why authors are often so surprised by what their characters come up with, and why the best alternative history stories have the feature that, in these stories, our “real world” ends up as the deeper truth. We also discuss connections to chatbots and human reasoning, stylized facts and puzzles in science, and the millionth digit of pi.

At the center our framework is a paradox: learning from anomalies seems to contradict usual principles of science and statistics where we seek representative or unbiased samples. We resolve this paradox by placing learning-within-stories into a hypothetico-deductive (Popperian) framework, in which storytelling is a form of exploration of the implications of a hypothesis. This has direct implications for our work as a statistician and a writing coach.

Basbøll and I have corresponded and written a couple papers together, but we’ve never met before this!

I posted on these talks a few months ago; reposting now because it’s coming up soon.

Decorative statistics and historical records

Sean Manning points to this remark from Matthew “not the musician” White:

I [White] am sometimes embarrassed by where I have been forced to find my statistics … Often, the only place to find numbers is in a newspaper article, almanac, chronicle or encyclopedia which needs to summarize major events into a few short sentences or into one scary number, and occasionally I get the feeling that some writers use numbers as pure rhetorical flourishes. To them, “over a million” does not mean “>106”; it’s just synonymous with “a lot”.

White was sooooo close to picking up on the concept of decorative statistics.

Now here’s a tour de force for ya

In social science, we’ll study some topic, then move on to the next thing. For example, Yotam and I did this project on social penumbras and political attitudes, we designed a study, collected data, analyzed the data, wrote it up, eventually it was published—the whole thing took years! and we were very happy with the results—and then we moved on. The idea is that other people will pick up the string. There were lots of little concerns, issues of measurement, causal identification, generalization, etc., and we discussed these in our paper, again hoping that these will be useful leads to further researchers.

And that’s how it often goes. Sometimes we return to old problems (for example, we wrote a paper on incumbency advantage in 1990 and followed up 18 years later), and we’re still working on R-hat, over 30 years after I first came up with the idea), but, even there, we’re typically not working with continuous focus.

The opposite approach in science is to drill down obsessively on a single phenomenon, to really pin it down. I think this is what historians do when they immerse themselves in some archive for a decade and then emerge to write the definitive book on the topic.

Here’s an example, not from history but from cognitive psychology, by Andrew Meyer and Shane Frederick:

This paper presents 59 new studies (N=72,310) which focus primarily on the “bat and ball problem.” It documents our attempts to understand the determinants of the erroneous intuition, our exploration of ways to stimulate reflection, and our discovery that the erroneous intuition often survives whatever further reflection can be induced. Our investigation helps inform conceptions of dual process models, as “system 1” processes often appear to override or corrupt “system 2” processes. Many choose to uphold their intuition, even when directly confronted with simple arithmetic that contradicts it – especially if the intuition is approximately correct.

The paper contains the charming Ascii graphic reproduced above (for example, page 8 here). I love Ascii graphics! Regarding the paper, Frederick writes:

One thing I’m proud of is summarizing 59 studies in just 9 pages. Another thing I like, and you’ll probably like, is that when sample sizes get large enough (and we have some pretty large ones), psychology starts to look like physics.

What really impresses me about the paper is not the sample size but the obsessiveness of the project. And I mean that in a good way.

Analogy between (a) model checking in Bayesian statistics, and (b) the self-correcting nature of science.

This came up in a discussion thread a few years ago. In response to some thoughts from Danielle Navarro about the importance of model checking, I wrote:

This makes me think of an analogy between the following two things:

– Model checking in Bayesian statistics, and

– The self-correcting nature of science.

The story of model checking in Bayesian statistics is that the fact that Bayesian inference can give ridiculous answers is a good thing, in that, when we see the ridiculous answer, this signals to us that there’s a problem with the model, and we can go fix it. This is the idea that we would rather have our methods fail loudly than fail quietly. But this all only works if, when we see a ridiculous result, we confront the anomaly. It doesn’t work if we just accept the ridiculous conclusion without questioning it, and it doesn’t work if we shunt the ridiculous conclusion aside and refuse to consider its implications.

Similarly with the self-correcting nature of science. Science makes predictions which can be falsified. Scientists make public statements, many (most?) of which will eventually be proved wrong. These failures motivate re-examination of assumptions. That’s the self-correcting nature of science. But it only works if individual scientists do this (notice anomalies and explore them) and it only works if the social structure of science allows it. Science doesn’t self-correct if scientists continue to stand by refuted claims, and it doesn’t work if they attack or ignore criticism.

In short, science is self-correcting, but only if “science”—that is, the people and the institutions of science—do that correction.

Similarly, statistical methods are checkable, but only if the users of these methods actually check them, and only if the developers of these methods develop methods for users to perform these checks. Which is where I come in, as a methodologist.

As Thomas Bayes famously said, with great power comes great responsibility.

The data are on a 1-5 scale, the mean is 4.61, and the standard deviation is 1.64 . . . What’s so wrong about that??

James Heathers reports on the article, “Contagion or restitution? When bad apples can motivate ethical behavior,” by Gino, Gu, and Zhong (2009):

There is some sentiment data reported in Experiment 3, which seems to be reported in whole units.

They also indicated how guilty they would feel about the behavior of the person who took all the money along with some unrelated emotional measures (1 = not at all, 5 = very much)… participants in the in-group selfish condition felt more guilty (M = 4.61, SD = 1.64) about the person’s selfish behavior than the participants in the out-group selfish condition (M = 3.26, SD = 1.54), t(80) = 3.82, p < .001.

If you have a 1 to 5 scale, it isn’t possible to have M = 4.61, SD = 1.64.

Huh? Really? Yeah!

Let’s work it out. If your measurements are on a 1-5 scale, the way to maximize their standard deviation for any given mean is to put the data all at 1 and 5. If the mean is 4.61, that would imply that (4.61 – 1)/(5 – 1) = 0.9025 of the data take on the value 5, and 1 – 0.9025 = 0.0975 take on the value 1. (Just to check, 0.0975*1 + 0.9025*5 = 4.61.)

For this extreme dataset, the standard deviation is sqrt(0.0975*(1 – 4.61)^2 + 0.9025*(5 – 4.61)^2) = 1.19. So, yeah, there’s no way to get a standard deviation of 1.64 from these data. Just not possible!

Just to make sure, we can check our calculation via simulation:

n <- 1e6
y <- sample(c(1,5), n, replace=TRUE, prob=c(0.0975, 0.9025))
print(c(mean(y), sd(y)))

Here's what we get:

[1] 4.610172 1.186317

Check.

OK, let's try one more thing. Maybe b is so small that there's some kinda 1/sqrt(n-1) thing in the denominator driving the result? I don't think so. The trouble is that, to get a mean of 4.61, you need enough data (in his post, Heathers guesses "n=41 (as 189/41 = 4.6098)") that the difference between 1/sqrt(n) and 1/sqrt(n-1) wouldn't be enough to take you from 1.19 all the way up to 1.64 or even close. Also, it's kinda implausible that all the observations would be 1's and 5's anyway.

So what happened?

It's always easier to figure out what didn't happen than to figure out what did happen.

Here are some speculations.

One possibility is a typo, but Heathers doubts that because other calculations in that paper are consistent the above-reported impossible numbers.

A related possibility is that this was a typo that was then propagated into the rest of paper. For example, the mean was 3.61, it was typed in the paper as 4.61, and then this typed-in number was used in later calculations. This would be bad workflow---you want all the computations to be done in a single script---but people use bad workflow all the time. I use bad workflow myself sometimes and end up with wrong numbers or wrongly-labeled graphs.

Another possibility is that the mean and standard deviation were calculated from two different datasets. That might sound kind of weird, but it can happen all the time, due to sloppiness or because of goofs in data processing. For example, you read in the data, calculate the mean and standard deviation for each variable, then perform some data-exclusion rule, perhaps removing data with incomplete responses to some of the questions, then you do further statistical analysis, recalculating the mean and standard deviation, among other things---but then when you pull together your numbers, you take the mean from some place and the standard deviation from the other place.

Yet another possibility is that someone involved in the data analysis or writeup was cheating in order to get a statistically-significant and thus publishable result, for example changing 3.61 to 4.61 to get a big fat difference but not touching the standard deviation. This would be a great way to cheat, because if you get caught, you can just say that you made a typo!

In any case, it's a fun little statistics example. And it's worth checking your data, even if you have no suspicion of cheating. I've often had incoherent data in problems I've worked on. Lots of things can go wrong in data processing and analysis, and we have to check things in all sorts of ways.

Infovis, infographics, and data visualization: My thoughts 12 years later

I came across this post from 2011, “Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go,” and it seemed to make sense to reassess where we are now, 12 years later.

From 2011:

I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines.

In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other students were not making a lot of graphs. I discovered and absorbed the principles of Cleveland’s The Elements of Graphing Data.

In grad school and beyond, I continued to use graphs in my research. But I noticed a disconnect in how statisticians thought about graphics. There seemed to be three perspectives:

1. The proponents of exploratory data analysis liked to graph raw data and never think about models. I used their tools but was uncomfortable with the gap between the graphs and the models, between exploration and analysis.

2. From the other direction, mainstream statisticians–Bayesian and otherwise–did a lot of math and fit a lot of models (or, as my ascetic Berkeley colleagues would say, applied a lot of procedures to data) but rarely made a graph. They never seemed to care much about the fit of their models to data.

3. Finally, textbooks and software manuals featured various conventional graphs such as stem-and-leaf plots, residual plots, scatterplot matrices, and q-q plots, all of which seemed appealing in the abstract but never did much for me in the particular applications I was working on.

In my article with Meng and Stern, and in Bayesian Data Analysis, and then in my articles from 2003 and 2004, I have attempted to bring these statistical perspectives together by framing exploratory graphics as model checking: a statistical graph can reveal the unexpected, and “the unexpected” is defined relative to “the expected”–that is, a model. This fits into my larger philosophy that puts model checking at the center of the statistical enterprise.

Meanwhile, my graphs have been slowly improving. I realized awhile ago that I didn’t need tables of numbers at all. And here and there I’ve learned of other ideas, for example Howard Wainer’s practice of giving every graph a title.

I continued with some scattered thoughts about graphics and communication:

A statistical graph does not stand alone. It needs some words to go along with it to explain it. . . . I realized that our plots, graphically strong though they were, did not stand on their own. . . . This experience has led me to want to put more effort into explaining every graph, not merely what the points and lines are indicating (although that is important and can be hard to figure out in many published graphs) but also what is the message the graph is sending.

Most graphs are nonlinear and don’t have a natural ordering. A graph is not a linear story or a movie you watch from beginning to end; rather, it’s a cluttered house which you can enter from any room. The perspective you pick up if you start from the upstairs bathroom is much different than what you get by going through the living room–or, in graphical terms, you can look at clusters of points and lines, you can look at outliers, you can make lots of different comparisons. That’s fine but if a graph is part of a scientific or journalistic argument it can help to guide the reader a bit–just as is done automatically in the structuring of words in an article. . . .

While all this was happening, I also was learning more about decision analysis. In particular, Dave Krantz convinced me that the central unit of decision analysis is not the utility function or even the decision tree but rather the goal.

Applying this idea to the present discussion: what is the goal of a graph? There can be several, and there’s no reason to suppose that the graph that is best for achieving one of these goals will be optimal, or even good, for another. . . .

I’m a statistician who loves graphs and uses them all the time, I’m continually working on improving my graphical presentation of data and of inferences, but I’m probably stuck (without realizing it) in a bit of a rut of dotplots and lineplots. I’m aware of an infographics community . . .

Here’s an example of where I’m coming from: a blog post entitled, “Is the internet causing half the rapes in Norway? I wanna see the scatterplot.” To me, visualization is not an adornment or a way of promoting social science. Visualization is a central tool in social science research. (I’m not saying visualization is strictly necessary–I’m sure you can do a lot of good work with no visual sense at all–but I think it’s a powerful approach, and I worry about people who believe social science claims that they can’t visualize. I worry about researchers who believe their own claims without understanding them well enough to visualize the relation of these claims to the data from which they are derived.)

The rest of my post from 2011 discusses my struggles in communicating with the information visualization community–these are people who produce graphs for communication with general audiences, which motivates different goals and tools than those used by statisticians to communicate as part of the research process. Antony Unwin and I wrote a paper about these differences which was ultimately published with discussion in 2013 (and here is our rejoinder to the discussions).

Looking at all this a decade later, I’m not so interested in non-statistical information visualization anymore. I don’t mean this in a disparaging way! I think infofiz is great. Sometimes the very aspects of an infographic that make it difficult to read and deficient from a purely statistical perspective are a benefit for communication in that they can push the reader into thinking in new ways; here’s an example we discussed from a few years ago.

I continue to favor what we call the click-through solution: Start with the infographic, click to get more focused statistical graphics, click again to get the data and sources. But, in any case, the whole stat graphics vs. infographics thing has gone away, I guess because it’s clear that they can coexist; I don’t really see them as competing.

Whassup now?

Perhaps surprisingly, my graphical practices have remained essentially unchanged since 2011. I say “perhaps surprisingly,” because other aspects of my statistical workflow have changed a lot during this period. My lack of graphical progress is probably a bad thing!

A big reason for my stasis in this regard, I think, is that I’ve worked on relatively few large applied projects during the past fifteen years.

From 2004 through 2008, my collaborators and I were working every day on Red State Blue State. We produced hundreds of graphs and the equivalent of something like 10 or 20 research articles. In addition to our statistical goals of understanding our data and how they related to public opinion and voting, we knew from the start that we wanted to communicate both to political scientists and to the general public, so we were on the lookout for new ways to display our data and inferences. Indeed, we had the idea for the superplot before we ever made the actual graph.

Since 2008, I’ve done lots of small applied analyses for books and various research projects, but no big project requiring a rethinking of how to make graphs. The closest thing would be Stan, and here we have made some new displays–at least, new to me–but that work was done by collaborators such as Jonah Gabry, who did ShinyStan, and this hasn’t directly affected the sorts of graphs that I make.

I continue to think about graphs in new ways (for example, causal quartets and the ladder of abstraction), but, as can be seen in those new papers, the looks of my graphs haven’t really changed since 2011.

“Close but no cigar” unit tests and bias in MCMC

I’m coding up a new adaptive sampler in Python, which is super exciting (the basic methodology is due to Nawaf Bou-Rabee and Tore Kleppe). Luckily for me, another great colleague, Edward Roualdes, has been keeping me on the straight and narrow by suggesting stronger tests and pointing out actual bugs in the repository (we’ll open access it when we put the arXiv paper up—hopefully by the end of the month).

There are a huge number of potential fencepost (off by one), log-vs-exponential, positive-vs-negative, numerator-vs-denominator, and related errors to make in this kind of thing. For example, here’s a snippet of the transition code.

L = self.uturn(theta, rho)
LB = self.lower_step_bound(L)
N = self._rng.integers(LB, L)
theta_star, rho_star = self.leapfrog(theta, rho, N)
rho_star = -rho_star
Lstar = self.uturn(theta_star, rho_star)
LBstar = self.lower_step_bound(Lstar)
if not(LBstar <= N and N < Lstar):
    ... reject ...

Looks easy, right? Not quite. The uturn function returns the number of steps to get to a point that is one step past the U-turn point. That is, if I take L steps from (theta, rho), I wind up closer than to where I started than if I take L - 1 steps. The rng.integers function samples uniformly, but it’s Python, so it excludes the upper bound and samples from {LB, LB + 1, .., L - 1} . That’s correct, because I want to choose a number of steps greater than 1 and less than the point past which you’ve made a U-turn. Let’s just say I got this wrong the first time around.

Because it’s MCMC and I want a simple proof of correctness, I have to make sure the chain’s reversible. So I see how many steps to get one past a U-turn coming back (after momentum flip), which is Lstar. Now I have to grab its lower bound, and make sure that I take a number of steps between the lower bound (inclusive) and upper bound (exclusive). Yup, had this wrong at one point. But the off-by-one error shows up in a position that is relatively rare given how I was sampling.

For more fun, we have to compute the acceptance probability. In theory, it’s just p(theta_star, rho_star, N) / p(theta, rho, N) in this algorithm, which looks as follows on the log scale.

log_accept = (
    self.log_joint(theta_star, rho_star) - np.log(Lstar - LBstar)
    - (log_joint_theta_rho - np.log(L - LB))
)

That’s because p(N | theta_star, rho_star) = 1 / (Lstar - LBstar) given the uniform sampling with Lstar excluded and LBstar included. But then I substituted the uniform distribution for a binomial, and made the following mistake.

log_accept = (
  self.log_joint(theta_star, rho_star) - self.length_log_prob(N, Lstar)
  - (log_joint_theta_rho - self.length_log_prob(N, L))
)

I only had the negation in -np.log(L - LB) because it was equivalent to np.log(1 / (L - LB)) with a subtraction instead of a division. Luckily Edward caught this one in the code review. I should’ve just coded the log density and added it rather than subtracted it. Now you’d think this would lead to an immediate and glaring bug in the results because MCMC is a delicate algorithm. In this case, the issue is that (N - L) and (N - Lstar) are identically distributed and only range over values of roughly 5 to 7. That’s a minor difference in a stochastic acceptance probability that’s already high. How hard was this to detect? With 100K iterations, everything looked fine. With 1M iterations, the estimates of parameters continued to follow a 1 / sqrt(iterations) trend in error, but showed the estimates of parameters squared asymptotic with residual error only after 100K iterations. That is, it required 1M iterations and an evaluation of the means of squared parameters to detect this bug.

I then introduced a similar error when I went to a binomial number of steps selection. I was using sp.stats.binom.logpmf(N, L, self._success_prob) when I should have been using sp.stats.binom.logpmf(N, L - 1, self._success_prob). As an aside, I like SciPy’s clear naming here vs. R’s dbinom(log.p = True, ...). What I don’t like about Python is that the discrete uniform doesn’t include its endpoint. Of course, the binomial includes its endpoint as an option, so these two versions need to be coded off by 1. Of course, I missed the L - 1. This only introduced a bug because I didn’t do the matching adjustment in testing whether things were reversible. That’s if not(1 <= N and N < Lstar) to match the Lstar - 1 in the logpmf() call. If I ran it all the way to L, then I would've needed N <= Lstar. This is another subtle difference that only shows up after more than 100K iterations.

We introduced a similar problem into Stan in 2016 when we revised NUTS to do multinomial sampling rather than slice sampling. It was an off-by-one error on trajectory length. All of our unit tests of roughly 10K iterations passed. A user spotted the bug by fitting a 2D correlated normal with known correlation for 1M iterations as a test and realizing estimates were off by 0.01 when they should've had smaller error. We reported this on the blog back when it happened, culminating in the post Michael found the bug in Stan's new sampler.

I was already skeptical of empirical results in papers and this is making me even more skeptical!

P.S. In case you don't know the English idiom "close but no cigar", here's the dictionary definition from Cambridge (not Oxford!).

Do research articles have to be so one-sided?

It’s standard practice in research articles as well as editorials in scholarly journals to present just one side of an issue. That’s how it’s done! A typical research article looks like this:

“We found X. Yes, we really found X. Here are some alternative explanations for our findings that don’t work. So, yeah, it’s really X, it can’t reasonably be anything else. Also, here’s why all the thickheaded previous researchers didn’t already find X. They were wrong, though, we’re right. It’s X. Indeed, it had to be X all along. X is the only possibility that makes sense. But it’s a discovery, it’s absolutely new. As was said of the music of Beethoven, each note is prospectively unexpected but retrospectively absolutely right. In conclusion: X.”

There also are methods articles, which go like this:

“Method X works. Here’s a real problem where method X works better than anything else out there. Other methods are less accurate or more expensive than X, or both. There are good theoretical reasons why X is better. It might even be optimal under some not-too-unreasonable conditions. Also, here’s why nobody tried X before. They missed it! X is, in retrospect, obviously the right thing to do. Also, though, X is super-clever: it had to be discovered. Here are some more examples where X wins. In conclusion: X.”

Or the template for a review article:

“Here’s a super-important problem which has been studied in many different ways. The way we have studied it is the best. In this article, we also discuss some other approaches which are worse. Our approach looks even better in this contrast. In short, our correct approach both flows naturally from and is a bold departure from everything that came before.”

OK, sometimes we try to do better. We give tentative conclusions, we accept uncertainty, we compare our approach to others on a level playing field, we write a review that doesn’t center on our own work. It happens. But, unless you’re Bob Carpenter, such an even-handed approach doesn’t come naturally, and, as always with this kind of adjustment, there’s always the concern of going too far (“bending over backward”) in the other direction. Recall my criticism of the popular but I think bogus concept of “steelmanning.”

So, yes, we should try to be more balanced, especially when presenting our own results. But the incentives don’t go in that direction, especially when your contributions are out there fighting with lots of ideas that other people are promoting unreservedly. Realistically, often the best we can do is to include Limitations sections in otherwise-positive papers.

One might think that a New England Journal of Medicine editorial could do better, but editorials have the same problem as review articles, which is that the authors will still have an agenda.

Dale Lehman writes in, discussing such an example:

A recent article in the New England Journal of Medicine caught my interest. The authors – a Harvard economist and a McKinsey consultant (properly disclosed their ties) – provide a variety of ways that AI can contribute to health care delivery. I can hardly argue with the potential benefits, and some areas of application are certainly ripe for improvements from AI. However, the review article seems unduly one-sided. Almost all of the impediments to application that they discuss lay the “blame” on health care providers and organizations. No mention is made about the potential errors made by AI algorithms applied in health care. This I found particularly striking since they repeatedly appeal to AI use in business (generally) as a comparison to the relatively slow adoption of AI in health care. When I think of business applications, a common error might be a product recommendation or promotion that was not relevant to a consumer. The costs of such a mistake are generally small – wasted resources, unhappy customers, etc. A mistake made by an AI recommendation system in medicine strikes me as quite a bit more serious (lost customers is not the same thing as lost patients).

To that point, the article cites several AI applications to prediction of sepsis (references 24-27). That is a particular area of application where several AI sepsis-detection algorithms have been developed, tested, and reported on. But the references strike me as cherry-picked. A recent controversy has concerned the Epic model (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8218233/?report=classic) where the company reported results were much better than the attempted replication. Also, there was a major international challenge (PhysioNet: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6964870/) where data was provided from 3 hospital systems, 2 of which provided the training data for the competition and the remaining system was used as the test data. Notably, the algorithms performed much better on the systems for which the training data was provided than for the test data.

My question really concerns the role of the NEJM here. Presumably this article was peer reviewed – or at least reviewed by the editors. Shouldn’t the NEJM be demanding more balanced and comprehensive review articles? It isn’t that the authors of this article say anything that is wrong, but it seems deficient in its coverage of the issues. It would not have been hard to acknowledge that these algorithms may not be ready for use (admittedly, they may outperform existing human models, but that is an area on which there is research and it should be noted in the article). Nor would it be difficult to point out that algorithmic errors and biases in health care may be a more serious matter than in other sectors of the economy.

Interesting. I’m guessing that the authors of the article were coming from the opposite direction, with a feeling that there’s too much conservatism regarding health-care innovation and they wanted to push back against that. (Full disclosure: I’m currently working with a cardiologist to evaluate a machine-learning approach for ECG diagnosis.)

In any case, yes, this is part of a general problem. One thing I like about blogging, as opposed to scholarly writing or journalism, is that in a blog post there’s no expectation or demand or requirement that we come to a strong conclusion. We can let our uncertainty hang out, without some need to try to make “the best possible case” for some point. We may be expected to entertain, but that’s not so horrible!

N=43, “a statistically significant 226% improvement,” . . . what could possibly go wrong??

Enjoy.

They looked at least 12 cognitive outcomes, one of which had p = 0.02, but other differences “were just shy of statistical significance.” Also:

The degree of change in the brain measure was not significantly correlated with the degree of change in the behavioral measure (p > 0.05) but this may be due to the reduced power in this analysis which necessarily only included the smaller subset of individuals who completed neuropsychological assessments during in-person visits.

This is one of the researcher degrees of freedom we see all the time: an analysis with p > 0.05 can be labeled as “marginally statistically significant” or even published straight-up as a main result (“P < 0.10”), it can get some sort of honorable mention (“this may be due to the reduced power”), or it can be declared to be a null effect.

The “this may be due to the reduced power” thing is confused, for two reasons. First, of course it’s due to the reduced power! Set n to 1,000,000,000 and all your comparisons will be statistically significant! Second, the whole point of having these measures of sampling and measurement error is to reveal the uncertainty in an estimate’s magnitude and sign. It’s flat-out wrong to take a point estimate and just suppose that it would persist under a larger sample size.

People are trained in bad statistical methods, so they use bad statistical methods, it happens every day. In this one, I’m just bothered that this “226% improvement” thing didn’t set off any alarms. To the extent that these experimental results might be useful, the authors should be publishing the raw data rather than trying to fish out statistically significant comparisons. They also include a couple of impressive-looking graphs which wouldn’t look so impressive if they were to graph all the averages in the data rather than just those that randomly exceeded a significance threshold.

Did they publish the raw data? No! Here’s the Data availability statement:

The datasets presented in this article are not readily available because due to reasonable privacy and security concerns, the underlying data are not easily redistributable to researchers other than those engaged in the current project’s Institutional Review Board-approved research. The corresponding author may be contacted for an IRB-approved collaboration. Requests to access the datasets should be directed to …

It seems like it would be pretty trivial to remove names and any other identifying information and then release the raw data. This is a study on “whether older adults retain or improve their cognitive ability over a six-month period after daily olfactory enrichment at night.” What’s someone gonna do, track down participants based on their “daily exposure to essential oil scents”?

One problem here is that Institutional Review Boards are set up with a default no-approval stance. I think it should be the opposite: no IRB approval unless you commit ahead of time to posting your raw data. (Not that my collaborators and I usually post our raw data either. Posting raw data can be difficult. That’s one reason I think it should required, because otherwise it’s not likely to be done.)

No, it’s not “statistically implausible” when results differ between studies, or between different groups within a study.

James “not the cancer cure guy” Watson writes:

This letter by Thorland et al. published in the New England Journal of Medicine is rather amusing. It’s unclear to me what their point is, other than the fact that they find the published results for the new COVID drug molnupiravir “statistically implausible.”

Background: The pharma company Merck got very promising results for molnupiravir at their interim analysis (~50% reduction in hospitalisation/death) but less promising results at their final analysis (30% reduction). Thorlund et al. were surprised that the data for the two study periods (before and after interim analysis) provided very different point estimates for benefit (goes the other way in the second period). They were also surprised to see inconsistent results when comparing across the different countries included in the study (non-overlapping confidence intervals).

They clearly had never read the subgroup analysis from the ISIS-2 trial: the authors convincingly showed that aspirin reduced vascular deaths in patients of all astrological birth signs expect Gemini and Libra, see Figure 5 in this Lancet paper from 1998.

He’s not kidding—that Lancet paper really does talk about astrological signs. What the hell??

Regarding the letter in the New England Journal of Medicine, I guess the point is that different studies, and different groups within a study, have different patients and are conducted at different times and under different conditions, so it makes sense that they can have different outcomes, more different that would be expected to arise from pure chance when comparing two samples from an identical distribution. People often don’t seem to realize this, leading them to characterize differences from chance as “statistically implausible” etc. rather than just representing underlying differences across patients, scenarios, and times.

As the authors of the original study put it in their response letter in the journal:

Given the shifts in prevailing SARS-CoV-2 variants, changes in out- patient management, and inclusion of trial sites from countries with unique Covid-19 disease burdens, the trial was not necessarily conducted under uniform conditions. The differences in the results between the interim and final analyses might be statistically improbable under ideal circumstances, but they reflect the fact that several key factors could not remain constant despite a consistent trial design.

Indeed.

Simulation to understand two kinds of measurement error in regression

This is all super-simple; still, it might be useful. In class today a student asked for some intuition as to why, when you’re regressing y on x, measurement error on x biases the coefficient estimate but measurement error on y does not.

I gave the following quick explanation:
– You’re already starting with the model, y_i = a + bx_i + e_i. If you add measurement error to y, call it y*_i = y_i + eta_i, and then you regress y* on x, you can write y* = a + bx_i + e_i + eta_i, and as long as eta is independent of e, you can just combine them into a single error term.
– When you have measurement error in x, two things happen to attenuate b—that is, to pull the regression coefficient toward zero. First, if you spreading out x but keep y unchanged, this will reduce the slope of y on x. Second, when you add noise to x you’re changing the ordering of the data, which will reduce the strength of the relationship.

But that’s all words (and some math). It’s simpler and clearer to do a live simulation, which I did right then and there in class!

Here’s the R code:

# simulation for measurement error
library("arm")
set.seed(123)
n <- 1000
x <- runif(n, 0, 10)
a <- 0.2
b <- 0.3
sigma <- 0.5
y <- rnorm(n, a + b*x, sigma)
fake <- data.frame(x,y)

fit_1 <- lm(y ~ x, data=fake)
display(fit_1)

sigma_y <- 1
fake$y_star <- rnorm(n, fake$y, sigma_y)
sigma_x <- 4
fake$x_star <- rnorm(n, fake$x, sigma_x)

fit_2 <- lm(y_star ~ x, data=fake)
display(fit_2)

fit_3 <- lm(y ~ x_star, data=fake)
display(fit_3)

fit_4 <- lm(y_star ~ x_star, data=fake)
display(fit_4)

x_range <- range(fake$x, fake$x_star)
y_range <- range(fake$y, fake$y_star)

par(mfrow=c(2,2), mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
plot(fake$x, fake$y, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_1), col="red", main="No measurement error")
plot(fake$x, fake$y_star, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_2), col="red", main="Measurement error on y")
plot(fake$x_star, fake$y, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_3), col="red", main="Measurement error on x")
plot(fake$x_star, fake$y_star, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_4), col="red", main="Measurement error on x and y")

The resulting plot is at the top of this post.

I like this simulation for three reasons:

1. You can look at the graph and see how the slope changes with measurement error in x but not in y.

2. This exercise shows the benefits of clear graphics, including little things like making the dots small, adding the regression lines in red, labeling the individual plots, and using a common axis range for all four graphs.

3. It was fast! I did it live in class, and this is an example of how students, or anyone, can answer this sort of statistical question directly, with a lot more confidence and understanding than would come from a textbook and some formulas.

P.S. As Eric Loken and I discuss in this 2017 article, everything gets more complicated if you condition on "statistical significance."

P.P.S. Yes, I know my R code is ugly. Think of this as an inspiration: even if, like me, you’re a sloppy coder, you can still code up these examples for teaching and learning.

Intelligence is whatever machines cannot (yet) do

I had dinner a few nights ago with Andrew’s former postdoc Aleks Jakulin, who left the green fields of academia for entrepreneurship ages ago. Aleks was telling me he was impressed by the new LLMs, but then asserted that they’re clearly not intelligent. This reminded me of the old saw in AI that “AI is whatever a machine can’t do.”

In the end, the definition of “intelligent” is a matter of semantics. Semantics is defined by conventional usage, not by fiat (the exception seems to be an astronomical organization trying to change the definition of “planet” to make it more astronomically precise). We do this all the time. If you think about what “water” means, it’s incredibly vague. In the simplest case, how many minerals can it contain before we call it “mud” rather than “water”? Does it even have to be made of H20 if we can find a clear liquid on an alternative earth that will nourish us in the same way (this is a common example in philosophy from Hilary Putnam, I believe)? When the word “water” was first introduced into English, let’s just say that our understanding of chemistry was less developed than it is now. The word “intelligent” is no different. We’ve been using the term since before computers, and now we have to rethink what it means. By convention, we could decide as a group of language users to define “intelligent” however we want. Usually such decisions are guided by pragmatic considerations (or at least I’d like to think so—this is the standard position of pragmatist philosophers of language, like Richard Rorty). For instance, we could decide to exclude GPT because (a) it’s not embodied in the same way as a person, (b) it doesn’t have long-term memory, (c) it runs on silicon rather than cells, etc.

It would be convenient for benchmarking if we could fix a definition of “intelligence” to work with. What we do instead is just keep moving the bar on what counts as “intelligent.” I doubt people 50 years ago (1974) would have said you can play chess without being intelligent. But as soon as Deep Blue beat the human chess champion, everyone changed their tune and the chorus became “chess is just a game” and “it’s finite” and “it has well defined rules, unlike real life.” Then when IBM’s Watson trounced the world champion at Jeopardy!, a language based game, it was dismissed as a parlor trick. Obviously because a machine can play Jeopardy!, the reasoning went, it doesn’t require intelligence.

Here’s the first hit on Google I found searching for something like [what machines can’t do]. This one’s in a popular magazine, not the scientific literature. It’s the usual piece in the genre of “ML is amazing, but it’s not intelligent because it can’t do X”.

Let’s go over Toews’s list of AI’s failures circa 2021 (these are direct quotes).

  1. Use “common sense.” A man went to a restaurant. He ordered a steak. He left a big tip. If asked what the man ate in this scenario, a human would have no problem giving the correct answer—a steak. Yet today’s most advanced artificial intelligence struggles with prompts like this.
     
  2. Learn continuously and adapt on the fly. Today, the typical AI development process is divided into two distinct phases: training and deployment.
     
  3. Understand cause and effect. Today’s machine learning is at its core a correlative tool. It excels at identifying subtle patterns and associations in data. But when it comes to understanding the causal mechanisms—the real-world dynamics—that underlie those patterns, today’s AI is at a loss.
     
  4. “Reason ethically…In 2016, Microsoft debuted an AI personality on Twitter named Tay. The idea was for Tay to engage in online conversations with Twitter users as a fun, interactive demonstration of Microsoft’s NLP technology. It did not go well. Within hours, Internet trolls had gotten Tay to tweet a wide range of offensive messages: for instance, “Hitler was right” and “I hate feminists and they should all die and burn in hell.”

(1) ChatGPT-4 gets these common-sense problems mostly right. But it’s not logic. The man may have ordered a steak, gotten it, sent it back, ordered the fish instead, and still left a big tip. This is a problem with a lot of the questions posed to GPT about whether X follows from Y. It’s not a sound inference, just the most likely thing to happen, or as we used to say, the “default.” Older AIs were typically designed around sound inference and weren’t so much trying to emulate human imprecision (having said that, my grad school admissions essay was about and my postdoc was funded by a grant on default logics back in the 1980s!).

(2) You can do in-context learning with ChatGPT, but it doesn’t retain anything long term without retraining/fine tuning. It will certainly adapt to its task/listener on the fly throughout a conversation (arguably the current systems like ChatGPT adapt to their interlocuter too much—it’s what they were trained to do via reinforcement learning). Long-term memory is perhaps the biggest technical challenge to overcome, and it’s been interesting to see people going back to LSTM/recursive NN ideas (transformers, the neural net architecture underlying ChatGPT, were introduced in a paper titled “Attention is all you need”, which used long, but finite memory).

(3) ChatGPT 4 is pretty bad at causal inference. But it’s probably above the bar for what Toews’s complaints. It’ll get simple “causal inference” right the same way people do. In general, humans are pretty bad at causal inference. We are way too prone to jump to causal conclusions based on insufficient evidence. Do we classify baseball announcers as not intelligent when they talk about how a player struggles with high pressure situations after N = 10 plate appearances in the playoffs? We’re also pretty bad at reasoning about things that go against our preconceptions. Do we think Fisher was not intelligent because he argued that smoking didn’t cause cancer? Do we think all the anthropogenic global warming deniers are not intelligent? Maybe they’re right and it’s just a coincidence that temps have gone up coinciding with industrialization and carbon emissions. Seems like a highly suspicious coincidence, but causation is really hard when you can’t do randomized controlled trials (and even then it’s not so easy because of all the possible mediation).

(4) How you call this one depends on whether you think the front-line fine-tuning of ChatGPT made a reasonably helpful/harmless/truthful bot or not and whether the “ethics” it was trained with are yours. You can certainly jailbreak even ChatGPT-4 to send it spiraling into hate land or fantasy land. You can jailbreak some of my family in the same way, but I wouldn’t go so far as to say they weren’t intelligent. You can find lots of folks who think ChatGPT is too “woke”. This is a running theme on the GPT subreddit. It’s also a running theme among anti-woke billionaires, as reflected in the UK’s Daily Telegraph article title, “ChatGPT may be the next big thing, but it’s a biased woke robot.”

I’ve heard a lot of people say their dog is more intelligent than ChatGPT. I suppose they would argue for a version of intelligence that doesn’t require (1) or (4) and is very tolerant of poor performance in (2) and (3).

Evidence, desire, support

I keep worrying, as with a loose tooth, about news media elites who are going for the UFOs-as-space-aliens theory. This one falls halfway between election denial (too upsetting for me to want to think about too often) and belief in ghosts (too weird to take seriously).

I was also thinking about the movie JFK, which I saw when it came out in 1991. As a reader of the newspapers, I knew that the narrative pushed in the movie was iffy, to say the least; still, I watched the movie intently—I wanted to believe. In the same way that in the 1970s I wanted to believe those claims that dolphins are smarter than people, or that millions of people wanted to believe in the Bermuda Triangle or ancient astronauts or Noah’s Ark or other fringe ideas that were big in that decade. None of those particular ideas appealed to me.

Anyway, this all got me thinking about what it takes for someone to believe in something. My current thinking is that belief requires some mixture of the following three things:
1. Evidence
2. Desire
3. Support

To go through these briefly:

1. I’m using the term “evidence” in a general sense to include things you directly observe and also convincing arguments of some sort or another. Evidence can be ambiguous and, much to people’s confusion, it doesn’t always point in the same direction. The unusual trajectory of Oswald’s bullet is a form of evidence, even though not as strong as has been claimed by conspiracy theories. The notorious psychology paper from 2011 is evidence for ESP. It’s weak evidence, really no evidence at all for anything beyond the low standards of academic psychology at the time, but it played the role of evidence for people who were interested in or open to believing.

2. By “desire,” I mean a desire to believe in the proposition at hand. There can be complicated reasons for this desire. Why did I have some desire in 1991 to believe the fake JFK story, even thought I knew ahead of time it was suspect? Maybe because it helped make sense of the world? Maybe because, if I could believe the story, I could go with the flow of the movie and feel some righteous anger? I don’t really know. Why do some media insiders seen to have the desire to believe that UFOs are space aliens? Maybe because space aliens are cool, maybe because, if the theory is true, then these writers are in on the ground floor of something big, maybe because the theory is a poke in the eye at official experts, maybe all sorts of things.

3. “Support” refers to whatever social environment you’re in. 30% of Americans believe in ghosts, and belief in ghosts seems to be generally socially acceptable—I’ve heard people from all walks of life express the belief—but there are some places where it’s not taken seriously, such as in the physics department. The position of ghost-belief within the news media is complicated, typically walking a fine line to avoid expressing belief or disbelief. For example, a quick search of *ghosts npr* led to this from the radio reporter:

I’m pretty sure I don’t believe in ghosts. Now, I say pretty sure because I want to leave the possibility open. There have definitely been times when I felt the presence of my parents who’ve both died, like when one of their favorite songs comes on when I’m walking the aisles of the grocery store, or when the wind chime that my mom gave me sings a song even though there’s no breeze. But straight-up ghosts, like seeing spirits, is that real? Can that happen?

This is kind of typical. It’s a news story that’s pro-ghosts, reports a purported ghost sighting with no pushback, but there’s that kinda disclaimer too. It’s similar to reporting on religion. Different religions contradict each other, and so if you want to report in a way that’s respectful of religion, you have to place yourself in a no-belief-yet-no-criticism mode: if you have a story about religion X, you can’t push back (“Did you really see the Lord smite that goat in your backyard that day?”) because that could offend adherents of that religion, but you can’t fully go with it, as that could offend adherents of every other religion.

I won’t say that all three of evidence, desire, and support are required for belief, just that they can all contribute. We can see this with some edge cases. That psychologist who published the terrible paper on ESP: he had a strong desire to believe, a strong enough desire to motivate an entire research program on his part. There was also a little bit of institutional support for the belief. Not a lot—ESP is a fringe take that would be, at best, mocked by most academic psychologists, it’s a belief that has much lower standing now than it did fifty years ago—but some. Anyway, the strong desire was enough, along with the terrible-but-nonzero evidence and the small-but-nonzero support. Another example would be Arthur Conan Doyle believing those ridiculous faked fairy photos: spiritualism was big in society at the time, so he had strong social support as well as strong desire to believe. In other cases, evidence is king, but without the institutional support it can be difficult for people to be convinced. Think of all those “they all laughed, but . . .” stories of scientific successes under adversity: continental drift and all the rest.

As we discussed in an earlier post, the “support” thing seems like a big change regarding the elite media and UFOs-as-space-aliens. The evidence for space aliens, such as it is—blurry photographs, eyewitness testimony, suspiciously missing government records, and all the rest—has been with us for half a century. The desire to believe has been out there too for a long time. What’s new is the support: some true believers managed to insert the space aliens thing into the major news media in a way that gives permission to wanna-believers to lean into the story.

I don’t have anything more to say on this right now, just trying to make sense of it all. This all has obvious relevance to political conspiracy theories, where authority figures can validate an idea, which then gives permission for other wanna-believers to push it.

Delayed retraction sampling

Colby Vorland writes:

In case it is of interest, a paper we reported 3 years, 4 months ago was just retracted:

Retracted: Effect of Moderate-Intensity Aerobic Exercise on Hepatic Fat Content and Visceral Lipids in Hepatic Patients with Diabesity: A Single-Blinded Randomised Controlled Trial
https://www.hindawi.com/journals/ecam/2023/9829387/

Over this time, I was sent draft retraction notices on two occasions by Hindawi’s research integrity team that were then reneged for reasons that were not clear. The research integrity team stopped responding to me, but after I involved COPE, they eventually got it done. Happy to give more details. Our full team who helped with this one was Colby Vorland, Greyson Foote, Stephanie Dickinson, Evan Mayo-Wilson, David Allison, and Andrew Brown.

As stated in the retraction notice, here are the issues:

(i) There is no mention of the clinical trial registration number, NCT03774511 (retrospectively registered in December 2018), or that this was part of a larger study. Overall, there were three arms: a control, a high-intensity exercise group (HII) and a moderate-intensity exercise group (MIC), but only the control and MIC were reported in [1].

(ii) There is no indication that references 35 and 36 [4, 5] cited in the article draw on data from the same study participants and these references are incorrectly presented as separate studies supporting the findings of the article, which may have misled readers.

(iii) The authors have stated that recruitment and randomization occurred during August-December 2017, the HII and control arms were conducted during January-August 2018, and the MIC arm was run during August-December 2018, which is a non-standard study design and was not reported in any of the articles.

(iv) The data presented in Figure 1 and Tables 1 and 2 are identical to data presented in Abdelbasset et al. [5]. With respect to Figure 1 the study has been presented without the additional study arm shown in Abdelbasset et al. [5].

(v) The data in Table 2 is identical to that shown as the MIC study arm in Abdelbasset et al. [5]. However, the p values have been presented to three decimal places whereas in Abdelbasset et al. [5] they are presented to two decimal places [5]. The data also shows inconsistent rounding. There is a particular concern where 0.046 has been rounded down to 0.04 (and hence appears statistically significant) rather than rounding up, as has occurred with other values. In addition, several items shown as in Abdelbasset et al. [5] are shown as values less than 0.01 (i.e., <0.01, 0.004 and 0.002). (vi) There are concerns with the accuracy of the statistical tests reported in the article, because the comparisons are of within-group differences rather than using valid between-group tests such as ANOVA. Many of the p-values reported in the article could not be replicated by Vorland et al. [3], and in particular they found no significant differences between treatment groups for BMI, IHTG, visceral adipose fat, total cholesterol, and triglycerides. This was confirmed by the authors’ reanalysis, apart from triglycerides for which there was a significant difference between treatment groups according to the authors’ reanalysis. (vii) The age ranges are slightly inconsistent between the articles, despite the studies collectively reporting on the same participants: 45–60 in [1, 4] and 40–60 in [5]. The authors state that 40–60 years reflects the inclusion criteria for the study, whereas the actual age range of the included participants was 45–60 years. (viii) Although this was a single clinical trial, different ethical approval numbers are given in each article: PT/2017/00-019 [1], PT/2017/00-018 [4], and P.TREC/012/002146 [5].

Also this from the published retraction:

The authors do not agree to the retraction and the notice.

I appreciate the effort by Vorland et al. I’ve done this sort of thing too on occasion, and other times I’ve asked a journal to publish a letter of correction but they’ve refused. Unfortunately, retraction and correction are not scalable. Literally zillions of scientific papers are published a year, and only a handful get retracted or corrected.

How large is that treatment effect, really? (my talk at NYU economics department Thurs 18 Apr 2024, 12:30pm)

19 W 4th Street, Room 517:

How large is that treatment effect, really?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.