Statistical Modeling, Causal Inference, and Social Science

For that price he could’ve had 54 Jamaican beef patties or 1/216 of a conference featuring Gray Davis, Grover Norquist, and a rabbi

Andrew — Wed, 24 Apr 2024 13:41:09 +0000

It’s the eternal question . . . what do you want, if given these three options:

(a) 54 Jamaican beef patties.

(b) 1/216 of a conference featuring some mixture of active and washed-up business executives, academics, politicians, and hangers-on.

The ideal would be to put it all together: 54 Jamaican beef patties at the airport, waiting to your flight to the conference to meet Grover Norquist’s rabbi. Who probably has a lot to say about the ills of modern consumerism.

I’d pay extra for airport celery if that’s what it took, but there is no airport celery so I bring it from home.

P.S. The above story is funny. Here’s some stuff that makes me mad.

Postdoc Opportunity at the HEDCO Institute for Evidence-Based Educational Practice in the College of Education at the University of Oregon

Andrew — Tue, 23 Apr 2024 23:18:42 +0000

Emily Tanner-Smith writes:

Remote/Hybrid Postdoc Opportunity—join us as a Post-Doctoral Scholar at the HEDCO Institute for Evidence-Based Educational Practice in the College of Education at the University of Oregon!

The HEDCO Institute specializes in the conduct of evidence syntheses that meet the immediate decision-making demands of local, state, and national school leaders. Our work is carried out by a team of faculty and staff who work collaboratively with affiliated faculty at the UO and an external advisory board. The HEDCO Institute also provides research and outreach training and experience to students from across the COE and the UO.

We are looking for a new Post-Doctoral Scholar to join our team and contribute to our work aiming to close the gap between educational research and practice. The postdoc will work with Dr. Sean Grant on creating, maintaining, and disseminating living systematic reviews on school-based mental health prevention. Principal responsibilities include organizing and analyzing evidence synthesis data, collaborating with members from the larger institute team, participating in project meetings and conference calls, and working closely with Dr. Grant and other team members to achieve institute goals. Examples of these responsibilities include:
– Implementing protocols for evidence synthesis research data collection, data management, data analysis, and data presentation.
– Collecting and archiving evidence synthesis research data, ensuring integrity of data collection and archival procedures to ensure reproducibility and reuse.
– Analyzing data, interpreting results, and disseminating evidence to researchers and decision-makers (e.g., authoring/co-authoring technical reports, peer-reviewed journal articles, and policy briefs; delivering conference presentations and webinars).
– Assisting Dr. Grant with guidance for and oversight of undergraduate and graduate students at the institute.

We are seeking a highly motivated individual with a Ph.D. in relevant scientific field (including education, psychology, prevention science, or quantitative methodology) by start of position. Competitive candidates will have experience participating in evidence synthesis research projects (such as authoring or co-authoring a published systematic review), training in quantitative methods (particularly meta-analysis and data science), and proficiency with statistical analysis software (particularly R, RStudio, and Shiny) and evidence synthesis software (particularly DistillerSR).

This position is full-time for 1 year, with multi-year appointments possible contingent upon receipt of ongoing funding. This position is housed in the Eugene COE building, though remote/hybrid working options are also available for the entirety of the position (several current team members work remotely). The desired start date is August 2024 and expected salary range is $60,000 – $69,000. The position will have a formal mentor plan and involve professional development opportunities throughout the appointment to improve evidence synthesis and knowledge mobilization skills.

The University of Oregon is an equal opportunity, affirmative action institution committed to cultural diversity and compliance with the ADA. All qualified individuals are encouraged to apply! Applications will be reviewed on a rolling basis. For full consideration, please apply to our open pool by May 24, 2024. Please contact hedcoinstitute@uoregon.edu if you have any questions about this opportunity.

I don’t post every job ad that’s sent to me, but this one seemed particularly relevant, as it has to do with evidence synthesis in policy analysis. The announcement doesn’t mention Stan, but I can only assume that experience with Bayesian modeling and Stan would be highly relevant to the job.

6 ways to follow this blog

Andrew — Tue, 23 Apr 2024 21:53:15 +0000

RSS.

Also, our old posts are spooling at StatRetro every three hours starting with our very first post from 2004.

The blog’s in all these places because people told me they were having difficulty staying informed about the new posts. So, lots of places for you to find these.

What is your superpower?

Andrew — Tue, 23 Apr 2024 13:21:04 +0000

After writing this post, I was thinking that my superpower as a researcher is my willingness to admit I’m wrong, which gives me many opportunities to learn and do better (see for example here or here). My other superpower is my capacity to be upset, which has often led me to think deeper about statistical questions (for example here).

That’s all fine, but then it struck me that, whenever people talk about their “superpower,” they always seem to talk about qualities that just about anyone could have.

For example, “My superpower is my ability to listen to people,” or “My superpower is that I always show up on time.” Or the classic “Sleep is your superpower.”

A quick google yields, yields, “The superpower question invites you to single out a quality that has made it possible for you to achieve, and to give an example of a goal that you were able to reach as a result. Our first tip is to choose a simple but strong and effective superpower, for example: Endurance, strength or resilience.”

And “My superpower is the fact I am STRONG, DETERMINED, AND RESILIENT.”

And this: “Your superpower is your contribution—the role that you’re put on this Earth to fill. It’s what you do better than anyone else and tapping into it will not only help your team, but you’ll find your work more satisfying, too.” Which sounds different, but then it continues with these examples: Empathy, Systems Thinking, Creative Thinking, Grit, and Decisiveness.

I’m reminded of that Ben Stiller movie where he played a superhero whose superpower was that he could get really annoyed. Kinda like a Ben Stiller character, actually!

How it could be?

OK, superheroes aren’t real. So it’s not like people can say their superpower is flying, or invisibility, or breathing underwater, or being super-elastic, etc.

But . . . lots of people do have special talents. So you could imagine people saying that their superpower is that they have a really good memory, or they’re really good at learning languages, or that they’re really flexible, or some other property which, if not superhuman or even unique, is at least unusual and special. Instead you get things like “grit” or “sleep.”

And, as noted above, even in own thinking, I was saying that my superpower is the commonplace ability to admit I’m wrong, or the characteristic of being easily upset. I could’ve said that my superpower is my mathematical talent or my ability to rapidly spin out ideas onto the page—but I didn’t!

I don’t know what this all means, but it seems like a funny thing that “superpower” is so often used to refer to commonplace habits that just about anyone could develop. I mean, sure, it fits with the whole growth-mindset thing: If I say that my superpower is that I can admit I’m wrong or that I work really hard, then anyone can emulate that. If I say that my superpower is that math comes easy to me, well, that’s not something you can do much with, if you don’t happen to have that superpower yourself.

So, yeah, I kind of get it. Still it seems off that, without even thinking about it, we use the term “superpower” for these habits and traits that are valuable but are pretty much the opposite of superpowers.

Storytelling and Scientific Understanding (my talks with Thomas Basbøll at Johns Hopkins this Friday)

Andrew — Mon, 22 Apr 2024 20:48:50 +0000

Fri 26 Apr, 10am in Shriver Hall Boardroom and 5pm in Hodson Hall 213 (see also here):

Storytelling and Scientific Understanding

Andrew Gelman and Thomas Basbøll

Storytelling is central to science, not just as a tool for broadcasting scientific findings to the outside world, but also as a way that we as scientists understand and evaluate theories. We argue that, for this purpose, a story should be anomalous and immutable; that is, it should be surprising, representing some aspect of reality that is not well explained by existing models of the world, and have details that stand up to scrutiny.

We consider how this idea illuminates some famous stories in social science involving soldiers in the Alps, Chinese boatmen, and trench warfare, and we show how it helps answer literary puzzles such as why Dickens had all those coincidences, why authors are often so surprised by what their characters come up with, and why the best alternative history stories have the feature that, in these stories, our “real world” ends up as the deeper truth. We also discuss connections to chatbots and human reasoning, stylized facts and puzzles in science, and the millionth digit of pi.

At the center our framework is a paradox: learning from anomalies seems to contradict usual principles of science and statistics where we seek representative or unbiased samples. We resolve this paradox by placing learning-within-stories into a hypothetico-deductive (Popperian) framework, in which storytelling is a form of exploration of the implications of a hypothesis. This has direct implications for our work as a statistician and a writing coach.

Basbøll and I have corresponded and written a couple papers together, but we’ve never met before this!

I posted on these talks a few months ago; reposting now because it’s coming up soon.

Decorative statistics and historical records

Andrew — Mon, 22 Apr 2024 13:42:49 +0000

Sean Manning points to this remark from Matthew “not the musician” White:

I [White] am sometimes embarrassed by where I have been forced to find my statistics … Often, the only place to find numbers is in a newspaper article, almanac, chronicle or encyclopedia which needs to summarize major events into a few short sentences or into one scary number, and occasionally I get the feeling that some writers use numbers as pure rhetorical flourishes. To them, “over a million” does not mean “>10⁶”; it’s just synonymous with “a lot”.

White was sooooo close to picking up on the concept of decorative statistics.

Now here’s a tour de force for ya

Andrew — Sun, 21 Apr 2024 13:50:49 +0000

In social science, we’ll study some topic, then move on to the next thing. For example, Yotam and I did this project on social penumbras and political attitudes, we designed a study, collected data, analyzed the data, wrote it up, eventually it was published—the whole thing took years! and we were very happy with the results—and then we moved on. The idea is that other people will pick up the string. There were lots of little concerns, issues of measurement, causal identification, generalization, etc., and we discussed these in our paper, again hoping that these will be useful leads to further researchers.

And that’s how it often goes. Sometimes we return to old problems (for example, we wrote a paper on incumbency advantage in 1990 and followed up 18 years later), and we’re still working on R-hat, over 30 years after I first came up with the idea), but, even there, we’re typically not working with continuous focus.

The opposite approach in science is to drill down obsessively on a single phenomenon, to really pin it down. I think this is what historians do when they immerse themselves in some archive for a decade and then emerge to write the definitive book on the topic.

Here’s an example, not from history but from cognitive psychology, by Andrew Meyer and Shane Frederick:

This paper presents 59 new studies (N=72,310) which focus primarily on the “bat and ball problem.” It documents our attempts to understand the determinants of the erroneous intuition, our exploration of ways to stimulate reflection, and our discovery that the erroneous intuition often survives whatever further reflection can be induced. Our investigation helps inform conceptions of dual process models, as “system 1” processes often appear to override or corrupt “system 2” processes. Many choose to uphold their intuition, even when directly confronted with simple arithmetic that contradicts it – especially if the intuition is approximately correct.

The paper contains the charming Ascii graphic reproduced above (for example, page 8 here). I love Ascii graphics! Regarding the paper, Frederick writes:

One thing I’m proud of is summarizing 59 studies in just 9 pages. Another thing I like, and you’ll probably like, is that when sample sizes get large enough (and we have some pretty large ones), psychology starts to look like physics.

What really impresses me about the paper is not the sample size but the obsessiveness of the project. And I mean that in a good way.

Analogy between (a) model checking in Bayesian statistics, and (b) the self-correcting nature of science.

Andrew — Sat, 20 Apr 2024 13:31:59 +0000

This came up in a discussion thread a few years ago. In response to some thoughts from Danielle Navarro about the importance of model checking, I wrote:

This makes me think of an analogy between the following two things:

– Model checking in Bayesian statistics, and

– The self-correcting nature of science.

The story of model checking in Bayesian statistics is that the fact that Bayesian inference can give ridiculous answers is a good thing, in that, when we see the ridiculous answer, this signals to us that there’s a problem with the model, and we can go fix it. This is the idea that we would rather have our methods fail loudly than fail quietly. But this all only works if, when we see a ridiculous result, we confront the anomaly. It doesn’t work if we just accept the ridiculous conclusion without questioning it, and it doesn’t work if we shunt the ridiculous conclusion aside and refuse to consider its implications.

Similarly with the self-correcting nature of science. Science makes predictions which can be falsified. Scientists make public statements, many (most?) of which will eventually be proved wrong. These failures motivate re-examination of assumptions. That’s the self-correcting nature of science. But it only works if individual scientists do this (notice anomalies and explore them) and it only works if the social structure of science allows it. Science doesn’t self-correct if scientists continue to stand by refuted claims, and it doesn’t work if they attack or ignore criticism.

In short, science is self-correcting, but only if “science”—that is, the people and the institutions of science—do that correction.

Similarly, statistical methods are checkable, but only if the users of these methods actually check them, and only if the developers of these methods develop methods for users to perform these checks. Which is where I come in, as a methodologist.

As Thomas Bayes famously said, with great power comes great responsibility.

The data are on a 1-5 scale, the mean is 4.61, and the standard deviation is 1.64 . . . What’s so wrong about that??

Andrew — Fri, 19 Apr 2024 13:33:14 +0000

James Heathers reports on the article, “Contagion or restitution? When bad apples can motivate ethical behavior,” by Gino, Gu, and Zhong (2009):

There is some sentiment data reported in Experiment 3, which seems to be reported in whole units.

They also indicated how guilty they would feel about the behavior of the person who took all the money along with some unrelated emotional measures (1 = not at all, 5 = very much)… participants in the in-group selfish condition felt more guilty (M = 4.61, SD = 1.64) about the person’s selfish behavior than the participants in the out-group selfish condition (M = 3.26, SD = 1.54), t(80) = 3.82, p < .001.

If you have a 1 to 5 scale, it isn’t possible to have M = 4.61, SD = 1.64.

Huh? Really? Yeah!

Let’s work it out. If your measurements are on a 1-5 scale, the way to maximize their standard deviation for any given mean is to put the data all at 1 and 5. If the mean is 4.61, that would imply that (4.61 – 1)/(5 – 1) = 0.9025 of the data take on the value 5, and 1 – 0.9025 = 0.0975 take on the value 1. (Just to check, 0.0975*1 + 0.9025*5 = 4.61.)

For this extreme dataset, the standard deviation is sqrt(0.0975*(1 – 4.61)^2 + 0.9025*(5 – 4.61)^2) = 1.19. So, yeah, there’s no way to get a standard deviation of 1.64 from these data. Just not possible!

Just to make sure, we can check our calculation via simulation:

n <- 1e6
y <- sample(c(1,5), n, replace=TRUE, prob=c(0.0975, 0.9025))
print(c(mean(y), sd(y)))

Here's what we get:

[1] 4.610172 1.186317

Check.

OK, let's try one more thing. Maybe b is so small that there's some kinda 1/sqrt(n-1) thing in the denominator driving the result? I don't think so. The trouble is that, to get a mean of 4.61, you need enough data (in his post, Heathers guesses "n=41 (as 189/41 = 4.6098)") that the difference between 1/sqrt(n) and 1/sqrt(n-1) wouldn't be enough to take you from 1.19 all the way up to 1.64 or even close. Also, it's kinda implausible that all the observations would be 1's and 5's anyway.

So what happened?

It's always easier to figure out what didn't happen than to figure out what did happen.

Here are some speculations.

One possibility is a typo, but Heathers doubts that because other calculations in that paper are consistent the above-reported impossible numbers.

A related possibility is that this was a typo that was then propagated into the rest of paper. For example, the mean was 3.61, it was typed in the paper as 4.61, and then this typed-in number was used in later calculations. This would be bad workflow---you want all the computations to be done in a single script---but people use bad workflow all the time. I use bad workflow myself sometimes and end up with wrong numbers or wrongly-labeled graphs.

Another possibility is that the mean and standard deviation were calculated from two different datasets. That might sound kind of weird, but it can happen all the time, due to sloppiness or because of goofs in data processing. For example, you read in the data, calculate the mean and standard deviation for each variable, then perform some data-exclusion rule, perhaps removing data with incomplete responses to some of the questions, then you do further statistical analysis, recalculating the mean and standard deviation, among other things---but then when you pull together your numbers, you take the mean from some place and the standard deviation from the other place.

Yet another possibility is that someone involved in the data analysis or writeup was cheating in order to get a statistically-significant and thus publishable result, for example changing 3.61 to 4.61 to get a big fat difference but not touching the standard deviation. This would be a great way to cheat, because if you get caught, you can just say that you made a typo!

In any case, it's a fun little statistics example. And it's worth checking your data, even if you have no suspicion of cheating. I've often had incoherent data in problems I've worked on. Lots of things can go wrong in data processing and analysis, and we have to check things in all sorts of ways.

Infovis, infographics, and data visualization: My thoughts 12 years later

Andrew — Thu, 18 Apr 2024 13:16:13 +0000

I came across this post from 2011, “Infovis, infographics, and data visualization: Where I’m coming from, and where I’d like to go,” and it seemed to make sense to reassess where we are now, 12 years later.

From 2011:

I majored in physics in college and I worked in a couple of research labs during the summer. Physicists graph everything. I did most of my plotting on graph paper–this continued through my second year of grad school–and became expert at putting points at 1/5, 2/5, 3/5, and 4/5 between the x and y grid lines.

In grad school in statistics, I continued my physics habits and graphed everything I could. I did notice, though, that the faculty and the other students were not making a lot of graphs. I discovered and absorbed the principles of Cleveland’s The Elements of Graphing Data.

In grad school and beyond, I continued to use graphs in my research. But I noticed a disconnect in how statisticians thought about graphics. There seemed to be three perspectives:

1. The proponents of exploratory data analysis liked to graph raw data and never think about models. I used their tools but was uncomfortable with the gap between the graphs and the models, between exploration and analysis.

2. From the other direction, mainstream statisticians–Bayesian and otherwise–did a lot of math and fit a lot of models (or, as my ascetic Berkeley colleagues would say, applied a lot of procedures to data) but rarely made a graph. They never seemed to care much about the fit of their models to data.

3. Finally, textbooks and software manuals featured various conventional graphs such as stem-and-leaf plots, residual plots, scatterplot matrices, and q-q plots, all of which seemed appealing in the abstract but never did much for me in the particular applications I was working on.

In my article with Meng and Stern, and in Bayesian Data Analysis, and then in my articles from 2003 and 2004, I have attempted to bring these statistical perspectives together by framing exploratory graphics as model checking: a statistical graph can reveal the unexpected, and “the unexpected” is defined relative to “the expected”–that is, a model. This fits into my larger philosophy that puts model checking at the center of the statistical enterprise.

Meanwhile, my graphs have been slowly improving. I realized awhile ago that I didn’t need tables of numbers at all. And here and there I’ve learned of other ideas, for example Howard Wainer’s practice of giving every graph a title.

I continued with some scattered thoughts about graphics and communication:

A statistical graph does not stand alone. It needs some words to go along with it to explain it. . . . I realized that our plots, graphically strong though they were, did not stand on their own. . . . This experience has led me to want to put more effort into explaining every graph, not merely what the points and lines are indicating (although that is important and can be hard to figure out in many published graphs) but also what is the message the graph is sending.

Most graphs are nonlinear and don’t have a natural ordering. A graph is not a linear story or a movie you watch from beginning to end; rather, it’s a cluttered house which you can enter from any room. The perspective you pick up if you start from the upstairs bathroom is much different than what you get by going through the living room–or, in graphical terms, you can look at clusters of points and lines, you can look at outliers, you can make lots of different comparisons. That’s fine but if a graph is part of a scientific or journalistic argument it can help to guide the reader a bit–just as is done automatically in the structuring of words in an article. . . .

While all this was happening, I also was learning more about decision analysis. In particular, Dave Krantz convinced me that the central unit of decision analysis is not the utility function or even the decision tree but rather the goal.

Applying this idea to the present discussion: what is the goal of a graph? There can be several, and there’s no reason to suppose that the graph that is best for achieving one of these goals will be optimal, or even good, for another. . . .

I’m a statistician who loves graphs and uses them all the time, I’m continually working on improving my graphical presentation of data and of inferences, but I’m probably stuck (without realizing it) in a bit of a rut of dotplots and lineplots. I’m aware of an infographics community . . .

Here’s an example of where I’m coming from: a blog post entitled, “Is the internet causing half the rapes in Norway? I wanna see the scatterplot.” To me, visualization is not an adornment or a way of promoting social science. Visualization is a central tool in social science research. (I’m not saying visualization is strictly necessary–I’m sure you can do a lot of good work with no visual sense at all–but I think it’s a powerful approach, and I worry about people who believe social science claims that they can’t visualize. I worry about researchers who believe their own claims without understanding them well enough to visualize the relation of these claims to the data from which they are derived.)

The rest of my post from 2011 discusses my struggles in communicating with the information visualization community–these are people who produce graphs for communication with general audiences, which motivates different goals and tools than those used by statisticians to communicate as part of the research process. Antony Unwin and I wrote a paper about these differences which was ultimately published with discussion in 2013 (and here is our rejoinder to the discussions).

Looking at all this a decade later, I’m not so interested in non-statistical information visualization anymore. I don’t mean this in a disparaging way! I think infofiz is great. Sometimes the very aspects of an infographic that make it difficult to read and deficient from a purely statistical perspective are a benefit for communication in that they can push the reader into thinking in new ways; here’s an example we discussed from a few years ago.

I continue to favor what we call the click-through solution: Start with the infographic, click to get more focused statistical graphics, click again to get the data and sources. But, in any case, the whole stat graphics vs. infographics thing has gone away, I guess because it’s clear that they can coexist; I don’t really see them as competing.

Whassup now?

Perhaps surprisingly, my graphical practices have remained essentially unchanged since 2011. I say “perhaps surprisingly,” because other aspects of my statistical workflow have changed a lot during this period. My lack of graphical progress is probably a bad thing!

A big reason for my stasis in this regard, I think, is that I’ve worked on relatively few large applied projects during the past fifteen years.

From 2004 through 2008, my collaborators and I were working every day on Red State Blue State. We produced hundreds of graphs and the equivalent of something like 10 or 20 research articles. In addition to our statistical goals of understanding our data and how they related to public opinion and voting, we knew from the start that we wanted to communicate both to political scientists and to the general public, so we were on the lookout for new ways to display our data and inferences. Indeed, we had the idea for the superplot before we ever made the actual graph.

Since 2008, I’ve done lots of small applied analyses for books and various research projects, but no big project requiring a rethinking of how to make graphs. The closest thing would be Stan, and here we have made some new displays–at least, new to me–but that work was done by collaborators such as Jonah Gabry, who did ShinyStan, and this hasn’t directly affected the sorts of graphs that I make.

I continue to think about graphs in new ways (for example, causal quartets and the ladder of abstraction), but, as can be seen in those new papers, the looks of my graphs haven’t really changed since 2011.

“Close but no cigar” unit tests and bias in MCMC

Bob Carpenter — Wed, 17 Apr 2024 19:00:37 +0000

I’m coding up a new adaptive sampler in Python, which is super exciting (the basic methodology is due to Nawaf Bou-Rabee and Tore Kleppe). Luckily for me, another great colleague, Edward Roualdes, has been keeping me on the straight and narrow by suggesting stronger tests and pointing out actual bugs in the repository (we’ll open access it when we put the arXiv paper up—hopefully by the end of the month).

There are a huge number of potential fencepost (off by one), log-vs-exponential, positive-vs-negative, numerator-vs-denominator, and related errors to make in this kind of thing. For example, here’s a snippet of the transition code.

L = self.uturn(theta, rho)
LB = self.lower_step_bound(L)
N = self._rng.integers(LB, L)
theta_star, rho_star = self.leapfrog(theta, rho, N)
rho_star = -rho_star
Lstar = self.uturn(theta_star, rho_star)
LBstar = self.lower_step_bound(Lstar)
if not(LBstar <= N and N < Lstar):
    ... reject ...

Looks easy, right? Not quite. The uturn function returns the number of steps to get to a point that is one step past the U-turn point. That is, if I take L steps from (theta, rho), I wind up closer than to where I started than if I take L - 1 steps. The rng.integers function samples uniformly, but it’s Python, so it excludes the upper bound and samples from {LB, LB + 1, .., L - 1} . That’s correct, because I want to choose a number of steps greater than 1 and less than the point past which you’ve made a U-turn. Let’s just say I got this wrong the first time around.

Because it’s MCMC and I want a simple proof of correctness, I have to make sure the chain’s reversible. So I see how many steps to get one past a U-turn coming back (after momentum flip), which is Lstar. Now I have to grab its lower bound, and make sure that I take a number of steps between the lower bound (inclusive) and upper bound (exclusive). Yup, had this wrong at one point. But the off-by-one error shows up in a position that is relatively rare given how I was sampling.

For more fun, we have to compute the acceptance probability. In theory, it’s just p(theta_star, rho_star, N) / p(theta, rho, N) in this algorithm, which looks as follows on the log scale.

log_accept = (
    self.log_joint(theta_star, rho_star) - np.log(Lstar - LBstar)
    - (log_joint_theta_rho - np.log(L - LB))
)

That’s because p(N | theta_star, rho_star) = 1 / (Lstar - LBstar) given the uniform sampling with Lstar excluded and LBstar included. But then I substituted the uniform distribution for a binomial, and made the following mistake.

log_accept = (
  self.log_joint(theta_star, rho_star) - self.length_log_prob(N, Lstar)
  - (log_joint_theta_rho - self.length_log_prob(N, L))
)

I only had the negation in -np.log(L - LB) because it was equivalent to np.log(1 / (L - LB)) with a subtraction instead of a division. Luckily Edward caught this one in the code review. I should’ve just coded the log density and added it rather than subtracted it. Now you’d think this would lead to an immediate and glaring bug in the results because MCMC is a delicate algorithm. In this case, the issue is that (N - L) and (N - Lstar) are identically distributed and only range over values of roughly 5 to 7. That’s a minor difference in a stochastic acceptance probability that’s already high. How hard was this to detect? With 100K iterations, everything looked fine. With 1M iterations, the estimates of parameters continued to follow a 1 / sqrt(iterations) trend in error, but showed the estimates of parameters squared asymptotic with residual error only after 100K iterations. That is, it required 1M iterations and an evaluation of the means of squared parameters to detect this bug.

I then introduced a similar error when I went to a binomial number of steps selection. I was using sp.stats.binom.logpmf(N, L, self._success_prob) when I should have been using sp.stats.binom.logpmf(N, L - 1, self._success_prob). As an aside, I like SciPy’s clear naming here vs. R’s dbinom(log.p = True, ...). What I don’t like about Python is that the discrete uniform doesn’t include its endpoint. Of course, the binomial includes its endpoint as an option, so these two versions need to be coded off by 1. Of course, I missed the L - 1. This only introduced a bug because I didn’t do the matching adjustment in testing whether things were reversible. That’s if not(1 <= N and N < Lstar) to match the Lstar - 1 in the logpmf() call. If I ran it all the way to L, then I would've needed N <= Lstar. This is another subtle difference that only shows up after more than 100K iterations.

We introduced a similar problem into Stan in 2016 when we revised NUTS to do multinomial sampling rather than slice sampling. It was an off-by-one error on trajectory length. All of our unit tests of roughly 10K iterations passed. A user spotted the bug by fitting a 2D correlated normal with known correlation for 1M iterations as a test and realizing estimates were off by 0.01 when they should've had smaller error. We reported this on the blog back when it happened, culminating in the post Michael found the bug in Stan's new sampler.

I was already skeptical of empirical results in papers and this is making me even more skeptical!

P.S. In case you don't know the English idiom "close but no cigar", here's the dictionary definition from Cambridge (not Oxford!).

Do research articles have to be so one-sided?

Andrew — Wed, 17 Apr 2024 13:22:22 +0000

It’s standard practice in research articles as well as editorials in scholarly journals to present just one side of an issue. That’s how it’s done! A typical research article looks like this:

“We found X. Yes, we really found X. Here are some alternative explanations for our findings that don’t work. So, yeah, it’s really X, it can’t reasonably be anything else. Also, here’s why all the thickheaded previous researchers didn’t already find X. They were wrong, though, we’re right. It’s X. Indeed, it had to be X all along. X is the only possibility that makes sense. But it’s a discovery, it’s absolutely new. As was said of the music of Beethoven, each note is prospectively unexpected but retrospectively absolutely right. In conclusion: X.”

There also are methods articles, which go like this:

“Method X works. Here’s a real problem where method X works better than anything else out there. Other methods are less accurate or more expensive than X, or both. There are good theoretical reasons why X is better. It might even be optimal under some not-too-unreasonable conditions. Also, here’s why nobody tried X before. They missed it! X is, in retrospect, obviously the right thing to do. Also, though, X is super-clever: it had to be discovered. Here are some more examples where X wins. In conclusion: X.”

Or the template for a review article:

“Here’s a super-important problem which has been studied in many different ways. The way we have studied it is the best. In this article, we also discuss some other approaches which are worse. Our approach looks even better in this contrast. In short, our correct approach both flows naturally from and is a bold departure from everything that came before.”

OK, sometimes we try to do better. We give tentative conclusions, we accept uncertainty, we compare our approach to others on a level playing field, we write a review that doesn’t center on our own work. It happens. But, unless you’re Bob Carpenter, such an even-handed approach doesn’t come naturally, and, as always with this kind of adjustment, there’s always the concern of going too far (“bending over backward”) in the other direction. Recall my criticism of the popular but I think bogus concept of “steelmanning.”

So, yes, we should try to be more balanced, especially when presenting our own results. But the incentives don’t go in that direction, especially when your contributions are out there fighting with lots of ideas that other people are promoting unreservedly. Realistically, often the best we can do is to include Limitations sections in otherwise-positive papers.

One might think that a New England Journal of Medicine editorial could do better, but editorials have the same problem as review articles, which is that the authors will still have an agenda.

Dale Lehman writes in, discussing such an example:

A recent article in the New England Journal of Medicine caught my interest. The authors – a Harvard economist and a McKinsey consultant (properly disclosed their ties) – provide a variety of ways that AI can contribute to health care delivery. I can hardly argue with the potential benefits, and some areas of application are certainly ripe for improvements from AI. However, the review article seems unduly one-sided. Almost all of the impediments to application that they discuss lay the “blame” on health care providers and organizations. No mention is made about the potential errors made by AI algorithms applied in health care. This I found particularly striking since they repeatedly appeal to AI use in business (generally) as a comparison to the relatively slow adoption of AI in health care. When I think of business applications, a common error might be a product recommendation or promotion that was not relevant to a consumer. The costs of such a mistake are generally small – wasted resources, unhappy customers, etc. A mistake made by an AI recommendation system in medicine strikes me as quite a bit more serious (lost customers is not the same thing as lost patients).

To that point, the article cites several AI applications to prediction of sepsis (references 24-27). That is a particular area of application where several AI sepsis-detection algorithms have been developed, tested, and reported on. But the references strike me as cherry-picked. A recent controversy has concerned the Epic model (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8218233/?report=classic) where the company reported results were much better than the attempted replication. Also, there was a major international challenge (PhysioNet: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6964870/) where data was provided from 3 hospital systems, 2 of which provided the training data for the competition and the remaining system was used as the test data. Notably, the algorithms performed much better on the systems for which the training data was provided than for the test data.

My question really concerns the role of the NEJM here. Presumably this article was peer reviewed – or at least reviewed by the editors. Shouldn’t the NEJM be demanding more balanced and comprehensive review articles? It isn’t that the authors of this article say anything that is wrong, but it seems deficient in its coverage of the issues. It would not have been hard to acknowledge that these algorithms may not be ready for use (admittedly, they may outperform existing human models, but that is an area on which there is research and it should be noted in the article). Nor would it be difficult to point out that algorithmic errors and biases in health care may be a more serious matter than in other sectors of the economy.

Interesting. I’m guessing that the authors of the article were coming from the opposite direction, with a feeling that there’s too much conservatism regarding health-care innovation and they wanted to push back against that. (Full disclosure: I’m currently working with a cardiologist to evaluate a machine-learning approach for ECG diagnosis.)

In any case, yes, this is part of a general problem. One thing I like about blogging, as opposed to scholarly writing or journalism, is that in a blog post there’s no expectation or demand or requirement that we come to a strong conclusion. We can let our uncertainty hang out, without some need to try to make “the best possible case” for some point. We may be expected to entertain, but that’s not so horrible!

N=43, “a statistically significant 226% improvement,” . . . what could possibly go wrong??

Andrew — Tue, 16 Apr 2024 13:50:38 +0000

Enjoy.

They looked at least 12 cognitive outcomes, one of which had p = 0.02, but other differences “were just shy of statistical significance.” Also:

The degree of change in the brain measure was not significantly correlated with the degree of change in the behavioral measure (p > 0.05) but this may be due to the reduced power in this analysis which necessarily only included the smaller subset of individuals who completed neuropsychological assessments during in-person visits.

This is one of the researcher degrees of freedom we see all the time: an analysis with p > 0.05 can be labeled as “marginally statistically significant” or even published straight-up as a main result (“P < 0.10”), it can get some sort of honorable mention (“this may be due to the reduced power”), or it can be declared to be a null effect.

The “this may be due to the reduced power” thing is confused, for two reasons. First, of course it’s due to the reduced power! Set n to 1,000,000,000 and all your comparisons will be statistically significant! Second, the whole point of having these measures of sampling and measurement error is to reveal the uncertainty in an estimate’s magnitude and sign. It’s flat-out wrong to take a point estimate and just suppose that it would persist under a larger sample size.

People are trained in bad statistical methods, so they use bad statistical methods, it happens every day. In this one, I’m just bothered that this “226% improvement” thing didn’t set off any alarms. To the extent that these experimental results might be useful, the authors should be publishing the raw data rather than trying to fish out statistically significant comparisons. They also include a couple of impressive-looking graphs which wouldn’t look so impressive if they were to graph all the averages in the data rather than just those that randomly exceeded a significance threshold.

Did they publish the raw data? No! Here’s the Data availability statement:

The datasets presented in this article are not readily available because due to reasonable privacy and security concerns, the underlying data are not easily redistributable to researchers other than those engaged in the current project’s Institutional Review Board-approved research. The corresponding author may be contacted for an IRB-approved collaboration. Requests to access the datasets should be directed to …

It seems like it would be pretty trivial to remove names and any other identifying information and then release the raw data. This is a study on “whether older adults retain or improve their cognitive ability over a six-month period after daily olfactory enrichment at night.” What’s someone gonna do, track down participants based on their “daily exposure to essential oil scents”?

One problem here is that Institutional Review Boards are set up with a default no-approval stance. I think it should be the opposite: no IRB approval unless you commit ahead of time to posting your raw data. (Not that my collaborators and I usually post our raw data either. Posting raw data can be difficult. That’s one reason I think it should required, because otherwise it’s not likely to be done.)

No, it’s not “statistically implausible” when results differ between studies, or between different groups within a study.

Andrew — Mon, 15 Apr 2024 13:31:04 +0000

James “not the cancer cure guy” Watson writes:

This letter by Thorland et al. published in the New England Journal of Medicine is rather amusing. It’s unclear to me what their point is, other than the fact that they find the published results for the new COVID drug molnupiravir “statistically implausible.”

Background: The pharma company Merck got very promising results for molnupiravir at their interim analysis (~50% reduction in hospitalisation/death) but less promising results at their final analysis (30% reduction). Thorlund et al. were surprised that the data for the two study periods (before and after interim analysis) provided very different point estimates for benefit (goes the other way in the second period). They were also surprised to see inconsistent results when comparing across the different countries included in the study (non-overlapping confidence intervals).

They clearly had never read the subgroup analysis from the ISIS-2 trial: the authors convincingly showed that aspirin reduced vascular deaths in patients of all astrological birth signs expect Gemini and Libra, see Figure 5 in this Lancet paper from 1998.

He’s not kidding—that Lancet paper really does talk about astrological signs. What the hell??

Regarding the letter in the New England Journal of Medicine, I guess the point is that different studies, and different groups within a study, have different patients and are conducted at different times and under different conditions, so it makes sense that they can have different outcomes, more different that would be expected to arise from pure chance when comparing two samples from an identical distribution. People often don’t seem to realize this, leading them to characterize differences from chance as “statistically implausible” etc. rather than just representing underlying differences across patients, scenarios, and times.

As the authors of the original study put it in their response letter in the journal:

Given the shifts in prevailing SARS-CoV-2 variants, changes in out- patient management, and inclusion of trial sites from countries with unique Covid-19 disease burdens, the trial was not necessarily conducted under uniform conditions. The differences in the results between the interim and final analyses might be statistically improbable under ideal circumstances, but they reflect the fact that several key factors could not remain constant despite a consistent trial design.

Indeed.

Simulation to understand two kinds of measurement error in regression

Andrew — Sun, 14 Apr 2024 13:07:16 +0000

This is all super-simple; still, it might be useful. In class today a student asked for some intuition as to why, when you’re regressing y on x, measurement error on x biases the coefficient estimate but measurement error on y does not.

I gave the following quick explanation:
– You’re already starting with the model, y_i = a + bx_i + e_i. If you add measurement error to y, call it y*_i = y_i + eta_i, and then you regress y* on x, you can write y* = a + bx_i + e_i + eta_i, and as long as eta is independent of e, you can just combine them into a single error term.
– When you have measurement error in x, two things happen to attenuate b—that is, to pull the regression coefficient toward zero. First, if you spreading out x but keep y unchanged, this will reduce the slope of y on x. Second, when you add noise to x you’re changing the ordering of the data, which will reduce the strength of the relationship.

But that’s all words (and some math). It’s simpler and clearer to do a live simulation, which I did right then and there in class!

Here’s the R code:

# simulation for measurement error
library("arm")
set.seed(123)
n <- 1000
x <- runif(n, 0, 10)
a <- 0.2
b <- 0.3
sigma <- 0.5
y <- rnorm(n, a + b*x, sigma)
fake <- data.frame(x,y)

fit_1 <- lm(y ~ x, data=fake)
display(fit_1)

sigma_y <- 1
fake$y_star <- rnorm(n, fake$y, sigma_y)
sigma_x <- 4
fake$x_star <- rnorm(n, fake$x, sigma_x)

fit_2 <- lm(y_star ~ x, data=fake)
display(fit_2)

fit_3 <- lm(y ~ x_star, data=fake)
display(fit_3)

fit_4 <- lm(y_star ~ x_star, data=fake)
display(fit_4)

x_range <- range(fake$x, fake$x_star)
y_range <- range(fake$y, fake$y_star)

par(mfrow=c(2,2), mar=c(3,3,1,1), mgp=c(1.5,.5,0), tck=-.01)
plot(fake$x, fake$y, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_1), col="red", main="No measurement error")
plot(fake$x, fake$y_star, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_2), col="red", main="Measurement error on y")
plot(fake$x_star, fake$y, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_3), col="red", main="Measurement error on x")
plot(fake$x_star, fake$y_star, xlim=x_range, ylim=y_range, bty="l", pch=20, cex=.5)
abline(coef(fit_4), col="red", main="Measurement error on x and y")

The resulting plot is at the top of this post.

I like this simulation for three reasons:

1. You can look at the graph and see how the slope changes with measurement error in x but not in y.

2. This exercise shows the benefits of clear graphics, including little things like making the dots small, adding the regression lines in red, labeling the individual plots, and using a common axis range for all four graphs.

3. It was fast! I did it live in class, and this is an example of how students, or anyone, can answer this sort of statistical question directly, with a lot more confidence and understanding than would come from a textbook and some formulas.

P.S. As Eric Loken and I discuss in this 2017 article, everything gets more complicated if you condition on "statistical significance."

P.P.S. Yes, I know my R code is ugly. Think of this as an inspiration: even if, like me, you’re a sloppy coder, you can still code up these examples for teaching and learning.

Intelligence is whatever machines cannot (yet) do

Bob Carpenter — Sat, 13 Apr 2024 19:00:13 +0000

I had dinner a few nights ago with Andrew’s former postdoc Aleks Jakulin, who left the green fields of academia for entrepreneurship ages ago. Aleks was telling me he was impressed by the new LLMs, but then asserted that they’re clearly not intelligent. This reminded me of the old saw in AI that “AI is whatever a machine can’t do.”

In the end, the definition of “intelligent” is a matter of semantics. Semantics is defined by conventional usage, not by fiat (the exception seems to be an astronomical organization trying to change the definition of “planet” to make it more astronomically precise). We do this all the time. If you think about what “water” means, it’s incredibly vague. In the simplest case, how many minerals can it contain before we call it “mud” rather than “water”? Does it even have to be made of H20 if we can find a clear liquid on an alternative earth that will nourish us in the same way (this is a common example in philosophy from Hilary Putnam, I believe)? When the word “water” was first introduced into English, let’s just say that our understanding of chemistry was less developed than it is now. The word “intelligent” is no different. We’ve been using the term since before computers, and now we have to rethink what it means. By convention, we could decide as a group of language users to define “intelligent” however we want. Usually such decisions are guided by pragmatic considerations (or at least I’d like to think so—this is the standard position of pragmatist philosophers of language, like Richard Rorty). For instance, we could decide to exclude GPT because (a) it’s not embodied in the same way as a person, (b) it doesn’t have long-term memory, (c) it runs on silicon rather than cells, etc.

It would be convenient for benchmarking if we could fix a definition of “intelligence” to work with. What we do instead is just keep moving the bar on what counts as “intelligent.” I doubt people 50 years ago (1974) would have said you can play chess without being intelligent. But as soon as Deep Blue beat the human chess champion, everyone changed their tune and the chorus became “chess is just a game” and “it’s finite” and “it has well defined rules, unlike real life.” Then when IBM’s Watson trounced the world champion at Jeopardy!, a language based game, it was dismissed as a parlor trick. Obviously because a machine can play Jeopardy!, the reasoning went, it doesn’t require intelligence.

Here’s the first hit on Google I found searching for something like [what machines can’t do]. This one’s in a popular magazine, not the scientific literature. It’s the usual piece in the genre of “ML is amazing, but it’s not intelligent because it can’t do X”.

Rob Toews. 2021. What artificial intelligence still can’t do. Forbes.

Let’s go over Toews’s list of AI’s failures circa 2021 (these are direct quotes).

Use “common sense.” A man went to a restaurant. He ordered a steak. He left a big tip. If asked what the man ate in this scenario, a human would have no problem giving the correct answer—a steak. Yet today’s most advanced artificial intelligence struggles with prompts like this.
Learn continuously and adapt on the fly. Today, the typical AI development process is divided into two distinct phases: training and deployment.
Understand cause and effect. Today’s machine learning is at its core a correlative tool. It excels at identifying subtle patterns and associations in data. But when it comes to understanding the causal mechanisms—the real-world dynamics—that underlie those patterns, today’s AI is at a loss.
“Reason ethically…In 2016, Microsoft debuted an AI personality on Twitter named Tay. The idea was for Tay to engage in online conversations with Twitter users as a fun, interactive demonstration of Microsoft’s NLP technology. It did not go well. Within hours, Internet trolls had gotten Tay to tweet a wide range of offensive messages: for instance, “Hitler was right” and “I hate feminists and they should all die and burn in hell.”

(1) ChatGPT-4 gets these common-sense problems mostly right. But it’s not logic. The man may have ordered a steak, gotten it, sent it back, ordered the fish instead, and still left a big tip. This is a problem with a lot of the questions posed to GPT about whether X follows from Y. It’s not a sound inference, just the most likely thing to happen, or as we used to say, the “default.” Older AIs were typically designed around sound inference and weren’t so much trying to emulate human imprecision (having said that, my grad school admissions essay was about and my postdoc was funded by a grant on default logics back in the 1980s!).

(2) You can do in-context learning with ChatGPT, but it doesn’t retain anything long term without retraining/fine tuning. It will certainly adapt to its task/listener on the fly throughout a conversation (arguably the current systems like ChatGPT adapt to their interlocuter too much—it’s what they were trained to do via reinforcement learning). Long-term memory is perhaps the biggest technical challenge to overcome, and it’s been interesting to see people going back to LSTM/recursive NN ideas (transformers, the neural net architecture underlying ChatGPT, were introduced in a paper titled “Attention is all you need”, which used long, but finite memory).

(3) ChatGPT 4 is pretty bad at causal inference. But it’s probably above the bar for what Toews’s complaints. It’ll get simple “causal inference” right the same way people do. In general, humans are pretty bad at causal inference. We are way too prone to jump to causal conclusions based on insufficient evidence. Do we classify baseball announcers as not intelligent when they talk about how a player struggles with high pressure situations after N = 10 plate appearances in the playoffs? We’re also pretty bad at reasoning about things that go against our preconceptions. Do we think Fisher was not intelligent because he argued that smoking didn’t cause cancer? Do we think all the anthropogenic global warming deniers are not intelligent? Maybe they’re right and it’s just a coincidence that temps have gone up coinciding with industrialization and carbon emissions. Seems like a highly suspicious coincidence, but causation is really hard when you can’t do randomized controlled trials (and even then it’s not so easy because of all the possible mediation).

(4) How you call this one depends on whether you think the front-line fine-tuning of ChatGPT made a reasonably helpful/harmless/truthful bot or not and whether the “ethics” it was trained with are yours. You can certainly jailbreak even ChatGPT-4 to send it spiraling into hate land or fantasy land. You can jailbreak some of my family in the same way, but I wouldn’t go so far as to say they weren’t intelligent. You can find lots of folks who think ChatGPT is too “woke”. This is a running theme on the GPT subreddit. It’s also a running theme among anti-woke billionaires, as reflected in the UK’s Daily Telegraph article title, “ChatGPT may be the next big thing, but it’s a biased woke robot.”

I’ve heard a lot of people say their dog is more intelligent than ChatGPT. I suppose they would argue for a version of intelligence that doesn’t require (1) or (4) and is very tolerant of poor performance in (2) and (3).

Evidence, desire, support

Andrew — Sat, 13 Apr 2024 13:50:05 +0000

I keep worrying, as with a loose tooth, about news media elites who are going for the UFOs-as-space-aliens theory. This one falls halfway between election denial (too upsetting for me to want to think about too often) and belief in ghosts (too weird to take seriously).

I was also thinking about the movie JFK, which I saw when it came out in 1991. As a reader of the newspapers, I knew that the narrative pushed in the movie was iffy, to say the least; still, I watched the movie intently—I wanted to believe. In the same way that in the 1970s I wanted to believe those claims that dolphins are smarter than people, or that millions of people wanted to believe in the Bermuda Triangle or ancient astronauts or Noah’s Ark or other fringe ideas that were big in that decade. None of those particular ideas appealed to me.

Anyway, this all got me thinking about what it takes for someone to believe in something. My current thinking is that belief requires some mixture of the following three things:
1. Evidence
2. Desire
3. Support

To go through these briefly:

1. I’m using the term “evidence” in a general sense to include things you directly observe and also convincing arguments of some sort or another. Evidence can be ambiguous and, much to people’s confusion, it doesn’t always point in the same direction. The unusual trajectory of Oswald’s bullet is a form of evidence, even though not as strong as has been claimed by conspiracy theories. The notorious psychology paper from 2011 is evidence for ESP. It’s weak evidence, really no evidence at all for anything beyond the low standards of academic psychology at the time, but it played the role of evidence for people who were interested in or open to believing.

2. By “desire,” I mean a desire to believe in the proposition at hand. There can be complicated reasons for this desire. Why did I have some desire in 1991 to believe the fake JFK story, even thought I knew ahead of time it was suspect? Maybe because it helped make sense of the world? Maybe because, if I could believe the story, I could go with the flow of the movie and feel some righteous anger? I don’t really know. Why do some media insiders seen to have the desire to believe that UFOs are space aliens? Maybe because space aliens are cool, maybe because, if the theory is true, then these writers are in on the ground floor of something big, maybe because the theory is a poke in the eye at official experts, maybe all sorts of things.

3. “Support” refers to whatever social environment you’re in. 30% of Americans believe in ghosts, and belief in ghosts seems to be generally socially acceptable—I’ve heard people from all walks of life express the belief—but there are some places where it’s not taken seriously, such as in the physics department. The position of ghost-belief within the news media is complicated, typically walking a fine line to avoid expressing belief or disbelief. For example, a quick search of *ghosts npr* led to this from the radio reporter:

I’m pretty sure I don’t believe in ghosts. Now, I say pretty sure because I want to leave the possibility open. There have definitely been times when I felt the presence of my parents who’ve both died, like when one of their favorite songs comes on when I’m walking the aisles of the grocery store, or when the wind chime that my mom gave me sings a song even though there’s no breeze. But straight-up ghosts, like seeing spirits, is that real? Can that happen?

This is kind of typical. It’s a news story that’s pro-ghosts, reports a purported ghost sighting with no pushback, but there’s that kinda disclaimer too. It’s similar to reporting on religion. Different religions contradict each other, and so if you want to report in a way that’s respectful of religion, you have to place yourself in a no-belief-yet-no-criticism mode: if you have a story about religion X, you can’t push back (“Did you really see the Lord smite that goat in your backyard that day?”) because that could offend adherents of that religion, but you can’t fully go with it, as that could offend adherents of every other religion.

I won’t say that all three of evidence, desire, and support are required for belief, just that they can all contribute. We can see this with some edge cases. That psychologist who published the terrible paper on ESP: he had a strong desire to believe, a strong enough desire to motivate an entire research program on his part. There was also a little bit of institutional support for the belief. Not a lot—ESP is a fringe take that would be, at best, mocked by most academic psychologists, it’s a belief that has much lower standing now than it did fifty years ago—but some. Anyway, the strong desire was enough, along with the terrible-but-nonzero evidence and the small-but-nonzero support. Another example would be Arthur Conan Doyle believing those ridiculous faked fairy photos: spiritualism was big in society at the time, so he had strong social support as well as strong desire to believe. In other cases, evidence is king, but without the institutional support it can be difficult for people to be convinced. Think of all those “they all laughed, but . . .” stories of scientific successes under adversity: continental drift and all the rest.

As we discussed in an earlier post, the “support” thing seems like a big change regarding the elite media and UFOs-as-space-aliens. The evidence for space aliens, such as it is—blurry photographs, eyewitness testimony, suspiciously missing government records, and all the rest—has been with us for half a century. The desire to believe has been out there too for a long time. What’s new is the support: some true believers managed to insert the space aliens thing into the major news media in a way that gives permission to wanna-believers to lean into the story.

I don’t have anything more to say on this right now, just trying to make sense of it all. This all has obvious relevance to political conspiracy theories, where authority figures can validate an idea, which then gives permission for other wanna-believers to push it.

Delayed retraction sampling

Andrew — Fri, 12 Apr 2024 13:38:43 +0000

Colby Vorland writes:

In case it is of interest, a paper we reported 3 years, 4 months ago was just retracted:

Retracted: Effect of Moderate-Intensity Aerobic Exercise on Hepatic Fat Content and Visceral Lipids in Hepatic Patients with Diabesity: A Single-Blinded Randomised Controlled Trial
https://www.hindawi.com/journals/ecam/2023/9829387/

Over this time, I was sent draft retraction notices on two occasions by Hindawi’s research integrity team that were then reneged for reasons that were not clear. The research integrity team stopped responding to me, but after I involved COPE, they eventually got it done. Happy to give more details. Our full team who helped with this one was Colby Vorland, Greyson Foote, Stephanie Dickinson, Evan Mayo-Wilson, David Allison, and Andrew Brown.

As stated in the retraction notice, here are the issues:

(i) There is no mention of the clinical trial registration number, NCT03774511 (retrospectively registered in December 2018), or that this was part of a larger study. Overall, there were three arms: a control, a high-intensity exercise group (HII) and a moderate-intensity exercise group (MIC), but only the control and MIC were reported in [1].

(ii) There is no indication that references 35 and 36 [4, 5] cited in the article draw on data from the same study participants and these references are incorrectly presented as separate studies supporting the findings of the article, which may have misled readers.

(iii) The authors have stated that recruitment and randomization occurred during August-December 2017, the HII and control arms were conducted during January-August 2018, and the MIC arm was run during August-December 2018, which is a non-standard study design and was not reported in any of the articles.

(iv) The data presented in Figure 1 and Tables 1 and 2 are identical to data presented in Abdelbasset et al. [5]. With respect to Figure 1 the study has been presented without the additional study arm shown in Abdelbasset et al. [5].

(v) The data in Table 2 is identical to that shown as the MIC study arm in Abdelbasset et al. [5]. However, the p values have been presented to three decimal places whereas in Abdelbasset et al. [5] they are presented to two decimal places [5]. The data also shows inconsistent rounding. There is a particular concern where 0.046 has been rounded down to 0.04 (and hence appears statistically significant) rather than rounding up, as has occurred with other values. In addition, several items shown as in Abdelbasset et al. [5] are shown as values less than 0.01 (i.e., <0.01, 0.004 and 0.002). (vi) There are concerns with the accuracy of the statistical tests reported in the article, because the comparisons are of within-group differences rather than using valid between-group tests such as ANOVA. Many of the p-values reported in the article could not be replicated by Vorland et al. [3], and in particular they found no significant differences between treatment groups for BMI, IHTG, visceral adipose fat, total cholesterol, and triglycerides. This was confirmed by the authors’ reanalysis, apart from triglycerides for which there was a significant difference between treatment groups according to the authors’ reanalysis. (vii) The age ranges are slightly inconsistent between the articles, despite the studies collectively reporting on the same participants: 45–60 in [1, 4] and 40–60 in [5]. The authors state that 40–60 years reflects the inclusion criteria for the study, whereas the actual age range of the included participants was 45–60 years. (viii) Although this was a single clinical trial, different ethical approval numbers are given in each article: PT/2017/00-019 [1], PT/2017/00-018 [4], and P.TREC/012/002146 [5].

Also this from the published retraction:

The authors do not agree to the retraction and the notice.

I appreciate the effort by Vorland et al. I’ve done this sort of thing too on occasion, and other times I’ve asked a journal to publish a letter of correction but they’ve refused. Unfortunately, retraction and correction are not scalable. Literally zillions of scientific papers are published a year, and only a handful get retracted or corrected.

How large is that treatment effect, really? (my talk at NYU economics department Thurs 18 Apr 2024, 12:30pm)

Andrew — Thu, 11 Apr 2024 13:33:55 +0000

19 W 4th Street, Room 517:

How large is that treatment effect, really?

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

“Unbiased estimates” aren’t really unbiased, for a bunch of reasons, including aggregation, selection, extrapolation, and variation over time. Econometrics typically focus on causal identification, with this goal of estimating “the” effect. But we typically care about individual effects (not “Does the treatment work?” but “Where and when does it work?” and “Where and when does it hurt?”). Estimating individual effects is relevant not only for individuals but also for generalizing to the population. For example, how do you generalize from an A/B test performed on a sample right now to possible effects on a different population in the future? Thinking about variation and generalization can change how we design and analyze experiments and observational studies. We demonstrate with examples in social science and public health.

“He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it.”

Andrew — Wed, 10 Apr 2024 13:39:27 +0000

Ron Bloom points us to this wonderful article, “The Ethics of Belief,” by the mathematician William Clifford, also known for Clifford algebras. The article is related to some things I’ve written about evidence vs. truth (see here and here) but much more beautifully put. Here’s how it begins:

A shipowner was about to send to sea an emigrant-ship. He knew that she was old, and not overwell built at the first; that she had seen many seas and climes, and often had needed repairs. Doubts had been suggested to him that possibly she was not seaworthy. These doubts preyed upon his mind, and made him unhappy; he thought that perhaps he ought to have her thoroughly overhauled and refitted, even though this should put him to great expense. Before the ship sailed, however, he succeeded in overcoming these melancholy reflections. He said to himself that she had gone safely through so many voyages and weathered so many storms that it was idle to suppose she would not come safely home from this trip also. He would put his trust in Providence, which could hardly fail to protect all these unhappy families that were leaving their fatherland to seek for better times elsewhere. He would dismiss from his mind all ungenerous suspicions about the honesty of builders and contractors. In such ways he acquired a sincere and comfortable conviction that his vessel was thoroughly safe and seaworthy; he watched her departure with a light heart, and benevolent wishes for the success of the exiles in their strange new home that was to be; and he got his insurance-money when she went down in mid-ocean and told no tales.

What shall we say of him? Surely this, that he was verily guilty of the death of those men. It is admitted that he did sincerely believe in the soundness of his ship; but the sincerity of his conviction can in no wise help him, because he had no right to believe on such evidence as was before him. He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it.

Clifford’s article is from 1877!

Bloom writes:

One can go over this in two passes. One pass may be read as “moral philosophy.”

But the second pass helps one think a bit about how one ought to make precise the concept of ‘relevance’ in “relevant evidence.”

Specifically (this is remarkably deficient in the Bayesian corpus I find) I would argue that when we say “all probabilities are relative to evidence” and write the symbolic form straightaway P(A|E) we are cheating. We have not faced the fact — I think — that not every “E” has any bearing (“relevance”) one way or another on A and that it is *inadmissible* to combine the symbols because it is so easy to write ’em down. Perhaps one evades the problem by saying, well what do you *think* is the case. Perhaps you might say, “I think that E is irrelevant if P(A|E) = P(A|~E).” But that begs the question: it says in effect that *both* E and ~E can be regarded as “evidence” for A. I argue that easily leads to nonsense. To regard any utterance or claim as “evidence” for any other utterance or claim leads to absurdities. Here for instance:

A = “Water ice of sufficient quantity to maintain a lunar base will be found in the spectral analysis of the plume of the crashed lunar polar orbiter.”

E = If there are martians living on the Moon of Jupiter, Europa, then they celebrate their Martian Christmas by eating Martian toast with Martian jam.

Is E evidence for A? is ~E evidence for A? Is any far-fetched hypothetical evidence for any other hypothetical whatsoever?

Just to provide some “evidence” that I am not being entirely facetious about the Lunar orbiter; I attach also a link to now much superannuated item concerning that very intricate “experiment” — I believe in the end there was some spectral evidence turned up consistent with something like a teaspoon’s worth of water-ice per 25 square Km.

P.S. Just to make the connection super-clear, I’d say that Clifford’s characterization, “He had acquired his belief not by honestly earning it in patient investigation, but by stifling his doubts. And although in the end he may have felt so sure about it that he could not think otherwise, yet inasmuch as he had knowingly and willingly worked himself into that frame of mind, he must be held responsible for it,” is an excellent description of those Harvard professors who notoriously endorsed the statement, “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” Also a good match to those Columbia administrators who signed off on those U.S. News numbers. In neither case did a ship go down; it’s the same philosophical principle but lower stakes. Just millions of dollars involved, no lives lost.

As Isaac Asimov put it, “A robot may not injure a human being or, through inaction, allow a human being to come to harm.” Sometimes that inaction is pretty damn active, when a shipowner or a scientific researcher or a university administrator puts in some extra effort to avoid looking at some pretty clear criticisms.

Here’s something you should do when beginning a project, and in the middle of a project, and in the end of the project: Clearly specify your goals, and also specify what’s not in your goal set.

Andrew — Tue, 09 Apr 2024 13:11:24 +0000

Here’s something from from Witold’s slides on baggr, an R package (built on Stan) that does hierarchical modeling for meta-analysis:

Overall goals:

1. Implement all basic meta-analysis models and tools
2. Focus on accessibility, model criticism and comparison
3. Help people avoid basic mistakes
4. Keep the framework flexible and extend to more models

(Probably) not our goal:

5. Build a package for people who already build their models in Stan

I really like this practice of specifying goals. This is so basic that it seems like we should always be doing it—but so often we don’t! Also I like the bit where he specifies something that’s not in his goals.

Again, this all seems so natural when we see it, but it’s something we don’t usually do. We should.

People have needed rituals to turn data into truth for many years. Why would we be surprised if many people now need procedural reforms to work?

Jessica Hullman — Mon, 08 Apr 2024 17:56:52 +0000

This is Jessica. How to weigh metascience or statistical reform proposals has been on my mind more than usual lately as a result of looking into and blogging about the Protzko et al. paper on rigor-enhancing practices. Seems it’s also been on Andrew’s mind.

Like Andrew, I have been feeling “bothered by the focus on procedural/statistical ‘rigor-enhancing practices’ of ‘confirmatory tests, large sample sizes, preregistration, and methodological transparency’” because I suspect researchers are taking it to heart that these steps will be enough to help them produce highly informative experiments.

Yesterday I started thinking about it via an analogy to writing. I heard once that if you’re trying to help someone become a better writer, you should point out no more than three classes of things they’re doing wrong at one time, because too much new information can be self-defeating.

Imagine you’re a writing consultant and people bring you their writing and you advise them on how to make it better. Keeping in mind the need to not overwhelm, initially maybe you focus on the simple things that won’t make them a great writer, but are easy to teach. “You’re doing comma splices, your transitions between paragraphs suck, you’re doing citation wrong.” You talk up how important these things are to get right to get them motivated, and you bite your tongue when it comes to all the other stuff they need help with to avoid discouraging them.

Say the person goes away and fixes the three things, then comes back to you with some new version of what they’ve written. What do you do now? Naturally, you give them three more things. Over time as this process repeats, you eventually get to the most nuanced stuff that is harder to get right but ultimately more important to their success as a writer.

But for this approach to work presupposes that your audience either intrinsically cares enough about improving to keep coming back, or that they have some outside reason they must keep coming back, like maybe they are a Ph.D. student and their advisor is forcing their hand. What if you can never be sure when someone walks in the door that they will come back a second time after you give your advice? In fact, what if the people who really need your help with their writing are bad writers because they fixated on the superficial advice they got in middle school or high school that boils good writing down to a formula, and considered themselves done? And now they’re looking for the next quick fix, so they can go back to focusing on whatever they are actually interested in and treating writing as a necessary evil?

Probably they will latch onto the next three heuristics you give them and consider themselves done. So if we suspect the people we are advising will be looking for easy answers, it seems unlikely that we are going to get them there using the approach above where we give them three simple things and we talk up the power of these things to make them good writers. Yet some would say this is what mainstream open science is doing, by giving people simple procedural reforms (just preregister, just use big samples, etc) and talking up how they help eliminate replication problems.

I like writing as an analogy for doing experimental social science because both are a kind of wicked problem where there are many possible solutions, and the criteria for selecting between them are nuanced. There are simple procedural things that are easier to point out, like the comma splices or lacking transitions between paragraphs in writing, or not having a big enough sample or invalidating your test by choosing it posthoc in experimental science. But avoiding mistakes at this level is not going to make you a good writer, just like enacting simple procedural heuristics are not going to make you a good experimentalist or modeler. For that you need to adopt an orientation that acknowledges the inherent difficulty of the task and prompts you to take a more holistic approach

Figuring out how to encourage that is obviously not easy. But one reason that starting with the simple procedural stuff (or broadly applicable stuff, as Nosek implies the “rigor-enhancing practices” are), seems insufficient to me is that I don’t necessarily think there’s a clear pathway from the simple formulaic stuff to the deeper stuff, like the connection between your theory and what you are measuring and how you are measuring it and how you specify and select among competing models. I actually think things make more sense to go the opposite way, from why inference from experimental data is necessarily very hard as a result of model misspecification, effect heterogeneity, measurement error etc. to the ingredients that have to be in place for us to even have a chance, like sufficient sample size and valid confirmatory tests. The problem is that one can understand the concepts of preregistration or sufficient sample size while still having a relatively simple mental model of effects as real or fake and questionable research practices as the main source of issues.

In my own experience, the more I’ve thought about statistical inference from experiments over the years, the more seriously I take heterogeneity and underspecification/misspecification, to the point that I’ve largely given up doing experimental work. This is an extreme outcome of course, but I think we should expect that the more one recognizes how hard the job really is, the less likely one is to firehose the literature in one’s field with a bunch of careless dead-in-the-water style studies. As work by Berna Devezer and colleagues has pointed out, open science proposals are often subject to the same kinds of problems such as overconfident claims and reliance on heuristics that contributed to the replication crisis in the first place. This solution-ism (a mindset I’m all too familiar with as a computer scientist) can be counterproductive.

Hey, some good news for a change! (Child psychology and Bayes)

Andrew — Mon, 08 Apr 2024 13:57:47 +0000

Erling Rognli writes:

I just wanted to bring your attention to a positive stats story, in case you’d want to feature it on the blog. A major journal in my field (the Journal of Child Psychology and Psychiatry) has over time taken a strong stance for using Bayesian methods, publishing an editorial in 2016 advocating switching to Bayesian methods:

Editorial: Bayesian benefits for child psychology and psychiatry researchers – Oldehinkel – 2016 – Journal of Child Psychology and Psychiatry.

And recently following up with inviting myself and some colleagues to write a brief introduction to Bayesian methods (where we of course recommend Stan):

Editorial perspective: Bayesian statistical methods are useful for researchers in child and adolescent mental health – Rognli – Journal of Child Psychology and Psychiatry.

I think this consistent editorial support really matters for getting risk-averse researchers to start using new methods, so I think the editors of the JCPP deserve recognition for contributing to improving statistical practice in this field.

No reason to think that Bayes and Stan will, by themselves, transform child psychology, but I think it’s a step in the right direction. As Rubin used to say, one advantage of Bayes is that the work you do to set up the model represents a bridge between experiment, data, and scientific understanding. It’s getting you to think about the right questions.

Evilicious 3: Face the Music

Andrew — Sun, 07 Apr 2024 13:34:03 +0000

A correspondent forwards me this promotional material that appeared in his inbox:

Hello hello,

I am happy to announce that my new book MISBELIEF is out today!

Do you have a friend or family member who changed in some dramatic ways in the last few years? So, much so that you no longer understand them? Over the last few years I have been on a journey into some of the darkest corners of the internet, trying to make sense of what was happening around us. In MISBELIEF, which is out this week, I weave my personal story and research in an effort to shed light on this complex and important process. Misbeliefs are not just about other people, they are also about our own beliefs. This book will allow you to question your own beliefs, and how well you really know what you think you know. More generally, and important for society, this book is also about trust, the trust crisis we are in, and what it means for us as we go into the next election season.

You can learn more about the book here and here. If you decide you want to read the book, this page includes an option to order a copy for yourself + a copy to be donated to an educator. The book can also be found at Amazon, BAM, or Barnes & Noble.

And if you like the book, please leave a review on Amazon – it will help other people find the book.

I hope you enjoy the book.

Irrationally yours,

Dan

“Misbeliefs are not just about other people, they are also about our own beliefs.” Indeed.

I wonder if this new book includes the shredder story.

P.S. The book has blurbs from Yuval Harari, Arianna Huffington, and Michael Shermer (the professional skeptic who assures us that he has a haunted radio). This thing where celebrities stick together . . . it’s nuts!

P.P.S. The good news is that there’s already some new material for the eventual sequel. And it’s “preregistered”! What could possibly go wrong?

What is the prevalence of bad social science?

Andrew — Sat, 06 Apr 2024 13:21:51 +0000

Someone pointed me to this post from Jonatan Pallesen:

Frequently, when I [Pallesen] look into a discussed scientific paper, I find out that it is astonishingly bad.

• I looked into Claudine Gay’s 2001 paper to check a specific thing, and I find out that research approach of the paper makes no sense. (https://x.com/jonatanpallesen/status/1740812627163463842)

• I looked into the famous study about how blind auditions increased the number of women in orchestras, and found that the only significant finding is in the opposite direction. (https://x.com/jonatanpallesen/status/1737194396951474216)

• The work of Lisa Cook was being discussed because of her nomination to the fed. @AnechoicMedia_ made a comment pointing out a potential flaw in her most famous study. And indeed, the flaw was immediately obvious and fully disqualifying. (https://x.com/jonatanpallesen/status/1738146566198722922)

• The study showing judges being very affected by hunger? Also useless. (https://x.com/jonatanpallesen/status/1737965798151389225)

These studies do not have minor or subtle flaws. They have flaws that are simple and immediately obvious. I think that anyone, without any expertise in the topics, can read the linked tweets and agree that yes, these are obvious flaws.

I’m not sure what to conclude from this, or what should be done. But it is rather surprising to me to keep finding this.

My quick answer is, at some point you should stop being surprised! Disappointed, maybe, just not surprised.

A key point is that these are not just any papers, they’re papers that have been under discussion for some reason other than their potential problems. Pallesen, or any of us, doesn’t have to go through Psychological Science and PNAS every week looking for the latest outrage. He can just sit in one place, passively consume the news, and encounter a stream of prominent published research papers that have clear and fatal flaws.

Regular readers of this blog will recall dozens more examples of high-profile disasters: the beauty-and-sex-ratio paper, the ESP paper and its even more ridiculous purported replications, the papers on ovulation and clothing and ovulation and voting, himmicanes, air rage, ages ending in 9, the pizzagate oeuvre, the gremlins paper (that was the one that approached the platonic ideal of more corrections than data points), the ridiculously biased estimate of the effects of early-childhood intervention, the air pollution in China paper and all the other regression discontinuity disasters, much of the nudge literature, the voodoo study, the “out of Africa” paper, etc. As we discussed in the context of that last example, all the way back in 2013 (!), the problem is closely related to these papers appearing in top journals:

The authors have an interesting idea and want to explore it. But exploration won’t get you published in the American Economic Review etc. Instead of the explore-and-study paradigm, researchers go with assert-and-defend. They make a very strong claim and keep banging on it, defending their claim with a bunch of analyses to demonstrate its robustness. . . . High-profile social science research aims for proof, not for understanding—and that’s a problem. The incentives favor bold thinking and innovative analysis, and that part is great. But the incentives also favor silly causal claims. . . .

So, to return to the question in the title of this post, how often is this happening? It’s hard for me to say. On one hand, ridiculous claims get more attention; we don’t spend much time talking about boring research of the “Participants reported being hungrier when they walked into the café (mean = 7.38, SD = 2.20) than when they walked out [mean = 1.53, SD = 2.70, F(1, 75) = 107.68, P < 0.001]" variety. On the other hand, do we really think that high-profile papers in top journals are that much worse than the mass of published research?

I expect that some enterprising research team has done some study, taking a random sample of articles published in some journals and then looking at each paper in detail to evaluate its quality. Without that, we can only guess, and I don’t have it in me to hazard a percentage. I’ll just say that it happens a lot—enough so that I don’t think it makes sense to trust social-science studies by default.

My correspondent also pointed me to a recent article in Harvard’s student newspaper, “I Vote on Plagiarism Cases at Harvard College. Gay’s Getting off Easy,” by “An Undergraduate Member of the Harvard College Honor Council,” who writes:

Let’s compare the treatment of Harvard undergraduates suspected of plagiarism with that of their president. . . . A plurality of the Honor Council’s investigations concern plagiarism. . . . when students omit quotation marks and citations, as President Gay did, the sanction is usually one term of probation — a permanent mark on a student’s record. A student on probation is no longer considered in good standing, disqualifying them from opportunities like fellowships and study-abroad programs. Good standing is also required to receive a degree.

What is striking about the allegations of plagiarism against President Gay is that the improprieties are routine and pervasive. She is accused of plagiarism in her dissertation and at least two of her 11 journal articles. . . .

In my experience, when a student is found responsible for multiple separate Honor Code violations, they are generally required to withdraw — i.e., suspended — from the College for two semesters. . . . We have even voted to suspend seniors just about to graduate. . . .

There is one standard for me and my peers and another, much lower standard for our University’s president.

This echoes what Jonathan Bailey has written here and here at his blog Plagiarism Today:

Schools routinely hold their students to a higher and stricter standard when it comes to plagiarism than they handle their faculty and staff. . . .

To give an easy example. In October 2021, W. Franklin Evans, who was then the president of West Liberty University, was caught repeated plagiarizing in speeches he was giving as President. Importantly, it wasn’t past research that was in dispute, it was the work he was doing as president.

However, though the board did vote unanimously to discipline him, they also voted against termination and did not clarify what discipline he was receiving.

He was eventually let go as president, but only after his contract expired two years later. It’s difficult to believe that a student at the school, if faced with a similar pattern of plagiarism in their coursework, would be given that same chance. . . .

The issue also isn’t limited to higher education. In February 2020, Katy Independent School District superintendent Lance Hindt was accused of plagiarism in his dissertation. Though he eventually resigned, the district initially threw their full sport behind Hindt. This included a rally for Lindth that was attended by many of the teachers in the district.

Even after he left, he was given two years of salary and had $25,000 set aside for him if he wanted to file a defamation lawsuit.

There are lots and lots of examples of prominent faculty committing scholarly misconduct and nobody seems to care—or, at least, not enough to do anything about it. In my earlier post on the topic, I mentioned the Harvard and Yale law professors, the USC medical school professor, the Princeton history professor, the George Mason statistics professor, and the Rutgers history professor, none of whom got fired. And I’d completely forgotten about the former president of the American Psychological Association and editor of Perspectives on Psychological Science who misrepresented work he had published and later was forced to retract—but his employer, Cornell University, didn’t seem to care. And the University of California professor who misrepresented data and seems to have suffered no professional consequences. And the Stanford professor who gets hyped by his university while promoting miracle cures and bad studies. And the dean of engineering at the University of Nevada. Not to mention all the university administrators and football coaches who misappropriate funds and then are quietly allowed to leave on golden parachutes.

Another problem is that we rely on the news media to keep these institutions accountable. We have lots of experience with universities (and other organizations) responding to problems by denial; the typical strategy appears to be to lie low and hope the furor will go away, which typically happens in the absence of lots of stories in the major news media. But . . . the news media have their own problems: little problems like NPR consistently hyping junk science and big problems like Fox pushing baseless political conspiracy theories. And if you consider podcasts and Ted talks to be part of “the media,” which I think they are—I guess as part of the entertainment media rather than the news media, but the dividing line is not sharp—then, yeah, a huge chunk of the media is not just susceptible to being fooled by bad science and indulgent of academic misconduct, it actually relies on bad science and academic misconduct to get the wow! stories that bring the clicks.

To return to the main thread of this post: by sanctioning students for scholarly misconduct but letting its faculty and administrators off the hook, Harvard is, unfortunately, following standard practice. The main difference, I guess, is that “president of Harvard” is more prominent than “Princeton history professor” or “Harvard professor of constitutional law” or “president of West Liberty University” or “president of the American Psychological Association” or “UCLA medical school professor” or all the others. The story of the Harvard president stays in the news, while those others all receded from view, allowing the administrators at those institutions to follow the usual plan of minimizing the problem, saying very little, and riding out the storm.

Hey, we just got sidetracked into a discussion of plagiarism. This post was supposed to be about bad research. What can we say about that?

Bad research is different than plagiarism. Obviously, students don’t get kicked out for doing bad research, using wrong statistical methods, losing their data, making claims that defy logic and common sense, claiming to modify a paper shredder that has never existed, etc etc etc. That’s the kind of behavior that, if your final paper also has formatting problems, will get you slammed with a B grade and that’s about it.

When faculty are found to have done bad research, the usual reaction is not to give them a B or to do the administrative equivalent—lowering their salary, perhaps?, or removing them from certain research responsibilities, maybe making them ineligible to apply for grants?—but rather to pretend that nothing happened. The idea is that, once an article has been published, you draw a line under it and move onward. It’s considered in bad taste—Javert-like, even!—to go back and find flaws in papers that are already resting comfortably in someone’s C.V. As Pallesen notes, so often when we do go back and look at those old papers, we find serious flaws. Which brings us to the question in the title of this post.

P.S. The paper by Claudine Gay discussed by Pallesen is here; it was published in 2001. For more on the related technical questions involving the use of ecological regression, I recommend this 2002 article by Michael Herron and Kenneth Shots (link from Pallesen) and my own article with David Park, Steve Ansolabehere, Phil Price, and Lorraine Minnite, “Models, assumptions, and model checking in ecological regressions,” from 2001.

“AI” as shorthand for turning off our brains. (This is not an anti-AI post; it’s a discussion of how we think about AI.)

Andrew — Fri, 05 Apr 2024 13:38:07 +0000

Before going on, let me emphasize that, yes, modern AI is absolutely amazing—self-driving cars, machines that can play ping-pong, chessbots, computer programs that write sonnets, the whole deal! Call it machine intelligence or whatever, it’s amazing.

What I’m getting at in this post is the way in which attitudes toward AI fit into existing practices in science and other aspects of life.

This came up recently in comments:

“AI” does not just refer to a particular set of algorithms or computer programs but also to the attitude in which an algorithm or computer program is idealized to the extent that people think it’s ok for them to rely on it and not engage their brains.

Some examples of “AI” in that sense of the term:
– When people put a car on self-driving mode and then disengage from the wheel.
– When people send out a memo produced by a chatbot without reading and understanding it first.
– When researchers use regression discontinuity analysis or some other identification strategy and don’t check that their numbers make any sense at all.
– When journal editors see outrageous claims backed “p less than 0.05” and then just push the Publish button.

“AI” is all around us, if you just know where to look!

One thing that interests me here is how current expectations of AI in some ways match and in some ways go beyond past conceptions in science fiction. The chatbot, for example, is pretty similar to all those talking robots, and I guess you could imagine a kid in such a story asking his robot to do his homework for him. Maybe the difference is that the robot is thought to have some sort of autonomy, along which comes some idiosyncratic fallibility (if only that the robot is too “logical” to always see clearly to the solution of a problem), whereas an AI is considered more of an institutional product with some sort of reliability, in the same sense that every bottle of Coca-Cola is the same. Maybe that’s the connection to naive trust in standardized statistical methods.

This also relates to the idea that humans used to be thought of as the rational animal but now are viewed as irrational computers. In the past, our rationality was considered to be what separates us from the beasts, either individually or through collective action, as in Locke and Hobbes. If the comparison point is animals, then our rationality is a real plus! Nowadays, though, it seems almost the opposite: if the comparison point is a computer, then what makes us special is not our rationality but our emotions.

There is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.

Andrew — Thu, 04 Apr 2024 20:28:21 +0000

Following our recent post on the latest Dishonestygate scandal, we got into a discussion of the challenges of simulating fake data and performing a pre-analysis before conducting an experiment.

You can see it all in the comments to that post—but not everybody reads the comments, so I wanted to repeat our discussion here. Especially the last line, which I’ve used as the title of this post.

Raphael pointed out that it can take some work to create a realistic simulation of fake data:

Do you mean to create a dummy dataset and then run the preregistered analysis? I like the idea, and I do it myself, but I don’t see how this would help me see if the endeavour is doomed from the start? I remember your post on the beauty-and-sex ratio, which proved that the sample size was far too small to find an effect of such small magnitude (or was it in the Type S/Type M paper?). I can see how this would work in an experimental setting – simulate a bunch of data sets, do your analysis, compare it to the true effect of the data generation process. But how do I apply this to observational data, especially with a large number of variables (number of interactions scales in O(p²))?

I elaborated:

Yes, that’s what I’m suggesting: create a dummy dataset and then run the preregistered analysis. Not the preregistered analysis that was used for this particular study, as that plan is so flawed that the authors themselves don’t seem to have followed it, but a reasonable plan. And that’s kind of the point: if your pre-analysis plan isn’t just a bunch of words but also some actual computation, then you might see the problems.

In answer to your second question, you say, “I can see how this would work in an experimental setting,” and we’re talking about an experiment here, so, yes, it would’ve been better to have simulated data and performed an analysis on the simulated data. This would require the effort of hypothesizing effect sizes, but that’s a bit of effort that should always be done when planning a study.

For an observational study, you can still simulate data; it just takes more work! One approach I’ve used, if I’m planning to fit data predicting some variable y from a bunch of predictors x, is to get the values of x from some pre-existing dataset, for example an old survey, and then just do the simulation part for y given x.

Raphael replied:

Maybe not the silver bullet I had hoped for, but now I believe I understand what you mean.

To which I responded:

There is no silver bullet; there is no golden path to discovery. One of my problems with all the focus on p-hacking, preregistration, harking, etc. is that I fear that it is giving the impression that all will be fine if researchers just avoid “questionable research practices.” And that ain’t the case.

Again, this is not a diss on preregistration. Preregistration does one thing; it’s not intended to fix bad aspects of the culture of science such as the idea that you can gather a pile of data, grab some results, declare victory, go on the Ted talk circuit based only on the very slender bit of evidence that you seem to have been able to reject that the data came from a specific random number generator. That line of reasoning, where rejection of straw-man null hypothesis A is taken as evidence in favor of preferred alternative B, is wrong—but it’s not preregistration’s fault that people think that way!

P-hacking can be bad (but the problem here, in my view, is not in performing multiple analyses but rather in reporting only one of them rather than analyzing them all together); various questionable research practices are, well, questionable; and preregistration can help with that, either directly (by motivating researchers to follow a clear plan) or indirectly (by allowing outsiders to see problems in post-publication review, as here).

I am, however, bothered by the focus on procedural/statistical “rigor-enhancing practices” of “confirmatory tests, large sample sizes, preregistration, and methodological transparency.” Again, the problem is if researchers mistakenly think that following such advice will place them back on that nonexistent golden path to discovery.

So, again, I recommend to make assumptions, simulate fake data, and analyze these data as a way of constructing a pre-analysis plan, before collecting any data. That won’t put you on the golden path to discovery either!

All I can offer you here is blood, toil, tears and sweat, along with the possibility that a careful process of assumptions/simulation/pre-analysis will allow you to avoid disasters such as this ahead of time, thus avoiding the consequences of: (a) fooling yourself into thinking you’ve made a discovery, (b) wasting the time and effort of participants, coauthors, reviewers, and postpublication reviewers (that’s me!), and (c) filling the literature with junk that will later be collected in a GIGO meta-analysis and promoted by the usual array of science celebrities, podcasters, and NPR reporters.

Aaaaand . . . in the time you’ve saved from all of that could be repurposed into designing more careful experiments with clearer connections between theory and measurement. Not a glide along the golden path to a discovery; more of a hacking through the jungle of reality to obtain some occasional glimpses of the sky.

It’s Ariely time! They had a preregistration but they didn’t follow it.

Andrew — Thu, 04 Apr 2024 13:31:49 +0000

I have a story for you about a success of preregistration. Not quite the sort of success that you might be expecting—not a scientific success—but a kind of success nonetheless.

It goes like this. An experiment was conducted. It was preregistered. The results section was written up in a way that reads as if the experiment worked as planned. But if you go back and forth between the results section and the preregistration plan, you realize that the purportedly successful results did not follow the preregistration plan. They’re just the usual story of fishing and forking paths and p-hacking. The preregistration plan was too vague to be useful, also the authors didn’t even bother to follow it—or, if they did follow it, they didn’t bother to write up the results of the preregistered analysis.

As I’ve said many times before, there’s no reason that preregistration should stop researchers from doing further analyses once they see their data. The problem in this case is that the published analysis was not well justified either from a statistical or a theoretical perspective, nor was it in the preregistration. Its only value appears to be as a way for the authors to spin a story around a collection of noisy p-values.

On the minus side, the paper was published, and nowhere in the paper does it say that the statistical evidence they offer from their study does not come from the preregistration. In the abstract, their study is described as “pre-registered,” which isn’t a lie—there’s a pregistration plan right there on the website—but it’s misleading, given that the preregistration does not line up with what’s in the paper.

On the plus side, outside readers such as ourselves can see the paper and the preregistrations and draw our own conclusions. It’s easier to see the problems with p-hacking and forking paths when the analysis choices are clearly not in the preregistration plan.

The paper

The Journal of Experimental Social Psychology recently published an article, “How pledges reduce dishonesty: The role of involvement and identification,” by Eyal Peer, Nina Mazar, Yuval Feldman, and Dan Ariely.

I had no idea that Ariely is still publishing papers on dishonesty! It says that data from this particular paper came from online experiments. Nothing involving insurance records or paper shredders or soup bowls or 80-pound rocks . . . It seems likely that, in this case, the experiments actually happened and that the datasets came from real people and have not been altered.

And the studies are preregistered, with the preregistration plans all available on the papers’ website.

I was curious about that. The paper had 4 studies. I just looked at the first one, which already took some effort on my part. The rest of you can feel free to look at Studies 2, 3, and 4.

The results section and the preregistration

From the published paper:

The first study examined the effects of four different honesty pledges that did or did not include a request for identification and asked for either low or high involvement in making the pledge (fully-crossed design), and compared them to two conditions without any pledge (Control and Self-Report).

There were six conditions: one control (with no possibility to cheat), a baseline treatment (possibility and motivation to cheat and no honesty pledge), and four different treatments with honesty pledges.

This is what they reported for their primary outcome:

And this is how they summarize in their discussion section:

Interesting, huh?

Now let’s look at the relevant section of the preregistration:

Compare that to what was done in the paper:

– They did the Anova, but that was not relevant to the claims in the paper. The Anova included the control condition, and nobody’s surprised that when you give people the opportunity and motivation to cheat, that some people will cheat. That was not the point of the paper. It’s fine to do the Anova; it’s just more of a manipulation check than anything else.

– There’s something in the preregistration about a “cheating gap” score, which I did not see in the paper. But if we define A to be the average outcome under the control, B to be the average outcome under the baseline treatment, and C, D, E, F to be the average under the other four treatments, then I think the preregistration is saying they’ll define the cheating gap as B-A, and the compare this to C-A, D-A, E-A, and F-A. This is mathematically the same as looking at C-B, D-B, E-B, and F-B, which is what they do in the paper.

– The article jumps back and forth between different statistical summaries: “three of the four pledge conditions showed a decrease in self-reports . . . the difference was only significant for the Copy + ID condition.” It’s not clear what to make of it. They’re using statistical significance as evidence in some way, but the preregistration plan does not make it clear what comparisons would be done, how many comparisons would be made, or how they would be summarized.

– The preregistration plan says, “We will replicate the ANOVAs with linear regressions with the Control condition or Self-Report conditions as baseline.” I didn’t see any linear regressions in the results for this experiment in the published paper.

– The preregistration plan says, “We will also examine differences in the distribution of the percent of problems reported as solved between conditions using Kolmogorov–Smirnov tests. If we find significant differences, we will also examine how the distributions differ, specifically focusing on the differences in the percent of “brazen” lies, which are defined as the percent of participants who cheated to a maximal, or close to a maximal, degree (i.e., reported more than 80% of problems solved). The differences on this measure will be tested using chi-square tests.” I didn’t see any of this in the paper either! Maybe this is fine, because doing all these tests doesn’t seem like a good analysis plan to me.

How do we think of all the analyses stated in the preregistration plan that were not in the paper? Since these analyses were preregistered, I can only assume the authors performed them. Maybe the results were not impressive and so they weren’t included. I don’t know; I didn’t see any discussion of this in the paper.

– The preregistration plan says, “Lastly, we will explore interactions effects between the condition and demographic variables such as age and gender using ANOVA and/or regressions.” They didn’t report any of that either! Also there’s the weird “and/or” in the preregistration, which gives the researchers some additional degrees of freedom.

Not a moral failure

I continue to emphasize that scientific problems do not necessarily correspond to moral problems. You can be a moral person and still do bad science (honesty and transparency are not enuf); to put it another way, if I say that you make a scientific error or are sloppy in your science, I’m not saying you’re a bad person.

For me to say someone’s a bad person just because they wrote a paper and didn’t follow their preregistration plan . . . that would be ridiculous! Over 99% of my published papers have no preregistration plans; and, those that do have such plans, I’m pretty sure we didn’t exactly follow them in our published papers. That’s fine. The reason I do preregistration is not to protect my p-values; it’s just part of a larger process of hypothesizing about possible outcomes and simulating data and analysis as a prelude to measurement and data collection.

I think what happened in the “How pledges reduce dishonesty” paper is that the preregistration was both too vague and too specific. Too vague in that it did not include simulation and analysis of fake data, nor did it include quantitative hypotheses about effects and the distributions of outcomes, nor did it include anything close to what the authors ended up actually doing to support the claims in their paper. Too specific in that it included a bunch of analyses that the authors then didn’t think were worth reporting.

But, remember, science is hard. Statistics is hard. Even what might seem like simple statistics is hard. One thing I like about doing simulation-based design and analysis before collecting any data is that it forces me to make some of the hard choices early. So, yeah, it’s hard, and it’s no moral criticism of the authors of the above-discussed paper that they botched this. We’re all still learning. At the same time, yeah, I don’t think their study offers any serious evidence for the claims being made in that paper; it looks like noise mining to me. Not a moral failing; still, bad science in there being no good links between theory, effect sized, data collection, and measurement, which, as is often the case, leads to super-noisy results that can be interpreted in all sorts of ways to fit just about any theory.

Possible positive outcomes for preregistration

I think preregistration is great; again, it’s a floor, not a ceiling, on the data processing and analyses that can be done.

Here are some possible benefits of preregistration:

1. Preregistration is a vehicle for getting you to think harder about your study. The need to simulate data and create a fake world forces you to make hard choices and consider what sorts of data you might expect to see.

2. Preregistration with fake-data simulation can make you decide to redesign a study, or to not do it at all, if it seems that it will be too noisy to be useful.

3. If you already have a great plan for a study, preregistration can allow the subsequent analysis to be bulletproof. No need to worry about concerns of p-hacking if your data coding and analysis decisions are preregistered—and this also holds for analyses that are not based on p-values or significance tests.

4. A preregistered replication can build confidence in a previous exploratory finding.

5. Conversely, a preregistered study can yield a null result, for example if it is designed to have a high statistical power but then does not yield statistically significant preregistered results. Failure is not always as exciting or informative as success—recall the expression “big if true“—but it ain’t nothing.

6. Similarly, a preregistered replication can yield a null result. Again, this can be a disappointment but still a step in scientific learning.

7. Once the data appears, and the preregistered analysis is done, if it’s unsuccessful, this can lead the authors to change their thinking and to write a paper explaining that they were wrong, or maybe just to publish a short note saying that the preregistered experiment did not go as expected.

8. If a preregistered analysis fails, but the authors still try to claim success using questionable post-hoc analysis, the journal reviewers can compare the manuscript to the preregistration, point out the problem, and require that the article be rewritten to admit the failure. Or, if the authors refuse to do that, the journal can reject the article as written.

9. Preregistration can be useful in post-publication review to build confidence in published paper by reassuring readers who might have been concerned about p-hacking and forking paths. Readers can compare the published paper to the preregistration and see that it’s all ok.

10. Or, if the paper doesn’t follow the preregistration plan, readers can see this too. Again, it’s not a bad thing at all for the paper to go beyond the preregistration plan. That’s part of good science, to learn new things from the data. The bad thing is when a non-preregistered analysis is presented as if it were the preregistered analysis. And the good thing is that the reader can read the documents and see that this happened. As we did here.

In the case of this recent dishonesty paper, preregistration did not give benefit 1, nor did it give benefit 2, nor did it give benefits 3, 4, 5, 6, 7, 8, or 9. But it did give benefit 10. Benefit 10 is unfortunately the least of all the positive outcomes of preregistration. But it ain’t nothing. So here we are. Thanks to preregistration, we now know that we don’t need to take seriously the claims made in the published paper, “How pledges reduce dishonesty: The role of involvement and identification.”

For example, you should feel free to accept that the authors offer no evidence for their claim that “effective pledges could allow policymakers to reduce monitoring and enforcement resources currently allocated for lengthy and costly checks and inspections (that also increase the time citizens and businesses must wait for responses) and instead focus their attention on more effective post-hoc audits. What is more, pledges could serve as market equalizers, allowing better competition between small businesses, who normally cannot afford long waiting times for permits and licenses, and larger businesses who can.”

Huh??? That would not follow from their experiments, even if the results had all gone as planned.

There’s also this funny bit at the end of the paper:

I just don’t know whether to believe this. Did they sign an honesty pledge?

Overkill?

OK, it’s 2024, and maybe this all feels like shooting a rabbit with a cannon. A paper by Dan Ariely on the topic of dishonesty, published in an Elsevier journal, purporting to provide “guidance to managers and policymakers” based on the results of an online math-puzzle game? Whaddya expect? This is who-cares research at best, in a subfield that is notorious for unreplicable research.

What happened was I got sucked in. I came across this paper, and my first reaction was surprise that Ariely was still collaborating with people working on this topic. I would’ve thought that the crashing-and-burning of his earlier work on dishonesty would’ve made him radioactive as a collaborator, at least in this subfield.

I took a quick look and saw that the studies were preregistered. Then I wanted to see exactly what that meant . . . and here we are.

Once I did the work, it made sense to write the post, as this is an example of something I’ve seen before: a disconnect between the preregistration and the analyses in the paper, and a lack of engagement in the paper with all the things in the preregistration that did not go as planned.

Again, this post should not be taken as any sort of opposition to preregistration, which in this case led to positive outcome #10 on the above list. The 10th-best outcome, but better than nothing, which is what we would’ve had in the absence of preregistration.

Baby steps.

Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis

Aki Vehtari — Wed, 03 Apr 2024 14:59:43 +0000

There is a new paper in arXiv: “Supporting Bayesian modelling workflows with iterative filtering for multiverse analysis” by Anna Elisabeth Riha, Nikolas Siccha, Antti Oulasvirta, and Aki Vehtari.

Anna writes

An essential component of Bayesian workflows is the iteration within and across models with the goal of validating and improving the models. Workflows make the required and optional steps in model development explicit, but also require the modeller to entertain different candidate models and keep track of the dynamic set of considered models.

By acknowledging the existence of multiple candidate models (universes) for any data analysis task, multiverse analysis provides an approach for transparent and parallel investigation of various models (a multiverse) that makes considered models and their underlying modelling choices explicit and accessible. While this is great news for the task of tracking considered models and their implied conclusions, more exploration can introduce more work for the modeller since not all considered models will be suitable for the problem at hand. With more models, more time needs to be spent with evaluation and comparison to decide which models are the more promising candidates for a given modelling task and context.

To make joint evaluation easier and reduce the amount of models in a meaningful way, we propose to filter out models with largely inferior predictive abilities and check computation and reliability of obtained estimates and, if needed, adjust models or computation in a loop of changing and checking. Ultimately, we evaluate predictive abilities again to ensure a filtered set of models that contains only the models that are sufficiently able to provide accurate predictions. Just like we filter out coffee grains in a coffee filter, our suggested approach sets out to remove largely inferior candidates from an initial multiverse and leaves us with a consumable brew of filtered models that is easier to evaluate and usable for further analyses. Our suggested approach can reduce a given set of candidate models towards smaller sets of models of higher quality, given that our filtering criteria reflect characteristics of the models that we care about.

“Bayesian Workflow: Some Progress and Open Questions” and “Causal Inference as Generalization”: my two upcoming talks at CMU

Andrew — Wed, 03 Apr 2024 13:13:17 +0000

I’ll be speaking twice at Carnegie Mellon soon.

CMU statistics seminar, Fri 5 Apr 2024, 2:15pm, in Doherty Hall A302:

Bayesian Workflow: Some Progress and Open Questions

The workflow of applied Bayesian statistics includes not just inference but also model building, model checking, confidence-building using fake data, troubleshooting problems with computation, model understanding, and model comparison. We would like to codify these steps in the realistic scenario in which researchers are fitting many models for a given problem. We discuss various issues including prior distributions, data models, and computation, in the context of ideas such as the Fail Fast Principle and the Folk Theorem of Statistical Computing. We also consider some examples of Bayesian models that give bad answers and see if we can develop a workflow that catches such problems. For background, see here: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf

CMU computer science seminar, Tues 9 Apr, 10:30am, in Gates Hillman Building 8102:

Causal Inference as Generalization

In causal inference, we generalize from sample to population, from treatment to control group, and from observed measurements to underlying constructs of interest. The challenge is that models for varying effects can be difficult to estimate from available data. For example, there is a conflict between two tenets of evidence-based medicine: (1) reliance on statistically significant estimates from controlled trials and (2) decision making for individual patients. There’s no way to get to step 2 without going beyond step 1. We discuss limitations of existing approaches to causal generalization and how it might be possible to do better using Bayesian multilevel models. For background, see here: http://www.stat.columbia.edu/~gelman/research/published/KennedyGelman_manuscript.pdf and here: http://www.stat.columbia.edu/~gelman/research/published/causalreview4.pdf and here: http://www.stat.columbia.edu/~gelman/research/unpublished/causal_quartets.pdf

In between the two talks is a solar eclipse. I hope there’s good weather in Cleveland on Monday at 3:15pm.

Bad parenting in the news, also, yeah, lots of kids don’t believe in Santa Claus

Andrew — Tue, 02 Apr 2024 13:06:25 +0000

A recent issue of the New Yorker had two striking stories of bad parenting.

Margaret Talbot reported on a child/adolescent-care center in Austria from the 1970s that was run by former Nazis who were basically torturing the kids. This happened for decades. The focus of the story was a girl whose foster parents had abused her before sending her to this place. The creepiest thing about all of this was how normal it all seemed. Not normal to me, but normal to that society: abusive parents, abusive orphanage, abusive doctors, all of which fit into an authoritarian society. Better parenting would’ve helped, but it seems that all of these people were all trapped in a horrible system, supported by an entrenched network of religious, social, and political influences.

In that same issue of the magazine, Sheelah Kolhatkar wrote about the parents of crypto-fraudster Sam Bankman-Fried. This one was sad in a different way. I imagine that most parents don’t want their children to grow up to be criminals, but such things happen. The part of the story that seemed particularly sad to me was how the parents involved themselves in their son’s crimes. They didn’t just passively accept it—which would be bad enough, but, sure, sometimes kids just won’t listen and they need to learn their lessons on their own—; they very directly got involved, indeed profited from the criminal activity. What kind of message is that to send to your child? In some ways this is similar to the Austrian situation, in that the adults involved were so convinced in their moral righteousness. Anyway, it’s gotta be heartbreaking to realize that, not only did you not stop your child’s slide into crime, you actually participated in it.

Around the same time, the London Review of Books ran an article which motivated me to write them this letter:

Dear editors,

In his article in the 2 Nov 2023 issue, John Lanchester writes that financial fraudster Sam Bankman-Fried “grew up aware that his mind worked differently from most people’s. Even as a child he thought that the whole idea of Santa Claus was ridiculous.” I don’t know what things are like in England, but here in the United States it’s pretty common for kids to know that Santa Claus is a fictional character.

More generally, I see a problem with the idealization of rich people. It’s not enough to say that Bankman-Fried was well-connected, good at math, and had a lack of scruple that can be helpful in many aspects of life. He also has to be described as being special, so much that a completely normal disbelief in the reality of Santa Claus is taken as a sign of how exceptional he is.

Another example is Bankman-Fried’s willingness to gamble his fortune in the hope of even greater riches, which Lanchester attributes to the philosophy of effective altruism, rather than characterizing it as simple greed.

Yours

Andrew Gelman
New York

They’ve published my letters before (here and here), but not this time. I just hope that in the future they don’t take childhood disbelief in Santa Claus as a signal of specialness, or attribute a rich person’s desire for even more money to some sort of unusual philosophy.

“A passionate group of scientists determined to revolutionize the traditional publishing model in academia”

Andrew — Tue, 02 Apr 2024 00:10:35 +0000

Jonathan Heppner saw our post, Refuted papers continue to be cited more than their failed replications: Can a new search engine be built that will fix this problem?, writes:

I [Heppner] am a philosopher of psychology working with the ResearchHub Foundation. Our mission is to decentralize scientific publishing, and a crucial aspect of our approach is aligning academic incentives with the actual quality of research.

Currently, I am spearheading a campaign to identify qualified scientific fraud investigators who can provide content on our platform.

At ResearchHub, we are a passionate group of scientists determined to revolutionize the traditional publishing model in academia. We offer financial incentives for various contributions, including posting research, writing peer reviews, and submitting literature reviews, all geared towards accelerating scientific research while improving the quality of rigor.

Transparency is a key for us, and we directly attach peer reviews to published articles. This approach allows readers to easily assess how a paper is received by peers, providing a clearer picture of its validity.

I took a look, and this seems to associated with something called Research Coin? I told Heppner that I’m concerned about these Coin things as I’ve heard they are very wasteful of energy.

He replied:

I understand that you are wary of the coins, since that is a pretty common attitude held by people in general. However, the blockchain community has heard the energy criticism and the chain on which our coin exists, Ethereum, has switched to a more energy-efficient method (Proof of Stake). It now uses only ~0.0026 TWh/yr.

I followed the link, and, if this is for real, then, yeah, Bitcoin really does seem pretty horrible. But I know nothing about any of this (except for scientific fraud; that’s a topic that I wish I knew less about!); I’m just sharing it here in case it interests some of you.

Paper cited by Stanford medical school professor retracted—but even without considering the reasons for retraction, this paper was so bad that it should never have been cited.

Andrew — Mon, 01 Apr 2024 13:20:22 +0000

Last year we discussed a paper sent to us by Matt Bogard. The paper was called, “Impact of cold exposure on life satisfaction and physical composition of soldiers,” it appeared in the British Medical Journal, and Bogard was highly suspicious of it. As he put it at the time:

I don’t have full access to this article to know the full details and can’t seem to access the data link but with n = 49 split into treatment and control groups for these outcomes (also making gender subgroup comparisons) this seems to scream, That which does not kill my statistical significance only makes it stronger.

I took a look and I agreed that article was absolutely terrible. I guess it was better than most of the stuff published in the International Supply Chain Technology Journal, but that’s not saying much; indeed all it’s saying is that the paper strung together some coherent sentences.

Despite the paper being clearly very bad, it had been promoted by a professor at Stanford Medical School who has a side gig advertising this sort of thing:

What’s up with that?? Stanford’s supposed to be a serious university, no? I hate to see Stanford Medical School mixed up in this sort of thing.

News!

That article has been retracted:

The reason for the retraction is oddly specific. I think it would be enough for them to have just said they’re retracting the paper because it’s no good. As Gideon Meyerowitz-Katz put it:

To sum up – this is a completely worthless study that has no value whatsoever scientifically. It is quite surprising that it got published in its current form, and even more surprising that anyone would try to use it as evidence.

I agree that the study is worthless, even without the specific concerns that caused it to be retracted.

On the other hand, I’m not at all surprised that it got published, nor am I surprised that anyone would try to use it as evidence. Crap studies are published all the time, and they’re used as evidence all the time too.

By the way, if you are curious and want to take a look at the original paper:

You still gotta pay 50 bucks.

I wonder if that Stanford dude is going to announce the retraction of the evidence he touted for his claim that “deliberate cold exposure is great training for the mind.” I doubt it—if the quality of the evidence were important, he wouldn’t have cited the study in the first place—but who knows, I guess anything’s possible.

P.S. At this point, some people are gonna complain at how critical we all are. Why can’t we just let these people alone? I’ll give my usual answer, which is that (a) junk science is a waste of valuable resources and attention, and (b) bad science drives out the good. Somewhere there is a researcher who does good and careful work but was not hired at Stanford medical school because of not being flashy enough. Just like there are Ph.D. students in psychology whose work does not get published in Psychological Science because it can’t compete with clickbait crap like the lucky golf ball study.

As Paul Alper says, one should always beat a dead horse because the horse is never really dead.

“Randomization in such studies is arguably a negative, in practice, in that it gives apparently ironclad causal identification (not really, given the ultimate goal of generalization), which just gives researchers and outsiders a greater level of overconfidence in the claims.”

Andrew — Sun, 31 Mar 2024 13:32:40 +0000

Dean Eckles sent me an email with subject line, “Another Perry Preschool paper . . .” and this link to a recent research paper that reports, “We find statistically significant effects of the program on a number of different outcomes of interest.” We’ve discussed Perry Preschool before (see also here), so I was coming into this with some skepticism. It turns out that this new paper is focused more on methods than on the application. It that begins:

This paper considers the problem of making inferences about the effects of a program on multiple outcomes when the assignment of treatment status is imperfectly randomized. By imperfect randomization we mean that treatment status is reassigned after an initial randomization on the basis of characteristics that may be observed or unobserved by the analyst. We develop a partial identification approach to this problem that makes use of information limiting the extent to which randomization is imperfect to show that it is still possible to make nontrivial inferences about the effects of the program in such settings. We consider a family of null hypotheses in which each null hypothesis specifies that the program has no effect on one of many outcomes of interest. Under weak assumptions, we construct a procedure for testing this family of null hypotheses in a way that controls the familywise error rate–the probability of even one false rejection–in finite samples. We develop our methodology in the context of a reanalysis of the HighScope Perry Preschool program. We find statistically significant effects of the program on a number of different outcomes of interest, including outcomes related to criminal activity for males and females, even after accounting for imperfections in the randomization and the multiplicity of null hypotheses.

I replied: “a family of null hypotheses in which each null hypothesis specifies that the program has no effect on one of many outcomes of interest . . .”: What the hell? This makes no sense at all.

Dean responded:

Yeah, I guess it is a complicated way of saying there’s a null hypothesis for each outcome…

To me this just really highlights the value of getting the design right in the first place — and basically always including well-defined randomization.

To which I replied: Randomization is fine, but I think much less important than measurement as a design factor. The big problem with all these preschool studies is noisy data. These selection-on-statistical-significance methods then combine with forking paths to yield crappy estimates. I’d say “useless estimates,” but I guess the estimates are useful in the horrible sense that they allow the promoters of these interventions to get attention and funding.

Randomization in such studies is arguably a negative, in practice, in that it gives apparently ironclad causal identification (not really, given the ultimate goal of generalization), which just gives researchers and outsiders a greater level of overconfidence in the claims. They’re following the “strongest link” reasoning, which is an obvious logical fallacy but that doesn’t stop it from driving so much policy research.

Dean:

Yes, definitely agreed that measurement is a key part of the design, as is choosing a sample size that has any hope of detecting reasonable effect sizes.

Me: Agreed. One reason I de-emphasize the importance of sample size is that researchers often seem to think that sample size is a quick fix. It goes like this: Researchers do a study and finds a result that’s 1.5 se’s from 0. So then they think that if they just increase sample size by a factor of (2/1.5)^2, they’ll get statistical significance. Or if their sample size was higher by a factor of (2.8/1.5)^2, that they’d have 80% power. And then they do the study, they manage to find that statistically significant result, and they (a) declare victory on their substantive claim and (b) think that their sample size is retrospectively justified, in the same way that a baseball manager’s decision to leave in the starting pitcher is justified if the outcome is a W.

So, the existence of the “just increase the sample size” option can be an enabler for bad statistics and bad science. My favorite example along those lines is the beauty-and-sex-ratio study, which used a seemingly-reasonable sample size of 3000 but would realistically need something like a million people or more to have any reasonable chance of detecting any underlying signal.

Dean:

Yes, that’s a good warning. Of course, for those who know better, obviously the original, extra noisy point estimate is not really helpful at all for power analysis. I think trying to use pilots to somehow get an initial estimate of an effect (rather than learn other things) is a common trap.

Yeah, and rules of thumbs about what a “big sample” is can lead you astray. I’ve run some experiments where if you’d run the same experiment with 100k people, it would have been hopeless. A couple slides from my recent course on this point are attached… involving the same “failed” expeirment as the second part of my recent post https://statmodeling.stat.columbia.edu/2023/10/16/getting-the-first-stage-wrong/

P.S. Regarding the title of this post, I’m not saying that randomization is always bad or usually bad or that it’s bad on net. What I’m saying is that it can be bad, and it can be bad in important situations, the kinds of settings where people want to find an effect.

To put it another way, suppose we start by looking at a study with randomization. Would it be better without random assignment, just letting participants in the study pick their treatments or having them assigned in some other way that would be subject to unmeasurable biases? No, of course not. Randomization doesn’t make a study worse. What it can do is give researchers and consumers of researchers an inappropriately warm and cozy feeling, leading them to not look at serious problems of interpretation of the results of the study, for example, extracting large and unreproducible results from small noisy samples and then using inappropriately applied statistical models to label such findings as “statistically significant.”

“Andrew, you are skeptical of pretty much all causal claims. But wait, causality rules the world around us, right? Plenty have to be true.”

Andrew — Sat, 30 Mar 2024 13:27:48 +0000

Awhile ago, Kevin Lewis pointed me to this article that was featured in the Wall Street Journal. Lewis’s reaction was, “I’m not sure how robust this is with just some generic survey controls. I’d like to see more of an exogenous assignment.” I replied, “Nothing wrong with sharing such observational patterns. They’re interesting. I don’t believe any of the causal claims, but that’s ok, description is fine,” to which Lewis responded, “Sure, but the authors are definitely selling the causal claim.” I replied, “Whoever wrote that looks like they had the ability to get good grades in college. That’s about it.”

At this point, Alex Tabarrok, who’d been cc-ed on all this, jumped in to say, quite reasonably, “Andrew, you are skeptical of pretty much all causal claims. But wait, causality rules the world around us, right? Plenty have to be true.”

I replied to Alex as follows:

There are lots of causal claims that I believe! For this one, there are two things going on. First, do I think the claim is true? Maybe, maybe not, I have no idea. I certainly wouldn’t stake my reputation on a statement that the claim is false. Second, how relevant do I think this sort of data and analysis are to this claim? My answer: a bit relevant but not very. When I think about the causal claims that I believe, my belief is usually not coming from some observational study.

Regarding, “Plenty have to be true.” Yup, and that includes plenty of statements that are the opposite of what’s claimed to be true. For example, a few years ago a researcher preregistered a claim that exposure to poor people would cause middle-class people to have more positive views regarding economic redistribution policies. The researcher then did a study, found the opposite result (not statistically significant, but whatever), and then published the results and claimed that exposure to poor people would reduce middle-class people’s support for redistribution. So what do I believe? I believe that for most people, an encounter (staged or otherwise) with a person on the street would have essentially no effects on their policy views. For some people in some settings, though, the encounter could have an effect. Sometimes it could be positive, sometimes negative. In a large enough study it would be possible to find an average effect. The point is that plenty of things have to be true, but estimating average causal effects won’t necessarily find any of these things. And this does not even get into the difficulty with the study linked by Kevin, where the data are observational.

Or, for another example, sure, I believe that early childhood intervention can be effective in some cases. That doesn’t give me any obligation to believe the strong claims that have been made on its behalf using flawed data analysis.

To put it another way: the authors of all these studies should feel free to publish their claims. I just think lots of these studies are pretty random. Randomness can be helpful. Supposedly Philip K. Dick used randomization (the I Ching) to write some of this books. In this case, the randomization was a way to jog his imagination. Similarly, it could be that random social science studies are useful in that they give people an excuse to think about real problems, even if the studies themselves are not telling us what the researchers claim.

Finally, I think there’s a problem in social science that researchers are pressured to make strong causal claims that are not supported by their data. It’s a selection bias. Researchers who just make descriptive claims are less likely to get published in top journals, get newspaper op-eds, etc. This is just some causal speculation of my own: if the authors of this recent study had been more clear (to themselves and to others) that their conclusions are descriptive, not causal, none of us would’ve heard about the study in the first place.

Summary

There’s a division of labor in metascience as well as in science. I lean toward skepticism, to the extent that there must be cases where I don’t get around to thinking seriously about new ideas or results that are actually important. Alex leans toward openness, to the extent that there must be cases where he goes through the effort of working out the implications of results that aren’t real. It’s probably a good thing that the science media includes both of us. We play different roles in the system of communication.

Every time Tyler Cowen says, “Median voter theorem still underrated! Hail Anthony Downs!”, I’m gonna point him to this paper . . .

Andrew — Fri, 29 Mar 2024 13:55:32 +0000

Here’s Cowen’s post, and here’s our paper:

Moderation in the pursuit of moderation is no vice: the clear but limited advantages to being a moderate for Congressional elections

Andrew Gelman Jonathan N. Katz

September 18, 2007

It is sometimes believed that is is politically risky for a congressmember to go against his or her party. On the other hand, Downs’s familiar theory of electoral competition holds that political moderation is a vote-getter. We analyze recent Congressional elections and find that moderation is typically worth less about 2% of the vote. This suggests there is a motivation to be moderate, but not to the exclusion of other political concerns, especially in non-marginal districts. . . .

Banning the use of common sense in data analysis increases cases of research failure: evidence from Sweden

Andrew — Thu, 28 Mar 2024 13:12:06 +0000

Olle Folke writes:

I wanted to highlight a paper by an author who has previously been featured on your blog when he was one of the co-authors of a paper on the effect of strip clubs on sex crimes in New York. This paper looks at the effect of criminalizing the buying of sex in Sweden and finds a 40-60% increase. However, the paper is equally problematic as the one on strip clubs. In what I view as his two main specifications he using the timing of the ban to estimate the effect. However, while there is no variation across regions he uses regional data to estimate the effect, which of course does not make any sense. Not surprisingly there is no adjustment for the dependence of the error term across observations.

What makes this analysis particularly weird is that there actually is no shift in the outcome if we use national data (see figure below). So basically the results must have been manufactured. As the author has not posted any replication files it is not possible to figure out what he has done to achieve the huge increase.

I think that his response to this critique is that he has three alternative estimation methods. However, these are not very convincing and my suspicion is that neither those results would hold up for scrutiny. Also, I find the use of alternative methods both strange and problematic. First, it suggests that neither method is convincing it itself. However, doing four additional problematic analysis does not make the first one better. Also, it gives author an out when they are criticized as it involves a lot of labor to work through each analysis (especially when there is not replication data).

I took a look at the linked paper, and . . . yeah, I’m skeptical. The article begins:

This paper leverages the timing of a ban on the purchase of sex to assess its impact on rape offenses. Relying on Swedish high-frequency data from 1997 to 2014, I find that the ban increases the number of rapes by around 44–62%.

But the above graph, supplied by Folke, does not show any apparent effect at all. The linked paper has a similar graph using monthly data that also shows
nothing special going on at 1999:

This one’s a bit harder to read because of the two axes, the log scale, and the shorter time frame, but the numbers seem similar. In the time period under study, the red curve is around 5.0 on the log scale per month, 12*log(5) = 1781, and the annual curve is around 2000, so that seems to line up.

So, not much going on in the aggregate. But then the paper says:

Several pieces of evidence find that rape more than doubled after the introduction of the ban. First, Table 1 finds that the average before the ban is around 6 rapes per region and month, while after the introduction is roughly 12. Second, Table 2 presents the results of the naive analysis of regressing rape on a binary variable taking value 0 before the ban and 1 after, controlling for year, month, and region fixed effects. Results show that the post ban period is associated with an increase of around 100% of cases of rape in logs and 125% of cases of rape in the inverse hyperbolic sine transformation (IHS, hereafter). Third, a simple descriptive exercise –plotting rape normalized before the ban around zero by removing pre-treatment fixed effects– encounters that rape boosted around 110% during the sample period (Fig. 4).

OK, the averages don’t really tell us anything much at all: they’re looking at data from 1997-2014, the policy change happened in 1999, in the midst of a slow increase, and most of the change happened after 2004, as is clearly shown in Folke’s graph. So Table 1 and Table 2 are pretty much irrelevant.

But what about Figure 4:

This looks pretty compelling, no?

I dunno. The first thing is that the claim that of “more than doubling” relies very strongly on the data after 2004. log(2) = 0.69, and if you look at that graph, the points only reach 0.69 around 2007, so the inference is leaning very heavily on the model by which the treatment causes a steady annual increase, rather than a short-term change in level at the time of the treatment. The other issue is the data before 1999, which in this graph are flat but in the two graphs shown earlier in this post showed an increasing trend. That makes a big difference in Figure 4! Replace that flat line pre-1999 with a positively-sloped line, and the story looks much different. Indeed, that line is soooo flat and right on zero, that I wonder if this is an artifact of the statistical fitting procedure (“Pre-treatment fixed effects are removed from the data to normalize the number of rapes around zero before the ban.”). I’m not really sure. The point is that something went wrong.

They next show their regression discontinuity model, which fits a change in level rather than slope:

There’s something else strange going on here: if they’re really fitting fixed effects for years, how can they possibly estimate a change over time? This is not making a lot of sense.

I’m not going to go through all of this paper in detail, I just did the above quick checks in order to get a rough sense what was going on, and to make sure I didn’t see anything immediately wrong with Folke’s basic analysis.

Folke continued:

The paper is even stranger than I have expected. I have gotten part of the regression code and he is estimating models that would not get any estimates on the treatment of there where no coding error (treatment is constant within years but he includes year fixed effects). Also, when I do the RDanalysis he claims he is doing I get the figure below in which there clearly is not a jump of 0.6 log points…

What the hell????

This one goes into the regression discontinuity hall of fame.

The next day, Folke followed up:

It took some digging and coding the figure out how the author was able to find such a large effect. We [Joop Adema, Olle Folke, and Johanna Rickne] have now written up a draft of a comment where we show that it is all based on a specification error and he ends up estimating something entirely different than he claims to be.

The big picture, or, how can this sort of error be avoided or its consequences mitigated

Look, everybody makes mistakes. Statistical models are hard to fit and interpret, data can be a mess, and social science theories are vague enough that if you’re not careful you can explain just about anything.

Still, it looks like this paper was an absolute disaster and a bit of an embarrassment for the Journal of Population Economics, which published it.

Should the problems been noticed earlier? I’d argue yes.

The problems with the regression discontinuity model—OK, we’re not gonna expect the author, reviewers, or editors of a paper to look too carefully at that—it’s a big ugly equation, after all—and we can’t expect author, reviewers, or editors to check the code—that’s a lot of work, right? Equations that don’t make sense, that’s just the cost of doing business.

The clear problem is the pattern in the aggregate data, the national time series that shows no jump in 1999.

I’m not saying that, just cos there’s no jump in 1999, that the policy had no effect. I’m just saying that the lack of jump in 1999 is right there for everyone to see. At the very least, if you’re gonna claim you found an effect, you’re under the scientific obligation to explain how you found that effect given the lack of pattern in the aggregate data. Such things can happen—you can have an effect that happens to be canceled out in the data by some other pattern at the same time—but then you should explain it, give that trail of breadcrumbs.

So, I’m not saying the author, reviewers, and editors of that paper should’ve seen all or even most of the problems with this paper. What I am saying is that they should’ve engaged with the contradiction between their claims and what was shown by the simple time series. To have not done this is a form of “scientism,” a kind of mystical belief in the output of a black box, a “believe the stats, not your lying eyes” kind of attitude.

Also, as Folke points out, the author of this paper has a track record of extracting dramatic findings using questionable data analysis.

I have no reason to think that the author is doing things wrong on purpose. Statistics is hard! The author’s key mistakes in these two papers have been:

1. Following a workflow in which contrary indications were ignored or set aside rather than directly addressed.

2. A lack of openness to the possibility that the work could be fatally flawed.

3. Various technical errors, including insufficient concern about data quality, a misunderstanding of regression discontinuity checks, and an inappropriate faith in robustness checks.

In this case, Adema, Folke, and Rickne did a lot of work to track down what went wrong in that published analysis. A lot of work for an obscure paper in a minor journal. But the result is a useful general lesson, which is why I’m sharing the story here.

The feel-good open science story versus the preregistration (who do you think wins?)

Jessica Hullman — Wed, 27 Mar 2024 17:18:14 +0000

This is Jessica. Null results are hard to take. This may seem especially true when you preregistered your analysis, since technically you’re on the hook to own up to your bad expectations or study design! How embarrassing. No wonder some authors can’t seem to give up hope that the original hypotheses were true, even as they admit that the analysis they preregistered didn’t produce the expected effects. Other authors take an alternative route, one that deviates more dramatically from the stated goals of preregistration: they bury aspects of that pesky original plan and instead proceed under the guise that they preregistered whatever post-hoc analyses allowed them to spin a good story. I’ve been seeing this a lot lately.

On that note, I want to follow up on the previous blog discussion on the 2023 Nature Human Behavior article “High replicability of newly discovered social-behavioural findings is achievable” by Protzko, Krosnick, Nelson (of Data Colada), Nosek (of OSF), Axt, Berent, Buttrick, DeBell, Ebersole, Lundmark, MacInnis, O’Donnell, Perfecto, Pustejovsky, Roeder, Walleczek, and Schooler. It’s been about four months since Bak-Coleman and Devezer posted a critique that raised a number of questions about the validity of the claims the paper makes.

This was a study that asked four labs to identify (through pilot studies) four effects for possible replication. The same lab then did a larger (n=1500) preregistered confirmation study for each of their four effects, documenting their process and sharing it with three other labs, who attempted to replicate it. The originating lab also attempted a self-replication for each effect.

The paper presents analyses of the estimated effects and replicability across these studies as evidence that four rigor-enhancing practices used in the post-pilot studies–confirmatory tests, large sample sizes, preregistration, and methodological transparency–lead to high replicability of social psychology findings. The observed replicability is said to be higher than expected based on observed effect sizes and power estimates, and notably higher than prior estimates of replicability in the psych literature. All tests and analyses are described as preregistered, and, according to the abstract, the high replication rate they observe “justifies confidence of rigour-enhancing methods to increase the replicability of new discoveries.”

On the surface it appears to be an all-around win for open science. The paper has already been cited over fifty times. From a quick glance, many of these citing papers refer to it as if it provides evidence of a causal effect that open practices lead to high replicability.

But one of the questions raised by Bak-Coleman and Devezer about the published version was about their claim that all of the confirmatory analyses they present were preregistered. There was no such preregistration in sight if you checked the provided OSF link. I remarked back in November that even in the best case scenario where the missing preregistration was found, it was still depressing and ironic that a paper whose message is about the value of preregistration could make claims about its own preregistration that it couldn’t back up at publication time.

Around that time, Nosek said on social media that the authors were looking for the preregistration for the main results. Shortly after Nature Human Behavior added a warning label indicating an investigation of the work:

And if I’m trying to tell the truth, it’s all bad

It’s been some months, and the published version hasn’t changed (beyond the added warning), nor do the authors appear to have made any subsequent attempts to respond to the critiques. Given the “open science works” message of the paper, the high profile author list, and the positive attention it’s received, it’s worth discussing here in slightly more detail how some of these claims seem to have come about.

The original linked project repository has been updated with historical files since the Bak-Coleman and Devezer critique. By clicking through the various versions of the analysis plan, analysis scripts, and versions of the manuscript, we can basically watch the narrative about the work (and what was preregistered) change over time.

The first analysis plan is dated October 2018 by OSF, and outlines a set of analyses of a decline effect, where effects decrease after an initial study, that differ substantially from the story presented in the published paper. This document first describes a data collection process for each of the confirmation studies and replications in two halves that splits the collection of observations into two parts, with 750 observations collected first, and the other 750 second. Each confirmation study and replication study are also assigned to either a) analyze the first half-sample and then the second half-sample or b) analyze the second half-sample and then the first half-sample.

There were three planned tests:

Whether the effects statistically significantly increase or decrease depending on whether the effects belonged to the first or the second 750 half samples;
Whether the effect sizes of the originating lab’s self-replication study is statistically larger or smaller than the originating lab’s confirmation study.
Whether effects statistically significantly decrease or increase across all four waves of data collection (all 16 studies with all 5 confirmations and replications).

If you haven’t already guessed it, the goal of all this is to evaluate whether a supernatural-like effect resulted in a decreased effect size in whatever wave was analyzed second. It appears all this is motivated by hypotheses that some of the authors (okay, maybe just Schooler) felt were within the realm of possibility. There is no mention of comparing replicability in the original analysis plan nor the the preregistered analysis code uploaded last December in a dump of historical files by James Pustejovsky, who appears to have played the role of a consulting statistician. This is despite the blanket claim that all analyses in the main text were preregistered and further description of these analyses in the paper’s supplement as confirmatory.

The original intent did not go unnoticed by one of the reviewers (Tal Yarkoni) for Nature Human Behavior, who remarks:

The only hint I can find as to what’s going on here comes from the following sentence in the supplementary methods: “If observer effects cause the decline effect, then whichever 750 was analyzed first should yield larger effect sizes than the 750 that was analyzed second”. This would seem to imply that the actual motivation for the blinding was to test for some apparently supernatural effect of human observation on the results of their analyses. On its face, this would seem to constitute a blatant violation of the laws of physics, so I am honestly not sure what more to say about this.

It’s also clear that the results were analyzed in 2019. The first public presentation of results from the individual confirmation studies and replications can be traced to a talk Schooler gave at the Metascience 2019 conference in September, where he presents the results as evidence of an incline effect. The definition of replicability he uses is not the one used in the paper.

Cause if you’re looking for the proof, it’s all there

There are lots of clues in the available files that suggest the main message about rigor-enhancing practices emerged as the decline effects analysis above failed to show the hypothesized effect. For example, there’s a comment on an early version of the manuscript (March 2020) where the multi-level meta-analysis model used to analyze heterogeneity across replications is suggested by James. This is suggested after data collection has been done and initial results analyzed, but the analysis is presented as confirmatory in the paper with p-values and discussion of significance. As further evidence that it wasn’t preplanned, in a historical file added more recently by James, it is described as exploratory. It shows up later in the main analysis code with some additional deviations, no longer designated as exploratory. By the next version of the manuscript, it has been labeled a confirmatory analysis, as it is in the final published version.

This is pretty clear evidence that the paper is not accurately portraying itself.

Similarly, various definitions of replicability show up in earlier versions of the manuscript: the rate at which the replication is significant, the rate at which the replication effect size falls within the confirmatory study CI, and the rate at which replications produce significant results for significant confirmatory studies. Those which produce higher rates of replicability relative to statistical power are retained and those with lower rates are either moved to the supplement, dismissed, or not explored further because they produced low values. For example, defining replicability using overlapping confidence intervals was moved to the supplement and not discussed in the main text, with the earliest version of the manuscript (Deciphering the Decline Effect P6_JEP.docx) justifying its dismissal because it “produced the ‘worst’ replicability rates” and “performs poorly when original studies and replications are pre-registered.” Statistical power is also recalculated across revisions to align with the new narrative.

In a revision letter submitted prior to publication (Decline Effect Appeal Letterfinal.docx), the authors tell the reviewers they’re burying the supernatural motivation for study:

Reviewer 1’s fourth point raised a number of issues that were confusing in our description of the study and analyses, including the distinction between a confirmation study and a self-replication, the purpose and use of splitting samples of 1500 into two subsamples of 750, the blinding procedures, and the references to the decline effect. We revised the main text and SOM to address these concerns and improve clarity. The short answer to the purpose of many of these features was to design the study a priori to address exotic possibilities for the decline effect that are at the fringes of scientific discourse.

There’s more, like a file where they appeared to try a whole bunch of different models in 2020 after the earliest provided draft of the paper, got some varying results, and never disclose it in the published version or supplement (at least I didn’t see any mention). But I’ll stop there for now.

C’mon baby I’m gonna tell the truth and nothing but the truth

It seems clear that the dishonesty here was in service of telling a compelling story about something. I’ve seen things like this transpire plenty of times: the goal of getting published leads to attempts to find a good story in whatever results you got. Combined with the appearance of rigor and a good reputation, a researcher can be rewarded for work that on closer inspection involves so much post-hoc interpretation that the preregistration seems mostly irrelevant. It’s not surprising that the story here ends up being one that we would expect some of the authors to have faith in a priori.

Could it be that the authors were pressured by reviewers or editors to change their story? I see no evidence of that. In fact, the same reviewer who noted the disparity between the original analysis plan and the published results encouraged the authors to tell the real story:

I won’t go so far as to say that there can be no utility whatsoever in subjecting such a hypothesis to scientific test, but at the very least if this is indeed what the authors are doing, I think they should be clear about that in the main text, otherwise readers are likely to misunderstand what the blinding manipulation is supposed to accomplish, and are at risk of drawing incorrect conclusions

You can’t handle the truth, you can’t handle it

It’s funny to me how little attention the warning label or the multiple points raised by Bak-Coleman and Devezer (which I’m simply concretizing here) have drawn, given the zeal with which some members of open science crowd strike to expose questionable practices in other work. My guess is this is because of the feel-good message of the paper and the reputation of the authors. The lack of attention seems selective, which is part of why I’m bringing up some details here. It bugs me (though doesn’t not surprise me) to think that whether questionable practices get called out depends on who exactly is in the author list.

What do I care? Why should you?

On some level, the findings the paper presents – that if you use large studies and attempt to eliminate QRPs, you can get a high rate of statistical significance – are very unsurprising. So why care if the analyses weren’t exactly decided in advance? Can’t we just call it sloppy labeling and move on?

I care because if deception is occurring openly in papers published in a respected journal for behavioral research by authors who are perceived as champions of rigor, then we still have a very long way to go. Interpreting this paper as a win for open science, as if it cleanly estimated the causal effect of rigor-enhancing practices is not, in my view, a win for open science. The authors’ lack of concern for labeling exploratory analysis as confirmatory, their attempt to spin the null findings from the intended study into a result about effects on replicability even though the definition they use is unconventional and appears to have been chosen because it led to a higher value, and the seemingly selective summary of prior replication rates from the literature should be acknowledged as the paper accumulates citations. At this point months have passed and there have not been any amendments to the paper, nor admission by the authors that the published manuscript makes false claims about the preregistration status. Why not just own up to it?

It’s frustrating because my own methodological stance has been positively impacted by some of these authors. I value what the authors call rigor-enhancing practices. In our experimental work, my students and I routinely use preregistration, we do design calculations via simulations to choose sample sizes, we attempt to be transparent about how we arrive at conclusions. I want to believe that these practices do work, and that the open science movement is dedicated to honesty and transparency. But if papers like the Nature Human Behavior article are what people have in mind when they laud open science researchers for their attempts to rigorously evaluate their proposals, then we have problems.

There are many lessons to be drawn here. When someone says all the analyses are preregistered, don’t just accept them at their word, regardless of their reputation. Another lesson that I think Andrew previously highlighted is that researchers sometimes form alliances with others that may have different views for the sake of impact but this can lead to compromised standards. Big collaborative papers where you can’t be sure what your co-authors are up to should make all of us nervous. Dishonestly is not worth the citations.

Writing inspiration from J.I.D. and Mereba.

Bayesian inference with informative priors is not inherently “subjective”

Andrew — Wed, 27 Mar 2024 13:22:36 +0000

The quick way of saying this is that using a mathematical model informed by background information to set a prior distribution for logistic regression is no more “subjective” than deciding to run a logistic regression in the first place.

Here’s a longer version:

Every once in awhile you get people saying that Bayesian statistics is subjective bla bla bla, so every once in awhile it’s worth reminding people of my 2017 article with Christian Hennig, Beyond subjective and objective in statistics. Lots of good discussion there too. Here’s our abstract:

Decisions in statistical data analysis are often justified, criticized or avoided by using concepts of objectivity and subjectivity. We argue that the words ‘objective’ and ‘subjective’ in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. Together with stability, these make up a collection of virtues that we think is helpful in discussions of statistical foundations and practice.

The advantage of these reformulations is that the replacement terms do not oppose each other and that they give more specific guidance about what statistical science strives to achieve. Instead of debating over whether a given statistical method is subjective or objective (or normatively debating the relative merits of subjectivity and objectivity in statistical practice), we can recognize desirable attributes such as transparency and acknowledgement of multiple perspectives as complementary goals. We demonstrate the implications of our proposal with recent applied examples from pharmacology, election polling and socio-economic stratification. The aim of the paper is to push users and developers of statistical methods towards more effective use of diverse sources of information and more open acknowledgement of assumptions and goals.

Philip K. Dick’s character names

Andrew — Tue, 26 Mar 2024 13:05:43 +0000

The other day I was thinking of some of the wonderful names that Philip K. Dick gave to his characters:
Joe Chip
Glen Runciter
Bob Arctor
Palmer Eldritch
Perky Pat

And, of course, Horselover Fat.

My personal favorite names from these stories are Ragle Gumm from Time out of Joint, and Addison Doug, the main character in an obscure spaceship/time-travel story from 1974.

I feel like it shows a deep confidence to give your characters this sort of name. As names, they’re off, but at the same time they’re just right in context. “Addison Doug,” indeed.

Some authors are good at titles, some are good at last lines, some are good at names. So many books, even great books, have character names that are boring or too cute or just fine, but no more than just fine. To come up with these distinctive names is a high-risk ploy that, when it works, it adds something special to the whole story.

The contrapositive of “Politics and the English Language.” One reason writing is hard:

Andrew — Mon, 25 Mar 2024 13:50:41 +0000

In his classic essay, “Politics and the English Language,” the political journalist George Orwell drew a connection between cloudy writing and cloudy content.

The basic idea was: if you don’t know what you’re saying, or if you’re trying to say something you don’t really want to say, then one strategy is to write unclearly. Conversely, consistently cloudy writing can be an indication that the writer ultimately doesn’t want to be understood.

In Orwell’s words:

[The English language] becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts.

He continues:

In our time, political speech and writing are largely the defence of the indefensible. Things like the continuance of British rule in India, the Russian purges and deportations, the dropping of the atom bombs on Japan, can indeed be defended, but only by arguments which are too brutal for most people to face, and which do not square with the professed aims of the political parties. Thus political language has to consist largely of euphemism, question-begging and sheer cloudy vagueness.

A few years ago I posted on this topic, drawing an analogy to cloudy writing in science. To be sure, much of the bad writing in science comes from researchers who have never learned to write clearly. Writing is hard!

But it’s not just that. A key problem with a lot of the bad science that we see featured in PNAS, Ted, NPR, Gladwell, Freakonomics, etc., is that the authors are trying to use statistical analysis and storytelling to do something they can’t do with their science, which is to draw near-certain conclusions from noisy data that can’t support strong conclusions. This leads to tortured constructions such as this from a medical journal:

The pair‐wise results (using paired‐samples t‐test as well as in the mixed model regression adjusted for age, gender and baseline BMI‐SDS) showed significant decrease in BMI‐SDS in the parents–child group both after 3 and 24 months, which indicate that this group of children improved their BMI status (were less overweight/obese) and that this intervention was indeed effective.

However, as we wrote in the results and the discussion, the between group differences in the change in BMI‐SDS were not significant, indicating that there was no difference in change in our outcome in either of the interventions. We discussed, in length, the lack of between‐group difference in the discussion section. We assume that the main reason for the non‐significant difference in the change in BMI‐SDS between the intervention groups (parents–child and parents only) as compared to the control group can be explained by the fact that the control group had also a marginal positive effect on BMI‐SDS . . .

Obv not as bad as political journalists in the 1930s defending Stalin’s purges or whatever; the point is that the author is in the awkward position of trying to use the ambiguities of language to say something while not quite saying it. Which leads to unclear and barely readable writing, not just by accident.

The writing and the statistics have to be cloudy, because if they were clear, the emptiness of the conclusions would be apparent.

The problem

Orwell’s statement, when transposed to writing a technical paper, is that if you attempt to cover the gaps in your reasoning with words, this will typically yield bad writing. Indeed, if you’re covering the gaps in your reasoning with words, you’ll either have bad writing or dishonest writing, or both. In some important way, it’s a good thing that this sort of writing is so hard to follow; otherwise it could be really misleading.

Now let’s flip it around.

Often you will find yourself trying to write an article, and it will be very difficult to write it clearly. You’ll go around and around, and whatever you, your written output will feel like the worst of both worlds: a jargon-filled mess, while at the same time being sloppy and imprecise. Try to make it more readable and it becomes even sloppier and harder to follow at a technical level; try to make it accurate and precise, and it reads like a complicated, uninterpretable set of directions.

You’re stuck. You’re in a bad place. And any direction you take makes the writing worse in some important way.

What’s going on?

It could be this: You’re trying to write something you don’t fully understand, you’re trying to bridge a gap between what you want to say and what is actually justified by your data and analysis . . . and the result is “Orwellian,” in the sense that you’re desperately using words to try to paper over this yawning chasm in your reasoning.

The solution

One way out of this trap is to follow what we could call Orwell’s Contrapositive.

It goes like this: Step back. Pause in whatever writing you’re doing. Pull out a new sheet of paper (or an empty document on the computer) and write, as directly as you can, in two columns. Column 1 is what you want to be able to say (the method is effective, the treatment saves lives, whatever); Column 2 is what is supported by your evidence (the method works better than a particular alternative in a particular setting, fewer people died in the treatment than the control group after adjusting this and that, whatever).

At that point, do the work to pull Column 2 to Column 1, or make concessions to reality to shift Column 1 toward Column 2. Do what it takes to get them to line up.

At this point, you’ve left the bad zone in which you’re trying to say more than you can honestly say. And the writing should then go much smoother.

That’s the contrapositive: if bad writing is a sign of someone trying to say the indefensible, then you can make your writing better by not trying to say the defensible, either by expanding what is legitimately defensible or restricting what you’re trying to say.

Remember the folk theorem of statistical computing: When you have computational problems, often there’s a problem with your model. Orwell’s Contrapositive is a sort of literary analogy to that.

One reason writing is hard

To put it another way: One reason writing is hard is that we use writing to cover the gaps in our reasoning. This is not always a bad thing! On the way to the destination of covering these gaps is the important step of revealing these gaps. We write to understand. Writing has an internal logic that can protect us from (some) errors and gaps—if we let it, by reacting to the warning sign that the writing is unclear.

Hey! Here’s a study where all the preregistered analyses yielded null results but it was presented in PNAS as being wholly positive.

Andrew — Sun, 24 Mar 2024 13:43:41 +0000

Ryan Briggs writes:

In case you haven’t seen this, PNAS (who else) has a new study out entitled “Unconditional cash transfers reduce homelessness.” This is the significant statement:

A core cause of homelessness is a lack of money, yet few services provide immediate cash assistance as a solution. We provided a one-time unconditional CAD$7,500 cash transfer to individuals experiencing homelessness, which reduced homelessness and generated net societal savings over 1 y. Two additional studies revealed public mistrust in homeless individuals’ ability to manage money and the benefit of counter-stereotypical or utilitarian messaging in garnering policy support for cash transfers. This research adds to growing global evidence on cash transfers’ benefits for marginalized populations and strategies to increase policy support. Although not a panacea, cash transfers may hasten housing stability with existing social supports. Together, this research offers a new tool to reduce homelessness to improve homelessness reduction policies.

Based on that, I was surprised to read the pre-registration documents and supplemental information and learn that literally none of the outcomes that the researchers pre-registered were significant. Even the variable that they chose to focus on (days homeless) was essentially the same in the 12 month follow up (0.18 vs 0.17) and, just eyeballing Table S3, it seems the differences were rarely large and not ever significant in any single follow up period.

This is now generating news coverage about how cash transfers work to reduce homelessness (e.g., here and here).

I guess in a sense pre-registration worked because we can see that they did not expect this and had to explore to find it, but what good does that do if the press just reports it all credulously?

I have mixed feelings on this one. On one hand, I don’t like the whole statistical-significance-thresholding thing: if the study found positive results, this could be worth reporting, even if the results are within the margin of error. This within-the-margin-of-error bit should just be mentioned in the news articles. On the other hand, if the researchers are rummaging around through their results looking for something big to report, then, yeah, these results will be massively biased upward.

So, from that perspective, maybe a good headline would not be, “Homeless people were given lump sums of cash. Their spending defied stereotypes” or “B.C. researchers studied how homeless people spent a $7,500 handout. Here’s what they found,” but rather something like, “Preliminary results from a small study suggest . . .”

But then we could step back and ask, How did this study get the press in the first place? I’m guessing PNAS is the reason. So let’s head to the PNAS paper. From the abstract:

Exploratory analyses showed that over 1 y, cash recipients spent fewer days homeless, increased savings and spending with no increase in temptation goods spending, and generated societal net savings of $777 per recipient via reduced time in shelters.

I guess that “exploratory analysis” is code for non-preregistered or non-statistically-significant. Either way, I think it’s irresponsible and statistically incorrect—although, regrettably, absolutely standard practice—to report this “$777” without any regularization or partial pooling toward zero. It’s a biased estimate, and the bias could be huge.

Figure 1 of the paper looks very impressive! This figure displays 35 outcomes, almost all of which go in a positive direction (fewer days homeless, more days in stable housing, higher value of savings . . ., all the way down to lower substance use severity, lower cost of all service use, and cost of shelter use. The very few negative outcomes were tiny compared to their uncertainty. If you look at Figure 1, the evidence looks overwhelming.

But Table 1 does not seem like such a great summary of the data displayed elsewhere in the paper. Looking at Table 3, the good stuff all seems to be happening in the 1-month and 3-month followups without much happening after 1 year.

Here’s what the authors wrote:

The preregistered analyses yielded null effects in cognitive and well-being outcomes, which could be due to the low statistical power from the small participant number in each condition or the possibility that any effect on cognition and well-being may take more than 1 mo to show up.

I agree that these null findings should be mentioned right up there in the abstract. They should also include the possibility that the treatment really has no consistent effect on these outcomes. It’s kinda lame to give all these alibis and never even consider that maybe there’s nothing going on.

What about the housing effects going away after a year? The authors write:

First, the cost of living is extremely high in Vancouver, and the majority of the cash was spent within the first 3 mo for most recipients. Second, while the cash provided immediate benefits, control participants even-tually “caught up” over time.

On the other hand, here’s what they said about a different result:

By combining the two cash and two noncash conditions to increase statistical power, exploratory analyses showed that cash recipients showed higher positive affect at 1 mo and higher executive function at 3 mo. Based on debriefing, participants expressed that while they were initially happy with the cash transfer, moving out of homelessness into stable housing took substantial efforts and hard work in the first few months, which could explain the delayed effect on cognitive function.

They’ve successfully convinced me that they have the ability to explain any possible result they might find.

The thing that bothers me most about the paper is that the authors don’t seem to have wrestled with the ways in which their results seem to refute their theoretical framework. Their choice of what to preregister suggests that they were expecting to find large effects on cognitive and subjective well-being outcomes and then maybe, if they were lucky, they’d find some positive results on financial and housing outcomes. I guess their theory was that the money would give people a better take on life, which could then lead to material benefits. Actually, though, they found no benefits on the cognitive and subjective outcomes—when I say “no benefits,” I mean, yeah, really nothing, not just nothing statistically significant—but the money did seem to help people pay the rent for the first few months. That’s fine—there are worse things than giving low-income people some money to pay the rent!—; it’s just a different story than what they’d started with. It’s less of a psychology story and more of an economics story. In any case, yeah, further study is required. I just think that they could get the most from their existing study if they thought more about what went wrong with their theory.

Hey—let’s collect all the stupid things that researchers say in order to deflect legitimate criticism

Andrew — Sat, 23 Mar 2024 13:53:20 +0000

When rereading this post the other day, I noticed the post that came immediately before.

I followed the link and came across the delightful story of a researcher who, after one of his papers was criticized, replied, “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)” One of the critics responded with appropriate disdain, writing:

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become . . . a vast graveyard of undead theories?

I agree. To put it another way: Yes, ha ha ha, let’s spend our time on guitar practice rather than exhuming 11-year-old published articles. Fine—I’ll accept that, as long as you also accept that we should not be citing 11-year-old articles.

As is so often the case, the authors of published work are happy to get unthinking positive publicity and citations, but when anything negative comes in, they pull up the drawbridge.

From the perspective of the ladder of responses to criticism, the above behavior isn’t so bad: they’re not suing their critics or using surrogates to attack them critics or labeling anybody as suicide bombers or East German secret police, they’re just trying to laugh it off. From a scientific perspective, though, it’s still pretty bad to act as there’s something wrong with discussing the flaws of a paper that’s still being cited, just cos it’s a decade old.

Putting together a list

Anyway, this made me think of a fun project, which is to list all the different ways that researchers try to avoid addressing legitimate criticism of their published work.

Here are a few responses we’ve seen. I won’t bother finding the links right now, but if we put together a good list, I can go back and provide references for all of them.

1. The corrections do not affect the main results of the paper. (Always a popular claim, even if the corrections actually do affect the main results of the paper.)

2. The criticism should be dismissed because the critics are obsessive/Stasi/terrorists, etc. (Recall the Javert paradox.)

3. The critics are jealous losers sniping at their betters. Or, if that doesn’t work, the critics are picking on unfortunate young researchers. (I don’t think it does any favors to researchers of any age to exempt their work from criticism.)

4. The criticism is illegitimate if it does not go through the peer-review process. (A hard claim to swallow given how the peer-review process is rigged against criticism of published papers.)

5. Criticism should be a discreet exchange between author and critic, with no public criticism. (But the people who claim to hold that attitude seem to have no problem when their work is cited or praised in a public way.)

The most common response to criticism seems to be to just ignore it entirely and hope it goes away. Unfortunately, that strategy often seems to work very well!

Jonathan Bailey vs. Stephen Wolfram

Andrew — Fri, 22 Mar 2024 13:41:56 +0000

Key quote:

While there are definitely environments where using a ghostwriter is acceptable, academic publishing typically isn’t one of them.

The reason is simple: Using a ghostwriter on an academic paper entails having an author do significant work on the paper without receiving credit or having their work disclosed. This is broadly seen as a breach of authorship and an act of research misconduct unto itself.

Why are all these school cheating scandals happening?

Andrew — Thu, 21 Mar 2024 13:14:11 +0000

Paul Alper writes:

While the national scene is all about woke, book banning and the like, apparently Columbia University is still dealing with the long-standing conundrum, the best method to teach kids how to read.

He’s referring to this news article, “Amid Reading Wars, Columbia Will Close a Star Professor’s Shop,” which begins:

Lucy Calkins ran a beloved — and criticized — center at Teachers College for four decades. It is being dissolved. . . .

Her curriculum had teachers conduct “mini-lessons” on reading strategies, but also gave students plenty of time for silent reading and freedom to choose their own books. Supporters say those methods empower children, but critics say they waste precious classroom minutes, and allow students to wallow in texts that are too easy.

Some of the practices she once favored, such as prompting children to guess at words using the first letter and context clues, like illustrations, have been discredited.

Over the past three years, several prominent school districts — including New York City, the nation’s largest — dropped her program, though it remains in wide use. . . .

Critics of her ideas, including some cognitive scientists and instructional experts, said her curriculum bypassed decades of settled research, often referred to as the science of reading. That body of research suggests that direct, carefully sequenced instruction in phonics, vocabulary building and comprehension is more effective for young readers than Dr. Calkins’s looser approach.

Alper writes:

This article did not at all mention anything about language specifics. I bring this up because my granddaughters are in a Minneapolis Spanish immersion primary school. Because Spanish is almost 100 per cent phonetic, and English is terrible in this regard, they spell and read better in Spanish than they do in English. The mechanics of learning to read back in my day, was simple and devoid of theory or disagreement. You kept at it until you got it right. The it was English only because no accommodation was made for special needs, immigrants or for the outside world in general.

I know some people at Teachers College but I’ve never encountered Prof. Calkins, nor have I ever looked at the literature on language teaching. So I got nothin’ on this one.

But I did reply that the above story isn’t half as bad as this one from a few years back, which I titled, “What’s the stupidest thing the NYC Department of Education and Columbia University Teachers College did in the past decade?” It involved someone who was found to be a liar, a cheat, and a thief, and then, with that all known, was hired to two jobs as school principal! And then a Teachers College professor said, “We felt that on balance, her recommendations were so glowing from everyone we talked to in the D.O.E. that it was something that we just were able to live with.” This came out in the news after the principal in question was found do have “forged answers on students’ state English exams in April because the students had not finished the tests.” Quelle surprise, no? A liar/cheat/thief gets a new job doing the same thing and then does more lying and cheating (maybe no stealing that time, though).

Alper responded:

You wrote that in 2015 which is about the same time as this story which made Fani Willis RICO famous:

Her most prominent case was her prosecution of the Atlanta Public Schools cheating scandal. Willis, an assistant district attorney at the time, served as lead prosecutor in the 2014 to 2015 trial of twelve educators accused of correcting answers entered by students to inflate the scores of state administered standardized tests.

SAT and all the others did not exist in my 1950 NYC school days, but I believe we did have the so-called Regents Exams and they are still around. It never crossed my mind that the scoring of those exams was not on the up and up. Was I being naive? Was there more honesty and/or less messing around back then and it was just not financially worth it?

Here’s my response:

1. This particular form of cheating sounds no more easy or difficult now than in the past.

2. In the past (i.e., somewhere between 1950 and 2015), tests were important students but not so much for schools. So, yeah, students may have been motivated to cheat, but teachers and school administrators did not have any motivation, either to help students cheat, or to massively cheat on their own. Nowadays, tests can be high stakes for the school administrators, and so, for some of them, cheating is worth the risk.

“Whistleblowers always get punished”

Andrew — Wed, 20 Mar 2024 13:45:55 +0000

In one of our comment threads about how scholars and journalists should be thanking, not smearing, people who ask for replications, Allan Stam writes:

The corollary to all this, and closely related to Javert’s paradox, is the social law: Whistleblowers always get punished.

The Javert paradox, as regular readers will recall, goes like this: Suppose you find a problem with published work. If you just point it out once or twice, the authors of the work are likely to do nothing. But if you really pursue the problem, then you look like a Javert, that is, like an obsessive, a “hater,” someone who needs to “get a life.” It’s complicated, because some critics really do obsess over unimportant details.

On the other hand, details that are unimportant in themselves can be important as indicating bigger problems. For example, the Nudgelords hyped some junk science. In one way, that’s no big deal: everybody makes mistakes. But their lack of interest in their mistakes and their willingness to memory-hole these errors suggests a deeper problem, in that their workflow is lacking that important feedback loop that can allow themselves to identify places where their model for the world has failed. A lack of interest in confronting the failure of one’s model: that’s something that bothered me with so many Bayesians back in the early 1990s, motivating much of my work on posterior predictive checking, and it bothers me today.

The point is, sometimes to find the problems you have to look at the details in detail, which takes the sort of extra effort that can make you look obsessive—heck, maybe it is obsessive. But, so what? And, sure, sometimes a critic will be obsessive and also just be mistaken, and that’s annoying, but there’s little we can do except to try our best to respond to those mistaken criticisms when they arise.

Now back to Stam’s point.

I pretty much agree with what he’s saying: whistleblowing just about always seems to be a bad career move. The clarification I’d like to make is that the “punishment” received by a whistleblower is not necessary anyone directly trying directly trying to punish anyone.

Here’s how it goes. Scholar A does something wrong—maybe it’s flat-out cheating, maybe it’s just bad work that the scholar doesn’t want anyone to re-examine, which, OK, that attitude is a form of cheating too (Clarke’s law!). Scholar B points out the problem.

At this point, no “whistleblowing” has happened. “Whistleblowing” occurs following two more steps: (1) Scholar A, instead of behaving properly by acknowledging and considering the criticism, instead evades it or flat-out lies about it; (2) Scholar B, instead of just letting this be the end, keeps on about it. I guess that even this is not necessarily whistleblowing. Also the whistleblower has to be on the inside.

OK, so at this point it’s a negative-sum game. Scholar A can get the reputation of someone who does bad work and refuses to learn from mistakes. Scholar B can get the reputation of not being a team player. The more this goes on, the more that both scholars are hurt. Even if final consensus if close to Scholar B’s position, so that Scholar B has “won” the intellectual and social argument, it’s still likely to be a net loss in that Scholar B gets some reputation as a difficult person. Conversely, even if Scholar A “wins” in the sense of there being a consensus judgment that the criticism was misguided, there can still be a vague cloud that hangs over Scholar A’s head.

Part of this whole net-loss thing arises because most academics get no negative coverage at all. In politics, any success brings some negative coverage, and getting into a fight can be worth it, by helping you stand out from the crowd. In academia, you want to be known for positive contributions. At least, in science academia. Humanities and some of social science seem different: there, I guess it’s more common for scholars to make their names through controversy.

Anyway, here’s my point. A scientific dispute involving claims of unethical behavior can easily end up hurting both sides. Even if nobody’s trying to punish a whistleblower, there are negative social consequences, and in that sense I think Stam is correct.

“I was left with an overwhelming feeling that the World Values Survey is simply a vehicle for telling stories about values . . .”

Andrew — Tue, 19 Mar 2024 13:22:05 +0000

Dale Lehman writes:

My guess is that you are familiar with the World Values Survey – I was not until I saw it described in the Economist this week (August 12, 2023). It has probably been used in the careers of many academics and is a monumental effort to collect survey data about values from across the world over a long period of time (the latest wave of the survey includes around 130,000 respondents from at least 90 countries). With the caveat that I have no experience with this data and have not read anything about its methodological development, I am struck by what seems like a shoddy ill-conceived research effort. To begin with a minor thing that appeared in the Economist story, I’ve attached a screenshot of part of what appears in the print magazine (the online version interactively builds up this view so the print version is more complete but provides less context). I have an issue with the visualization – some might call it a quibble, I’d call it a major problem – and I can’t tell if the blame lies with The Economist or the WVS, but it is what first alerted me to this data. The change in values over time is shown by the line segments ending in a circle marker for the latest survey wave. Why didn’t they use arrows rather than a circle at one end? I think this is inexcusable – arrows invoke pre-attentive visual processing whereas the line segment/circles force me to constantly reassess the picture to understand how things are changing. In other words, the visual presented doesn’t work – arrows would be immensely better. I don’t believe that is just sloppiness – I think it reveals something more fundamental, and that is what really concerns me about the WVS.

Moving on in the graph, I am immediately stuck by the dimensions of the graph. The methodology is described in detail on the WVS website (https://www.worldvaluessurvey.org/WVSContents.jsp) and I haven’t reviewed it in detail. But I have a number of issues about these measurements. Among these:

– The survival-self expression dimensions strikes me as unintuitive. Since the questions involved (such as the importance of religion vs the importance of environmental protection) are linked to wealth, and much of the WVS research concerns changes in values as wealth changes, why not measure wealth directly? My preference would be for the more unambiguous (relatively speaking) measures like GDP than these derived measures that seem vague to me.

– I have similar issues with the other dimension: traditional vs secular-rational. Neither of these seem intuitive to me and the underlying questions don’t improve things. There are questions about “national pride” and self descriptions of whether or not someone feels “very happy.” I find it very difficult to see how these map cleanly into the dimension they are being used for.

– Since these surveys are done across many countries and over time, I think the meaning of the words may change. For example, asking whether “people are trustworthy” requires the idea of “trust” to mean the same thing in different places and different periods of time. I see no evidence of this and can imagine that there might be differences in how people interpret phrases like that. In general, it seems to me that the wording of these survey questions was not carefully thought out or tested (though perhaps I just am not familiar with their development).

– I am disturbed by the use of single points to represent entire countries. Indeed, there is considerable discussion of how heterogeneous countries are, but the graphs use average measures to represent entire countries. As with many things, the average may be less interesting than the variability. This concern is accentuated by the aggregation of these countries into groups such as “Protestant Europe” and “Orthodox Europe.” I don’t find these groups particularly intuitive either.

– I’m unconvinced that the two dimensional picture of values is the best way to analyze values. Are there two dimensions the most important? Why two? Perhaps the changes over time simply reveal how valid the dimensions are rather than any intrinsic changes in values people hold.

There is more, but I’ll stress again that I have no background with this data. I can say it was difficult for me to even read the Economist article since almost every statement struck me as troublesome regarding what was being measured and how it relates to the fundamental methodology of this two dimensional view of values. I also can’t tell how much of my concern lies with the Economist article or the WVS itself. But I was left with an overwhelming feeling that the WVS is simply a vehicle for telling stories about values and how they differ between countries or groups of people and how these change over time. Those stories are naturally interesting, but I don’t see that the methodology and data support any particular story over any other. It seems like a perfect mechanism for academic career development, but little else.

My reply: I’m not sure! I’ve never worked with the World Values Survey myself. Maybe some readers can share their thoughts?

Inspiring story from a chemistry classroom

Andrew — Mon, 18 Mar 2024 13:07:12 +0000

From former chemistry teacher HildaRuth Beaumont:

I was reminded of my days as a newly qualified teacher at a Leicestershire comprehensive school in the 1970s, when I was given a group of reluctant pupils with the instruction to ‘keep them occupied’. After a couple of false starts we agreed that they might enjoy making simple glass ornaments. I knew a little about glass blowing so I was able to teach them how to combine coloured and transparent glass to make animal figures and Christmas tree decorations. Then one of them made a small bottle complete with stopper. Her classmate said she should buy some perfume, pour some of it into the bottle and give it to her mum as a Mother’s Day gift. ‘We could actually make the perfume too,’ I said. With some dried lavender, rose petals, and orange and lemon peel, we applied solvent extraction and steam distillation to good effect and everyone was able to produce small bottles of perfume for their mothers.

What a wonderful story. We didn’t do anything like this in our high school chemistry classes! Chemistry 1 was taught by an idiot who couldn’t understand the book he was teaching out of. Chemistry 2 was taught with a single-minded goal of teaching us how to solve the problems on the Advanced Placement exam. We did well on the exam and learned essentially zero chemistry. On the plus side, this allowed me to place out of the chemistry requirement in college. On the minus side . . . maybe it would’ve been good for me to learn some chemistry in college. I don’t remember doing any labs in Chemistry 2 at all!

Preregistration is a floor, not a ceiling.

Andrew — Sun, 17 Mar 2024 20:37:03 +0000

This comes up from time to time, for example someone sent me an email expressing a concern that preregistration stifles innovation: if Fleming had preregistered his study, he never would’ve noticed the penicillin mold, etc.

My response is that preregistration is a floor, not a ceiling. Preregistration is a list of things you plan to do, that’s all. Preregistration does not stop you from doing more. If Fleming had followed a pre-analysis protocol, that would’ve been fine: there would have been nothing stopping him from continuing to look at his bacterial cultures.

As I wrote in comments to my 2022 post, “What’s the difference between Derek Jeter and preregistration?” (which I just added to the lexicon), you don’t preregister “the” exact model specification; you preregister “an” exact model specification, and you’re always free to fit other models once you’ve seen the data.

It can be really valuable to preregister, to formulate hypotheses and simulate fake data before gathering any real data. To do this requires assumptions—it takes work!—and I think it’s work that’s well spent. And then, when the data arrive, do everything you’d planned to do, along with whatever else you want to do.

Planning ahead should not get in the way of creativity. It should enhance creativity because you can focus your data-analytic efforts on new ideas rather than having to first figure out what defensible default thing you’re supposed to do.

Aaaand, pixels are free, so here’s that 2002 post in full:

What’s the difference between Derek Jeter and preregistration?

There are probably lots of clever answers to this one, but I’ll go with: One of them was hyped in the media as a clean-cut fresh face that would restore fan confidence in a tired, scandal-plagued entertainment cartel—and the other is a retired baseball player.

Let me put it another way. Derek Jeter had three salient attributes:

1. He was an excellent baseball player, rated by one source at the time of his retirement as the 58th best position player of all time.

2. He was famously overrated.

3. He was a symbol of integrity.

The challenge is to hold 1 and 2 together in your mind.

I was thinking about this after Palko pointed me to a recent article by Rose McDermott that begins:

Pre-registration has become an increasingly popular proposal to address concerns regarding questionable research practices. Yet preregistration does not necessarily solve these problems. It also causes additional problems, including raising costs for more junior and less resourced scholars. In addition, pre-registration restricts creativity and diminishes the broader scientific enterprise. In this way, pre-registration neither solves the problems it is intended to address, nor does it come without costs. Pre-registration is neither necessary nor sufficient for producing novel or ethical work. In short, pre-registration represents a form of virtue signaling that is more performative than actual.

I think this is like saying, “Derek Jeter is no Cal Ripken, he’s overrated, gets too much credit for being in the right place at the right time, he made the Yankees worse, his fans don’t understand how the game of baseball really works, and it was a bad idea to promote him as the ethical savior of the sport.”

Here’s what I think of preregistration: It’s a great idea. It’s also not the solution to problems of science. I have found preregistration to be useful in my own work. I’ve seen lots of great work that is not preregistered.

I disagree with the claim in the above-linked paper that “Under the guidelines of preregistration, scholars are expected to know what they will find before they run the study; if they get findings they do not expect, they cannot publish them because the study will not be considered legitimate if it was not preregistered.” I disagree with that statement in part for the straight-up empirical reason that it’s false; there are counterexamples; indeed a couple years ago we discussed a political science study that was preregistered and yielded unexpected findings which were published and were considered legitimate by the journal and the political science profession.

More generally, I think of preregistration as a floor, not a ceiling. The preregistered data collection and analysis is what you need to do. In addition, you can do whatever else you want.

Preregistration remains overrated if you think it’s gonna fix science. Preregistration facilitates the conditions for better science, but if you preregister a bad design, it’s still a bad design. Suppose you could go back in time and preregister the collected work of the beauty-and-sex-ratio guy, the ESP guy, and the Cornell Food and Brand Lab guy, and then do all those studies. The result wouldn’t be a spate of scientific discoveries; it would just be a bunch of inconclusive results, pretty much no different than the inconclusive results we actually got from that crowd but with the improvement that the inconclusiveness would have been more apparent. As we’ve discussed before, the benefits of procedural reforms such as preregistration are indirect—making it harder for scientists to fool themselves and others with bad designs—but not direct. Are these indirect benefits greater than the costs? I don’t know; maybe McDermott is correct that they’re not. I guess it depends on the context.

I think preregistration can be valuable, and I say that while recognizing that it’s been overrated and inappropriately sold as a miracle cure for scientific corruption. As I wrote a few years ago:

In the long term, I believe we as social scientists need to move beyond the paradigm in which a single study can establish a definitive result. In addition to the procedural innovations [of preregistration and mock reports], I think we have to more seriously consider the integration of new studies with the existing literature, going beyond the simple (and wrong) dichotomy in which statistically significant findings are considered as true and nonsignificant results are taken to be zero. But registration of studies seems like a useful step in any case.

Derek Jeter was overrated. He was a times a drag on the Yankees’ performance. He was still an excellent player and overall was very much a net positive.

“On the uses and abuses of regression models: a call for reform of statistical practice and teaching”: We’d appreciate your comments . . .

Andrew — Sun, 17 Mar 2024 13:51:35 +0000

John Carlin writes:

I wanted to draw your attention to a paper that I’ve just published as a preprint: On the uses and abuses of regression models: a call for reform of statistical practice and teaching (pending publication I hope in a biostat journal). You and I have discussed how to teach regression on a few occasions over the years, but I think with the help of my brilliant colleague Margarita Moreno-Betancur I have finally figured out where the main problems lie – and why a radical rethink is needed. Here is the abstract:

When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. More broadly, statistics is widely understood to provide a body of techniques for “modelling data”, underpinned by what we describe as the “true model myth”, according to which the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective leads to a range of problems in the application of regression methods, including misguided “adjustment” for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline an alternative approach to the teaching and application of regression methods, which begins by focussing on clear definition of the substantive research question within one of three distinct types: descriptive, predictive, or causal. The simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of “input” variables, but their conceptualisation and usage should follow from the purpose at hand.

The paper is aimed at the biostat community, but I think the same issues apply very broadly at least across the non-physical sciences.

Interesting. I think this advice is roughly consistent with what Aki, Jennifer, and I say and do in our books Regression and Other Stories and Active Statistics.

More specifically, my take on teaching regression is similar to what Carlin and Moreno say, with the main difference being that I find that students have a lot of difficulty understanding plain old mathematical models. I spend a lot of time teaching the meaning of y = a + bx, how to graph it, etc. I feel that most regression textbooks focus too much on the error term and not enough on the deterministic part of the model. Also, I like what we say on the first page of Regression and Other Stories, about the three tasks of statistics being generalizing from sample to population, generalizing from control to treatment group, and generalizing from observed data to underlying constructs of interest. I think models are necessary for all three of these steps, so I do think that understanding models is important, and I’m not happy with minimalist treatments of regression that describe it as a way of estimating conditional expectations.

The first of these tasks is sampling inference, the second is causal inference, and the third refers to measurement. Statistics books (including my own) spend lots of time on sampling and causal inference, not so much on measurement. But measurement is important! For an example, see here.

If any of you have reactions to Carlin and Moreno’s paper, or if you have reactions to my reactions, please share them in comments, as I’m sure they’d appreciate it.