More on possibly rigor-enhancing practices in quantitative psychology research

In an paper entitled, “Causal claims about scientific rigor require rigorous causal evidence,” Joseph Bak-Coleman and Berna Devezer write:

Protzko et al. (2023) claim that “High replicability of newly discovered social-behavioral findings is achievable.” They argue that the 86% rate of replication observed in their replication studies is due to “rigor-enhancing practices” such as confirmatory tests, large sample sizes, preregistration and methodological transparency. These findings promise hope as concerns over low rates of replication have plagued the social sciences for more than a decade. Unfortunately, the observational design of the study does not support its key causal claim. Instead, inference relies on a post hoc comparison of a tenuous metric of replicability to past research that relied on incommensurable metrics and sampling frames.

The article they’re referring to is by a team of psychologists (John Protzko, Jon Krosnick, et al.) reporting “an investigation by four coordinated laboratories of the prospective replicability of 16 novel experimental findings using rigor-enhancing practices: confirmatory tests, large sample sizes, preregistration, and methodological transparency. . . .”

When I heard about that paper, I teed off on their proposed list of rigor-enhancing practices.

I’ve got no problem with large sample sizes, preregistration, and methodological transparency. And confirmatory tests can be fine too, as long as they’re not misinterpreted and not used for decision making.

My biggest concern is that the authors or readers of that article will think that these are the best rigor-enhancing practices in science (or social science, or psychology, or social psychology, etc.), or the first rigor-enhancing practices that researchers should reach for, or the most important rigor-enhancing practices, or anything like that.

Instead, I gave my top 5 rigor-enhancing practices, in approximately decreasing order of importance:

1. Make it clear what you’re actually doing. Describe manipulations, exposures, and measurements fully and clearly.

2. Increase your effect size, e.g., do a more effective treatment.

3. Focus your study on the people and scenarios where effects are likely to be largest.

4. Improve your outcome measurement.

5. Improve pre-treatment measurements.

The suggestions of “confirmatory tests, large sample sizes, preregistration, and methodological transparency” are all fine, but I think all are less important than the 5 steps listed above. You can read the linked post to see my reasoning; also there’s Pam Davis-Kean’s summary, “Know what the hell you are doing with your research.” You might say that goes without saying, but it doesn’t, even in some papers published in top journals such as Psychological Science and PNAS!

You can also read a response to my post from Brian Nosek, a leader in the replication movement and one of the coauthors of the article being discussed.

In their new article, Bak-Coleman and Devezer take a different tack than me, in that they’re focused on challenges of measuring replicability of empirical claims in psychology, whereas I was more interested in the design of future studies. To a large extent, I find the whole replicability thing important to the extent that it gives researchers and users of research less trust in generic statistics-backed claims; I’d guess that actual effects typically vary so much based on context that new general findings are mostly not to be trusted. So I’d say that Protzko et al., Nosek, Bak-Coleman and Devezer, and I are coming from four different directions. (Yes, I recognize that Nosek is one of the authors of the Protzko et al. paper; still, in his blog comment he seemed to have a slightly different perspective). The article by Bak-Coleman and Devezer seems very relevant to any attempt to understand the empirical claims of Protzko et al.

6 thoughts on “More on possibly rigor-enhancing practices in quantitative psychology research

  1. Reviewers will rightly worry about generalizability…except they will do so to the point that (2) and (3) are seen as goosing the effect. All great suggestions, just need buy-in from those in power.

  2. Ehm, I don’t understand this at all. I browsed through the original preprint to try and understand what they exactly did, and how this relates to their findings and conclusions.

    I was immediately worried about the comparison between replicability of older findings and these “new discoveries” as also mentioned by Bak-Coleman and Devezer in their preprint.

    But I don’t even get the following:

    The “rigor enhancing practices” seem to me to come into play at the second stage where pilot studies and exploratory research conducted independently in each laboratory were replicated by other labs (?).

    If this is correct, I don’t understand why and how anything can be said concerning “rigor enhancing practices” and replicability except that they can state that they used “rigor enhancing practices” when performing all the replications. The “rigor enhancing practices” were not necessarily used when the “new discoveries” were discovered (?), they were only used when replicating the “new discoveries” (?)

    • If I am not mistaken, I think I remember certain “rigor enhancing practices” like preregistration, larger sample sizes, methodological transparency, confirmatory tests, and being able to ask the original authors, being part of performing the replications of the Reproducibility Project which found, what some may call, low rates of replicability.

      If this is correct, I wonder what the authors of this new project and paper in question would state about the role of “rigor enhancing practices” and replicability concerning those original studies and/or replications.

      I am still confused about this all…

  3. Andrew, your recommendations are too general in my opinion and can lead to people misunderstanding what you are saying. This has happened before: what many psycholinguists took away from Gelman and Hill was that one should not log transform positive-only values. That misunderstanding happened because of the lack of clarity in the book about what the topic under discussion was (tl;dr: it wasn’t null hypothesis significance testing; it wasn’t even hypothesis testing of any kind at all).

    In particular, this can be bad advice to give as a one-size-fits-all recommendation:

    > 2. Increase your effect size, e.g., do a more effective treatment.

    > 3. Focus your study on the people and scenarios where effects are likely to be largest.

    I can see that there will be situations where increasing effect size without changing the question you are investigating can be done, but there are many many situations where you are studying inherently small effect sizes and if you incrases the effect size, you also change the question being answered.

    Point 3 is also potentially misleading. For example, if I am studying the effect of working memory on reading time or something like that in an adult native-speaker population, I can find sub-populations (like older people, or non-native speakers) in which the effect will be the largest. But that would again answer a different research question.

    I like the other recommendations in this list, as they apply across the board.

    I think it’s just hard to come up with recommendations that work for every situation. It’s better to recommend to people to equip themselves to understand what they are doing (and to understand the limitations of what they are doing). And then point them to how to achieve this.

    • Shravan:

      Interesting points. In response:

      – My “top 5 recommendations” are not always easy to do, and with any such recommendations you have to think about tradeoffs. That’s true with any set of recommendations. For example, the linked paper recommended “confirmatory tests, large sample sizes, preregistration and methodological transparency,” and all of those too can involve tradeoffs if used blindly, for example if researchers do a confirmatory test instead of, instead of addition to, exploratory data analysis; or if they find the budget to attain a large sample size by gathering cheaper measurements on each individual person; or if they use preregistration as a ceiling, rather than a floor, on their statistical analysis; or if they create methodological transparency by avoiding the use of any methods that are difficult to explain. That said, I agree with the general point that, when making recommendations, everyone—including me—should make such tradeoffs clear to potential users, so as not to imply that following these recommendations will necessarily be easy or costless.

      – I agree with your point that increasing effect size can change the nature of the treatment. This is particularly clear in a medical study where you can’t just double the number of pills people take. Again, it would be better for me to say something like, “Increase effect size to the extent feasible, while recognizing that this can change what is being studied.” One thing I will say is that researchers all the time will shift the goal and conclusions of a study to line up better with what has been measured. The study starts with the goal of studying vague hypothesis A, and then it moves to more precise hypothesis B, with B depending on the details of the treatments and measurements. So some of this is inevitable even without any effort to increase the effect size. What this implies for practice is, when considering how to improve the design of a study, don’t restrict yourself by starting with B. Instead, go back to A and consider possible ways of designing a treatment with a large effect within that general umbrella.

      – Regarding my advice to focus on people and scenarios where the effect will be largest: I was assuming that this would be a subset of the population of interest. So I’m recommending to focus on subgroups where you will expect large effects. Not to switch to entirely different groups. And, yes, at some point you might well want to crack the nut and study groups with small effects; that’s just not where I recommend starting. Again, I should make that more clear in my recommendation, as that’s another tradeoff issue.

      – Regarding the final paragraph of your comment: I agree that any general recommendations will have problems. I wrote the above post in reaction to an earlier published paper that referred to purported “rigor-enhancing practices” and where it seemed that those were just taken for granted.

      I’ve been annoyed for several years about recommendations to preregister, increase sample size, do hypothesis tests right, etc.—not because I think any of these are, on their own, bad recommendations, but because of the potential implication that (a) these will directly fix science and that (b) these are the first steps to take. I disagree with implication (a) for the reason I’ve discussed in this space many times, that pregistration etc. can make it harder (but not impossible!) for researchers to fool themselves and others with noise mining, but it won’t turn a bad study into a good study; if a study has problems with measurement and is estimating a small or highly variable effect, then preregistration and elimination of so-called “questionable research practices” won’t fix that problem. I disagree with implication (b) because I think there are other steps I think can more directly improve a study. Hence the above post.

      • Hi Andrew,

        That’s a good way to frame it.

        My lab has been doing pre-registrations for several years now, and most of the time what I learned from the pre-registration was that we didn’t really adequately think about what we would do once we have the data. My lab and I are getting better at this now, but it took many attempts to do a pre-registration that actually made sense once the data were in. That said, it’s still better to do a pre-registration than not, if only for the experimenter’s own sake (as a sanity pre-check). However, it turns out that, as you imply, pre-registration doesn’t solve any of the real problems. One major problem seems to be the inherent variability of behavioral responses; there is no one truth to find, but we act ask we are seeking out that one truth, and then we find ourselves stuck with an uncertain conclusion. It’s striking for me when I read papers that report definitive conclusions; either I don’t know how to do experiments or other people are just amazingly lucky. Even with the most patient data collection and the best methods I can muster, we rarely replicate anything, where replication is defined as obtaining clear evidence for a claimed effect.

        So, based on the experimental work we have done in my lab over the last 10 or so years, I am agreeing with you that the original paper you commented on seems to be living in a fantasy world: just do these x things and you will do replicable work. 86% replicability rate (even if the meaning of this term is defined clearly)? Fuggedaboudid. There are a few inherently robust effects that will replicate 100%, but these all constitute low-hanging fruit that has long been picked.

Leave a Reply

Your email address will not be published. Required fields are marked *