What is the point, if any, of retrospective power calculations?

In his aforementioned chapter, Stephen Senn writes:

“In order to interpret a trial it is necessary to know its power”: This is a rather silly point of view that nevertheless continues to attract adherents. A power calculation is used for planning trials and is effectively superseded once the data are in. . . . An analogy may be made. In determining to cross the Atlantic it is important to consider what size of boat it is prudent to employ. If one sets sail from Plymouth and several days later sees the Statue of Liberty and the Empire State Building, the fact that the boat employed was rather small is scarcely relevant to deciding whether the Atlantic was crossed.

I used to think this too, but after writing my paper with David Weakliem, I’ve changed my stance on the relevance of retrospective power calculations. In that article, Weakliem and I discussed the problem of Type M (magnitude) errors, where the true effect is small but it is estimated to be large. One problem with underpowered studies is that, when they do turn up statistically significant results, they tend to be huge compared to the true effect sizes.

On the other hand, large studies can be a huge waste of effort, so I don’t really know what I would recommend for medical research.

8 thoughts on “What is the point, if any, of retrospective power calculations?

  1. If the true effect is smaller than the estimated effect, and therefore it turns out to be significant, then that's a type I error. You can reduce the chance of that happening by requiring a smaller alpha; but retrospective power does not inform you further — it is just a monotone decreasing function of the P value. It certainly doesn't give you information about whether or not you over-estimated the effect. — Russ L.

    "… I discussed the problem of Type M (magnitude) errors, where the true effect is small but it is estimated to be large…."

  2. Russ,

    Yes, I certainly agree that you (and others) have pointed out the issue of Type M errors. The challenge is incorporating it into a formal framework. I don't think the framework of Type 1 and Type 2 errors is helpful.

  3. "One problem with underpowered studies is that, when they do turn up statistically significant results, they tend to be huge compared to the true effect sizes."

    I don't think I get this. Is the problem that with a small sample size you tend not to get a good estimate of variance? So if you find an effect it can be because the variance was estimated to be smaller than it actually is?

  4. Well, I definitely agree that it is *some* kind of error to look only at the asterisks and ignore the actual numbers. That seems to be a bad habit shared by a lot of people. If a formalism of that would help, then I'm all for it. But I'm still not convinced that power, which is inherently prospective, is the way to go. Maybe a Bayesian approach would be less strained — turning the conditions the other direction. At JSM, Ralph O'Brien talked about some related concepts where he discusses something like a posterior probability of the null hypothesis, given that it was rejected. I wasn't completely sold on some of the details, but there were some attractive ideas.

  5. The need to look at power to interpret the meaning of a "significant" result is obvious if you consider the extreme of a trial that provides no information at all – for instance, one in which the p-value is just generated at random from the uniform distribution between 0 and 1. Assuming a significance threshold of 0.05, it's clear that the power is 0.05 for any true value of the parameter. If you happen to get a "significant" result with such a trial (ie, p-value less than 0.05), you actually have no evidence at all that the null hypothesis is false.

    This situation is approximated by trials with very small sample sizes, or very large variances in measurements.

    One can also see the need for a (if necessary, retrospective) power calculation from just considering the traditional justification for a hypothesis test – if the p-value is small, either the null hypothesis is false, or an unlikely event happened. For us to prefer the "null hypothesis false" option, it's necessary for the unlikely event to be substantially less unlikely if the null hypothesis is false. But that's not the case if the power is low.

  6. I'll say it again. Retrospective power is a monotone decreasing function of the P value, such that retro power is about 50% when P = alpha. If a result is (nonsignificant|significant), retro power will always be (low|high). In the situation you describe above, if you happen to get a "significant" result in spite of there being no effect, the retrospective power will be high. It adds no new information to an analysis.

    For more discussion, see the article by Hoenig and Heisey, American Statistician, Feb 2001.

  7. Maybe we different things by "retrospective power analysis". You maybe mean a power analysis in which you find the power for an alternative that matches the estimate of the parameter from the actual data. I mean a power analysis that happens to be done after the data was gathered, using whatever effect size would be thought reasonable a priori.

  8. Russ: In my article with Weakliem linked above, we do the Bayesian approach, which I agree is the best way to go. But I also thought it useful to describe the problem in terms of power, since this is how many statisticians think.

    Radford: I agree with your comment. The extreme example is a good way to see the limitations of statistical significance as an inferential summary.

Comments are closed.