The work-until-its-significant strategy

| 7 Comments

Ian Stevenson writes:

I'm a computational neuroscience grad student at Northwestern with Konrad Kording. I have a quick question for you if you have time... We do a lot of psychophysics experiments and have been thinking about a potential problem with methodology. Most psychophysics labs tend to run experiments until the result is significant (as opposed to fixing the number of subjects in advance like a clinical trial). We did a few simulations of that kind of work-until-its-significant strategy and found that in some cases the reported effect size could be biased by ~40% and the reported p-values could be off by a factor of ~3.

Do you happen to know if this is a well known effect? I found a lot of work on publication bias, but this seems like it has more to do with how the experiments are run. We were thinking of writing up something short for the psychophysics community.

Just the other day, I was talking about psychophysics with a colleague who was getting hassled because he was averaging subjective ratings, and some ignoramous thought there was some rule against treating ordinal data as numbers. I told my colleague that he should take confidence in the fact that researchers in psychometrics and psychophysics have been doing this for many decades.

Anyway, back to the original question: yes, I think this sort of thing is well known. It's not such an issue in my work because I don't do much with p-values, and, when I do, my p-values are defined based on the actual data collection strategy that was used (or our best approximation to it), so this problem wouldn't arise, at least not in its most serious form.

See here for more on multiple comparisons and p-values in neuroscience.

7 Comments


This is quite well known in statistics - I encountered the concept as an undergraduate.

You can of course deliberately have a sampling scheme that has a stopping rule that's related to significance... AS LONG AS you take the impact of the sampling scheme on your inference into account.

What you can't do (at least not and still imagine your analysis means anything) is use a stopping rule from ONE kind onf sampling and apply the inferenntial procedures of a different kind of sampling.

It has been shown that you're guaranteed to get significance eventually, even if the null is true. Of course, Bayesian inference does not have this problem.

It is quite illegitimate to do the kind of analysis that the correspondent describes. Doing this is called "sampling to a foregone conclusion."

See

Berger, J. 0. (1985). Statistical decision theory and Bayesian analysis (Second Ed.). New York:
Springer-Verlag. Section 7.7.

Note in particular Example 20, p. 507: Here, focus on the case where theta=0 (standard normal distribution). One sequentially draws samples from an N(0,1) distribution, and computes the sample means \sum(x_j)/j where j is the number of samples so far. We keep sampling and j gets larger and larger. The Law of the Iterated Logarithm guarantees (I believe almost surely) that for some j the absolute value of the sample mean will exceed k/sqrt(j) for any arbitrary fixed k. By adjusting k we can choose in advance any desired p-value, and by sampling long enough we can guarantee that we'll eventually get a p-value smaller than this. Then we will (illegitimately) reject the null hypothesis theta=0, even though it is in fact true.

A short, nonmathematical discussion of some of these issues can be found in

Berger, J. O., & Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American
Scientist, 76, 159-165.

The stopping rule is absolutely critical in any frequentist analysis. Using a stopping rule that is incompatible with the experimental protocol is an invitation to disaster.

Other useful papers include:

Berger, J. O., & Delampady, M. (1987). Testing precise hypotheses. Statistical Science, 2, 3 17-
352.

Berger, J. O., & Sellke, T. (1987). Testing of a point null hypothesis: The irreconcilability of
significance levels and evidence. Journal cfthe American Statistical Association, 82, 1 12-
122.

Yes, it's well known, and an unfortunate side effect of the use of p-values as evidence. On the size of the bias thus induced, see Sellke, Bayarri and Berger http://www.stat.duke.edu/~berger/papers/99-13.html

Richard, Bill: Certainly, Bayes Factors avoid the problem of sampling to a foregone conclusion, and p-values do not. But this doesn't mean that all frequentist analysis suffers the same fate, or that all Bayesian inference avoids it.

As mentioned above the stopping rule should match the analysis. There is nothing wrong within the frequentist framework of sampling until significance or a fixed sample size is reached as long as the results are properly adjusted
(e.g., Gosh, Mukhopadhyay, & Sen, 1997, Sequential Estimation). Sometimes using these types of designs can be a huge time saver.

"thought there was some rule against treating ordinal data as numbers....".

For comparing two groups, it doesn't make much difference, but try using ordinal data in regressions and then implementing the results. Problems quickly become apparent. If you're dealing with them as responses you get out of range predictions, and if you're using them as predictors, you get weird results. Additivity goes out the window when you shift populations (for those of us that do global work...).

It isn't covered in typical statistics courses, but there is a whole area of study called "representational measurement theory". It is easier to just analyze the numbers, though.

Leave a comment

Subscribe to Entry

Recent Comments

  • bill r: "thought there was some rule against treating ordinal data as read more
  • Jared Smith: As mentioned above the stopping rule should match the analysis. read more
  • Freddy: Richard, Bill: Certainly, Bayes Factors avoid the problem of sampling read more
  • Jonathan: Yes, it's well known, and an unfortunate side effect of read more
  • Bill Jefferys: It is quite illegitimate to do the kind of analysis read more
  • Richard D. Morey: It has been shown that you're guaranteed to get significance read more
  • GB: This is quite well known in statistics - I read more