Is significance testing all bad?

Dan Goldstein quotes J. Scott Armstrong:

About two years ago, I [Armstrong] was a reasonable person who argued that tests of statistical significance were useful in some limited situations. After completing research for “Significance tests harm progress in forecasting” in the International Journal of Forecasting, 23 (2007), 321-327, I have concluded that tests of statistical significance should never be used.

Here’s a link to Armstrong’s paper, and here’s a link to his rejoinder to discussion.

My thoughts:

It has been rare that I’ve found significance tests to be useful, but when they have, it has been as a way to get a sense of the ways in which a model does not fit data, to give direction on where the model can be improved; see Chapter 6 of Bayesian Data Analysis.

For a specific example in which I found significance tests useful, see Section 2.6 of our new book. I emailed Amstrong and am interested to see if he agrees that significance testing was appropriate in that case. I suppose I agree that, ultimately, confidence intervals and effect size estimates would be appropriate even in this example, but the significance testing was relatively simple and clear so I was happy with it.

I was also reminded that the difference between “significant” and “not significant” is not itself statistically significant.

1 thought on “Is significance testing all bad?

  1. I think the a big problem with null hypothesis significance tests (NHSTs) is that they are over-used and over-emphasised.

    The idea that they are "unnecessary even when properly done and properly interpreted" sounds controversial unless you believe that virtually any analytic tool could be substituted by an alternative in any specific application. For example, virtually any significance test could be replaced by graphs or descriptive statistics (and the graph could be just as informative or misleading as the NHST).

    I'm increasingly fond of CIs because they are much more flexible and informative than NHSTs. Nevertheless it can be difficult to construct an appropriate CI for some situations (e.g., repeated measures designs) and CIs can also be misinterpreted. Standardized effect size can be useful but is often problematic because it conflates the magnitude of an effect with its variability (typically with a very poor quality or inappropriate estimate of the variability) and can be more misleading than an NHST.

    My own view is that we need to be more flexible about the tools for analyzing and reporting data. We also have to accept that the choice of CIs, NHSTs, graphs, tables, descriptive statistics, information criteria etc. is in part a rhetorical one. I'm also not convinced that meta-analysis doesn't use NHSTs as nearly every meta-analysis implicitly or explicitly rejects the hypothesis that the aggregate effect size is zero (typically using a CI).

    (I was also slightly puzzled by the idea that NHSTs take up space in publications – most alternatives would take up equivalent space or lots more space in the case of graphical methods).

    I can't muster any enthusiasm for the idea that merely getting rid of NHSTs of itself would be at all useful. Switching to effect size has not (as yet) enhanced the quality of much published work – mostly because inappropriate effect size statistics are used and/or because they authors have no idea what the effect size statistic they use means for their study.

Comments are closed.