A psychology journal (Basic and Applied Social Psychology) has recently caused a bit of stir by banning p-values from their published articles. For what it’s worth, here’s a few views on the journal’s new policy, and on the use of p-values and confidence intervals in empirical research.
On the misuse and problems caused by p-values
First off, I certainly have sympathy with what I presume is the journal editor’s issue, which is that p-values can be misused and have a harmful effect on the conduct of research and the pursuit of improved knowledge. As has been discussed at length over many decades, if one only uses p-values as the criteria for deciding what results are important, one ends up with many issues. Small studies which have low power may wrongly conclude that no effect/association is in truth present based on a non-significant p-value. Conversely, very large studies will often find statistically significant results which are scientifically not significant, the latter judgement of course requiring subject specific contextual knowledge. These and other issues which follow from focusing solely on p-values led to medical journals (e.g. BMJ) encouraging or mandating that authors give much greater focus on the magnitude of estimated effects and confidence intervals. Moreover, if journals only publish papers with statistically significant results, a large proportion of these are likely false positives and the estimated effects may be biased upwards. This is all eminently reasonable and true.
The journal’s ban
The Basic and Applied Social Psychology journal has banned the “null hypothesis significance testing procedure” in articles that it publishes. In the recently published editorial a number of quite bold claims are made in support of their ban.
Null hypothesis tests are invalid
The first is that “the null hypothesis significance testing procedure (NHSTP) is invalid”. I guess this is supposed to be read as meaning that the procedure is poor when viewed as a method for deciding when observed statistically significant effects are ‘true’. On this there’s John Ioannidis’ now famous paper, and more recently, a nice article by David Colquhoun, looking at the false discovery rate and conditional bias of effect estimates when one solely uses p-values to declare when an effect is present. This demonstrates that when the prevalence of true effects is of a given (arguably realistic) level, the proportion of statistically significant results which correspond to real effects is disappointingly low. But obviously in the statistical setup in which are defined, under certain conditions, they are valid in the sense in which they are supposed to be: if one tests a null hypothesis which is really true, the chance of obtaining a p-value less than 0.05 is 0.05.
What about confidence intervals?
In their recently published editorial, the editors explain that confidence intervals are also problematic because “a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval”. This is of course correct. However, if one repeatedly performs studies, considers the results of each (i.e. imagine we do not suffer from publication bias), then assuming that the statistical models being used are well specified, 95% of the confidence intervals will cover their respective true population parameters. The editors then explain that “confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval”. Since frequentist confidence intervals are not Bayesian credible intervals, as the editors explain, one should not interpret a confidence interval as if it were a credible interval. However, my personal view is that the 95% frequentist interval is still an extremely useful tool. It gives a measure of the statistical uncertainty in the parameter estimate, and gives a plausible range for the population value. Of course the interval may be too narrow because it ignores various uncertainties which are not allowed for in the model (biases due to measurement errors, missing data, etc etc), but I’d much rather see an estimate accompanied by a confidence interval than without.
Are any inferential statistical procedures required?
The editors’ final editorial question asks whether any inferential statistical procedures are required in articles submitted to the journal. They answer no, because “No, because the state of the art remains uncertain”. This seems a bizarre answer to me. Estimates of quantities of interest from empirical studies are subject to various uncertainties. The logic seems to be that because no one can agree what is the optimal inferential approach, one should abandon quantitative measures of uncertainty, as for example are given by p-values and confidence intervals. This seems a very strange position to adopt.
The editors encourage authors to present “strong descriptive statistics, including effect sizes”, and “larger sample sizes”. Both are certainly a good idea. But the reader of such an article, absent any confidence intervals or p-values, would I expect struggle to infer how strong the evidence is for the effects (whatever they might be) which are observed in the experiment or study that has been conducted. This will particularly be the case for analyses which do anything more complicated than compare means or report correlations. For example, if someone has used a linear mixed model to analyse some repeated measures data, and only presents the parameter estimates, I for one would struggle to infer how precise the estimates are based on descriptive statistics of the sample size and variability of the data.
My view
I would certainly support requiring authors not to confine their attention to statistical significance (p-values) in their analyses and reports, which perhaps is a particularly prevalent issue in psychological journals. Further, wider appreciation, both by journals and researchers, of the problems induced when one only publishes (or gets published) analyses which give statistically significant results, is certainly needed, and the papers by Ioannidis and Colquhoun are well worth a read for this.
But getting rid of p-values and confidence intervals doesn’t seem to me to be the right solution. Instead, as per what the medical journals have been trying to encourage for the last 30 years, we should see more use of confidence intervals and critical consideration of the scientific significance of estimated effects. Further, we must be keenly aware of the problems of publication bias and multiple testing, and try to move to a position where the research which is published represents an increasingly large proportion of the research being conducted. In the clinical trials arena this aim is hopefully increasingly achieved through the use of trial registers.
Also of interest here is an opinion piece published yesterday from a number of members of the Royal Statistical Society on the journal’s editorial.