Statistical Misunderstandings

(July 21st, 2016) Victor Spoormaker of the Bavarian Academy of Sciences and Humanities' Young Scholars Programme analyses the recent replication crisis in neuroscience and finds lessons for other life sciences. 

The reproducibility of neuroscientific research findings is so low that some researchers are talking about a replication crisis in neuroscience. The true extent of the problem is hard to estimate, but estimates from preclinical research in medicine to cognitive research in human subjects are dire: estimates for experimental findings that could be replicated vary from 11% to 50%, dependent on the discipline. This is not far away from the original estimate by the Stanford professor Ioannidis that most published research findings are false. Yet in scientific practice it is business as usual and a p-value of 0.05 still receives the same importance as ever – even though p-values just under 0.05 are the best predictor of a failure to replicate. This has to stop: scientists cannot outsource statistics and should stop believing in the magical border of 0.05.

Statisticians have long pointed to the limitations of our current way of significance testing (also referred to as null-hypothesis significance testing), and the seminal paper ‘Why most published research findings are false’ by John Ioannidis in 2005 certainly placed the topic of replicabilty and false positive results back in the mainstream medical and neuroscience (1). Empirical evidence that systematically examines reproducibility rates is now slowly increasing, and initial replication efforts do not bode well. In 2011, researchers from the pharmaceutical company Bayer published their internal efforts to replicate preclinical experimental findings in medicine, covering oncology, cardiovascular disease and women’s health (2). They reported that they could reproduce findings of 21% of all experiments, with an additional 11% providing reproducibility of part of the results. A similar analysis by the pharmaceutical company Amgen of preclinical experiments in oncology even observed a reproducibility rate as low as 11% (3).

There are no empirical data yet for preclinical experiments in neuroscience but one analysis strongly suggests such results may be similar (or worse). Key to reporting a solid effect is that one’s sample size is large enough to detect the expected effect of a given size, which is also referred to as statistical power. If an expected effect is very large, small groups suffice to find the true effect in the population in one’s sample with reasonable chance, and for smaller effects larger samples are required. A reasonable chance is typically considered 80% (but could as well be 70% or 90%), which means that as you set out to do your study to find this effect of a given size with your planned sample size, you will have 80% chance to also detect it. If you have smaller samples, this chance drops. In an analysis of preclinical neuroscientific studies of frequently used experiments, of which the effect size was estimated through meta-analyses, it turned out that the median power was just 18 to 31% (for sex differences on the water and radial maze, respectively) (4). This is concerning as it shows that sample sizes are much too low, which mostly makes us concerned about not detecting true effects (false negatives).

Yet then comes the issue of bias (1). One example is having too much (and undisclosed) flexibility in the analyses. This means that you analyse data for other purposes and/or in other ways than initially planned until you find a significant effect somewhere. Another example of bias is a positive publication bias in the field. In neuroscience, there is certainly such a preference of journals to publish postive findings: around 85% of published findings represent positive findings and nill findings are the exception (5). If such biases are present, and the field comprises of many small, underpowered studies instead of fewer adequately powered studies, this results in a relatively high incidence of published false positives. And even if small studies detect a true effect, they might overestimate the effect size: there is always random variability from sample to sample – particularly in small samples (6), and some samples will have an effect that lies below the true effect (and are less likely to have significant p-values and get published) and others will have an effect above the true effect (and will be more likely to yield significant p-values and get published). This means that the effect size estimates from the meta-analyses are likely inflated and could very well be smaller, indicating that the statistical power may actually be lower than the already low observed power (4).

The two reports by Bayer and Amgen estimated the size of the dark problem but did not provide data or transparancy on why most experiments could not be replicated. More insight into this issue came from a recently published large-scale replication effort (7) by the Open Science Collaboration in which 100 psychological studies published in 2008 were replicated by multiple teams of researchers (270 autors in total; both cognitive and social psychological studies). The replication teams contacted the authors of the original studies to obtain original materials, published the study protocol before conducting the study and ensured that the replication experiments had high statistical power (average overall power of 92%). Replication was tested in multiple manners, for instance whether significance was obtained again, whether the 95% confidence interval of the result from the replication included the original effect, or whether the teams subjectively reported to have replicated the experimental findings (yes/no).

This first main finding of this replication effort was that the effect size of the replication studies, in general, was about half the size of the effect size of the exact same original studies (Fig 1). The percentage of positive effects with significant results was 97% in the original studies and 36% in the replicated studies. Around half of original effects fell within the 95% confidence intervals of the results from the replications; 39% of studies was subjectively reported to be replicated. Numbers were better for cognitive psychological than social psychological studies: around 50% versus 25% (regarding the replication criterion of having a significant p-value), respectively. Interestingly, the p-values of the nonreplicated studies varied widely and the distribution of nonreplicated effect sizes was almost zero-centered.

This large-scale replication effort also assessed many characteristics of the original and replicated work such as the importance of the result (e.g., citations), surprisingness of the result, experience and expertise of the team, effect size and the p-value of the result, among others. No single variable could explain replication success, although perceived importance of the effect, expertise of the original or replication teams did not appear to have any effect on replication rates. Instead, the p-value of the original effect, as well as its size, were predictive of replication success. P-values just under 0.05 could not well be replicated: p-values between 0.04 and 0.05 had a meagre score of 18% (2 out of 11 findings were significant again), p-values between 0.02 and 0.04 had a score of 26% (6 of 23). By contrast, studies with a p-value <.001 had a replication success (again defined as having a significant p-value in the replication) in 20 out of 32 studies: 63%. This provides some initial empirical data for statisticians’ claims that we should not blindly trust a p-value of 0.05. (Of course we should not depend on any arbitrary p-value, and we had better forget about null-hypothesis significance testing altogether and instead use Bayesian statistics, but before the whole field is there, we might want to start with being a bit stricter.)

What do these numbers tell us about neuroscientific research in human subjects, e.g. with neuroimaging methods? One thing to keep in mind is that cognitive neuroscience extends such cognitive scientific studies as mentioned above by adding neuroimaging techniques such as functional magnetic resonance imaging (fMRI). Replication rates can be expected to be similar, with two factors that may pull replication rates down: lower sample sizes and common misunderstandings about multiple test correction with fMRI. We can be brief about lower sample sizes: the same paper as mentioned before about preclinical neuroscientific studies also examined neuroimaging research with human subjects and reported a median power of a mere 8% (4). This is absurdly low, probably due to the costs and time-investments associated with even a minor fMRI study. Worse is that it is too often not appreciated that with a regular fMRI analysis you can easily be testing 200,000 voxels. This does not mean that you need to correct for 200,000 test decisions, e.g. by dividing your alpha by 200,000, as these voxels are correlated with each other. FMRI data points are typically smoothed to get rid of outliers and strengthen ‘true’ activation, and in this procedure a voxel value is replaced by the mean value of all voxels around it (roughly a sphere with a given radius). But also after smoothing, one can still assume that there are independent elements in the brain data, also referred to as resolution elements (resels), which can be estimated from the average smoothness in the data. These can be used to correct for the multiple test issue, and other procedures have been proposed and evaluated as well (8).

However, only about 60% of all published fMRI work uses one form of multiple test correction (9), with almost a third of these studies failing to specify how. But even if a multiple test correction is used correctly, one is back to the 0.05 level in the best case, a level that might still be too lenient anyway. So in fMRI research current statistical flaws are amplified, and often not well understood. Add to this that you can do activity, functional connectivity (seeds, ICA, whole brain), effective connectivity and other novel forms of analyses and that for one test on the group level you can have multiple post-hoc contrasts (to pick the one where you find something) and the problem becomes even larger. That uncorrected thresholds may result in multiple large clusters popping up randomly as in any Gaussian random field may not be obvious (Fig 2), and p-values per voxels can look impressive if you forget about the sheer amount of tests. Maybe 1-2 years of basic, inferential and fMRI-statistics might be a good start for groups doing their first fMRI study, although the temptation to lower the thresholds to have a nice image for a great story is sometimes also not resisted by more experienced groups. FMRI has received a lot of criticism lately by neuroscientists and even journalists unfamiliar with the technique, but there is nothing intrinsically wrong with the technique itself and relatively decent statistical correction procedures exist. They are just not used; the main problem of fMRI is at the level of users who too often do not appreciate that they are looking at statistical maps and not activity maps.

FMRI studies are not the only studies that are hampered by statistical error, but small statistical misunderstandings cause disproportionally large effects in fMRI studies. At the same time, the multiple test correction gets attention and is openly debated in the neuroimaging field; it is possible that in other disciplines where multiple tests are performed (e.g. preclinical work that tests for between group differences with multiple behavioral experiments) this issue is not yet visible enough. Also other methodological and statistical flaws are common and present in studies that employ simple or traditional techniques. For instance, a recent analysis showed that half of all intervention studies published in high-impact journals failed to apply the correct statistical test, which in some cases led to serious misrepresentations of the data (10). It’s safe to say that, in biology and medicine, methodological and statistical training needs much more attention at the predoctoral level – starting with classical descriptive and inferential statistics, and introducing some Bayesian statistics. But learning about research methods and statistics, in addition to new techniques and working hypotheses in the field, simply takes time and cannot always be fit into a research master’s degree or the first year(s) of a PhD. Insufficient statistical training, e.g. following a two-day workshop in some statistical program that allows you to click through and generate hundreds of bivariate correlations without understanding the multiple test problem, could actually exacarbate the false positive problem more than having no statistical training at all (and therefore consulting a statistician).

Yet outsourcing the problem to statisticians, as sometimes required by applications in which one has to specify which statisticians helped with a power analysis, may not be the best solution. After all, statisticians are typically not working in the lab, they can be consulted or not, they can be listened to or not, and they can surely not overrule an PI when results are nice but probably false. Alternatively, statistical consulting could occur during a journal’s review process, but most journals are unlikely to pay for services that 1) regularly one of the reviewers happens to perform for free and 2) would cause them to publish more solid and less exciting results. Most original research papers in neuroscience employ some form of statistics, making understanding of statistics critical for reading the literature, assessing effects and getting your own effects straight. It may therefore be more helpful to see statistics as a core academic skill, not just a soft skill such as grant writing, management or presenting, but as hard a skill for life scientists as programming is for informaticians.

How do we increase the signal compared to the noise in published work? Of course, by better training and more understanding of statistics, by performing more highly powered studies, by not citing doubtful work, by letting PhD students compute a t-test by hand for once, by publishing confidence or credible intervals, etc. But it might take years, if not decades, before the majority has left the current convenient way of doing science with the current incentives. Little does it matter that you are generating noise when you are actually rewarded for it. Yet one recent analysis showed that it may not be that hard to change the current biased-for-false-positives system. We just have to be stricter.

Valen Johnson from the Texas A&M University developed a way to compare classical p-value testing (also referred to as the frequentist approach since the focus is on the probability of a given test result) with Bayes’ factors, and noted that a p-value of around 0.05 corresponds to a Bayes Factor between 3-5 (11), which is typically considered weak evidence to support one hypothesis over another. He estimated that this may be the core problem of having unreproducable results, and that it can easily be solved by using more strict statistical thresholds, such as p<0.005 or p<0.001. These thresholds provide much more convincing evidence for a given hypothesis over another (and would automatically require the researchers to use larger sample sizes). Two years after he published his calculation, the large-scale replication project supported his analysis with the initial empirical data mentioned above: replication rates ranged from 18% for a p-value just under 0.05 to 63% for under 0.001 (7).

Or in the words of Valen Johnson: "Finally, it is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects." Time to stop believing in a magical p-value of 0.05. And time to stop trusting research that does.

Victor Spoormaker

Picture: Hudson

1. Ioannidis JPA. Why Most Published Research Findings Are False. PLoS Med. 2005;2(8):e124.
2. Prinz F, Schlange T, Asadullah K. Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov. 2011;10(9):712-.
3. Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012;483(7391):531-3.
4. Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14(5):365-76.
5. Fanelli D. "Positive" Results Increase Down the Hierarchy of the Sciences. PLoS ONE. 2010;5(4):e10068.
6. Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle P value generates irreproducible results. Nat Meth. 2015;12(3):179-85.
7. Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251).
8. Nichols T, Hayasaka S. Controlling the familywise error rate in functional neuroimaging: a comparative review. Statistical Methods in Medical Research. 2003;12(5):419-46.
9. Carp J. The secret lives of experiments: Methods reporting in the fMRI literature. NeuroImage. 2012;63(1):289-300.
10. Nieuwenhuis S, Forstmann BU, Wagenmakers E-J. Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci. 2011;14(9):1105-7.
11. Johnson VE. Revised standards for statistical evidence. Proceedings of the National Academy of Sciences. 2013;110(48):19313-7.


This essay first appeared in German in Laborjournal 7-8/2016.

Last Changes: 08.23.2016