# Bad Statistics or Bad Practice?

#### by Steven D. Buckingham *Labtimes* 04/2016

Software packages used to analyse fMRI data are very sensitive to the appropriate setting of statistical parameters.

Bashing of functional magnetic resonance imaging (fMRI) is experiencing something of a surge. First, there was the Dead Salmon scandal, in which a dead Atlantic salmon was asked to perform an emotional valence test and areas of brain activation were found (http://prefrontal.org/files/posters/Bennett-Salmon-2009.pdf).Then in 2003, Nichols and Hayasaka took some published fMRI data, and found that most of them had not made some basic statistical corrections for multiples (Statistical Methods in Medical Research 12, 419-44). When they went ahead and did the corrections on the original data for themselves, only eight of the 11 datasets had any significant volume elements (voxels) left, at all. And don’t forget the reverse inferencing issue, where it was shown that a goodly portion of fMRI papers committed the major logical fallacy of effectively “begging the question” in their analysis.

Okay, you may say, so there are some off-papers out there, so what? One bad apple doesn’t spoil the bunch. Caveat Investigator.

But it could well be that the problem goes deeper than that. A recent paper by Anders Eklund at Linköping University, Sweden, has cast doubt on some of the most popular analysis packages used across the fMRI community (PNAS, 113, 28, 7900-05). Their analysis suggests that in certain cases, these packages can produce a positive error rate of around 50%.

To understand the problem, we have to look for a moment at the way fMRI is done. Typically, a subject is placed in a scanner and a resting state measurement of brain activity is made. Then the subject will perform a task and the levels of activity across the brain, or perhaps in a determined “region of interest”, is measured, based on the BOLD signal. Once you have got this signal, you have to do some heavy duty statistics to sort out the signal from the noise. There are several ways of doing this. At the most crude level, you can determine a threshold, above which you will accept a signal as having some meaning. More sophisticated analyses involve looking for areas of the brain, whose activity is statistically different from resting background levels or statistically correlated to some parameter of the task. This is usually done using a software package, some of the most popular of which are SPM (“Statistical Parametric Mapping”), FSL, FLAME1, 3dttest and 3dMEMA. These packages require a certain amount of statistical competence but the authors have gone to a lot of pains, to put in routines to take care of the most important procedures, such as correcting for multiple testing. The authors of these packages are not stupid – they know their stats and are committed to sound analysis.

But we are, I hope, empiricists at heart. So Eklund put these packages to the experimental test. The results are not comforting. He took advantage of the large sets of publicly available imaging data, and divided up the control subjects randomly as if they were the experimental groups. Then he simply did standard statistical tests on the resting activity in these controls, to see how many false positives the programmes threw up. Of course, these are controls, so there should be no difference between them and we should expect a false positive error rate of about five per cent.

Medical imaging specialist Anders Eklund obtained shockingly high false positive error rates during analysis of fMRI data with common software packages. Photo: Linköping University

He told the programmes to use either cluster detection (a sensitive method of finding faint signals spread out over different parts of the brain) or voxel-wise analysis (analysing point-by-point). Shockingly, at a cluster detection rate of p=0.01 (the default setting for FSL), all the packages except FLAME1 threw between 15 and 50% false positives. Admittedly, this is a high detection rate, and as you would expect, turning it down to 0.001 (the default setting for SPM) reduced the false positives. But they still persisted above 10% in about half of the “experiments”. The exception was FLAME1, which had about 10-20% false positive rates using a cluster detection threshold of 0.01, but when you set it at 0.001 the error rate was well below 10%, suggesting there is a danger of false negatives. In other words, between 10 and 50% reported results of fMRI experiments are false positives.

What is the problem here? Eklund had a look at the assumptions behind the packages’ analyses. Just like the more familiar parametric tests we use in the lab every day, the fMRI packages assume that the underlying data and the noise follow, at least approximately, simple distributions that can be approximated with a simple set of parameters. When he looked at the data, he found that the statistical test values (the z or t values) didn’t actually vary much from the expected null distribution, so no problem there. The exception was for the FLAME1 data that had a much lower variance than the theoretical distribution, which might explain why it was both insensitive and robust against false positives.

The root of the problem, it seems, lies in the assumptions made about the way brain signals correlate over space. Random field theory makes it easier to use parametric statistics and there are many good reasons for wanting to do this. But random field theory makes some assumptions that the spatial correlation between signals follows a squared exponential that is constant over the brain. But when you look at these correlations directly, it turns out that brain regions tend to be correlated in different ways, which means that some regions have a natural predisposition to form statistical clusters.

Neuroimaging authority, Karl Friston, (placed third in the recent Lab Times ranking “Basic Neuroscience”) developed SPM. He re-analysed Eklund’s datasets with more appropriate settings for smoothing and cluster detection.

This looks bad, so Lab Times contacted Karl Friston at University College London, the creator of SPM, and Jean Daunizeau at the Brain and Spine Institute, Paris, a major contributor. “Eklund tested random field theory with incorrect calibration,” says Daunizeau. “Their analysis used insufficient smoothing and the cluster-forming threshold was too low.”

Indeed, Friston issued a rebuttal paper (http://arxiv.org/pdf/1606.08199.pdf) that re-analysed the same dataset as Eklund’s but using more appropriate values for smoothing and cluster detection. All the problems went away and the expected five per cent false positives were obtained. The important point here is that the choice of these parameters weren’t chosen to get the result they wanted, but rather were guided in a principled way by an understanding of the underlying theory (random field theory) with its assumptions and limitations.

There is an important lesson to be learned here. These analysis packages are to be used with an understanding of the underlying physics and statistics. fMRI analysis packages require considerable care in their use. Friston makes the point pacifically, “We did not consider Eklund to be critical of SPM. Rather their contribution is to highlight the failings of inference based on random field theory when distributional assumptions are violated (e.g., violations of the good lattice assumption or using inappropriately low cluster forming thresholds). If people follow good practice, SPM provides (approximately) valid inference.”

Last Changed: 30.08.2016