25 Aug 2016

Task-based functional MRI has opened a window to the mind, allowing researchers to observe patterns of brain activity as people process information. In the Alzheimer’s field, the technology has helped researchers identify early functional deficits due to mild cognitive impairment or genetic risk factors. Because of this, some clinical trials for neurodegenerative disease are evaluating task fMRI as an outcome measure. A paper published online June 28 in the Proceedings of the National Academy of Sciences, however, called into question the standard analytical software used to evaluate these scans and, by extension, the validity of many past fMRI studies. How big is the problem?

Researchers led by Anders Eklund at Linköping University, Sweden, tested the three most common software packages against data from real brain scans, and found that under certain conditions, the software had as much as a 70 percent chance of producing at least one false positive result. The results demonstrate the importance of validating statistical methods against real data, the authors warn.

Sharing of fMRI data from large studies made this analysis possible, and the authors called for researchers to routinely open access to their raw data. In recent years, there has been a trend toward greater openness in research, which some have suggested could improve reproducibility (see Jul 2015 news; Mar 2016 news; Jul 2016 news).

In their PNAS paper, Eklund and colleagues noted that some 40,000 fMRI papers that have been published over the last 20 years could potentially be affected. Journalists picked this up, with many stories implying the software flaw invalidated all of that literature (see stories from Science Alert, Wired, and Motherboard). However, in a July 6 blog post, second author Thomas Nichols at the University of Warwick, U.K., refined the impact, estimating that about 3,500 papers had used the questionable analytic methods and could contain at least some false positives. At the authors’ request, PNAS published an erratum on August 16 amending the paper’s wording by removing references to the exact number of papers affected. Eklund declined to speak with Alzforum.

Imaging researchers Alzforum spoke with agreed that while not a disaster for the field, the problem is a real one and needs to be fixed. “This is an important and positive study for the field … [It] is going to force the neuroimaging community to use rigorous methods for reporting results,” Prashanthi Vemuri at the Mayo Clinic in Rochester, Minnesota, wrote to Alzforum (see full comment below).

Robert Cox at the National Institute of Mental Health, Bethesda, Maryland, who is the principal developer for Analysis of Functional NeuroImages (AFNI), one of the software packages in question, told Alzforum he is working on a correction for the issue. “I don’t think [the finding] calls into question all of fMRI, but it does call into question weak results,” he noted.

All in the Method.

The same set of data produced a 20-40 percent error rate when run with lenient parameters for cluster size (top), 10-20 percent with stricter parameters (middle), and 5 percent or less using voxel-based inference (bottom). The black bar indicates the desired target of 5 percent. *[Courtesy of Eklund et al., PNAS.]*

Mining Data to Find Cracks
Eklund and colleagues had previously reported excessive (i.e., more than 5 percent) false positives in task fMRI analyses of single participants when using the most popular software package, Statistical Parametric Mapping (SPM) (see Eklund et al., 2012). To see if this problem occurred in larger data sets and other software packages, the authors made use of publicly available resting-state fMRI data from the 1000 Functional Connectomes Project, which includes data from participants in the United States, China, and Finland. Because the test data was all resting-state, it contained no task-related brain activations. Therefore, any positive result, where activity in different brain regions appeared to vary in unison over the time course of the scan, would by definition be false.

The authors ran hundreds of thousands of analyses on this data, using several task fMRI software tools from the SPM, FMRIB Software Library (FSL), and AFNI packages. All three packages consist of open-source algorithms developed by various academic or government institutions in the United States and United Kingdom. These analyses looked for correlations in brain activity between regions. At the common significance threshold of 0.05, false positives should occur 5 percent of the time. Because fMRI analyses compare so many data points, 5 percent could represent a huge number of false positives, however. Large genome-wide association studies (GWAS) face the same problem, since they make thousands of comparisons that would lead to hundreds of false positives at the 5 percent significance level. To avoid this, GWAS use Bonferroni correction and similar statistical methods to set more stringent standards. The fMRI software tools correct for multiple comparisons as well, such that the overall chance for at least one false positive remains only 5 percent.

Many of the programs did not meet this standard. Their success at correcting the data varied greatly depending on the particular software tool and the parameters used, the authors found. Programs that corrected the data by comparing individual voxels worked well, producing less than 5 percent false positives. However, those that used clusters of voxels and a low threshold for cluster size routinely gave odds of 30 percent or more for at least one false positive, spiking up to 70 percent in some analyses (see image above). In other words, many fMRI findings from past studies could be spurious.

Why were these results so far off? The authors noted that the software tools make two assumptions about the data that are inaccurate. One is that the data should follow a standard curve, with voxels near each other highly correlated, and those farther apart less so. In reality, distant voxels correlate by chance more often than would be expected, producing a phenomenon called a “long tail” in the fMRI data. This was first reported nearly a decade ago, and it may be a property of the imaging technology itself, as it occurs even when scanning oxygen fluctuations in a volume of water (see Discover blog). The second erroneous assumption is that the noise, or random correlations, in the data remains constant across brain regions. Instead, noise varies by region.

Cox told Alzforum that, based on his own analysis of the data since the Eklund paper came out, making the first assumption translates into a large effect on false positives, and the second one a moderate effect. Mark Jenkinson at the University of Oxford, U.K., who is the principal developer of the FSL software, noted that researchers have long known the assumptions did not perfectly fit the data. However, no one realized how large the errors were until Eklund’s analysis became public, Jenkinson wrote to Alzforum. In the past, software was usually validated against computer-generated data. “This way of using large data sets of resting-state fMRI will help all of us in software development to devise even better tests for our software,” he wrote (see full comment below).

How Reliable Are Past Findings?
Even so, this validation exercise does not mean every paper that used cluster-based inference with small cluster size produced an unacceptable level of false positives, researchers stressed. They agreed that strong results, with p values far under 0.05, would be unaffected. Likewise, any study that has been independently repeated—and many have been—can be trusted, Jenkinson noted. Without a detailed analysis of each potential affected paper, no one knows exactly how many studies the findings call into question, researchers said.

And How About Dementia Studies?
In the neurodegeneration field, researchers mostly use resting-state fMRI, which is not affected by the software flaw. Resting-state data maps brain connectivity, and has been used to identify network alterations in people who carry risk genes for AD or frontotemporal dementia (see Dec 2010 news; Dec 2013 news; Jun 2015 news). Resting-state fMRI can also measure the effects of amyloid on connectivity (see Feb 2011 conference news). All of these studies remain valid, researchers said.

While less common, some AD and FTD studies have employed task fMRI to look for changes in brain activation as disease advances (see Feb 2006 conference news). Several papers report activity differences in risk gene carriers (see Dec 2011 news; Jul 2015 news; Oct 2015 news). Other studies see subtle signs of functional impairment at early disease stages (see Nov 2011 conference news; Oct 2013 news; Sep 2014 news). The technique has been used to detect drug effects and is being considered for FTD trials (see Mar 2015 news; Apr 2016 conference news).

It is unclear how many of these studies might contain false positive results, researchers said. One problem is that many papers do not spell out exactly which statistical method was used. In the studies mentioned above, of those that did specify the method, they split about evenly between cluster-based and voxel-based analysis. The choice may depend on the particular imaging group. For one, Eric Reiman at Banner Alzheimer’s Institute, Phoenix, wrote to Alzforum that his group does not use cluster-based inference. Mayo’s Vemuri noted that, in AD studies, differences between patients and controls are pronounced; therefore, even if cluster-based inference is used, the software flaw may have little effect because results are typically highly significant. However, studies of mild cognitive impairment see subtler differences, and those might be more likely to contain spurious results, she predicted.

Researchers also pointed out that many other sources of error exist for task fMRI studies. For example, results can vary based on the mood of the participant and whether he or she has consumed caffeine. For this reason, some AD researchers steer clear of task fMRI. Few studies have examined the reliability of these data, but one such effort by Reisa Sperling and colleagues at Brigham and Women’s Hospital, Boston, reported good reproducibility in a small cohort with mild cognitive impairment (see May 2011 news).

Fix in the Works
What is the solution for the software errors? Eklund and colleagues suggest using a different cluster-based model to correct for false positives. Instead of the “parametric” method described above, which assumes a bell-shaped curve, they recommend the “permutation” method. This approach makes no assumptions about the shape of the data curve, but instead accomplishes its aims by brute force, repeating tests 1,000 times or more to find the false-positive rate. In the authors’ analysis, this method performed well, with error rates below 5 percent (see image above).

Cox said he has already implemented this in the latest AFNI release, as of July 18. The AFNI t-test tool now employs a permutation algorithm. Likewise, Jenkinson is making a permutation test the default for the next FSL release. Cox noted one disadvantage to the permutation method, however—because it has to run so many tests, it is not practical for more complex statistical functions, which can take a long time. For these, he is developing a new mathematical model that takes into account the long tail of the bell-shaped curve. He has a preliminary version of the tool that he hopes to release to the public soon, he told Alzforum.

It is unclear if SPM will follow suit with changes. Guillaume Flandin and Karl Friston at the Wellcome Trust Centre for Neuroimaging, University College London, who developed this software, wrote a June 28 online response to Eklund and colleagues’ findings. In it, they defended the use of parametric correction methods as long as stringent parameters are used, and suggest that Eklund’s analysis was flawed. They declined to speak with Alzforum.

Looking at the bigger picture, researchers agreed that the Eklund paper points to the need to validate statistical tests against real data. Cox suggested that future studies also include task fMRI data to test for false negatives, i.e., instances where real results are overlooked. When parameters are made more stringent to avoid false positives, the false negative rate rises. Ideally, researchers want to find a balance between the two.

Commenters stressed the importance of data sharing. Cox pointed out that in addition to raw data, statisticians should also make public the scripts they used to analyze it. Because Eklund and colleagues did so, Cox was able to repeat their analyses and use the information to correct his own software.

Practical barriers to change exist, however. Many of the groups writing open-source software are small and have limited resources. AFNI’s development team consists of just five people, including Cox. Reiman noted that sharing data takes time and money. “The major challenge is how to incentivize and pay for folks to upload clean data, and how to maximize the value and productivity of the effort. A noble goal, but the devil is in the details,” Reiman wrote to Alzforum.—Madolyn Bowman Rogers

Comments

- Prashanthi Vemuri
  Mayo Clinic and Foundation
- Posted: 25 Aug 2016
This is an important and positive study for the field of neuroimaging. Though there are some flaws in the analyses, as pointed out by the SPM folks, the important points that this study brings forward need to be examined closely.

What does the study say?
There are two components of any image acquired—the noise and the signal (or the findings). The noise in the fMRI images is high and is due to several factors, such as physiological fluctuations, scanner noise, measurement error. In any neuroimaging experiment, the detection of the signal (i.e., changes to the brain due to the disease process) from all the noise is defined as “findings” of the study. The study by Eklund et al. suggests that most of the studies may not have used appropriately rigorous methods for filtering out the noise. However, it is also important to note that the authors re-corrected their original statement and said only about 1/10 of all study results may have been conducted with faulty corrections, but not all 40,000 as estimated in the paper. Some methods available are indeed lenient in filtering out noise, which may have led to faulty corrections.

What does it mean for the neuroimaging field?
This paper is going to force the neuroimaging community to use rigorous methods for reporting results. The quality of peer-reviewed neuroimaging publications will improve significantly, because reviewers are aware of these obvious flaws in methodologies. An important point to note is that over the years both the fMRI acquisition and processing methods have improved immensely and are aimed at minimizing the noise in fMRI signal. This will further propel interest in development of better fMRI acquisition, processing, and analysis methods.

What does this mean for AD fMRI studies?
In the two subject groups we study—AD and MCI—AD studies are less likely to be affected by this controversy because the fMRI signal decreases due to significant neurodegeneration in AD compared to normal is significantly higher than the noise, and noise is less likely to be detected as “findings.” However, in MCI subjects, fMRI studies have found both increases in signal (i.e., compensation) as well as decreases in signal compared to controls which may have been partly due to the differences in the lenient analysis methods.

View all comments by Prashanthi Vemuri

- Mark Jenkinson
  University of Oxford
- Posted: 25 Aug 2016
A number of aspects of this recent paper have been somewhat sensationalized in various different ways. The first of these is the estimated number of papers that are truly affected, and this is something that the original authors wished to change, and they have now been able to submit an erratum.

A better estimate of the number of papers potentially affected is contained in a blog post by one of the authors, Tom Nichols. From this you can see that the original number of 40,000 is a vast overestimate and is no longer supported by the original authors.

Another issue that has not been clearly reported is that strong results (p-values a lot less than 0.05) would remain unaffected by this issue. Furthermore, any result that had been independently replicated on separate data can still be trusted. Such replication is crucial, not only because of the statistical issue raised in Eklund et al., but also because of other known issues, such as reporting bias of results just under p=0.05 versus ones just over 0.05, and the lack of corrections applied to analyses that are repeated with different parameter settings. Thankfully a great many studies and results over the last 15 years have in fact been replicated, and so the impact on the field is more minimal than has been reported and talked about.

Somewhat more problematically, it seems that there is still some section of the neuroimaging community publishing without applying any form of multiple comparison correction at all. Results from the Eklund et al. paper clearly show how bad this can be with respect to false positives, and the re-estimation of the number of papers where this happens is more worrying (about 13,000 from the above blog). Failure to apply any multiple comparison correction is well known to be problematic, and all of the major software packages insist on users applying multiple comparison correction to obtain valid statistics. It is possible that the majority of these papers are older, from the early days when reviewers may have been less well informed, but it still seems that this is, unfortunately, happening now as well. This is something that hopefully the Eklund et al. paper can help to eradicate.

As for software, we have always been aware that the corrections based on random-field theory are approximate, but this helps to put into context the degree of that approximation using modern, null data for testing. Software packages have all been tested in various ways in the past, but often with simulated/synthetic data in order to measure false-positive rates, with real data used to test sensitivity. This testing paradigm, of using large data sets of resting-state fMRI, will help all of us in software development to form even better tests for our software. In addition, it shows the benefits of permutation-based analysis, which is available in the major software packages such as FSL. Such permutation tests have been the default, or only, option in FSL for the analyses of other imaging modalities such as structural and diffusion data, and so we are planning to also make permutation tests the default for group-level analyses of task fMRI in our upcoming release.

View all comments by Mark Jenkinson

Make a Comment

To make a comment you must login or register.

References

News Citations

Paper Citations

Correction for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. Proc Natl Acad Sci U S A. 2016 Aug 16;113(33):E4929. Epub 2016 Aug 8 PubMed.
Eklund A, Andersson M, Josephson C, Johannesson M, Knutsson H. Does parametric fMRI analysis with SPM yield valid results? An empirical study of 1484 rest datasets. Neuroimage. 2012 Jul 2;61(3):565-78. Epub 2012 Apr 10 PubMed.

Software Flaw Casts Doubt on Past Task fMRI Studies

Quick Links

Tools

All in the Method.

Comments

Make a Comment

References

News Citations

Paper Citations

External Citations

Further Reading

News

Primary Papers

Annotate