Task-based functional MRI has opened a window to the mind, allowing researchers to observe patterns of brain activity as people process information. In the Alzheimer’s field, the technology has helped researchers identify early functional deficits due to mild cognitive impairment or genetic risk factors. Because of this, some clinical trials for neurodegenerative disease are evaluating task fMRI as an outcome measure. A paper published online June 28 in the Proceedings of the National Academy of Sciences, however, called into question the standard analytical software used to evaluate these scans and, by extension, the validity of many past fMRI studies. How big is the problem?
Researchers led by Anders Eklund at Linköping University, Sweden, tested the three most common software packages against data from real brain scans, and found that under certain conditions, the software had as much as a 70 percent chance of producing at least one false positive result. The results demonstrate the importance of validating statistical methods against real data, the authors warn.
Sharing of fMRI data from large studies made this analysis possible, and the authors called for researchers to routinely open access to their raw data. In recent years, there has been a trend toward greater openness in research, which some have suggested could improve reproducibility (see Jul 2015 news; Mar 2016 news; Jul 2016 news).
In their PNAS paper, Eklund and colleagues noted that some 40,000 fMRI papers that have been published over the last 20 years could potentially be affected. Journalists picked this up, with many stories implying the software flaw invalidated all of that literature (see stories from Science Alert, Wired, and Motherboard). However, in a July 6 blog post, second author Thomas Nichols at the University of Warwick, U.K., refined the impact, estimating that about 3,500 papers had used the questionable analytic methods and could contain at least some false positives. At the authors’ request, PNAS published an erratum on August 16 amending the paper’s wording by removing references to the exact number of papers affected. Eklund declined to speak with Alzforum.
Imaging researchers Alzforum spoke with agreed that while not a disaster for the field, the problem is a real one and needs to be fixed. “This is an important and positive study for the field … [It] is going to force the neuroimaging community to use rigorous methods for reporting results,” Prashanthi Vemuri at the Mayo Clinic in Rochester, Minnesota, wrote to Alzforum (see full comment below).
Robert Cox at the National Institute of Mental Health, Bethesda, Maryland, who is the principal developer for Analysis of Functional NeuroImages (AFNI), one of the software packages in question, told Alzforum he is working on a correction for the issue. “I don’t think [the finding] calls into question all of fMRI, but it does call into question weak results,” he noted.
Mining Data to Find Cracks
Eklund and colleagues had previously reported excessive (i.e., more than 5 percent) false positives in task fMRI analyses of single participants when using the most popular software package, Statistical Parametric Mapping (SPM) (see Eklund et al., 2012). To see if this problem occurred in larger data sets and other software packages, the authors made use of publicly available resting-state fMRI data from the 1000 Functional Connectomes Project, which includes data from participants in the United States, China, and Finland. Because the test data was all resting-state, it contained no task-related brain activations. Therefore, any positive result, where activity in different brain regions appeared to vary in unison over the time course of the scan, would by definition be false.
The authors ran hundreds of thousands of analyses on this data, using several task fMRI software tools from the SPM, FMRIB Software Library (FSL), and AFNI packages. All three packages consist of open-source algorithms developed by various academic or government institutions in the United States and United Kingdom. These analyses looked for correlations in brain activity between regions. At the common significance threshold of 0.05, false positives should occur 5 percent of the time. Because fMRI analyses compare so many data points, 5 percent could represent a huge number of false positives, however. Large genome-wide association studies (GWAS) face the same problem, since they make thousands of comparisons that would lead to hundreds of false positives at the 5 percent significance level. To avoid this, GWAS use Bonferroni correction and similar statistical methods to set more stringent standards. The fMRI software tools correct for multiple comparisons as well, such that the overall chance for at least one false positive remains only 5 percent.
Many of the programs did not meet this standard. Their success at correcting the data varied greatly depending on the particular software tool and the parameters used, the authors found. Programs that corrected the data by comparing individual voxels worked well, producing less than 5 percent false positives. However, those that used clusters of voxels and a low threshold for cluster size routinely gave odds of 30 percent or more for at least one false positive, spiking up to 70 percent in some analyses (see image above). In other words, many fMRI findings from past studies could be spurious.
Why were these results so far off? The authors noted that the software tools make two assumptions about the data that are inaccurate. One is that the data should follow a standard curve, with voxels near each other highly correlated, and those farther apart less so. In reality, distant voxels correlate by chance more often than would be expected, producing a phenomenon called a “long tail” in the fMRI data. This was first reported nearly a decade ago, and it may be a property of the imaging technology itself, as it occurs even when scanning oxygen fluctuations in a volume of water (see Discover blog). The second erroneous assumption is that the noise, or random correlations, in the data remains constant across brain regions. Instead, noise varies by region.
Cox told Alzforum that, based on his own analysis of the data since the Eklund paper came out, making the first assumption translates into a large effect on false positives, and the second one a moderate effect. Mark Jenkinson at the University of Oxford, U.K., who is the principal developer of the FSL software, noted that researchers have long known the assumptions did not perfectly fit the data. However, no one realized how large the errors were until Eklund’s analysis became public, Jenkinson wrote to Alzforum. In the past, software was usually validated against computer-generated data. “This way of using large data sets of resting-state fMRI will help all of us in software development to devise even better tests for our software,” he wrote (see full comment below).
How Reliable Are Past Findings?
Even so, this validation exercise does not mean every paper that used cluster-based inference with small cluster size produced an unacceptable level of false positives, researchers stressed. They agreed that strong results, with p values far under 0.05, would be unaffected. Likewise, any study that has been independently repeated—and many have been—can be trusted, Jenkinson noted. Without a detailed analysis of each potential affected paper, no one knows exactly how many studies the findings call into question, researchers said.
And How About Dementia Studies?
In the neurodegeneration field, researchers mostly use resting-state fMRI, which is not affected by the software flaw. Resting-state data maps brain connectivity, and has been used to identify network alterations in people who carry risk genes for AD or frontotemporal dementia (see Dec 2010 news; Dec 2013 news; Jun 2015 news). Resting-state fMRI can also measure the effects of amyloid on connectivity (see Feb 2011 conference news). All of these studies remain valid, researchers said.
While less common, some AD and FTD studies have employed task fMRI to look for changes in brain activation as disease advances (see Feb 2006 conference news). Several papers report activity differences in risk gene carriers (see Dec 2011 news; Jul 2015 news; Oct 2015 news). Other studies see subtle signs of functional impairment at early disease stages (see Nov 2011 conference news; Oct 2013 news; Sep 2014 news). The technique has been used to detect drug effects and is being considered for FTD trials (see Mar 2015 news; Apr 2016 conference news).
It is unclear how many of these studies might contain false positive results, researchers said. One problem is that many papers do not spell out exactly which statistical method was used. In the studies mentioned above, of those that did specify the method, they split about evenly between cluster-based and voxel-based analysis. The choice may depend on the particular imaging group. For one, Eric Reiman at Banner Alzheimer’s Institute, Phoenix, wrote to Alzforum that his group does not use cluster-based inference. Mayo’s Vemuri noted that, in AD studies, differences between patients and controls are pronounced; therefore, even if cluster-based inference is used, the software flaw may have little effect because results are typically highly significant. However, studies of mild cognitive impairment see subtler differences, and those might be more likely to contain spurious results, she predicted.
Researchers also pointed out that many other sources of error exist for task fMRI studies. For example, results can vary based on the mood of the participant and whether he or she has consumed caffeine. For this reason, some AD researchers steer clear of task fMRI. Few studies have examined the reliability of these data, but one such effort by Reisa Sperling and colleagues at Brigham and Women’s Hospital, Boston, reported good reproducibility in a small cohort with mild cognitive impairment (see May 2011 news).
Fix in the Works
What is the solution for the software errors? Eklund and colleagues suggest using a different cluster-based model to correct for false positives. Instead of the “parametric” method described above, which assumes a bell-shaped curve, they recommend the “permutation” method. This approach makes no assumptions about the shape of the data curve, but instead accomplishes its aims by brute force, repeating tests 1,000 times or more to find the false-positive rate. In the authors’ analysis, this method performed well, with error rates below 5 percent (see image above).
Cox said he has already implemented this in the latest AFNI release, as of July 18. The AFNI t-test tool now employs a permutation algorithm. Likewise, Jenkinson is making a permutation test the default for the next FSL release. Cox noted one disadvantage to the permutation method, however—because it has to run so many tests, it is not practical for more complex statistical functions, which can take a long time. For these, he is developing a new mathematical model that takes into account the long tail of the bell-shaped curve. He has a preliminary version of the tool that he hopes to release to the public soon, he told Alzforum.
It is unclear if SPM will follow suit with changes. Guillaume Flandin and Karl Friston at the Wellcome Trust Centre for Neuroimaging, University College London, who developed this software, wrote a June 28 online response to Eklund and colleagues’ findings. In it, they defended the use of parametric correction methods as long as stringent parameters are used, and suggest that Eklund’s analysis was flawed. They declined to speak with Alzforum.
Looking at the bigger picture, researchers agreed that the Eklund paper points to the need to validate statistical tests against real data. Cox suggested that future studies also include task fMRI data to test for false negatives, i.e., instances where real results are overlooked. When parameters are made more stringent to avoid false positives, the false negative rate rises. Ideally, researchers want to find a balance between the two.
Commenters stressed the importance of data sharing. Cox pointed out that in addition to raw data, statisticians should also make public the scripts they used to analyze it. Because Eklund and colleagues did so, Cox was able to repeat their analyses and use the information to correct his own software.
Practical barriers to change exist, however. Many of the groups writing open-source software are small and have limited resources. AFNI’s development team consists of just five people, including Cox. Reiman noted that sharing data takes time and money. “The major challenge is how to incentivize and pay for folks to upload clean data, and how to maximize the value and productivity of the effort. A noble goal, but the devil is in the details,” Reiman wrote to Alzforum.—Madolyn Bowman Rogers
- New Journal Guidelines Aim to Boost Transparency in Research
- Mobile Phone App for Parkinson’s Patients Tests New Model for Data Sharing
- CAP Articulates Plan for Sharing Data from Trials of Preclinical Alzheimer’s Disease
- A Foreshadowing? ApoE4 Disrupts Brain Connectivity in Absence of Aβ
- Risk Genes Influence Brain Connectivity in Preclinical FTD
- Synaptic Genes Determine Brain Connectivity
- Miami: Multimodal Imaging, New Way to Test Amyloid Hypothesis
- Translational Biomarkers in Alzheimer Disease Research, Part 4
- Neuroimaging Offers a CLU to AD Risk Factor’s Functional Effects
- Familial Alzheimer’s Gene Alters Children’s Brains
- Young ApoE4 Carriers Wander Off the ‘Grid’ — Early Predictor of Alzheimer’s?
- DC: Do Measurable Changes in Brain Function Herald Dementia?
- Functional MRI Detects Brain Abnormalities in Former Football Players
- Overcompensation—It Could Work for the Brain
- More Evidence That Epilepsy Drug Calms Neurons and Boosts Memory
- WANTED: Biomarkers for Drug Trials in Frontotemporal Dementia
- Little by Little—Standardizing, Validating Those Biomarkers
- Correction for Eklund et al., Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. Proc Natl Acad Sci U S A. 2016 Aug 16;113(33):E4929. Epub 2016 Aug 8 PubMed.
- Eklund A, Andersson M, Josephson C, Johannesson M, Knutsson H. Does parametric fMRI analysis with SPM yield valid results? An empirical study of 1484 rest datasets. Neuroimage. 2012 Jul 2;61(3):565-78. Epub 2012 Apr 10 PubMed.
- ApoE Disrupts Brain Networks, Helps Microglia Clear Aβ
- ApoE4 Linked to Default Network Differences in Young Adults
- Does Amyloid Disturb the Slow Waves of Slumber—and Memory?
- Cocoa Flavanols Give Memory a Boost
- Do Earliest Cognitive Deficits in Alzheimer's Appear in the Entorhinal Cortex?
- BOLD New Look—Aβ Linked to Default Network Dysfunction
- Research Brief: Hippocampal Hyperactivity Tied to Early MCI Atrophy
- Seeing With the Mind’s Eye—Not So Easy for Seniors
- Eklund A, Nichols TE, Knutsson H. Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates. Proc Natl Acad Sci U S A. 2016 Jul 12;113(28):7900-5. Epub 2016 Jun 28 PubMed.