. Reproducible brain-wide association studies require thousands of individuals. Nature. 2022 Mar 16; PubMed.

Recommends

Please login to recommend the paper.

Comments

  1. This is an interesting paper. One overall goal of our field is to determine how we can use brain MRI information, together with other information about the participants, to gain knowledge concerning the brain regions which are involved in specific functions, and how they are affected by disease, and by treatments. Many of the reports made over the years have not been replicated. This very careful study demonstrates that the more data that is available, the more robust the findings are. This confirms, supports, and emphasizes the suspicion that many investigators have had that the results of small studies might lead to chance findings, or findings limited to a population subset, which would not stand the test of time. Therefore more large-scale studies, and data-sharing which facilitates amalgamation of data from separate studies, are to be encouraged.

    An important issue, not directly addressed by this paper, is the “generalizability” of such findings. The vast majority of brain-imaging studies come from projects in which the participants volunteer to participate. Such studies tend to over-select from populations who have the interest and time to participate in research, that is, people with higher socioeconomic status, who are more affluent. To obtain truly generalizable findings, we need more large-scale brain imaging studies from participants from epidemiologically sampled populations such as the Health and Retirement study, the Framingham study, or the Mayo Clinic Study of Aging.

    View all comments by Michael Weiner
  2. Marek et al. are to be commended on their timely reminder of the large sample sizes often required to obtain reliable and reproducible findings in brain imaging studies. With this in mind, we formed the ENIGMA Consortium over a decade ago, and a recent special issue of Human Brain Mapping (Thompson et al., 2022) showcases more than 30 papers pooling brain-imaging data from across the world, in diverse populations from over 45 countries, often yielding the largest neuroimaging studies of over 30 major brain diseases and conditions. What we learned regarding reproducibility may offer some relief from the reproducibility crisis, and the concerns expressed by Marek et al. ENIGMA has benefited from the power of global collaboration to put the neuroimaging of disease on a firmer footing. It offers a solution for those collecting smaller cohort studies to team up with others worldwide to identify how robust their findings are, but also to discover crucial modulating factors that influence disease risk and progression in the brain. Some examples of what we learned by pooling tens of thousands of brain scans worldwide are below.

    1. There are robust brain correlates of major brain diseases. By pooling brain disease data from more than 45 countries and more than 30 disorders, ENIGMA published the largest neuroimaging studies of Parkinson’s disease, epilepsy, schizophrenia, bipolar disorder, major depressive disorder, and PTSD (reviewed in Thompson et al., 2020). The general message is that the characteristic signatures of brain disease are highly robust and reproducible worldwide. Independent consortia in Japan are finding almost the exact same patterns of disease effects on the brain, and even the same rankings for brain biomarkers from MRI and DTI (Koshiyama et al., 2022). To first order, this agreement is remarkable and has brought robustness and rigor to fields where small effects are the rule, and where modulators of disease are important to identify.
    2. Do we always need thousands of subjects to find anything reliable? No. ENIGMA’s studies allowed us to go back and see which of the findings in the largest studies of each disease would have been found if groups had just published on their own, and what factors affect the discoverability and consistency of disease biomarkers worldwide. In neurological disorders such as Parkinson’s disease, epilepsy, and ataxia, extremely robust effects are found, replicating and extending effects found in smaller studies of a few hundred people, but often revealing stage-specific patterns and unsuspected subtypes. Similarly, ENIGMA’s studies show huge effects on the brain for many rare genetic variants—cohorts were able to discover these effects in a few dozen cases, but larger studies were able to verify and confirm them. In the psychiatric neuroimaging field, ENIGMA’s largest neuroimaging studies to date of anorexia, schizophrenia, bipolar disorder, and PTSD all show robust effects on brain MRI, DTI, and resting state fMRI. Pooling tens of thousands of MRIs across sites and countries helped to both discover and verify patterns that were once hotly debated or controversial. By pooling data from hundreds of labs, we found that the patterns of structural brain asymmetry are consistent worldwide (Kong et al., 2022), but are subtly altered in schizophrenia (Schijven et al., 2022) and autism (Postema et al., 2019)—supporting a hypothesis that had been controversial since the 1970s. Other hypotheses that have recently been resolved with very large datasets include the presence of subtle brain differences in schizotypy (Kirschner et al., 2021), a mysterious protective pattern seen in unaffected siblings of people with bipolar disorder (de Zwarte et al., 2022), and the association of robust brain differences in hippocampal volume with epigenetic variations such as methylation (Jia et al., 2021). 
    3. Modulators of disease and ancestral diversity. Sometimes disease effects do differ in important ways across populations—depending on prevailing risk factors and a person’s ancestry, effects found in one population may not always be found in another. In ENIGMA’s studies of OCD, large-scale data pooling (Piras et al., 2021) resolved medication, duration of illness, and age of onset as largely accounting for the differences seen in the reported brain effects across cohorts worldwide—a field that had once been regarded as rife with inconsistencies. In ADHD, we found a subtle and shifting pattern of effects on the brain that varies with age—in a remarkable study pooling data from over 150 cohorts worldwide and comparing effects across OCD, ASD and ADHD (Boedhoe et al. 2020). 

    As noted by Marek et al., special design considerations are required—including very large samples (thousands or tens of thousands) when very high dimensional datasets are screened for associations. To avoid the risk of spurious associations when large sets of biomarkers are screened, a split sample—or an independent replication sample—is often used. Before anyone complains about the cost, great success has been achieved by pooling existing datasets across the world, after careful coordination and harmonization. This process is eased in ENIGMA by using standardized workflows that process datasets consistently, avoiding the methodological heterogeneity that often leads to discrepancies, known and unknown, in the published literature.

    When thousands or even millions of biomarkers are screened for associations at the same time, the risk of false positive associations increases; as Smith and Nichols (2018) note, this case is widely recognized in brain mapping, and there are already many widely used methods to mitigate it. Methods to meta-analyze brain maps and aggregate evidence across multiple brain mapping studies have been highly successful (BrainMap, Neurosynth, SDM, and meta-TBM). In genetics, the same multiple testing issues have led to the coordinated analysis of large samples to detect associations between genomic variation and key brain biomarkers. In our own GWAS studies, we screened over a million genomic markers to identify common genetic variants associated with brain measures derived from MRI (Grasby et al., 2020), and even with the rates of brain atrophy over time (Brouwer et al., 2022), offering key new genomic targets for drug discovery. These genetic risk effects are individually small but highly robust—much like the genetic risk loci for Alzheimer’s disease beyond APOE, which required thousands of subjects to discover, but much smaller samples to verify (when the penalty of multiple testing is reduced in a follow-up, independent replication phase). One great success of Alzheimer’s research and degenerative disease research in general is the quest to identify disease-related markers in the genome, and in brain images. Such efforts have often disentangled the heterogeneity of disease, and discovered disease subtypes that can be dealt with in a more personalized and targeted way.

    We propose the following remedies for the need for large samples in neuroimaging research. First of all, large samples are not always required: They are only required when the effects are small, or if a large number of associations are being tested at the same time, or if key modulators of the disease effect are suspected. All these situations will typically lead to different findings across cohorts, when cohorts are small (under 100 subjects). If the effects are small, or multiple tests are performed, or modulators are important (e.g., sex differences, or ancestry effects), then clubbing together as consortia will offer the power to discover and verify effects, and find key modulators. We have shown this repeatedly, in over 100 papers by ENIGMA over the last decade.

    The neurology and psychiatry fields have been immensely successful in this regard, with vast international efforts in genomics (ADSP, ADGC, PGC), and neuroimaging (ENIGMA). We thank Marek et al. for their call to action for better reproducibility in brain imaging, and we encourage teams worldwide to unite to discover and verify factors that resist the devastating diseases that affect us all today.

    References:

    . The Enhancing NeuroImaging Genetics through Meta-Analysis Consortium: 10 Years of Global Collaborations in Human Brain Mapping. Hum Brain Mapp. 2022 Jan;43(1):15-22. Epub 2021 Oct 6 PubMed.

    . ENIGMA and global neuroscience: A decade of large-scale studies of the brain in health and disease across more than 40 countries. Transl Psychiatry. 2020 Mar 20;10(1):100. PubMed.

    . Neuroimaging studies within Cognitive Genetics Collaborative Research Organization aiming to replicate and extend works of ENIGMA. Hum Brain Mapp. 2022 Jan;43(1):182-193. Epub 2020 Jun 5 PubMed.

    . Mapping brain asymmetry in health and disease through the ENIGMA consortium. Hum Brain Mapp. 2022 Jan;43(1):167-181. Epub 2020 May 18 PubMed.

    . Large-scale analysis of structural brain asymmetries in schizophrenia via the ENIGMA consortium. medRxiv, March 11, 2022. medRxiv

    . Author Correction: Altered structural brain asymmetry in autism spectrum disorder in a study of 54 datasets. Nat Commun. 2021 Dec 8;12(1):7260. PubMed.

    . Cortical and subcortical neuroanatomical signatures of schizotypy in 3004 individuals assessed in a worldwide ENIGMA study. Mol Psychiatry. 2021 Oct 27; PubMed.

    . Intelligence, educational attainment, and brain structure in those at familial high-risk for schizophrenia or bipolar disorder. Hum Brain Mapp. 2022 Jan;43(1):414-430. Epub 2020 Oct 7 PubMed.

    . Epigenome-wide meta-analysis of blood DNA methylation and its association with subcortical volumes: findings from the ENIGMA Epigenetics Working Group. Mol Psychiatry. 2021 Aug;26(8):3884-3895. Epub 2019 Dec 6 PubMed.

    . White matter microstructure and its relation to clinical features of obsessive-compulsive disorder: findings from the ENIGMA OCD Working Group. Transl Psychiatry. 2021 Mar 17;11(1):173. PubMed.

    . Subcortical Brain Volume, Regional Cortical Thickness, and Cortical Surface Area Across Disorders: Findings From the ENIGMA ADHD, ASD, and OCD Working Groups. Am J Psychiatry. 2020 Sep 1;177(9):834-843. Epub 2020 Jun 16 PubMed.

    . Statistical Challenges in "Big Data" Human Neuroimaging. Neuron. 2018 Jan 17;97(2):263-268. PubMed.

    . The genetic architecture of the human cerebral cortex. Science. 2020 Mar 20;367(6484) PubMed.

    . Age-dependent genetic variants associated with longitudinal changes in brain structure across the lifespan. bioRxiv, November 8, 2021 bioRxiv

    View all comments by Paul Thompson
  3. This study neatly and systematically evaluates the boundaries of effect sizes that can be expected for varying sample sizes in Brain Wide Association Studies (BWAS) with cognitive or mental health measures by making use of three large, publicly available datasets (ABCD, HPC UKB) on mostly normal individuals.

    They find that when sample sizes are small, effect sizes are probably an exaggeration of the real effect, since effect sizes decreased when replicated in larger samples. Importantly, applying a stringent p-value threshold will not solve this problem, and will lead to bias for inflated effects. This indicates that when sample sizes are small, all results should be reported, and researchers should not fixate on p-values only. 

    The authors further compared effect sizes for small subsets of the data with effects observed in the total large data. Striking from those analyses is that the false-negative rate remained very high for very large sample sizes that included more than 2,000 individuals. This puts us between Scylla and Charybdis: Small and robust effect sizes may help understanding brain-phenotype relationships at a group level, but may not apply to single participants; whereas high effect sizes in small samples may exaggerate potential clinical meaningfulness. 

    In general, effect sizes in this study were small, which may reflect that investigating brain-phenotype relationships in normal individuals is a hard thing to do. It can be questioned whether performance on different cognitive tests would map onto the same brain areas, which is here assumed by the use of a composite score. Individuals show complex differences in brain-phenotype relationships, which can change with older age, and it is conceivable that small effect sizes echo this inter-individual heterogeneity. Here the ground truth was assumed be the effect observed in the largest group, but in fact this is unknown.

    Care should be taken to generalize these findings to research on Alzheimer’s disease, where effects might be more pronounced and do not require large sample sizes. Hippocampal atrophy can be seen with the naked eye and has been associated with memory decline in many studies. It would be interesting if this exercise could be repeated using a known, possibly artificial, ground truth to find ideal sample sizes for a given effect size.

    View all comments by Betty Tijms
  4. This is a very interesting article and a long-needed study and a great reference for the future. It demonstrates that the laws of (sample size and effect size in) statistics are universal and make no difference between the type of data. Thus, BWAS suffers from the same problems as GWAS and likely the remedies are similar: more data and replications in out-of-sample cohorts.

    We knew that small sample sizes are more likely, due to sampling bias, to produce more extreme effect sizes, i.e., significant results. These typically fail to replicate. However, what surprised me was that our common strategy to limit false-positives in the first place, that is, to apply a more rigorous statistical threshold, makes things even worse!

    For studies of AD, the takeaway is twofold. Firstly, we can assume that effect sizes of dementia on the brain are stronger than the cognitive phenotype examined here, therefore the sample-size requirements may be in the hundreds instead of thousands. Indeed, studies in the past have reliably mapped hallmark features of AD such as regional atrophy. However, subtle, or secondary, cognitive effects of the disease that are more varied between patients may indeed require thousands of participants to reliably map in order to better understand the disease’s heterogeneity. Secondly, the fundamental takeaway is that our study designs should include replication cohorts, besides aiming to maximize the expected effect sizes by focusing on longitudinal designs and the most promising imaging modalities.

    View all comments by Andre Altmann
  5. This important paper highlights both the benefits and challenges of open science. The benefits are clear—many samples that are widely available provide statistical power and opportunities for replication. However, there are challenges, mostly associated with using many exploratory correlations across multiple metrics both within and between the brain and behavior. The many aspects to measure, and ways to do so, make it difficult for the field to rely on a handful of large studies, each making its own choices about what to measure. As an alternative, researchers can aggregate smaller samples through meta-analyses, keeping the variety of brain and behavior phenotypes.

    As a side note, I argue against imposing committee-driven harmonization when studying brain and behavior because of the complexity of both that, in turn, requires a wide variety of approaches. Different forms of meta-analyses, such as the Activation Likelihood Estimates method and the ENIGMA and CHARGE consortia, have been very helpful in this regard.

    It is noteworthy that, although important, statistical power alone does not replace the need for asking the right question. Researchers must thoughtfully use large datasets to test specific hypotheses grounded in biology (brain) and psychology (behavior), taking advantage of existing knowledge gathered at multiple levels in various species, including humans, over the past 100+ years.

    Conceptually, brain-wide association studies are quite different from genome-wide association studies. The GWAS approach has a solid foundation: It is easily assessed in a highly standardized manner (e.g., SNP arrays), and inter-individual variations in just four nucleotides across the genome are correlated with various phenotypes. On the other hand, one can argue that BWAS is a correlation between a series of open-ended constructs on both sides (e.g., various metrics of “functional connectivity” against various, often latent, metrics of cognition or mental health). Furthermore, unlike the reasonably noise-free genomic variations of genotyping, phenotyping of the brain and behavior is noisy.

    View all comments by Tomas Paus
  6. This study is important because it shows that the concept of BWAS for complex phenotypes such as psychopathology and cognitive screening test composites may be principally flawed. In my view, increasing funding to obtain GWAS-like MRI sample sizes for hypothesis-free research on complex phenotypes in unstratified populations does not seem to be the best investment of resources. An important value of large cohorts is the ability to stratify for specific genetic, clinical, or socio-demographic or other phenotypes for hypothesis-driven brain imaging studies. This is, in my view, the predominant approach in the Alzheimer’s brain imaging field.

    In fact, the authors have been quite balanced in making a distinction between hypothesis-driven brain imaging research in stratified samples and the problems associated with BWAS discovery studies. We should make sure that this message does not get lost while digesting the important learnings from this study.

    View all comments by Emrah Düzel
  7. This is a timely and very important study. It confirmed several of our previous findings obtained in searching for associations between brain structure and behavior in healthy populations, but the current study further shows that the concerns we previously raised similarly apply to associations between brain connectivity and behavior. In particular, effect sizes are inflated in small samples (Kharabian Masouleh et al., 2019), and examining the same question in two different samples can lead to opposite conclusions, e.g., the sign of the associations can be flipped (Genon et al., 2017). Altogether, these works highlight important challenges in neuroimaging research, especially when aiming to relate brain to behavior in healthy populations.

    First, neuroimaging data, as indirect measurements of either brain structure or brain connectivity, are particularly noisy. We can reasonably wonder to which extent the interindividual variability that we measure in brain structure is valid in healthy populations. Similar concerns apply to psychometric data in healthy populations. For instance, we may wonder whether slight differences in the score on a list-learning task between two participants, or differences in personality scores, truly reflect differences in cognitive abilities or stable differences in behavior between these participants, respectively, allowing inferences on their neural correlates.

    Data and research in clinical populations, such as patients with Alzheimer’s disease, are affected to a lesser extent by these concerns. Indeed, we have previously shown that brain structural-behavior associations have better replicability rates in these populations (Kharabian Masouleh et al., 2019). This is likely because the current methods in the field well capture atrophy, as well as behavioral deficits in these populations. This leads to greater effect sizes than those observed for associations in healthy populations, hence yielding more robust findings. Thus, improving the validity of neuroimaging markers of brain structure and function would be an important perspective for progresses in understanding brain-behavior associations.

    Second, sampling variability is a crucial issue in neuroimaging research, like in any research field. The present study well demonstrated that extremely large cohorts are needed to identify robust brain-behavior associations. With cohorts of thousands of participants, a strict cross-validation scheme can be implemented to evaluate the generalizability of the findings. However, even with a within-cohort cross-validation setting, when the data have been acquired at a single site using a unique research protocol, the question of generalizability to a completely independent dataset remains open.

    Researchers have to remain aware of the characteristics of their samples and, thus, of potential bias. For instance, most datasets in our fields are dominated by one type of population, and the model derived for these samples may not generalize to some minorities in the general populations (Li et al., 2022). So, there is still a lot of work to be done to derive brain-behavior models that will be useful and accurate for single individuals.

    References:

    . Searching for behavior relating to grey matter volume in a-priori defined right dorsal premotor regions: Lessons learned. Neuroimage. 2017 Aug 15;157:144-156. Epub 2017 May 25 PubMed.

    . Empirical examination of the replicability of associations between brain structure and psychological variables. Elife. 2019 Mar 13;8 PubMed.

    . Cross-ethnicity/race generalization failure of behavioral prediction from resting-state functional connectivity. Sci Adv. 2022 Mar 18;8(11):eabj1812. Epub 2022 Mar 16 PubMed.

    View all comments by Sarah Genon
  8. The work by Marek and colleagues is an outstanding demonstration of a discussion that has been ongoing for the better part of a decade in the fields of psychology and cognitive neuroscience. Often deemed the "replication crisis," there is recognition that many behavioral and neuroimaging studies do not replicate (Open Science Collaboration, 2015; Camerer et al., 2018; Elliott et al., 2020). This lack of replication is often tied to small sample sizes leading to inadequate power (Button et al., 2013), and the ultimate result is poor science. Given how widespread and systematic such issues are across scientific specializations, it is unlikely that studies of neurodegenerative diseases such as Alzheimer's are free of such concerns. In fact, the prevalence of case studies or case series in more clinically oriented fields may exacerbate such issues.

    There are a number of factors, though, that are in our favor. The first is that the overall power in a given study is not simply due to sample size, but also the size of the effect being studied. Many disease effects are orders of magnitude larger than the subtle behavioral associations investigated by Marek and colleagues. For example, detecting atrophy in individuals with an AD diagnosis, or abnormal amyloid values in APOE E4 carriers, are robust, strong effects that can be seen in a handful of individuals. Statistical concerns become more prominent when analyses are focused on individual differences, which are likely to be subtle, as well as increasingly complex statistical models (i.e., three-way interactions).

    Another strength of our field is that the multiple large cohorts being studied in the U.S., Europe, and Asia can inherently lead to a high degree of replication and convergence. While there is often emphasis placed on being the first to show a finding, as a field we should appreciate the utility that replication and reaching a consensus bring. There is also a growing push to make data from these large cohorts more accessible so that studies do have increasingly large Ns. Still, the most important thing we can do is to ask reasonable, well-thought-out scientific questions. Larger and larger samples are not helpful if the base science being pursued is inherently flawed or ill planned.

    References:

    . PSYCHOLOGY. Estimating the reproducibility of psychological science. Science. 2015 Aug 28;349(6251):aac4716. PubMed.

    . Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat Hum Behav. 2018 Sep;2(9):637-644. Epub 2018 Aug 27 PubMed.

    . What Is the Test-Retest Reliability of Common Task-Functional MRI Measures? New Empirical Evidence and a Meta-Analysis. Psychol Sci. 2020 Jul;31(7):792-806. Epub 2020 Jun 3 PubMed.

    . Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013 May;14(5):365-76. Epub 2013 Apr 10 PubMed.

    View all comments by Brian Gordon

Make a Comment

To make a comment you must login or register.

This paper appears in the following:

News

  1. Like GWAS, BWAS Need Thousands of Participants