How’s this for a challenge? In a haystack of 250,000 genetic variants lurking in the genomes of more than 40,000 people, find the ones that associate with Alzheimer’s disease either alone or in combination. While you’re at it, see if you can glean biological mechanisms that lead to disease. Sounds like a job for AI, and according to a study published in Nature Communications on July 22, it sure is.

  • Scientists trained three machine-learning algorithms to find variants that distinguish AD cases from controls among more than 40,000 European AD Biobank samples.
  • ML found as many genome-wide significant variants as traditional GWAS, plus novel loci.
  • The AD risk variants were enriched for microglial and astrocytic genes, not synaptic ones.

Researchers led by Valentina Escott-Price of Cardiff University, U.K.; Cornelia van Duijn of the University of Oxford, U.K.; and Kristel Van Steen of KU Leuven, Belgium, trained three machine-learning algorithms on the genotypes of AD cases and controls in the European AD Biobank. Left to their own devices to sniff out the complex genetic scent of AD, the algorithms were then able to distinguish AD cases from controls, and to identify the genetic variants that allowed them to do so. Compared to traditional GWAS methods applied to this same batch of samples, ML picked out the same disease-associated variants, and then some. The study identified six previously unknown disease loci, and highlighted different biological roads that might lead to AD.

With this AI trial run under their belt, the scientists hope to incorporate more detailed clinical and biomarker data into AI algorithms, to zero in on parallel biological mechanisms that converge on AD and other dementias, Escott-Price told Alzforum. The new study applied these AI techniques to the genomes of 41,686 participants in the European AD Biobank consortium. This may seem like a lot, but it pales in comparison to the largest AD GWAS conducted so far, which topped one million samples (Sep 2021 news). While beefing up the sample size of GWAS over the years has unearthed more AD risk variants, that size comes at a cost, Escott-Price explained. To reach these numbers, scientists use meta-analysis techniques to integrate summary statistics across cohorts, and also count people with a family history of dementia as “AD-by-proxy” cases (Mar 2019 news; Feb 2021 news). What’s more, cases and controls are diagnosed differently across datasets, and often the controls aren’t clinically confirmed as being cognitively normal. These factors add noise and uncertainty to the results, Escott-Price said. Disease risk variants identified in these mega-GWAS can be tallied to generate a polygenic risk score. However, these PRS fail to account for interactions among different disease variants in calculating risk.

For these and other reasons, co-first authors Matthew Bracher-Smith, Federico Melograna, Brittany Ulm, and colleagues decided to apply machine-learning methods to investigate the genetic basis of AD in a smaller, more defined dataset in which both cases and controls had been clinically confirmed, and for which they had access to the raw genotyping data, rather than just the summary statistics that are often used in giant meta-analyses. The EADB consortium fits the bill.

The scientists used data from 70 percent of the samples to train three established ML models to tell the difference between cases and controls based on genotypes at some 250,000 SNP positions across the genome. The three models are gradient-boosting machines (GBM), neural networks (NN), and model-based multifactor dimensionality reduction (MB-MDR). GBM uses a tree-based approach to iteratively select polymorphisms that best distinguish cases from controls, while filtering out those that don’t. NNs work by passing input data—in this case, genetic variants—through a series of information layers, such as causal genes and biological pathways, which are used to separate cases from controls. Finally, MB-MDR considers interactions among different variants—i.e., epistasis—in influencing AD risk. In parallel, the scientists also applied traditional GWAS methods to this limited EADB dataset; they used the resulting genome-wide significant variants to calculate PRS.

When put to the test with a separate batch of samples, all of the AI models, as well as traditional PRS, correctly distinguished cases from controls with an accuracy between 66-69 percent. This was true when the researchers split the training and testing cohorts in different ways. That none of the ML models substantially outperformed traditional PRS in distinguishing cases from controls is an indicator that the scientists need to enhance the way the genotyping data are coded, Escott-Price told Alzforum. As of now, the coding system is too simplistic to detect epistatic effects.

Next, the researchers looked under the hood to see which variants each model relied upon to distinguish AD cases from controls. The models prioritized different sets of variants, with GBM finding the most. In all, the machine-learning algorithms identified 18 known AD risk loci, which had been found to be genome-wide significant SNPs in prior, larger GWAS. They also uncovered 34 putative loci, including six that were tied to AD below the threshold of genome-wide significance in a large GWAS (Jansen et al., 2019).

Those six are ARHGAP25, COG7, LINC00924/LOC105369212, LY6H, SOD1, and ZNF597. Some details about a few of them? ARHGAP25 is expressed in macrophages, where it modulates phagocytosis (Schlam et al., 2015). LY6H helps maintain the integrity of nicotinic acetylcholine receptors in the brain, and an Aβ-driven reduction in LY6H has been implicated in cholinergic dysfunction in AD (Wu et al., 2021). COG7 modulates glycosylation in the Golgi, a process that reportedly skews the trafficking and processing of tau and APP (Wu et al., 2004; Haukedal and Freude, 2021). Known as a familial ALS gene, SOD1 defends against oxidative damage and its expression is skewed in AD.  

All four ML models tied SNPs within the ApoE gene to AD. However, unlike traditional GWAS, which tie many SNPs in the ApoE region to AD, both the GBM and NN models zeroed in on the two specific SNPs that determine ApoE genotype, a finding Escott-Price thinks is exciting.

While the ML models did not prioritize every variant previously found in massive GWAS, they did root out every variant pegged as genome-wide significant in a GWAS performed on the smaller, EADB dataset. In all, these ML methods detected nearly a quarter of the genes that previous GWAS meta-analyses had found using 20 times the sample size.

Finally, the scientists looked for biological trends among the 52 loci the ML models had singled out. The variants leaned toward microglial and astrocytic genes, and away from synaptic ones. Top biological pathways the AD risk variants are part of included regulation of APP and Aβ formation, lipid homeostasis and cholesterol assembly, and endocytosis.—Jessica Shugart

Comments

  1. In this study, Bracher-Smith et al. use machine learning to investigate Alzheimer’s disease genetics. By analyzing genome-wide data from 41,686 individuals, the authors demonstrate the use of machine learning methods to robustly predict AD genetic risk. Our previous study in 2023 was the first to demonstrate the feasibility of applying machine-learning-based methods in prediction of AD genetic risk (Zhou et al., 2023). The current study takes a further step by leveraging a large European AD cohort to achieve a more robust risk prediction model for AD. More importantly, the authors used machine-learning-based variant prioritization methods to successfully identify well-stablished AD risk loci (e.g., APOE and BIN1) and discover novel AD risk loci (e.g., ARHGAP25 and LY6H) that exhibit interesting genetic associations with AD-related pathways, which were replicated in a larger AD GWAS (Jansen et al., 2019). This suggests that a key advantage of using machine learning in GWAS’s is enhancement of statistical power without requiring larger sample sizes, which addresses a critical limitation of conventional GWAS approaches.

    Notably, the study showcases the potential of machine learning to handle complex genetic architectures, such as linkage disequilibrium and epistatic effects, by simultaneously analyzing many genetic variants. A remarkable example is that the models correctly ranked APOE2/APOE4 variants (i.e., rs7412 and rs429358) as top predictors in the APOE locus, outperforming traditional GWAS methods, which often fail to prioritize these two coding variants as top hits according to ranked P-values. This underscores the potential of machine learning to refine causal variant identification in regions exhibiting linkage disequilibrium.

    Moreover, the authors thoughtfully outline critical next steps, such as establishing standardized significance thresholds for machine-learning-based genetics studies to improve reproducibility. They also explain the promise of federated learning approaches to enable large-scale collaborative analyses while maintaining data privacy. These insights lay the foundation for further methodological advancements in the field.

    In summary, this study bridges the gap between classical genetic epidemiology and modern computational approaches, offering a powerful statistical tool for AD genetics studies and paving the way for further advances using machine learning.

    References:

    . Deep learning-based polygenic risk analysis for Alzheimer's disease prediction. Commun Med (Lond). 2023 Apr 6;3(1):49. PubMed.

    . Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk. Nat Genet. 2019 Mar;51(3):404-413. Epub 2019 Jan 7 PubMed.

Make a Comment

To make a comment you must login or register.

References

News Citations

  1. From a Million Samples, GWAS Squeezes Out Seven New Alzheimer's Spots
  2. Paper Alerts: Massive GWAS Studies Published
  3. Massive GWAS Meta-Analysis Digs Up Trove of Alzheimer’s Genes

Paper Citations

  1. . Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer's disease risk. Nat Genet. 2019 Mar;51(3):404-413. Epub 2019 Jan 7 PubMed.
  2. . Phosphoinositide 3-kinase enables phagocytosis of large particles by terminating actin assembly through Rac/Cdc42 GTPase-activating proteins. Nat Commun. 2015 Oct 14;6:8623. PubMed.
  3. . Unbalanced Regulation of α7 nAChRs by Ly6h and NACHO Contributes to Neurotoxicity in Alzheimer's Disease. J Neurosci. 2021 Oct 13;41(41):8461-8474. Epub 2021 Aug 26 PubMed.
  4. . Mutation of the COG complex subunit gene COG7 causes a lethal congenital disorder. Nat Med. 2004 May;10(5):518-23. Epub 2004 Apr 25 PubMed.
  5. . Implications of Glycosylation in Alzheimer's Disease. Front Neurosci. 2020;14:625348. Epub 2021 Jan 13 PubMed.

Further Reading

No Available Further Reading

Primary Papers

  1. . Machine learning in Alzheimer's disease genetics. Nat Commun. 2025 Jul 22;16(1):6726. PubMed.