ResearchHub | Open Science Community

Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets

David Park et al.Jan 22, 2022

ABSTRACT Recently, polygenic risk score (PRS) has gained significant attention in studies involving complex genetic diseases and traits. PRS is often derived from summary statistics, from which the independence between discovery and replication sets cannot be monitored. Prior studies, in which the independence is strictly observed, report a relatively low gain from PRS in predictive models of binary traits. We hypothesize that the independence assumption may be compromised when using the summary statistics, and suspect an overestimation bias in the predictive accuracy. To demonstrate the overestimation bias in the replication dataset, prediction performances of PRS models are compared when overlapping subjects are either present or removed. We consider the task of Alzheimer’s disease (AD) prediction across genetics datasets, including the International Genomics of Alzheimer’s Project (IGAP), AD Sequencing Project (ADSP), and Accelerating Medicine Partnership - Alzheimer’s Disease (AMP-AD). PRS is computed from either sequencing studies for ADSP and AMP-AD (denoted as rPRS) or the summary statistics for IGAP (sPRS). Two variables with the high heritability in UK Biobank, hypertension, and height, are used to derive an exemplary scale effect of PRS. Based on the scale effect, the expected performance of sPRS is computed for AD prediction. Using ADSP as a discovery set for rPRS on AMP-AD, ΔAUC and ΔR 2 (performance gains in AUC and R 2 by PRS) record 0.069 and 0.11, respectively. Both drop to 0.0017 and 0.0041 once overlapping subjects are removed from AMP-AD. sPRS is derived from IGAP, which records ΔAUC and ΔR 2 of 0.051±0.013 and 0.063±0.015 for ADSP and 0.060 and 0.086 for AMP-AD, respectively. On UK Biobank, rPRS performances for hypertension assuming a similar size of discovery and replication sets are 0.0036±0.0027 (ΔAUC) and 0.0032±0.0028 (ΔR 2 ). For height, ΔR 2 is 0.029±0.0037. Considering the high heritability of hypertension and height of UK Biobank, we conclude that sPRS results from AD databases are inflated. The higher performances relative to the size of the discovery set were observed in PRS studies of several diseases. PRS performances for binary traits, such as AD and hypertension, turned out unexpectedly low. This may, along with the difference in linkage disequilibrium, explain the high variability of PRS performances in cross-nation or cross-ethnicity applications, i.e., when there are no overlapping subjects. Hence, for sPRS, potential duplications should be carefully considered within the same ethnic group.

Genetics

Molecular Biology

28

Paper

Save

Electronic Health Records Based Prediction of Future Incidence of Alzheimer’s Disease Using Machine Learning

Ji Park et al.May 2, 2019

Abstract Background Accurate prediction of future incidence of Alzheimer’s disease may facilitate intervention strategy to delay disease onset. Existing AD risk prediction models require collection of biospecimen (genetic, CSF, or blood samples), cognitive testing, or brain imaging. Conversely, EHR provides an opportunity to build a completely automated risk prediction model based on individuals’ history of health and healthcare. We tested machine learning models to predict future incidence of AD using administrative EHR in individuals aged 65 or older. Methods We obtained de-identified EHR from Korean elders age above 65 years old (N=40,736) collected between 2002 and 2010 in the Korean National Health Insurance Service database system. Consisting of Participant Insurance Eligibility database, Healthcare Utilization database, and Health Screening database, our EHR contain 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness, and socio-demographics. Our event of interest was new incidence of AD defined from the EHR based on both AD codes and prescription of anti-dementia medication. Two definitions were considered: a more stringent one requiring a diagnosis and dementia medication resulting in n=614 cases (“definite AD”) and a more liberal one requiring only diagnostic codes (n=2,026; “probable AD”). We trained and validated a random forest, support vector machine, and logistic regression to predict incident AD in 1,2,3, and 4 subsequent years using the EHR available since 2002. The length of the EHR used in the models ranged from 1,571 to 2,239 days. Model training, validation, and testing was done using iterative (5 times), nested, stratified 5-fold cross validation. Results Average duration of EHR was 1,936 days in AD and 2,694 days in controls. For predicting future incidence of AD using the “definite AD” outcome, the machine learning models showed the best performance in 1 year prediction with AUC of 0.781; in 2 year, 0.739; in 3 year, 0.686; in 4 year, 0.662. Using “probable AD” outcome, the machine learning models showed the best performance in 1 year prediction with AUC of 0.730; in 2 year, 0.645; in 3 year, 0.575; in 4 year, 0.602. Important clinical features selected in logistic regression included hemoglobin level (b=-0.902), age (b=0.689), urine protein level (b=0.303), prescription of Lodopin (antipsychotic drug) (b=0.303), and prescription of Nicametate Citrate (vasodilator) (b=-0.297). Conclusion This study demonstrates that EHR can detect risk for incident AD. This approach could enable risk-specific stratification of elders for better targeted clinical trials. Key Points Question Can machine learning be used to predict future incidence of Alzheimer’s disease using electronic health records? Findings We developed and validated supervised machine learning models using the HER data from 40,736 South Korean elders (age above 65 years old). Our model showed acceptable accuracy in predicting up to four year subsequent incidence of AD. Meaning This study shows the potential utility of the administrative EHR data in predicting risk for AD using data-driven machine learning to support physicians at the point of care.

Artificial Intelligence

Pharmacology

0

Paper

Artificial Intelligence

Pharmacology

0

Save

0

Machine Learning Prediction of Incidence of Alzheimer’s Disease Using Large-Scale Administrative Health Data

Ji Park et al.May 2, 2019

Nationwide population-based cohort provides a new opportunity to build a completely automated risk prediction model based on individuals’ history of health and healthcare beyond existing risk prediction models. We tested the possibility of machine learning models to predict future incidence of Alzheimer’s disease (AD) using large-scale administrative health data. From the Korean National Health Insurance Service database between 2002 and 2010, we obtained de-identified health data in elders above 65 years (N=40,736) containing 4,894 unique clinical features including ICD-10 codes, medication codes, laboratory values, history of personal and family illness, and socio-demographics. To define incident AD two operational definitions were considered: “definite AD” with diagnostic codes and dementia medication (n=614) and “probable AD” with only diagnosis (n=2,026). We trained and validated a random forest, support vector machine, and logistic regression to predict incident AD in 1,2,3, and 4 subsequent years. For predicting future incidence of AD in balanced samples (bootstrapping), the machine learning models showed reasonable performance in 1-year prediction with AUC of 0.775 and 0.759, based on “definite AD” and “probable AD” outcomes, respectively; in 2-year, 0.730 and 0.693; in 3-year, 0.677 and 0.644; in 4-year, 0.725 and 0.683. The results were similar when the entire (unbalanced) samples were used. Important clinical features selected in logistic regression included hemoglobin level, age, and urine protein level. This study may shed a light on the utility of the data-driven machine learning model based on large-scale administrative health data in AD risk prediction, which may enable better selection of individuals at risk for AD in clinical trials or early detection in clinical settings.

Artificial Intelligence

Internal Medicine

0

Paper

Artificial Intelligence

Internal Medicine

0

Save

0

Diagnosis and Prognosis Using Machine Learning Trained on Brain Morphometry and White Matter Connectomes

Yun Wang et al.Sep 4, 2018

Accurate, reliable prediction of risk for Alzheimers disease (AD) is essential for early, disease-modifying therapeutics. Multimodal MRI, such as structural and diffusion MRI, is likely to contain complementary information of neurodegenerative processes in AD. Here we tested the utility of commonly available multimodal MRI (T1-weighted structure and diffusion MRI), combined with high-throughput brain phenotyping-morphometry and connectomics-and machine learning, as a diagnostic tool for AD. We used, firstly, a clinical cohort at a dementia clinic (study 1: Ilsan Dementia Cohort; N=211; 110 AD, 64 mild cognitive impairment [MCI], and 37 subjective memory complaints [SMC]) to test and validate the diagnostic models; and, secondly, Alzheimers Disease Neuroimaging Initiative (ADNI)-2 (study 2) to test the generalizability of the approach and the prognostic models with longitudinal follow up data. Our machine learning models trained on the morphometric and connectome estimates (number of features=34,646) showed optimal classification accuracy (AD/SMC: 97% accuracy, MCI/SMC: 83% accuracy; AD/MCI: 97% accuracy) with iterative nested cross-validation in a single-site study, outperforming the benchmark model (FLAIR-based white matter hyperintensity volumes). In a generalizability study using ADNI-2, the combined connectome and morphometry model showed similar or superior accuracies (AD/HC: 96%; MCI/HC: 70%; AD/MCI: 75% accuracy) as CSF biomarker model (t-tau, p-tau, and Amyloid beta;, and ratios). We also predicted MCI to AD progression with 69% accuracy, compared with the 70% accuracy using CSF biomarker model. The optimal classification accuracy in a single-site dataset and the reproduced results in multi-site dataset show the feasibility of the high-throughput imaging analysis of multimodal MRI and data-driven machine learning for predictive modeling in AD.

Artificial Intelligence

Biochemistry

0

Paper

Artificial Intelligence

Biochemistry

0

Save

0

Diagnosis and Prognosis of Alzheimer’s Disease Using Brain Morphometry and White Matter Connectomes

Yun Wang et al.Sep 4, 2018

Accurate, reliable prediction of risk for Alzheimer’s disease (AD) is essential for early, disease-modifying therapeutics. Multimodal MRI, such as structural and diffusion MRI, is likely to contain complementary information of neurodegenerative processes in AD. Here we tested the utility of the multimodal MRI (T1-weighted structure and diffusion MRI), combined with high-throughput brain phenotyping—morphometry and structural connectomics—and machine learning, as a diagnostic tool for AD. We used, firstly, a clinical cohort at a dementia clinic (National Health Insurance Service-Ilsan Hospital [NHIS-IH]; N=211; 110 AD, 64 mild cognitive impairment [MCI], and 37 cognitively normal with subjective memory complaints [SMC]) to test the diagnostic models; and, secondly, Alzheimer’s Disease Neuroimaging Initiative (ADNI)-2 to test the generalizability. Our machine learning models trained on the morphometric and connectome estimates (number of features=34,646) showed optimal classification accuracy (AD/SMC: 97% accuracy, MCI/SMC: 83% accuracy; AD/MCI: 97% accuracy) in NHIS-IH cohort, outperforming a benchmark model (FLAIR-based white matter hyperintensity volumes). In ADNI-2 data, the combined connectome and morphometry model showed similar or superior accuracies (AD/HC: 96%; MCI/HC: 70%; AD/MCI: 75% accuracy) compared with the CSF biomarker model (t-tau, p-tau, and Amyloid β, and ratios). In predicting MCI to AD progression in a smaller cohort of ADNI-2 (n=60), the morphometry model showed similar performance with 69% accuracy compared with CSF biomarker model with 70% accuracy. Our comparison of classifiers trained on structural MRI, diffusion MRI, FLAIR, and CSF biomarkers show the promising utility of the white matter structural connectomes in classifying AD and MCI in addition to the widely used structural MRI-based morphometry, when combined with machine learning.Highlights

Biochemistry

Internal Medicine

0

Paper

Save

Growth-regulated Hsp70 phosphorylation regulates stress responses and prion maintenance

Chung-Hsuan Kao et al.Sep 5, 2019

Maintenance of protein homeostasis in eukaryotes during normal growth and stress conditions requires the functions of Hsp70 chaperones and associated co-chaperones. Here we investigate an evolutionarily-conserved serine phosphorylation that occurs at the site of communication between the nucleotide-binding and substrate-binding domains of Hsp70. Ser151 phosphorylation in yeast Hsp70 (Ssa1) is promoted by cyclin-dependent kinase (Cdk1) during normal growth and dramatically affects heat shock responses, a function conserved with HSC70 S153 phosphorylation in human cells. Phospho-mimic forms of Ssa1 (S151D) also fail to relocalize in response to starvation conditions, do not associate in vivo with Hsp40 co-chaperones, Ydj1 and Sis1, and do not catalyze refolding of denatured proteins in vitro in cooperation with Ydj1 and Hsp104. S151 phosphorylation strongly promotes survival of heavy metal exposure and reduces Sup35-dependent [PSI+] prion activity, however, consistent with proposed roles for Ssa1 and Hsp104 in generating self-nucleating seeds of misfolded proteins. Taken together, these results suggest that Cdk1 downregulates Hsp70 function during periods of active growth, reducing propagation of aggregated proteins despite potential costs to overall chaperone efficiency.

Biochemistry

Molecular Biology

0

Paper

Save

Diagnosis and Prognosis Using Machine Learning Trained on Brain Morphometry and White Matter Connectomes

Yun Wang et al.Jan 30, 2018

Accurate, reliable prediction of risk for Alzheimer's disease (AD) is essential for early, disease-modifying therapeutics. Multimodal MRI, such as structural and diffusion MRI, may contain multi-dimensional information neurodegenerative processes in AD. Here we tested the utility of structural MRI and diffusion MRI as imaging markers of AD using high-throughput brain phenotyping including morphometry and white-matter structural connectome (whole-brain tractography), and machine learning analytics for classification. We used a retrospective cohort collected at a dementia clinic (Ilsan Dementia Cohort; N=211; 110 AD, 64 mild cognitive impairment [MCI], and 37 subjective memory complaints [SMC]). Multi-modal MRI was collected (T1, T2-FLAIR, and diffusion MRI) and was used for morphometry, structural connectome, and white matter hyperintensity (WHM) segmentation. Our machine learning model trained on the large-scale brain phenotypes (n=34,646) classified AD, MCI, and SMC with unprecedented accuracy (AD/SMC: 97% accuracy, MCI/SMC: 83% accuracy; AD/MCI: 98% accuracy) with strict iterative nested ten-fold cross-validation. Model comparison revealed that white-matter structural connectome was the primary contributor compared with conventional volumetric features (e.g., WHM or hippocampal volume). This study indicates promising utility of multimodal MRI, particularly structural connectome, combined with high-throughput brain phenotyping and machine learning analytics to extract salient features enabling accurate diagnostic prediction.

Artificial Intelligence

Biochemistry

0

Paper

Artificial Intelligence

Biochemistry

0

Save