ResearchHub | Open Science Community

Computationally efficient whole-genome regression for quantitative and binary traits

Joelle Mbatchou et al.May 20, 2021

Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.

Genetics

Artificial Intelligence

10

Paper

Save

Genetic variants associated with idiopathic pulmonary fibrosis susceptibility and mortality: a genome-wide association study

Imre Noth et al.Apr 17, 2013

Background Idiopathic pulmonary fibrosis (IPF) is a devastating disease that probably involves several genetic loci. Several rare genetic variants and one common single nucleotide polymorphism (SNP) of MUC5B have been associated with the disease. Our aim was to identify additional common variants associated with susceptibility and ultimately mortality in IPF. Methods First, we did a three-stage genome-wide association study (GWAS): stage one was a discovery GWAS; and stages two and three were independent case-control studies. DNA samples from European-American patients with IPF meeting standard criteria were obtained from several US centres for each stage. Data for European-American control individuals for stage one were gathered from the database of genotypes and phenotypes; additional control individuals were recruited at the University of Pittsburgh to increase the number. For controls in stages two and three, we gathered data for additional sex-matched European-American control individuals who had been recruited in another study. DNA samples from patients and from control individuals were genotyped to identify SNPs associated with IPF. SNPs identified in stage one were carried forward to stage two, and those that achieved genome-wide significance (p<5 × 10−8) in a meta-analysis were carried forward to stage three. Three case series with follow-up data were selected from stages one and two of the GWAS using samples with follow-up data. Mortality analyses were done in these case series to assess the SNPs associated with IPF that had achieved genome-wide significance in the meta-analysis of stages one and two. Finally, we obtained gene-expression profiling data for lungs of patients with IPF from the Lung Genomics Research Consortium and analysed correlation with SNP genotypes. Findings In stage one of the GWAS (542 patients with IPF, 542 control individuals matched one-by-one to cases by genetic ancestry estimates), we identified 20 loci. Six SNPs reached genome-wide significance in stage two (544 patients, 687 control individuals): three TOLLIP SNPs (rs111521887, rs5743894, rs5743890) and one MUC5B SNP (rs35705950) at 11p15.5; one MDGA2 SNP (rs7144383) at 14q21.3; and one SPPL2C SNP (rs17690703) at 17q21.31. Stage three (324 patients, 702 control individuals) confirmed the associations for all these SNPs, except for rs7144383. Linkage disequilibrium between the MUC5B SNP (rs35705950) and TOLLIP SNPs (rs111521887 [r2=0·07], rs5743894 [r2=0·16], and rs5743890 [r2=0·01]) was low. 683 patients from the GWAS were included in the mortality analysis. Individuals who developed IPF despite having the protective TOLLIP minor allele of rs5743890 carried an increased mortality risk (meta-analysis with fixed-effect model: hazard ratio 1·72 [95% CI 1·24–2·38]; p=0·0012). TOLLIP expression was decreased by 20% in individuals carrying the minor allele of rs5743890 (p=0·097), 40% in those with the minor allele of rs111521887 (p=3·0 × 10−4), and 50% in those with the minor allele of rs5743894 (p=2·93 × 10−5) compared with homozygous carriers of common alleles for these SNPs. Interpretation Novel variants in TOLLIP and SPPL2C are associated with IPF susceptibility. One novel variant of TOLLIP, rs5743890, is also associated with mortality. These associations and the reduced expression of TOLLIP in patients with IPF who carry TOLLIP SNPs emphasise the importance of this gene in the disease. Funding National Institutes of Health; National Heart, Lung, and Blood Institute; Pulmonary Fibrosis Foundation; Coalition for Pulmonary Fibrosis; and Instituto de Salud Carlos III.

Genetics

Epidemiology

0

Paper

Save

Computationally efficient whole genome regression for quantitative and binary traits

Joelle Mbatchou et al.Jun 20, 2020

Abstract Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine learning method called REGENIE for fitting a whole genome regression model that is orders of magnitude faster than alternatives, while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes, and only requires local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives which must load genomewide matrices into memory. This results in substantial savings in compute time and memory usage. The method is applicable to both quantitative and binary phenotypes, including rare variant analysis of binary traits with unbalanced case-control ratios where we introduce a fast, approximate Firth logistic regression test. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach compared to several existing methods using quantitative and binary traits from the UK Biobank dataset with up to 407,746 individuals.

Genetics

Artificial Intelligence

129

Paper

Save

A framework for evaluating edited cell libraries created by massively parallel genome engineering

Simon Cawley et al.Sep 23, 2021

Abstract Genome engineering methodologies are transforming biological research and discovery. Approaches based on CRISPR technology have been broadly adopted and there is growing interest in the generation of massively parallel edited cell libraries. Comparing the libraries generated by these varying approaches is challenging and researchers lack a common framework for defining and assessing the characteristics of these libraries. Here we describe a framework for evaluating massively parallel libraries of edited genomes based on established methods for sampling complex populations. We define specific attributes and metrics that are informative for describing a complex cell library and provide examples for estimating these values. We also connect this analysis to generic phenotyping approaches, using either pooled (typically via a selection assay) or isolate (often referred to as screening) phenotyping approaches. We approach this from the context of creating massively parallel, precisely edited libraries with one edit per cell, though the approach holds for other types of modifications, including libraries containing multiple edits per cell (combinatorial editing). This framework is a critical component for evaluating and comparing new technologies as well as understanding how a massively parallel edited cell library will perform in a given phenotyping approach.

Genetics

Artificial Intelligence

18

Paper

Genetics

3

0

Save