ResearchHub | Open Science Community

The mutational constraint spectrum quantified from variation in 141,456 humans

Konrad Karczewski et al.May 27, 2020

Abstract Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes 1 . Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.

Genetics

Molecular Biology

0

Paper

Save

Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index

Elizabeth Speliotes et al.Oct 10, 2010

Genetics

Endocrinology

0

Paper

Save

Hundreds of variants clustered in genomic loci and biological pathways affect human height

Hana Allen et al.Sep 29, 2010

Genetics

Biology

0

Paper

Save

Efficient Bayesian mixed-model analysis increases association power in large cohorts

Po‐Ru Loh et al.Feb 2, 2015

Alkes Price, Po-Ru Loh and colleagues report the BOLT-LMM method for mixed-model association. They apply their method to 9 quantitative traits in 23,294 samples and demonstrate that it provides improvements in computational efficiency as well as gains in power that increase with the size of the cohort, making it useful for the analysis of large cohorts. Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts and may not optimize power. All existing methods require time cost O(MN2) (where N is the number of samples and M is the number of SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here we present a far more efficient mixed-model association method, BOLT-LMM, which requires only a small number of O(MN) time iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to 9 quantitative traits in 23,294 samples from the Women's Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for genome-wide association studies in large cohorts.

Genetics

Biology

1

Paper

Save

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Daniel Taliun et al.Feb 10, 2021

Abstract The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes) 1 . In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.

Genetics

Molecular Biology

1

Paper

Save

Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores

Bjarni Vilhjálmsson et al.Oct 1, 2015

Polygenic risk scores have shown great promise in predicting complex disease risk and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics, but this discards information and can reduce predictive accuracy. We introduce LDpred, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the approach of pruning followed by thresholding, particularly at large sample sizes. Accordingly, predicted R2 increased from 20.1% to 25.3% in a large schizophrenia dataset and from 9.8% to 12.0% in a large multiple sclerosis dataset. A similar relative improvement in accuracy was observed for three additional large disease datasets and for non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase. Polygenic risk scores have shown great promise in predicting complex disease risk and will become more accurate as training sample sizes increase. The standard approach for calculating risk scores involves linkage disequilibrium (LD)-based marker pruning and applying a p value threshold to association statistics, but this discards information and can reduce predictive accuracy. We introduce LDpred, a method that infers the posterior mean effect size of each marker by using a prior on effect sizes and LD information from an external reference panel. Theory and simulations show that LDpred outperforms the approach of pruning followed by thresholding, particularly at large sample sizes. Accordingly, predicted R2 increased from 20.1% to 25.3% in a large schizophrenia dataset and from 9.8% to 12.0% in a large multiple sclerosis dataset. A similar relative improvement in accuracy was observed for three additional large disease datasets and for non-European schizophrenia samples. The advantage of LDpred over existing methods will grow as sample sizes increase.

Genetics

Artificial Intelligence

1

Paper

Save

Understanding multicellular function and disease with human tissue-specific networks

Casey Greene et al.Apr 27, 2015

Olga Troyanskaya and colleagues present genome-wide functional interaction networks for 144 human tissues and cell types. They identify important disease-gene associations by combining data from GWAS and tissue-specific networks. They also developed a webserver, GIANT, that includes multi-gene query capability, network visualization and analysis tools. Tissue and cell-type identity lie at the core of human physiology and disease. Understanding the genetic underpinnings of complex tissues and individual cell lineages is crucial for developing improved diagnostics and therapeutics. We present genome-wide functional interaction networks for 144 human tissues and cell types developed using a data-driven Bayesian methodology that integrates thousands of diverse experiments spanning tissue and disease states. Tissue-specific networks predict lineage-specific responses to perturbation, identify the changing functional roles of genes across tissues and illuminate relationships among diseases. We introduce NetWAS, which combines genes with nominally significant genome-wide association study (GWAS) P values and tissue-specific networks to identify disease-gene associations more accurately than GWAS alone. Our webserver, GIANT, provides an interface to human tissue networks through multi-gene queries, network visualization, analysis tools including NetWAS and downloadable networks. GIANT enables systematic exploration of the landscape of interacting genes that shape specialized cellular functions across more than a hundred human tissues and cell types.

Genetics

Molecular Biology

0

Paper

Save

A structural variation reference for medical and population genetics

Ryan Collins et al.May 27, 2020

Structural variants (SVs) rearrange large segments of DNA1 and can have profound consequences in evolution and human disease2,3. As national biobanks, disease-association studies, and clinical genetic testing have grown increasingly reliant on genome sequencing, population references such as the Genome Aggregation Database (gnomAD)4 have become integral in the interpretation of single-nucleotide variants (SNVs)5. However, there are no reference maps of SVs from high-coverage genome sequencing comparable to those for SNVs. Here we present a reference of sequence-resolved SVs constructed from 14,891 genomes across diverse global populations (54% non-European) in gnomAD. We discovered a rich and complex landscape of 433,371 SVs, from which we estimate that SVs are responsible for 25-29% of all rare protein-truncating events per genome. We found strong correlations between natural selection against damaging SNVs and rare SVs that disrupt or duplicate protein-coding sequence, which suggests that genes that are highly intolerant to loss-of-function are also sensitive to increased dosage6. We also uncovered modest selection against noncoding SVs in cis-regulatory elements, although selection against protein-truncating SVs was stronger than all noncoding effects. Finally, we identified very large (over one megabase), rare SVs in 3.9% of samples, and estimate that 0.13% of individuals may carry an SV that meets the existing criteria for clinically important incidental findings7. This SV resource is freely distributed via the gnomAD browser8 and will have broad utility in population genetics, disease-association studies, and diagnostic screening.

Genetics

Cancer Research

0

Paper

Save

Multiple loci associated with indices of renal function and chronic kidney disease

Anna Köttgen et al.May 10, 2009

Chronic kidney disease (CKD) has a heritable component and is an important global public health problem because of its high prevalence and morbidity. We conducted genome-wide association studies (GWAS) to identify susceptibility loci for glomerular filtration rate, estimated by serum creatinine (eGFRcrea) and cystatin C (eGFRcys), and CKD (eGFRcrea < 60 ml/min/1.73 m(2)) in European-ancestry participants of four population-based cohorts (ARIC, CHS, FHS, RS; n = 19,877; 2,388 CKD cases), and tested for replication in 21,466 participants (1,932 CKD cases). We identified significant SNP associations (P < 5 × 10(-8)) with CKD at the UMOD locus, with eGFRcrea at UMOD, SHROOM3 and GATM-SPATA5L1, and with eGFRcys at CST and STC1. UMOD encodes the most common protein in human urine, Tamm-Horsfall protein, and rare mutations in UMOD cause mendelian forms of kidney disease. Our findings provide new insights into CKD pathogenesis and underscore the importance of common genetic variants influencing renal function and disease.

Genetics

Molecular Biology

0

Paper

Save

Rare and low-frequency coding variants alter human adult height

Eirini Marouli et al.Jan 31, 2017

Height is a highly heritable, classic polygenic trait with approximately 700 common associated variants identified through genome-wide association studies so far. Here, we report 83 height-associated coding variants with lower minor-allele frequencies (in the range of 0.1–4.8%) and effects of up to 2 centimetres per allele (such as those in IHH, STC2, AR and CRISPLD2), greater than ten times the average effect of common variants. In functional follow-up studies, rare height-increasing alleles of STC2 (giving an increase of 1–2 centimetres per allele) compromised proteolytic inhibition of PAPP-A and increased cleavage of IGFBP-4 in vitro, resulting in higher bioavailability of insulin-like growth factors. These 83 height-associated variants overlap genes that are mutated in monogenic growth disorders and highlight new biological candidates (such as ADAMTS3, IL11RA and NOX4) and pathways (such as proteoglycan and glycosaminoglycan synthesis) involved in growth. Our results demonstrate that sufficiently large sample sizes can uncover rare and low-frequency variants of moderate-to-large effect associated with polygenic human phenotypes, and that these variants implicate relevant genes and pathways. Data from over 700,000 individuals reveal the identity of 83 sequence variants that affect human height, implicating new candidate genes and pathways as being involved in growth. As a highly heritable polygenic trait, human height has provided a model for the genetic analysis of complex traits. So far about 700 common genetic variants have been linked to height through genome-wide association studies, but the role of low-frequency and rare variants has not been systematically explored. Guillaume Lettre, Joel Hirschhorn and colleagues in the GIANT Consortium now report their analysis of coding regions in the genomes of 711,418 individuals. They identify 120 loci newly associated with height, including 32 rare and 51 low-frequency coding variants. They highlight 83 candidate genes with low-frequency height-associated variants and implicate biological pathways with known roles in growth disorders as well as new candidates. Their analyses provide insights into the genomic architecture of human height.

Genetics

Molecular Biology

0

Paper

Genetics

593

0

Save