ResearchHub | Open Science Community

Computational Pan-Genomics: Status, Promises and Challenges

Tobias Marschall et al.Mar 12, 2016

Abstract Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens , the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics , a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.

Genetics

Artificial Intelligence

0

Paper

Save

A high-quality reference panel reveals the complexity and distribution of structural genome changes in a human population

Jayne Hehir‐Kwa et al.Jan 18, 2016

Structural variation (SV) represents a major source of differences between individual human genomes and has been linked to disease phenotypes. However, the majority of studies provide neither a global view of the full spectrum of these variants nor integrate them into reference panels of genetic variation. Here, we analyse whole genome sequencing data of 769 individuals from 250 Dutch families, and provide a haplotype-resolved map of 1.9 million genome variants across 9 different variant classes, including novel forms of complex indels, and retrotransposition-mediated insertions of mobile elements and processed RNAs. A large proportion are previously under reported variants sized between 21 and 100bp. We detect 4 megabases of novel sequence, encoding 11 new transcripts. Finally, we show 191 known, trait-associated SNPs to be in strong linkage disequilibrium with SVs and demonstrate that our panel facilitates accurate imputation of SVs in unrelated individuals. Our findings are essential for genome-wide association studies.

Genetics

Molecular Biology

0

Paper

Save

Enhancing sensitivity and controlling false discovery rate in somatic indel discovery

Johannes Köster et al.Aug 21, 2019

As witnessed by various population-scale cancer genome sequencing projects, accurate discovery of somatic variants has become of central importance in modern cancer research. However, count statistics on somatic insertions and deletions (indels) discovered so far point out that large amounts of discoveries must have been missed. The reason is that the combination of uncertainties relating to, for example, gap and alignment ambiguities, twilight zone indels, cancer heterogeneity, sample purity, sampling and strand bias are hard to accurately quantify. Here, a unifying statistical model is provided whose dependency structures enable to accurately quantify all inherent uncertainties in short time. As major consequence, false discovery rate (FDR) in somatic indel discovery can now be controlled at utmost accuracy. As demonstrated on simulated and real data, this enables to dramatically increase the amount of true discoveries while safely suppressing the FDR. Specifically supported by workflow design, our approach can be integrated as a post-processing step in large-scale projects.

Genetics

Molecular Biology

0

Paper

Save

Enhancing Sensitivity And Controlling False Discovery Rate In Somatic Indel Discovery Using A Latent Variable Model

Louis Dijkstra et al.Mar 29, 2017

Cancer is a genetic disorder in the first place. Therefore, next-generation sequencing (NGS) based discovery of somatically acquired genetic variants has gained widespread attention. Computational prediction of somatic variants, however, is affected by a variety of confounding factors. In addition to the uncertainties that one commonly encounters also in germline variation prediction, such as misplaced and/or inaccurate read alignments, cancer heterogeneity and impure samples significantly add to the issues. Overall, this hampers state-of-the-art indel discovery tools to discover somatic indels at operable performance rates, although they perform excellently when calling germline indels. While affecting all size ranges, both common and cancer-specific problems interfere in particularly unfavorable ways in the prediction of somatic midsize (30-150 bp) insertions and deletions. Here, we present a latent variable model that can take the major confounding factors and uncertainties into a unifying account. Using this modeling framework, we first demonstrate how to efficiently compute the probability for a (putative) indel to be somatic, thereby resolving a principled computational runtime bottleneck in Bayesian uncertainty quantification. Second, we show how to reliably estimate the allele frequencies for a given list of indels. Third, we also present an intuitive and effective way to control the false discovery rate, an issue in genetic variant discovery that has been found notoriously hard to deal with. As a tool that implements all methodology developed, we present PROSIC (PROcessing Somatic Indel Calls). PROSIC achieves significant improvements in particular in terms of recall when applied to deletion call sheets, as provided by prevalent state-of-the-art tools, in comparison to their integrated somatic indel calling routines.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save