ResearchHub | Open Science Community

Samplot: A Platform for Structural Variant Visual Validation and Automated Filtering

Jonathan Belyeu et al.Sep 25, 2020

Abstract Visual validation is an essential step to minimize false positive predictions resulting from structural variant (SV) detection. We present Samplot, a tool for quickly creating images that display the read depth and sequence alignments necessary to adjudicate purported SVs across multiple samples and sequencing technologies, including short, long, and phased reads. These simple images can be rapidly reviewed to curate large SV call sets. Samplot is easily applicable to many biological problems such as prioritization of potentially causal variants in disease studies, family-based analysis of inherited variation, or de novo SV review. Samplot also includes a trained machine learning package that dramatically decreases the number of false positives without human review. Samplot is available via the conda package manager or at https://github.com/ryanlayer/samplot . Contact Ryan Layer, Ph.D., Assistant Professor, University of Colorado Boulder, ryan.layer@colorado.edu .

Genetics

Artificial Intelligence

2

Paper

Save

De novo structural mutation rates and gamete-of-origin biases revealed through genome sequencing of 2,396 families

Jonathan Belyeu et al.Oct 8, 2020

Abstract Each human genome includes de novo mutations that arose during gametogenesis. While these germline mutations represent a fundamental source of new genetic diversity, they can also create deleterious alleles that impact fitness. The germline mutation rate for single nucleotide variants and factors that significantly influence this rate, such as parental age, are now well established. However, far less is known about the frequency, distribution, and features that impact de novo structural mutations. We report a large, family-based study of germline mutations, excluding aneuploidy, that affect genome structure among 572 genomes from 33 families in a multigenerational CEPH-Utah cohort and 2,363 cases of non-familial autism spectrum disorder (ASD), 1,938 unaffected siblings, and both parents (9,599 genomes in total). We find that de novo structural mutations detected by alignment-based, short-read WGS occurred at an overall rate of at least 0.160 events per genome in unaffected individuals and was significantly higher (0.206 per genome) in ASD cases. In both probands and unaffected samples, nearly 73% of de novo structural mutations arose in paternal gametes, and predict most de novo structural mutations to be caused by mutational mechanisms that do not require sequence homology. After multiple testing correction we did not observe a statistically significant correlation between parental age and the rate of de novo structural variation in offspring. These results highlight that a spectrum of mutational mechanisms contribute to germline structural mutations, and that these mechanisms likely have markedly different rates and selective pressures than those leading to point mutations.

Genetics

Plant Science

93

Paper

Save

Go Get Data (GGD): simple, reproducible access to scientific data

Michael Cormier et al.Sep 11, 2020

Abstract Genomics research is complicated by the inherent difficulty of collecting, transforming, and integrating the numerous datasets and annotations germane to one’s research. Furthermore, these data exist in disparate sources, and are stored in numerous, often abused formats from multiple genome builds. Since these complexities waste time, inhibit reproducibility, and curtail research creativity, we developed Go Get Data (GGD; https://gogetdata.github.io/ ) as a fast, reproducible approach to installing standardized data recipes.

Philosophy

Molecular Biology

0

Paper

Save

Unfazed: parent-of-origin detection for large and smallde novovariants

Jonathan Belyeu et al.Feb 3, 2021

Abstract Summary Unfazed is a command-line tool to determine the parental gamete of origin for de novo mutations from paired-end Illumina DNA sequencing reads. Unfazed uses variant information for a sequenced trio to identify the parental gamete of origin by linking phase-informative inherited variants to de novo mutations using read-based phasing. It achieves a high success rate by chaining reads into haplotype groups, thus increasing the search space for informative sites. Unfazed provides a simple command-line interface and scales well to large inputs, determining parent-of-origin for nearly 30,000 de novo variants in under 60 hours. Availability Unfazed is available at https://github.com/ibelyeu/unfazed .

Genetics

Molecular Biology

30

Paper

Save

Sawfish: Improving long-read structural variant discovery and genotyping with local haplotype modeling

Christopher Saunders et al.Aug 20, 2024

Motivation Structural variants (SVs) play an important role in evolutionary and functional genomics but are challenging to characterize. High-accuracy, long-read sequencing can substantially improve SV characterization when coupled with effective calling methods. While state-of the-art long-read SV callers are highly accurate, further improvements are achievable by systematically modeling local haplotypes during SV discovery and genotyping. Results We describe sawfish, an SV caller for mapped high-quality long reads incorporating systematic SV haplotype modeling to improve accuracy and resolution. Assessment against the draft Genome in a Bottle (GIAB) SV benchmark from the T2T-HG002-Q100 diploid assembly shows that sawfish has the highest accuracy among state-of-the-art long-read SV callers across every tested SV size group. Additionally, sawfish maintains the highest accuracy at every tested depth level from 10 to 32-fold coverage, such that other callers required at least 30-fold coverage to match sawfish accuracy at 15-fold coverage. Sawfish also shows the highest accuracy in the GIAB challenging medically relevant genes benchmark, demonstrating improvements in both comprehensive and medically relevant contexts. When joint-genotyping 10 samples from CEPH-1463, sawfish has over 9000 more pedigree-concordant calls than other state-of-the-art SV callers, with the highest proportion of concordant SVs (78%) as well. Sawfish's quality model can be used to select for an even higher proportion of concordant SVs (86%), while still calling over 5000 more pedigree-concordant SVs than other callers. These results demonstrate that sawfish improves on the state-of-the-art for long-read SV calling accuracy across both individual and joint-sample analyses. Availability Sawfish is released as a pre-compiled Linux binary and user guide on GitHub: https://github.com/PacificBiosciences/sawfish.

Genetics

Molecular Biology

0

Paper

Save

Sawfish: Improving long-read structural variant discovery and genotyping with local haplotype modeling

Christopher Saunders et al.Apr 9, 2025

Abstract Motivation Structural variants (SVs) play an important role in evolutionary and functional genomics but are challenging to characterize. High-accuracy, long-read sequencing can substantially improve SV characterization when coupled with effective calling methods. While state-of-the-art long-read SV callers are highly accurate, further improvements are achievable by systematically modeling local haplotypes during SV discovery and genotyping. Results We describe sawfish, an SV caller for mapped high-quality long reads incorporating systematic SV haplotype modeling to improve accuracy and resolution. Assessment against the draft Genome in a Bottle (GIAB) SV benchmark from the T2T-HG002-Q100 diploid assembly shows that sawfish has the highest accuracy among state-of-the-art long-read SV callers across every tested SV size group. Additionally, sawfish maintains the highest accuracy at every tested depth level from 10- to 32-fold coverage, such that other callers required at least 30-fold coverage to match sawfish accuracy at 15-fold coverage. Sawfish also shows the highest accuracy in the GIAB challenging medically relevant genes benchmark, demonstrating improvements in both comprehensive and medically relevant contexts. When joint-genotyping 7 samples from CEPH-1463, sawfish has over 9000 more pedigree-concordant calls than other state-of-the-art SV callers, with the highest proportion of concordant SVs (81%). Sawfish’s quality model enables selection for an even higher proportion of concordant SVs (88%), while still calling nearly 5000 more pedigree-concordant SVs than other callers. These results demonstrate that sawfish improves on the state-of-the-art for long-read SV calling accuracy across both individual and joint-sample analyses. Availability Sawfish source code, pre-compiled Linux binaries, and documentation are released on GitHub: https://github.com/PacificBiosciences/sawfish. Supplementary information Supplementary data are available at Bioinformatics online.\

Genetics

Aquatic Science

0

Paper

Save

SV-plaudit: A cloud-based framework for manually curating thousands of structural variants

Jonathan Belyeu et al.Feb 14, 2018

SV-plaudit is a framework for rapidly curating structural variant (SVs) predictions. For each SV, we generate an image that visualizes the coverage and alignment signals from a set of samples. Images are uploaded to our cloud framework where users assess the quality of each image using a client-side web application. Reports can then be generated as a tab-delimited file or annotated VCF. As a proof of principle, nine researchers collaborated for one hour to evaluate 1,350 SVs each. We anticipate that SV-plaudit will become a standard step in variant calling pipelines and the crowd-sourced curation of other biological results. Code available at https://github.com/jbelyeu/SV-plaudit. Demonstration video available at https://www.youtube.com/watch?v=ono8kHMKxDs

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

XPRESSyourself: Enhancing, Standardizing, and Automating Ribosome Profiling Computational Analyses Yields Improved Insight into Data

Jordan Berg et al.Jul 16, 2019

Ribosome profiling, an application of nucleic acid sequencing for monitoring ribosome activity, has revolutionized our understanding of protein translation dynamics. This technique has been available for a decade, yet the current state and standardization of publicly available computational tools for these data is bleak. We introduce XPRESSyourself, an analytical toolkit that eliminates barriers and bottlenecks associated with this specialized data type by filling gaps in the computational toolset for both experts and non-experts of ribosome profiling. XPRESSyourself automates and standardizes analysis procedures, decreasing time-to-discovery and increasing reproducibility. This toolkit acts as a reference implementation of current best practices in ribosome profiling analysis. We demonstrate this toolkit’s performance on publicly available ribosome profiling data by rapidly identifying hypothetical mechanisms related to neurodegenerative phenotypes and neuroprotective mechanisms of the small-molecule ISRIB during acute cellular stress. XPRESSyourself brings robust, rapid analysis of ribosome-profiling data to a broad and ever-expanding audience and will lead to more reproducible and accessible measurements of translation regulation. XPRESSyourself software is perpetually open-source under the GPL-3.0 license and is hosted at , where users can access additional documentation and report software issues.* AWS : Amazon Web Services BAM : Binary Sequence Alignment Map BED : Browser Extensible Data cDNA : complementary DNA CDS : coding sequence of gene ChIP-seq : chromatin immunoprecipitation sequencing CPU : central processing unit dbGaP : Database of Genotypes and Phenotypes DNA : deoxyribonucleic acid FDR : false discovery rate FPKM : fragments per kilobase of transcript per million GEO : Gene Expression Omnibus GTF : General Transfer Format IGV : Integrative Genomics Viewer ISR : integrated stress response ISRIB : ISR inhibitor mRNA : messenger RNA nt : nucleotide PCA : principal component analysis PCR : polymerase chain reaction RAM : random access memory RNA : ribonucleic acid RNA-Seq : RNA sequencing RPKM : reads per kilobase of transcript per million RPM : reads per million rRNA : ribosomal RNA TCGA : The Cancer Genome Atlas TE : translation efficiency TPM : transcripts per million UMI : unique molecular identifier UTR : untranslated region VCF : Variant Call Format

Biochemistry

Molecular Biology

0

Paper

Save

Identification of allele-specific KIV-2 repeats and impact on Lp(a) measurements for cardiovascular disease risk

Sairam Behera et al.Apr 27, 2023

The abundance of Lp(a) protein holds significant implications for the risk of cardiovascular disease (CVD), which is directly impacted by the copy number (CN) of KIV-2, a 5.5 kbp sub-region. KIV-2 is highly polymorphic in the population and accurate analysis is challenging. In this study, we present the DRAGEN KIV-2 CN caller, which utilizes short reads. Data across 166 WGS show that the caller has high accuracy, compared to optical mapping and can further phase ~50% of the samples. We compared KIV-2 CN numbers to 24 previously postulated KIV-2 relevant SNVs, revealing that many are ineffective predictors of KIV-2 copy number. Population studies, including USA-based cohorts, showed distinct KIV-2 CN, distributions for European-, African-, and Hispanic-American populations and further underscored the limitations of SNV predictors. We demonstrate that the CN estimates correlate significantly with the available Lp(a) protein levels and that phasing is highly important.

Genetics

Molecular Biology

1

Paper

Genetics

Molecular Biology

0

Save