ResearchHub | Open Science Community

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter

Shaun Jackman et al.Feb 23, 2017

The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.

Genetics

Molecular Biology

1

Paper

Save

ntHits: de novo repeat identification of genomics data using a streaming approach

Hamid Mohamadi et al.Nov 4, 2020

Abstract Motivation Repeat elements such as satellites, transposons, high number of gene copies, and segmental duplications are abundant in eukaryotic genomes. They often induce many local alignments, complicating sequence assembly and comparisons between genomes and analysis of large-scale duplications and rearrangements. Hence, identification and classification of repeats is a fundamental step in many genomics applications and their downstream analysis tools. Results In this work, we present an efficient streaming algorithm and software tool, ntHits, for de novo repeat identification based on the statistical analysis of the k -mer content profile of large-scale DNA sequencing data. In the proposed algorithm, we first obtain the k -mer coverage histograms of input datasets using the ntCard algorithm, an efficient streaming algorithm for estimating the k -mer coverage histograms. From the obtained k -mer coverage histogram, the repetitive k -mers would present a long tail to the distribution of k -mer coverage profile. Experimental results show that ntHits can efficiently and accurately identify the repeat content in large-scale DNA sequencing data. For example, ntHits accurately identifies the repeat k -mers in the white spruce sequencing data set with 96× sequencing coverage in about 12 hours and using less than 150GB of memory, while using the exact methods for reporting the repeated k -mers takes several days and terabytes of memory and disk space. Availability ntHits is written in C++ and is released under the MIT License. It is freely available at https://github.com/bcgsc/ntHits . Contact hmohamadi@bcgsc.ca

Genetics

Artificial Intelligence

1

Paper

Save

Overlapping long sequence reads: Current innovations and challenges in developing sensitive, specific and scalable algorithms

Justin Chu et al.Oct 17, 2016

Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap, and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency, and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. We benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

Improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index Bloom filters

Justin Chu et al.Oct 5, 2018

Alignment-free classification of sequences against collections of sequences has enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines. Originally hash-table based, much work has been done to improve and reduce the memory requirement of indexing of k-mer sequences with probabilistic indexing strategies. These efforts have led to lower memory highly efficient indexes, but often lack sensitivity in the face of sequencing errors or polymorphism because they are k-mer based. To address this, we designed a new memory efficient data structure that can tolerate mismatches using multiple spaced seeds, called a multi-index Bloom Filter. Implemented as part of BioBloom Tools, we demonstrate our algorithm in two applications, read binning for targeted assembly and taxonomic read assignment. Our tool shows a higher sensitivity and specificity for read-binning than BWA MEM at an order of magnitude less time. For taxonomic classification, we show higher sensitivity than CLARK-S at an order of magnitude less time while using half the memory.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

ntEdit: scalable genome sequence polishing

Warren Rm et al.Mar 26, 2019

In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled E. coli and C. elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20X), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14s and <3m, on average, on E. coli and C. elegans, respectively. We performed similar benchmarks on a sub-20X coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30-40m on those sequences. We show how ntEdit ran in <2h20m to improve upon long and linked read human genome assemblies of NA12878, using high coverage (54X) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gbp interior and white spruce genomes in<4 and<5h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability: https://github.com/bcgsc/ntedit

Genetics

Molecular Biology

0

Paper

Save

ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter

Shaun Jackman et al.Aug 7, 2016

The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps towards elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depends on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50 bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its re-design, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We present assembly benchmarks of human Genome in a Bottle 250 bp Illumina paired-end and 6 kbp mate-pair libraries from a single individual, yielding a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM, a modest memory requirement by today’s standard that is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold contiguity of this assembly to 42 (15) Mbp.

Genetics

Molecular Biology

0

Paper

Save

Tigmint: Correcting Assembly Errors Using Linked Reads From Large Molecules

Shaun Jackman et al.Apr 20, 2018

Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity, and assembly errors are common. These misassemblies may be identified by comparing the sequencing data to the assembly, and by looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembly. Although tools exist to identify and correct misassemblies using Illumina pair-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint for this purpose. To demonstrate the effectiveness of Tigmint, we corrected assemblies of a human genome using short reads assembled with ABySS 2.0 and other assemblers. Tigmint reduced the number of misassemblies identified by QUAST in the ABySS assembly by 216 (27%). While scaffolding with ARCS alone more than doubled the scaffold NGA50 of the assembly from 3 to 8 Mbp, the combination of Tigmint and ARCS improved the scaffold NGA50 of the assembly over five-fold to 16.4 Mbp. This notable improvement in contiguity highlights the utility of assembly correction in refining assemblies. We demonstrate its usefulness in correcting the assemblies of multiple tools, as well as in using Chromium reads to correct and scaffold assemblies of long single-molecule sequencing. The source code of Tigmint is available for download from https://github.com/bcgsc/tigmint, and is distributed under the GNU GPL v3.0 license.

Genetics

Ecology

0

Paper

Save

RNA-Bloom provides lightweight reference-free transcriptome assembly for single cells

Ka Nip et al.Jul 14, 2019

We present RNA-Bloom, a de novo RNA-seq assembly algorithm that leverages the rich information content in single-cell transcriptome sequencing (scRNA-seq) data to reconstruct cell-specific isoforms. We benchmark RNA-Bloom's performance against leading bulk RNA-seq assembly approaches, and illustrate its utility in detecting cell-specific gene fusion events using sequencing data from HiSeq-4000 and BGISEQ-500 platforms. We expect RNA-Bloom to boost the utility of scRNA-seq data, expanding what is informatically accessible now.

Genetics

Ecology

0

Paper

Genetics

Ecology

0

Save