ResearchHub | Open Science Community

The Combinatorial BLAS: design, implementation, and applications

Aydın Buluç et al.May 19, 2011

This paper presents a scalable high-performance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of high-performance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse-grained parallelism for thousands of processors, which can be uncovered by using the right primitives. We describe the parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extensible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and ease-of-use. The scalability and raw performance of the example applications, using the Combinatorial BLAS, are unprecedented on distributed memory clusters.

Theoretical Computer Science

Software

0

Paper

Theoretical Computer Science

399

0

Save

0

Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Aydın Buluç et al.Aug 11, 2009

This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A,x to be computed efficiently in parallel, where A is an n×n sparse matrix with nnzen nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running time) and Θ(√nlgn) span (critical-path length), yielding a parallelism of Θ(nnz/√nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is the same as that for the more-standard compressed-sparse-rows (CSR) format, for which computing Ax in parallel is easy but A,x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A,x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.

Computer Networks And Communications

Hardware And Architecture

0

Paper

Computer Networks And Communications

341

0

Save

82

Critical Assessment of Metagenome Interpretation - the second round of challenges

Fernando Meyer et al.Jul 12, 2021

Abstract Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created from ∼1,700 novel and known microbial genomes, as well as ∼600 novel plasmids and viruses. Altogether 5,002 results by 76 program versions were analyzed, representing a 22x increase in results. Substantial improvements were seen in metagenome assembly, some due to using long-read data. The presence of related strains still was challenging for assembly and genome binning, as was assembly quality for the latter. Taxon profilers demonstrated a marked maturation, with taxon profilers and binners excelling at higher bacterial taxonomic ranks, but underperforming for viruses and archaea. Assessment of clinical pathogen detection techniques revealed a need to improve reproducibility. Analysis of program runtimes and memory usage identified highly efficient programs, including some top performers with other metrics. The CAMI II results identify current challenges, but also guide researchers in selecting methods for specific analyses.

Genetics

Ecology

82

Paper

Save

Profiles of expressed mutations in single cells reveal subclonal expansion patterns and therapeutic impact of intratumor heterogeneity

Farid Mehrabadi et al.Mar 28, 2021

Abstract Advances in single-cell RNA sequencing (scRNAseq) technologies uncovered an unexpected complexity in tumors, underlining the relevance of intratumor heterogeneity to cancer progression and therapeutic resistance. Heterogeneity in the mutational composition of cancer cells is a result of distinct (sub)clonal expansions, each with a distinct metastatic potential and resistance to specific treatments. Unfortunately, due to their low read coverage per cell, scRNAseq datasets are too sparse and noisy to be used for detecting expressed mutations in single cells. Additionally, the large number of cells and mutations present in typical scRNAseq datasets are too large for available computational tools to, e.g., infer distinct subclones, lineages or trajectories in a tumor. Finally, there are no principled methods to assess distinct subclones inferred through single-cell sequencing data and the genomic alterations that seed and potentially cause them. Here we present Trisicell , a computational toolkit for scalable mutational intratumor heterogeneity inference and assessment from scRNAseq as well as single-cell genome or exome sequencing data. Trisicell allows reliable identification of distinct clonal lineages of a tumor, offering the ability to focus on the most important subclones and the genomic alterations that are associated with tumor proliferation. We comprehensively assessed Trisicell on a melanoma model by comparing distinct lineages and subclones it identifies on scRNAseq data, to those inferred using matching bulk whole exome (bWES) and transcriptome (bWTS) sequencing data from clonal sublines derived from single cells. Our results demonstrate that distinct lineages and subclones of a tumor can be reliably inferred and evaluated based on mutation calls from scRNAseq data through the use of Trisicell . Additionally, they reveal a strong correlation between aggressiveness and mutational composition, both across the inferred subclones, and among human melanomas. We also applied Trisicell to infer and evaluate distinct subclonal expansion patterns of the same mouse melanoma model after treatment with immune checkpoint blockade (ICB). After integratively analyzing our cell-specific mutation calls with their expression profiles, we observed that each subclone with a distinct set of novel somatic mutations is strongly associated with a specific developmental status. Moreover, each subclone had developed a unique ICB-resistance mechanism. These results demonstrate that Trisicell can robustly utilize scRNAseq data to delineate intratumor heterogeneity and help understand biological mechanisms underlying tumor progression and resistance to therapy.

Genetics

Molecular Biology

17

Paper

Save

Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition

Vivek Bharadwaj et al.Jun 4, 2024

Candecomp / PARAFAC (CP) decomposition, a generalization of the matrix singular value decomposition to higher-dimensional tensors, is a popular tool for analyzing multidimensional sparse data.On tensors with billions of nonzero entries, computing a CP decomposition is a computationally intensive task.We propose the first distributed-memory implementations of two randomized CP decomposition algorithms, CP-ARLS-LEV and STS-CP, that offer nearly an order-of-magnitude speedup at high decomposition ranks over well-tuned non-randomized decomposition packages.Both algorithms rely on leverage score sampling and enjoy strong theoretical guarantees, each with varying time and accuracy tradeoffs.We tailor the communication schedule for our random sampling algorithms, eliminating expensive reduction collectives and forcing communication costs to scale with the random sample count.Finally, we optimize the local storage format for our methods, switching between analogues of compressed sparse column and compressed sparse row formats.Experiments show that our methods are fast and scalable, producing 11x speedup over SPLATT by decomposing the billion-scale Reddit tensor on 512 CPU cores in under two minutes.

Organic Chemistry

Radiology, Nuclear Medicine And Imaging

0

Paper

Organic Chemistry

Radiology, Nuclear Medicine And Imaging

0

Save

0

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs

Benjamin Brock et al.May 30, 2024

K

A

B

Sparse matrix multiplication is an important kernel for large-scale graph processing and other data-intensive applications. In this paper, we implement various asynchronous, RDMA-based sparse times dense (SpMM) and sparse times sparse (SpGEMM) algorithms, evaluating their performance running in a distributed memory setting on GPUs. Our RDMA-based implementations use the NVSHMEM communication library for direct, asynchronous one-sided communication between GPUs. We compare our asynchronous implementations to state-of-the-art bulk synchronous GPU libraries as well as a CUDA-Aware MPI implementation of the SUMMA algorithm. We find that asynchronous RDMA-based implementations are able to offer favorable performance compared to bulk synchronous implementations, while also allowing for the straightforward implementation of novel work stealing algorithms.

Computer Networks And Communications

Hardware And Architecture

0

Paper

Computer Networks And Communications

Hardware And Architecture

0

Save

0

GPU accelerated partial order multiple sequence alignment for long reads self-correction

Francesco Peverelli et al.Feb 15, 2020

As third generation sequencing technologies become more reliable and widely used to solve several genome-related problems, self-correction of long reads is becoming the preferred method to reduce the error rate of Pacific Biosciences and Oxford Nanopore long reads, that is now around 10-12%. Several of these self-correction methods rely on some form of Multiple Sequence Alignment (MSA) to obtain a consensus sequence for the original reads. In particular, error-correction tools such as RACON and CONSENT use Partial Order (PO) graph alignment to accomplish this task. PO graph alignment, which is computationally more expensive than optimal global pairwise alignment between two sequences, needs to be performed several times for each read during the error correction process. GPUs have proven very effective in accelerating several compute-intensive tasks in different scientific fields. We harnessed the power of these architectures to accelerate the error correction process of existing self-correction tools, to improve the efficiency of this step of genome analysis.In this paper, we introduce a GPU-accelerated version of the PO alignment presented in the POA v2 software library, implemented on an NVIDIA Tesla V100 GPU. We obtain up to 6.5x speedup compared to 64 CPU threads run on two 2.3 GHz 16-core Intel Xeon Processors E5-2698 v3. In our implementation we focused on the alignment of smaller sequences, as the CONSENT segmentation strategy based on k-mer chaining provides an optimal opportunity to exploit the parallel-processing power of GPUs. To demonstrate this, we have integrated our kernel in the CONSENT software. This accelerated version of CONSENT provides a speedup for the whole error correction step that ranges from 1.95x to 8.5x depending on the input reads.

Biochemistry

Molecular Biology

0

Paper

Save

Sparsity-Aware Communication for Distributed Graph Neural Network Training

Ujjaini Mukhopadhyay et al.Aug 8, 2024

Graph Neural Networks (GNNs) are a computationally efficient method to learn embeddings and classifications on graph data. However, GNN training has low computational intensity, making communication costs the bottleneck for scalability. Sparse-matrix dense-matrix multiplication (SpMM) is the core computational operation in full-graph training of GNNs. Previous work parallelizing this operation focused on sparsity-oblivious algorithms, where matrix elements are communicated regardless of the sparsity pattern. This leads to a predictable communication pattern that can be overlapped with computation and enables the use of collective communication operations at the expense of wasting significant bandwidth by communicating unnecessary data.

Artificial Intelligence

Theoretical Computer Science

0

Paper

Artificial Intelligence

Theoretical Computer Science

0

Save

0

GenomeFace: a deep learning-based metagenome binner trained on 43,000 microbial genomes

Richard Lettich et al.Feb 8, 2024

Abstract Metagenomic binning, the process of grouping DNA sequences into taxonomic units, is critical for understanding the functions, interactions, and evolutionary dynamics of microbial communities. We propose a deep learning approach to binning using two neural networks, one based on composition and another on environmental abundance, dynamically weighting the contribution of each based on characteristics of the input data. Trained on over 43,000 prokaryotic genomes, our network for composition-based binning is inspired by metric learning techniques used for facial recognition. Using a task-specific, multi-GPU accelerated algorithm to cluster the embeddings produced by our network, our binner leverages marker genes observed to be universally present in nearly all taxa to grade and select optimal clusters of sequences from a hierarchy of candidates. We evaluate our approach on four simulated datasets with known ground truth. Our linear time integration of marker genes recovers more near complete genomes than state of the art but computationally infeasible solutions using them, while being over an order of magnitude faster. Finally, we demonstrate the scalability and acuity of our approach by testing it on three of the largest metagenome assemblies ever performed. Compared to other binners, we produced 47%-183% more near complete genomes. From these datasets, we find over the genomes of over 3000 new candidate species which have never been previously cataloged, representing a potential 4% expansion of the known bacterial tree of life.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper

Giulia Guidi et al.Nov 7, 2018

Recent advances in long-read sequencing enable the characterization of genome structure and its intra- and inter-species variation at a resolution that was previously impossible. Detecting overlaps between reads is integral to many long-read genomics pipelines, such as de novo genome assembly. While longer reads simplify genome assembly and improve the contiguity of the reconstruction, current long-read technologies come with high error rates. We present Berkeley Long-Read to Long-Read Aligner and Overlapper (BELLA), a novel algorithm for computing overlaps and alignments via sparse matrix-matrix multiplication that balances the goals of recall and precision, performing well on both.We present a probabilistic model that demonstrates the feasibility of using short k -mers for detecting candidate overlaps. We then introduce a notion of reliable k-mers based on our probabilistic model. Combining reliable k-mers with our binning mechanism eliminates both the k -mer set explosion that would otherwise occur with highly erroneous reads and the spurious overlaps from k -mers originating in repetitive regions. Finally, we present a new method based on Chernoff bounds for separating true overlaps from false positives using a combination of alignment techniques and probabilistic modeling. Our methodologies aim at maximizing the balance between precision and recall. On both real and synthetic data, BELLA performs amongst the best in terms of F1 score, showing performance stability which is often missing for competitor software. BELLA’s F1 score is consistently within 1.7% of the top entry. Notably, we show improved de novo assembly results on synthetic data when coupling BELLA with the Miniasm assembler.

Artificial Intelligence

Biochemistry

0

Paper

Artificial Intelligence

Biochemistry

0

Save