ResearchHub | Open Science Community

Andrey Prjibelski

Author with expertise in RNA Sequencing Data Analysis

Achievements

Cited Author

Open Access Advocate

Key Stats

Upvotes received:

Publications:

(63% Open Access)

Cited by:

26,217

h-index:

i10-index:

Reputation

Biology

< 1%

Chemistry

< 1%

Economics

< 1%

How is this calculated?

Publications

SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing

Anton Bankevich et al.Apr 16, 2012

The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V−SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online (http://bioinf.spbau.ru/spades). It is distributed as open source software.

Paper

Save

Using SPAdes De Novo Assembler

Andrey Prjibelski et al.Jun 1, 2020

SPAdes-St. Petersburg genome Assembler-was originally developed for de novo assembly of genome sequencing data produced for cultivated microbial isolates and for single-cell genomic DNA sequencing. With time, the functionality of SPAdes was extended to enable assembly of IonTorrent data, as well as hybrid assembly from short and long reads (PacBio and Oxford Nanopore). In this article we present protocols for five different assembly pipelines that comprise the SPAdes package and that are used for assembly of metagenomes and transcriptomes as well as assembly of putative plasmids and biosynthetic gene clusters from whole-genome sequencing and metagenomic datasets. In addition, we present guidelines for understanding results with use cases for each pipeline, and several additional support protocols that help in using SPAdes properly. © 2020 Wiley Periodicals LLC. Basic Protocol 1: Assembling isolate bacterial datasets Basic Protocol 2: Assembling metagenomic datasets Basic Protocol 3: Assembling sets of putative plasmids Basic Protocol 4: Assembling transcriptomes Basic Protocol 5: Assembling putative biosynthetic gene clusters Support Protocol 1: Installing SPAdes Support Protocol 2: Providing input via command line Support Protocol 3: Providing input data via YAML format Support Protocol 4: Restarting previous run Support Protocol 5: Determining strand-specificity of RNA-seq data.

Artificial Intelligence

Software

Paper

Artificial Intelligence

1,538

Save

Assembling Single-Cell Genomes and Mini-Metagenomes From Chimeric MDA Products

Sergey Nurk et al.Oct 1, 2013

Recent advances in single-cell genomics provide an alternative to largely gene-centric metagenomics studies, enabling whole-genome sequencing of uncultivated bacteria. However, single-cell assembly projects are challenging due to (i) the highly nonuniform read coverage and (ii) a greatly elevated number of chimeric reads and read pairs. While recently developed single-cell assemblers have addressed the former challenge, methods for assembling highly chimeric reads remain poorly explored. We present algorithms for identifying chimeric edges and resolving complex bulges in de Bruijn graphs, which significantly improve single-cell assemblies. We further describe applications of the single-cell assembler SPAdes to a new approach for capturing and sequencing “microbial dark matter” that forms small pools of randomly selected single cells (called a mini-metagenome) and further sequences all genomes from the mini-metagenome at once. On single-cell bacterial datasets, SPAdes improves on the recently developed E+V-SC and IDBA-UD assemblers specifically designed for single-cell sequencing. For standard (cultivated monostrain) datasets, SPAdes also improves on A5, ABySS, CLC, EULER-SR, Ray, SOAPdenovo, and Velvet. Thus, recently developed single-cell assemblers not only enable single-cell sequencing, but also improve on conventional assemblers on their own turf. SPAdes is available for free online download under a GPLv2 license.

Paper

Save

Versatile genome assembly evaluation with QUAST-LG

Alla Mikheenko et al.Apr 12, 2018

The emergence of high-throughput sequencing technologies revolutionized genomics in early 2000s. The next revolution came with the era of long-read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian-size genomes.In this manuscript, we demonstrate performance of the state-of-the-art genome assembly software on six eukaryotic datasets sequenced using different technologies. To evaluate the results, we developed QUAST-LG-a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly correctness and completeness. Using QUAST-LG, we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference.http://cab.spbu.ru/software/quast-lg.Supplementary data are available at Bioinformatics online.

Genetics

Artificial Intelligence

Paper

Genetics

907

Save

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data

Elena Bushmanova et al.Sep 1, 2019

Abstract Background The possibility of generating large RNA-sequencing datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the organisms with finished and well-annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing, and paralogous genes. Results Herein we describe the novel transcriptome assembler rnaSPAdes, which has been developed on top of the SPAdes genome assembler and explores computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-sequencing datasets, and briefly highlight strong and weak points of different assemblers. Conclusions Based on the performed comparison between different assembly methods, we infer that it is not possible to detect the absolute leader according to all quality metrics and all used datasets. However, rnaSPAdes typically outperforms other assemblers by such important property as the number of assembled genes and isoforms, and at the same time has higher accuracy statistics on average comparing to the closest competitors.

Paper

Save

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Francisco Pardo-Palacios et al.Jun 7, 2024

Abstract The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

Paper

Save

Cell-type, single-cell, and spatial signatures of brain-region specific splicing in postnatal development

Anoushka Joglekar et al.Aug 27, 2020

Abstract Alternative RNA splicing varies across brain regions, but the single-cell resolution of such regional variation is unknown. Here we present the first single-cell investigation of differential isoform expression (DIE) between brain regions, by performing single cell long-read transcriptome sequencing in the mouse hippocampus and prefrontal cortex in 45 cell types at postnatal day 7 ( www.isoformAtlas.com ). Using isoform tests for brain-region specific DIE, which outperform exon-based tests, we detect hundreds of brain-region specific DIE events traceable to specific cell-types. Many DIE events correspond to functionally distinct protein isoforms, some with just a 6-nucleotide exon variant. In most instances, one cell type is responsible for brain-region specific DIE. Cell types indigenous to only one anatomic structure display distinctive DIE, where for example, the choroid plexus epithelium manifest unique transcription start sites. However, for some genes, multiple cell-types are responsible for DIE in bulk data, indicating that regional identity can, although less frequently, override cell-type specificity. We validated our findings with spatial transcriptomics and long-read sequencing, yielding the first spatially resolved splicing map in the postnatal mouse brain ( www.isoformAtlas.com ). Our methods are highly generalizable. They provide a robust means of quantifying isoform expression with cell-type and spatial resolution, and reveal how the brain integrates molecular and cellular complexity to serve function.

Genetics

Molecular Biology

148

Paper

Genetics

Save

Single-nuclei isoform RNA sequencing reveals combination patterns of transcript elements across human brain cell types

Simon Hardwick et al.Dec 30, 2021

Abstract Single-nuclei RNA-Seq is being widely employed to investigate cell types, especially of human brain and other frozen samples. In contrast to single-cell approaches, however, the majority of single-nuclei RNA counts originate from partially processed RNA leading to intronic cDNAs, thus hindering the investigation of complete isoforms. Here, using microfluidics, PCR-based artifact removal, target enrichment, and long-read sequencing, we developed single-nuclei isoform RNA-sequencing (‘SnISOr-Seq’), and applied it to the analysis of human adult frontal cortex samples. We found that exons associated with autism exhibit coordinated and more cell-type specific inclusion than exons associated with schizophrenia or ALS. We discovered two distinct modes of combination patterns: first, those distinguishing cell types in the human brain. These are enriched in combinations of TSS-exon, exon-polyA site, and distant (non-adjacent) exon pairs. Second, those with all isoform combinations found within one neural cell type, which are enriched in adjacent exon pairs. Furthermore, adjacent exon pairs are predominantly mutually associated, while distant pairs are frequently mutually exclusive. Finally, we observed that human-specific exons are as tightly coordinated as conserved exons, pointing to an efficient evolutionary mechanism underpinning coordination. SnISOr-Seq opens the door to single-nuclei long-read isoform analysis in the human brain, and in any frozen, archived or hard-to-dissociate sample.

Genetics

Molecular Biology

Paper

Genetics

Save

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data

Elena Bushmanova et al.Sep 18, 2018

Possibility to generate large RNA-seq datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the model organisms with finished and annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing and paralogous genes. In this paper we describe a novel transcriptome assembler called rnaSPAdes, which is developed on top of SPAdes genome assembler and explores surprising computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-Seq datasets, and briefly highlight strong and weak points of different assemblers.

Paper

Save

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Francisco Pardo-Palacios et al.Jul 27, 2023

Abstract The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

Paper

Save