ResearchHub | Open Science Community

recount3: summaries and queries for large-scale RNA-seq expression and splicing

Christopher Wilks et al.May 23, 2021

ABSTRACT We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. Monorail can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio .

Genetics

Molecular Biology

168

Paper

Save

Megadepth: efficient coverage quantification for BigWigs and BAMs

Christopher Wilks et al.Dec 18, 2020

Abstract Motivation A common way to summarize sequencing datasets is to quantify data lying within genes or other genomic intervals. This can be slow and can require different tools for different input file types. Results Megadepth is a fast tool for quantifying alignments and coverage for BigWig and BAM/CRAM input files, using substantially less memory than the next-fastest competitor. Megadepth can summarize coverage within all disjoint intervals of the Gencode V35 gene annotation for more than 19,000 GTExV8 BigWig files in approximately one hour using 32 threads. Megadepth is available both as a command-line tool and as an R/Bioconductor package providing much faster quantification compared to the rtracklayer package. Availability https://github.com/ChristopherWilks/megadepth , https://bioconductor.org/packages/megadepth . Contact chris.wilks@jhu.edu

Genetics

Artificial Intelligence

35

Paper

Save

Pan-genomic Matching Statistics for Targeted Nanopore Sequencing

Omar Ahmed et al.Mar 23, 2021

Abstract Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject “non-target” DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing with the help of efficient pangenome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics (half-maximal exact matches) in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI’s index and peak memory footprint are also 15 to 4 times smaller than minimap2, respectively. These improvements become even more pronounced with even larger reference databases; SPUMONI’s index size scales sublinearly with the number of reference genomes included. This could enable accurate targeted sequencing even in the case where the targeted strains have not necessarily been sequenced or assembled previously. SPUMONI is open source software available from https://github.com/oma219/spumoni .

Genetics

Artificial Intelligence

1

Paper

Save

Cell-specific regulation of gene expression using splicing-dependent frameshifting

Jonathan Ling et al.Mar 2, 2022

Abstract Precise and reliable cell-specific gene delivery remains technically challenging. Here we report a splicing-based approach for controlling gene expression whereby separate translational reading frames are coupled to the inclusion or exclusion of cell-specific alternative exons. Candidate exons are identified by analyzing thousands of publicly available RNA sequencing datasets and filtering by cell specificity, sequence conservation, and local intron length. This method, which we denote splicing-linked expression design (SLED), can be combined in a Boolean manner with existing techniques such as minipromoters and viral capsids. SLED vectors can leverage the strong expression of constitutive promoters, without sacrificing precision, by decoupling the tradeoff between promoter strength and selectivity. We generated SLED vectors to selectively target all neurons, photoreceptors, or excitatory neurons, and demonstrated that specificity was retained in vivo when delivered using AAVs. We further demonstrated the utility of SLED by creating what would otherwise be unobtainable research tools, specifically a GluA2 flip/flop reporter and a dual excitatory/inhibitory neuronal calcium indicator. Finally, we show the translational potential of SLED by rescuing photoreceptor degeneration in Prph2 rds/rds mice and by developing an oncolytic vector that can selectively induce apoptosis in SF3B1 mutant cancer cells. The flexibility of SLED technology enables new avenues for basic and translational research.

Genetics

Molecular Biology

78

Paper

Save

Scaling read aligners to hundreds of threads on general-purpose processors

Ben Langmead et al.Oct 24, 2017

General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variable-record-length file formats like FASTQ and suggest changes that enable superior scaling.

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

Molecular Biology

0

Save

0

Polyester: simulating RNA-seq datasets with differential transcript expression

Alyssa Frazee et al.Jun 6, 2014

Motivation: Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially-constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. Results: Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user. Availability and Implementation: Polyester is freely available from Bioconductor (http://bioconductor.org/).

Genetics

Molecular Biology

0

Paper

Save

Human splicing diversity across the Sequence Read Archive

Abhinav Nellore et al.Jan 29, 2016

We aligned 21,504 publicly available Illumina-sequenced human RNA-seq samples from the Sequence Read Archive (SRA) to the human genome and compared detected exon-exon junctions with junctions in several recent gene annotations. 56,865 junctions (18.6%) found in at least 1,000 samples were not annotated, and their expression associated with tissue type. Newer samples contributed few novel well-supported junctions, with 96.1% of junctions detected in at least 20 reads across samples present in samples before 2013. Junction data is compiled into a resource called intropolis available at http://intropolis.rail.bio. We discuss an application of this resource to cancer involving a recently validated isoform of the ALK gene.

Genetics

Molecular Biology

0

Paper

Save

Prefix-Free Parsing for Building Big BWTs

Christina Boucher et al.Nov 19, 2018

High-throughput sequencing technologies have led to explosive growth of genomic databases; one of which will soon reach hundreds of terabytes. For many applications we want to build and store indexes of these databases but constructing such indexes is a challenge. Fortunately, many of these genomic databases are highly-repetitive---a characteristic that can be exploited to ease the computation of the Burrows-Wheeler Transform (BWT), which underlies many popular indexes. In this paper, we introduce a preprocessing algorithm, referred to as prefix-free parsing, that takes a text T as input, and in one-pass generates a dictionary D and a parse P of T with the property that the BWT of T can be constructed from D and P using workspace proportional to their total size and O(|T|)-time. Our experiments show that D and P are significantly smaller than T in practice, and thus, can fit in a reasonable internal memory even when T is very large. In particular, we show that with prefix-free parsing we can build an 131-megabyte run-length compressed FM-index (restricted to support only counting and not locating) for 1000 copies of human chromosome 19 in 2 hours using 21 gigabytes of memory, suggesting that we can build a 6.73 gigabyte index for 1000 complete human-genome haplotypes in approximately 102 hours using about 1 terabyte of memory.

Philosophy

Artificial Intelligence

0

Paper

Philosophy

Artificial Intelligence

0

Save

0

Analyzing whole genome bisulfite sequencing data from highly divergent genotypes

Phillip Wulfridge et al.Sep 22, 2016

In the study of DNA methylation, genetic variation between species, strains, or individuals can result in CpG sites that are exclusive to a subset of samples, and insertions and deletions can rearrange the spatial distribution of CpGs. How to account for this variation in an analysis of the interplay between sequence variation and DNA methylation is not well understood, especially when the number of CpG differences between samples is large. Here we use whole-genome bisulfite sequencint data on two highly divergent inbred mouse strains to study this problem. We find that while the large number of strain-specific CpGs necessitates considerations regarding the reference genomes used during alignment, properties such as CpG density are surprisingly conserved across the genome. We introduce a method for including strain-specific CpGs in differential analysis, and show that accounting for strain-specific CpGs increases the power to find differentially methylated regions between the strains. Our method uses smoothing to impute methylation levels at strain-specific sites, thereby allowing strain-specific CpGs to contribute to the analysis, and also allowing us to account for differences in the spatial occurrences of CpGs. Our results have implications for analysis of genetic variation and DNA methylation using bisulfite-converted DNA.

Genetics

Molecular Biology

0

Paper

Save

Efficient Construction of a Complete Index for Pan-Genomics Read Alignment

Alan Kuhnle et al.Nov 19, 2018

While short read aligners, which predominantly use the FM-index, are able to easily index one or a few human genomes, they do not scale well to indexing databases containing thousands of genomes. To understand why, it helps to examine the main components of the FM-index in more detail, which is a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA) containing pointers to starting positions of occurrences of a given pattern; second, a sample of the SA that --- when used with the rank data structure --- allows us access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that Gagie et al. (SODA 2018) have defined an SA sample that takes about the same space as the run-length compressed BWT --- we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018 we showed how to build the BWT of large genomic databases efficiently (WABI 2018) but the problem of building Gagie et al.'s SA sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes, and show that it improves over Bowtie with respect to both memory and time.

Philosophy

Artificial Intelligence

0

Paper

Philosophy

Artificial Intelligence

0

Save