ResearchHub | Open Science Community

SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification

Manuel Tardáguila et al.Feb 9, 2018

High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.

Genetics

Molecular Biology

1

Paper

Save

Genetic associations at regulatory phenotypes improve fine-mapping of causal variants for twelve immune-mediated diseases

Kousik Kundu et al.Jan 15, 2020

Abstract The identification of causal genetic variants for common diseases improves understanding of disease biology. Here we use data from the BLUEPRINT project to identify regulatory quantitative trait loci (QTL) for three primary human immune cell types and use these to fine-map putative causal variants for twelve immune-mediated diseases. We identify 340 unique, non major histocompatibility complex (MHC) disease loci that colocalise with high (>98%) posterior probability with regulatory QTLs, and apply Bayesian frameworks to fine-map associations at each locus. We show that fine-mapping applied to regulatory QTLs yields smaller credible set sizes and higher posterior probabilities for candidate causal variants compared to disease summary statistics. We also describe a systematic under-representation of insertion/deletion (INDEL) polymorphisms in credible sets derived from publicly available disease meta-analysis when compared to QTLs based on genome-sequencing data. Overall, our findings suggest that fine-mapping applied to disease-colocalising regulatory QTLs can enhance the discovery of putative causal disease variants and provide insights into the underlying causal genes and molecular mechanisms.

Genetics

Immunology

0

Paper

Save

tappAS: a comprehensive computational framework for the analysis of the functional impact of differential splicing

Lorena Fuente et al.Jul 3, 2019

Traditionally, the functional analysis of gene expression data has used pathway and network enrichment algorithms. These methods are usually gene rather than transcript centric and hence fall short to unravel functional roles associated to posttranscriptional regulatory mechanisms such as Alternative Splicing (AS) and Alternative PolyAdenylation (APA), jointly referred here as Alternative Transcript Processing (AltTP). Moreover, short-read RNA-seq has serious limitations to resolve full-length transcripts, further complicating the study of isoform expression. Recent advances in long-read sequencing open exciting opportunities for studying isoform biology and function. However, there are no established bioinformatics methods for the functional analysis of isoform-resolved transcriptomics data to fully leverage these technological advances. Here we present a novel framework for Functional Iso-Transcriptomics analysis (FIT). This framework uses a rich isoform-level annotation database of functional domains, motifs and sites –both coding and non-coding- and introduces novel analysis methods to interrogate different aspects of the functional relevance of isoform complexity. The Functional Diversity Analysis (FDA) evaluates the variability at the inclusion/exclusion of functional domains across annotated transcripts of the same gene. Parameters can be set to evaluate if AltTP partially or fully disrupts functional elements. FDA is a measure of the potential of a multiple isoform transcriptome to have a functional impact. By combining these functional labels with expression data, the Differential Analysis Module evaluates the relative contribution of transcriptional (i.e. gene level) and post-transcriptional (i.e. transcript/protein levels) regulation on the biology of the system. Measures of isoform relevance such as Minor Isoform Filtering, Isoform Switching Events and Total Isoform Usage Change contribute to restricting analysis to biologically meaningful changes. Finally, novel methods for Differential Feature Inclusion, Co-Feature Inclusion, and the combination of UTR-lengthening with Alternative Polyadenylation analyses carefully dissects the contextual regulation of functional elements resulting from differential isoforms usage. These methods are implemented in the software tappAS, a user-friendly Java application that brings FIT to the hands of non-expert bioinformaticians supporting several model and non-model species. tappAS complements statistical analyses with powerful browsing tools and highly informative gene/transcript/CDS graphs.We applied tappAS to the analysis of two mouse Neural Precursor Cells (NPCs) and Oligodendrocyte Precursor Cells (OPCs) whose transcriptome was defined by PacBio and quantified by Illumina. Using FDA we confirmed the high potential of AltTP regulation in our system, in which 90% of multi-isoform genes presented variation in functional features at the transcript or protein level. The Differential Analysis module revealed a high interplay between transcriptional and AltTP regulation in neural development, mainly controlled by differential expression, but where AltTP acts the main driver of important neural development biological mechanisms such as vesicle trafficking, signal transduction and RNA processing. The DFI analysis revealed that, globally, AltTP increased the availability of functional features in differentiated neural cells. DFI also showed that AltTP is a mechanism for altering gene function by changing cellular localization and binding properties of proteins, via the differential inclusion of NLS, transmembrane domains or DNA binding motifs, for example. Some of these findings were experimentally validated by others and us.In summary, we propose a novel framework for the functional analysis of transcriptomes at isoform resolution. We anticipate the tappAS tool will be an important resource for the adoption of the Functional Iso-Transcriptomics analysis by functional genomics community.

Genetics

Molecular Biology

0

Paper

Save

SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification

Manuel Tardáguila et al.Mar 18, 2017

High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in very well annotated organisms as mice and humans. Nonetheless, there is a need for studies and tools that characterize these novel isoforms. Here we present SQANTI, an automated pipeline for the classification of long-read transcripts that computes 47 descriptors that can be used to assess the quality of the data and of the preprocessing pipelines. We applied SQANTI to a neuronal mouse transcriptome using PacBio long reads and illustrate how the tool is effective in readily describing the composition of and characterizing the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach, and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, result more frequently in novel ORFs than novel UTRs and are enriched in both general metabolic and neural specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases we find that alternative isoforms are elusive to proteogenomics detection and are variable in protein changes with respect to the principal isoform of their genes. SQANTI allows the user to maximize the analytical outcome of long read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes. SQANTI is available at https://bitbucket.org/ConesaLab/sqanti.

Genetics

Molecular Biology

0

Paper

Save

Heterozygous gene truncation delineates the human haploinsufficient genome

István Bartha et al.Oct 23, 2014

Sequencing projects have identified large numbers of rare stop-gain and frameshift variants in the human genome. As most of these are observed in the heterozygous state, they test a gene?s tolerance to haploinsufficiency and dominant loss of function. We analyzed the distribution of truncating variants across 16,260 protein coding autosomal genes in 11,546 individuals. We observed 39,893 truncating variants affecting 12,062 genes, which significantly differed from an expectation of 12,916 genes under a model of neutral de novo mutation (p<1E-4). Extrapolating this to increasing numbers of sequenced individuals, we estimate that 10.8% of human genes do not tolerate heterozygous truncating variants. An additional 10 to 15% of truncated genes may be rescued by incomplete penetrance or compensatory mutations, or because the truncating variants are of limited functional impact. The study of protein truncating variants delineates the essential genome and, more generally, identifies rare heterozygous variants as an unexplored source of diversity of phenotypic traits and diseases.

Genetics

Molecular Biology

0

Paper

Save

Variant-to-function dissection of rare non-coding GWAS loci with high impact on blood traits

Manuel Tardáguila et al.Aug 5, 2024

Abstract Two decades of Genome Wide Association Studies (GWAS) have yielded hundreds of thousands of robust genetic associations to human complex traits and diseases. Nevertheless, the dissection of the functional consequences of variants lags behind, especially for non-coding variants (RNVs). Here we have characterised a set of rare, non-coding variants with large effects on haematological traits by integrating (i) a massively parallel reporter assay with (ii) a CRISPR/Cas9 screen and (iii) in vivo gene expression and transcript relative abundance analysis of whole blood and immune cells. After extensive manual curation we identify 22 RNVs with robust mechanistic hypotheses and perform an in-depth characterization of one of them, demonstrating its impact on megakaryopoiesis through regulation of the CUX1 transcriptional cascade. With this work we advance the understanding of the translational value of GWAS findings for variants implicated in blood and immunity.

Genetics

Molecular Biology

0

Paper

Save

Transcriptome-wide association study in UK Biobank Europeans identifies associations with blood cell traits

Bryce Rowland et al.Aug 5, 2021

Abstract Previous genome-wide association studies (GWAS) of hematological traits have identified over 10,000 distinct trait-specific risk loci, but the underlying causal mechanisms at these loci remain incompletely characterized. We performed a transcriptome-wide association study (TWAS) of 29 hematological traits in 399,835 UK Biobank (UKB) participants of European ancestry using gene expression prediction models trained from whole blood RNA-seq data in 922 individuals. We discovered 557 TWAS signals associated with hematological traits distinct from previously discovered GWAS variants, including 10 completely novel gene-trait pairs corresponding to 9 unique genes. Among the 557 associations, 301 were available for replication in a cohort of 141,286 participants of European ancestry from the Million Veteran Program (MVP). Of these 301 associations, 199 replicated at a nominal threshold ( α = 0.05) and 108 replicated at a strict Bonferroni adjusted threshold ( α = 0.05/301). Using our TWAS results, we systematically assigned 4,261 out of 16,900 previously identified hematological trait GWAS variants to putative target genes. Compared to coloc , our TWAS results show reduced specificity and increased sensitivity to assign variants to target genes.

Genetics

Molecular Biology

1

Paper

Save

Variation in PU.1 binding and chromatin looping at neutrophil enhancers influences autoimmune disease susceptibility

Stephen Watt et al.Apr 29, 2019

Neutrophils play fundamental roles in innate inflammatory response, shape adaptive immunity, and have been identified as a potentially causal cell type underpinning genetic associations with immune system traits and diseases. The majority of these variants are non-coding and the underlying mechanisms are not fully understood. Here, we profiled the binding of one of the principal myeloid transcriptional regulators, PU.1, in primary neutrophils across nearly a hundred volunteers, and elucidate the coordinated genetic effects of PU.1 binding variation, local chromatin state, promoter-enhancer interactions and gene expression. We show that PU.1 binding and the associated chain of molecular changes underlie genetically-driven differences in cell count and autoimmune disease susceptibility. Our results advance interpretation for genetic loci associated with neutrophil biology and immune disease.

Genetics

Immunology

0

Paper

Save

Misexpression of inactive genes in whole blood is associated with nearby rare structural variants

Thomas Vanderstichele et al.Jan 1, 2023

Gene misexpression is the aberrant transcription of a gene in a context where it is usually inactive. Despite its known pathological consequences in specific rare diseases, we have a limited understanding of its wider prevalence and mechanisms in humans. To address this, we analyzed gene misexpression in 4,568 whole blood bulk RNA sequencing samples from INTERVAL study blood donors. We found that while individual misexpression events occur rarely, in aggregate they were found in almost all samples and over half of inactive genes. Using 2,821 paired whole genome and RNA sequencing samples, we identified that misexpression events are enriched in cis for rare structural variants. We established putative mechanisms through which a subset of SVs lead to gene misexpression, including transcriptional readthrough, transcript fusions and gene inversion. Overall, we develop misexpression as a novel type of transcriptomic outlier analysis and extend our understanding of the variety of mechanisms by which genetic variants can influence gene expression.

Genetics

Paleontology

0

Paper

Genetics

Paleontology

0

Save