ResearchHub | Open Science Community

Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

Jiayang Chen et al.Aug 7, 2022

Abstract Non-coding RNA structure and function are essential to understanding various biological processes, such as cell signaling, gene expression, and post-transcriptional regulations. These are all among the core problems in the RNA field. With the rapid growth of sequencing technology, we have accumulated a massive amount of unannotated RNA sequences. On the other hand, expensive experimental observatory results in only limited numbers of annotated data and 3D structures. Hence, it is still challenging to design computational methods for predicting their structures and functions. The lack of annotated data and systematic study causes inferior performance. To resolve the issue, we propose a novel RNA foundation model (RNA-FM) to take advantage of all the 23 million non-coding RNA sequences through self-supervised learning. Within this approach, we discover that the pre-trained RNA-FM could infer sequential and evolutionary information of non-coding RNAs without using any labels. Furthermore, we demonstrate RNA-FM’s effectiveness by applying it to the downstream secondary/3D structure prediction, SARS-CoV-2 genome structure and evolution prediction, protein-RNA binding preference modeling, and gene expression regulation modeling. The comprehensive experiments show that the proposed method improves the RNA structural and functional modelling results significantly and consistently. Despite only being trained with unlabelled data, RNA-FM can serve as the foundational model for the field.

Genetics

Molecular Biology

4

Paper

Save

AcrNET: Predicting Anti-CRISPR with Deep Learning

Yunxiang Li et al.Apr 2, 2022

ABSTRACT As an important group of proteins discovered in phages, anti-CRISPR inhibits the activity of the immune system of bacteria ( i.e ., CRISPR-Cas), showing great potential for gene editing and phage therapy. However, the prediction and discovery of anti-CRISPR are challenging for its high variability and fast evolution. Existing biological studies often depend on known CRISPR and anti-CRISPR pairs, which may not be practical considering the huge number of pairs in reality. Computational methods usually struggle with prediction performance. To tackle these issues, we propose a novel deep neural net work for a nti- CR ISPR analysis ( AcrNET ), which achieves impressive performance. On both the cross-fold and cross-dataset validation, our method outperforms the previous state-of-the-art methods significantly. Impressively, AcrNET improves the prediction performance by at least 15% regarding the F1 score for the cross-dataset test. Moreover, AcrNET is the first computational method to predict the detailed anti-CRISPR classes, which may help illustrate the anti-CRISPR mechanism. Taking advantage of a Transformer protein language model pre-trained on 250 million protein sequences, AcrNET overcomes the data scarcity problem. Extensive experiments and analysis suggest that Transformer model feature, evolutionary feature, and local structure feature complement each other, which indicates the critical properties of anti-CRISPR proteins. Combined with AlphaFold prediction, further motif analysis and docking experiments demonstrate that AcrNET captures the evolutionarily conserved pattern and the interaction between anti-CRISPR and the target implicitly. With the impressive prediction capability, AcrNET can serve as a valuable tool for anti-CRISPR study and new anti-CRISPR discovery, with a free webserver at https://proj.cse.cuhk.edu.hk/aihlab/AcrNET/ .

Genetics

Artificial Intelligence

19

Paper

Save

Deep autoencoder for interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis

Yanshuo Chen et al.Oct 27, 2021

Abstract Single-cell RNA-sequencing has become a powerful tool to study biologically significant characteristics at explicitly high resolution. However, its application on emerging data is currently limited by its intrinsic techniques. Here, we introduce Tissue-AdaPtive autoEncoder (TAPE), a deep learning method connecting bulk RNA-seq and single-cell RNA-seq to achieve precise deconvolution in a short time. By constructing an interpretable decoder and training under a unique scheme, TAPE can predict cell-type fractions and cell-type-specific gene expression tissue-adaptively. Compared with popular methods on several datasets, TAPE has a better overall performance and comparable accuracy at cell type level. Additionally, it is more robust among different cell types, faster, and sensitive to provide biologically meaningful predictions. Moreover, through the analysis of clinical data, TAPE shows its ability to predict cell-type-specific gene expression profiles with biological significance. We believe that TAPE will enable and accelerate the precise analysis of high-throughput clinical data in a wide range.

Genetics

Artificial Intelligence

9

Paper

Save

Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

Xuesong Wang et al.Dec 13, 2021

ABSTRACT We have entered the multi-omics era, and we can measure cells from different aspects. When dealing with such multi-omics data, the first step is to determine the correspondence among different omics. In other words, we should match data from different spaces corresponding to the same object. This problem is particularly challenging in the single-cell multi-omics scenario because such data are very sparse with extremely high dimensions. Secondly, matched single-cell multi-omics data are rare and hard to collect. Furthermore, due to the limitations of the experimental environment, the data are usually highly noisy. To promote the single-cell multi-omics research, we overcome the above challenges, proposing a novel framework to align and integrate single-cell RNA-seq data and single-cell ATAC-seq data. Our approach can efficiently map the above data with high sparsity and noise from different spaces to a low-dimensional manifold in a unified space, making the downstream alignment and integration straightforward. Compared with the other state-of-the-art methods, our method performs better on both simulated and real single-cell data. On the real data, the performance improvement on accuracy over the previous methods is up to 55.7% regarding scRNA-seq and scATAC-seq data integration. Downstream trajectory inference analysis shows that our tool can transfer the labels from scRNA-seq to scATAC-seq with very high accuracy, which indicates our method’s effectiveness.

Artificial Intelligence

Biophysics

1

Paper

Artificial Intelligence

1

0

Save

6

conST: an interpretable multi-modal contrastive learning framework for spatial transcriptomics

Yongshuo Zong et al.Jan 17, 2022

Abstract Motivation Spatially resolved transcriptomics (SRT) shows its impressive power in yielding biological insights into neuroscience, disease study, and even plant biology. However, current methods do not sufficiently explore the expressiveness of the multi-modal SRT data, leaving a large room for improvement of performance. Moreover, the current deep learning based methods lack interpretability due to the “black box” nature, impeding its further applications in the areas that require explanation. Results We propose conST, a powerful and flexible SRT data analysis framework utilizing contrastive learning techniques. conST can learn low-dimensional embeddings by effectively integrating multi-modal SRT data, i . e . gene expression, spatial information, and morphology (if applicable). The learned embeddings can be then used for various downstream tasks, including clustering, trajectory and pseudotime inference, cell-to-cell interaction, etc . Extensive experiments in various datasets have been conducted to demonstrate the effectiveness and robustness of the proposed conST, achieving up to 10% improvement in clustering ARI in the commonly used benchmark dataset. We also show that the learned embedding can be used in complicated scenarios, such as predicting cancer progression by analyzing the tumour microenvironment and cell-to-cell interaction (CCI) of breast cancer. Our framework is interpretable in that it is able to find the correlated spots that support the clustering, which matches the CCI interaction pairs as well, providing more confidence to clinicians when making clinical decisions.

Artificial Intelligence

Biochemistry

6

Paper

Artificial Intelligence

Biochemistry

0

Save

3

OmicVerse: A single pipeline for exploring the entire transcriptome universe

Zehua Zeng et al.Jun 8, 2023

Abstract Single-cell sequencing is frequently marred by “interruptions” due to limitations in sequencing throughput, yet bulk RNA-seq may harbor these ostensibly “interrupted” cells. In response, we introduce the single cell trajectory blending from Bulk RNA-seq (BulkTrajBlend) algorithm, a component of the OmicVerse suite that leverages a Beta-Variational AutoEncoder for data deconvolution and graph neural networks for the discovery of overlapping community. This approach proficiently interpolates and restores the continuity of “interrupted” cells within single-cell RNA sequencing dataset. Furthermore, OmicVerse provides an extensive toolkit for bulk and single cell RNA-seq analysis, offering uniform access to diverse methodologies, streamlining computational processes, fostering exquisite data visualization, and facilitating the extraction of novel biological insights to advance scientific research.

Genetics

History

3

Paper

Genetics

History

0

Save