ResearchHub | Open Science Community

scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

Haotian Cui et al.May 1, 2023

Abstract Generative pre-trained models have achieved remarkable success in various domains such as natural language processing and computer vision. Specifically, the combination of large-scale diverse datasets and pre-trained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between linguistic constructs and cellular biology — where texts comprise words, similarly, cells are defined by genes — our study probes the applicability of foundation models to advance cellular biology and genetics research. Utilizing the burgeoning single-cell sequencing data, we have pioneered the construction of a foundation model for single-cell biology, scGPT, which is based on generative pre-trained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT, a generative pre-trained transformer, effectively distills critical biological insights concerning genes and cells. Through the further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell-type annotation, multi-batch integration, multi-omic integration, genetic perturbation prediction, and gene network inference. The scGPT codebase is publicly available at https://github.com/bowang-lab/scGPT .

Artificial Intelligence

Biophysics

9

Paper

Artificial Intelligence

Save

DeepVelo: Deep Learning extends RNA velocity to multi-lineage systems with cell-specific kinetics

Haotian Cui et al.Apr 5, 2022

1 Abstract The introduction of RNA velocity in single-cell studies has opened new ways of examining cell differentiation and tissue development. Existing RNA velocity estimation methods rely on strong assumptions of predefined dynamics and cell-agnostic constant transcriptional kinetic rates, which are often violated in complex and heterogeneous single-cell RNA sequencing (scRNA-seq) data. To overcome these limitations, we propose DeepVelo, a novel method that estimates the cell-specific dynamics of splicing kinetics using Graph Convolution Networks (GCNs). DeepVelo generalizes RNA velocity to cell populations containing time-dependent kinetics and multiple lineages, which are common in developmental and pathological systems. We applied DeepVelo to disentangle multifaceted kinetics in the processes of dentate gyrus neurogenesis, pancreatic endocrinogenesis, and hindbrain development. The method infers time-varying cellular rates of transcription, splicing and degradation, recovers each cell’s stage in the underlying differentiation process, and detects functionally relevant driver genes regulating these processes. DeepVelo relaxes the constraints of previous techniques, facilitates the study of more complex differentiation and lineage decision events in heterogeneous scRNA-seq data, and is more computationally efficient than previous techniques.

Genetics

Molecular Biology

1

Paper

Save

The differential impacts of dataset imbalance in single-cell data integration

Hassaan Maan et al.Oct 8, 2022

Abstract Single-cell transcriptomic data measured across distinct samples has led to a surge in computational methods for data integration. Few studies have explicitly examined the common case of cell-type imbalance between datasets to be integrated, and none have characterized its impact on downstream analyses. To address this gap, we developed the Iniquitate pipeline for assessing the stability of single-cell RNA sequencing (scRNA-seq) integration results after perturbing the degree of imbalance between datasets. Through benchmarking 5 state-of-the-art scRNA-seq integration techniques in 1600 perturbed integration scenarios for a multi-sample peripheral blood mononuclear cell (PBMC) dataset, our results indicate that sample imbalance has significant impacts on downstream analyses and the biological interpretation of integration results. We observed significant variation in clustering, cell-type classification, marker gene-based annotation, and query-to-reference mapping in imbalanced settings. Two key factors were found to lead to quantitation differences after scRNA-seq integration - the cell-type imbalance within and between samples ( relative cell-type support ) and the relatedness of cell-types across samples ( minimum cell-type center distance ). To account for evaluation gaps in imbalanced contexts, we developed novel clustering metrics robust to sample imbalance, including the balanced Adjusted Rand Index (bARI) and balanced Adjusted Mutual Information (bAMI). Our analysis quantifies biologically-relevant effects of dataset imbalance in integration scenarios and introduces guidelines and novel metrics for integration of disparate datasets. The Iniquitate pipeline and balanced clustering metrics are available at https://github.com/hsmaan/Iniquitate and https://github.com/hsmaan/balanced-clustering , respectively.

Artificial Intelligence

Molecular Biology

45

Paper

Artificial Intelligence

5

0

Save

45

Multiscale interactome analysis coupled with off-target drug predictions reveals drug repurposing candidates for human coronavirus disease

Michael Sugiyama et al.Apr 13, 2021

Abstract The COVID-19 pandemic has led to an urgent need for the identification of new antiviral drug therapies that can be rapidly deployed to treat patients with this disease. COVID-19 is caused by infection with the human coronavirus SARS-CoV-2. We developed a computational approach to identify new antiviral drug targets and repurpose clinically-relevant drug compounds for the treatment of COVID-19. Our approach is based on graph convolutional networks (GCN) and involves multiscale host-virus interactome analysis coupled to off-target drug predictions. Cellbased experimental assessment reveals several clinically-relevant repurposing drug candidates predicted by the in silico analyses to have antiviral activity against human coronavirus infection. In particular, we identify the MET inhibitor capmatinib as having potent and broad antiviral activity against several coronaviruses in a MET-independent manner, as well as novel roles for host cell proteins such as IRAK1/4 in supporting human coronavirus infection, which can inform further drug discovery studies.

Genetics

Ecology

45

Paper

Save

scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers

Haotian Cui et al.Nov 22, 2022

A bstract Single-cell sequencing has emerged as a promising technique to decode cellular heterogeneity and analyze gene functions. With the high throughput of modern techniques and resulting large-scale sequencing data, deep learning has been used extensively to learn representations of individual cells for downstream tasks. However, most existing methods rely on fully connected networks and are unable to model complex relationships between both cell and gene representations. We hereby propose scFormer, a novel transformer-based deep learning framework to jointly optimize cell and gene embeddings for single-cell biology in an unsupervised manner. By drawing parallels between natural language processing and genomics, scFormer applies self-attention to learn salient gene and cell embeddings through masked gene modelling. scFormer provides a unified framework to readily address a variety of downstream tasks such as data integration, analysis of gene function, and perturbation response prediction. Extensive experiments using scFormer show state-of-the-art performance on seven datasets across the relevant tasks. The scFormer model implementation is available at https://github.com/bowang-lab/scFormer .

Artificial Intelligence

Biophysics

29

Paper

Artificial Intelligence

4

0

Save

37

Experimental and natural evidence of SARS-CoV-2 infection-induced activation of type I interferon responses

Arinjay Banerjee et al.Jun 18, 2020

SUMMARY Type I interferons (IFNs) are our first line of defence against a virus. Protein over-expression studies have suggested the ability of SARS-CoV-2 proteins to block IFN responses. Emerging data also suggest that timing and extent of IFN production is associated with manifestation of COVID-19 severity. In spite of progress in understanding how SARS-CoV-2 activates antiviral responses, mechanistic studies into wildtype SARS-CoV-2-mediated induction and inhibition of human type I IFN responses are lacking. Here we demonstrate that SARS-CoV-2 infection induces a mild type I IFN response in vitro and in moderate cases of COVID-19. In vitro stimulation of type I IFN expression and signaling in human airway epithelial cells is associated with activation of canonical transcriptions factors, and SARS-CoV-2 is unable to inhibit exogenous induction of these responses. Our data demonstrate that SARS-CoV-2 is not adept in blocking type I IFN responses and provide support for ongoing IFN clinical trials.

Genetics

Immunology

37

Paper

Genetics

Immunology

0

Save