ResearchHub | Open Science Community

scaDA: A Novel Statistical Method for Differential Analysis of Single-Cell Chromatin Accessibility Sequencing Data

Fengdi Zhao et al.Jan 24, 2024

Abstract Single-cell ATAC-seq sequencing data (scATAC-seq) has been widely used to investigate chromatin accessibility on the single-cell level. One important application of scATAC-seq data analysis is differential chromatin accessibility analysis. However, the data characteristics of scATAC-seq such as excessive zeros and large variability of chromatin accessibility across cells impose a unique challenge for DA analysis. Existing statistical methods focus on detecting the mean difference of the chromatin accessible regions while overlooking the distribution difference. Motivated by real data exploration that distribution difference exists among cell types, we introduce a novel composite statistical test named “scaDA”, which is based on zero-inflated negative binomial model (ZINB), for performing differential distribution analysis of chromatin accessibility by jointly testing the abundance, prevalence and dispersion simultaneously. Benefiting from both dispersion shrinkage and iterative refinement of mean and prevalence parameter estimates, scaDA demonstrates its superiority to both ZINB-based likelihood ratio tests and published methods by achieving the highest power and best FDR control in a comprehensive simulation study. In addition to demonstrating the highest power in three real sc-multiome data analyses, scaDA successfully identifies differentially accessible regions in microglia from sc-multiome data for an Alzheimer ‘s disease (AD) study, regions which are most enriched in GO terms related to neurogenesis, the clinical phenotype of AD, and SNPs identified in AD-associated GWAS. Author summary Understanding the cis-regulatory elements that control the fundamental gene regulatory process is important to basic biology. scATAC-seq data offers an unprecedented opportunity to investigate chromatin accessibility on the single-cell level and explore cell heterogeneity to reveal the dynamic changes of cis-regulatory elements among different cell types. To understand the dynamic change of gene regulation using scATAC-seq data, differential chromatin (DA) analysis, which is one of the most fundamental analyses for scATAC-seq data, can enable the identification of differentially accessible regions between cell types or between multiple conditions. Subsequently, DA analysis has many applications such as identifying cell type-specific chromatin accessible regions to reveal the cell type-specific gene regulatory program, assessing disease-associated changes in chromatin accessibility to detect potential biomarkers, and linking differentially accessible regions to differentially expressed genes for building a comprehensive gene regulatory map. This paper proposes a novel statistical method named “scaDA” to improve the detection of differentially accessible regions by performing differential distribution analysis. scaDA is believed to benefit the research community of single-cell genomics.

Genetics

Molecular Biology

0

Paper

Save

Deep5hmC: Predicting genome-wide 5-Hydroxymethylcytosine landscape via a multimodal deep learning model

Xin Ma et al.Mar 6, 2024

5-hydroxymethylcytosine (5hmC), a critical epigenetic mark with a significant role in regulating tissue-specific gene expression, is essential for understanding the dynamic functions of the human genome. Using tissue-specific 5hmC sequencing data, we introduce Deep5hmC, a multimodal deep learning framework that integrates both the DNA sequence and the histone modification information to predict genome-wide 5hmC modification. The multimodal design of Deep5hmC demonstrates remarkable improvement in predicting both qualitative and quantitative 5hmC modification compared to unimodal versions of Deep5hmC and state-of-the-art machine learning methods. This improvement is demonstrated through benchmarking on a comprehensive set of 5hmC sequencing data collected at four time points during forebrain organoid development and across 17 human tissues. Notably, Deep5hmC showcases its practical utility by accurately predicting gene expression and identifying differentially hydroxymethylated regions in a case-control study of Alzheimer's disease.

Genetics

Artificial Intelligence

0

Paper

Save

DeepPHiC: Predicting promoter-centered chromatin interactions using a novel deep learning approach

Aman Agarwal et al.May 25, 2022

Abstract Motivation Promoter-centered chromatin interactions, which include promoter-enhancer and promoter-promoter interactions, are important to decipher gene regulation and disease mechanisms. The development of next generation sequencing technologies such as promoter capture Hi-C (pcHi-C) leads to the discovery of promoter-centered chromatin interactions. However, pcHi-C experiments are expensive and thus may be unavailable for tissues or cell types of interest. In addition, these experiments may be underpowered due to insufficient sequencing depth or various artifacts, which results in a limited finding of interactions. Results To overcome these challenges, we develop a supervised multi-modal deep learning model, which utilizes a comprehensive set of features including genomic sequence, epigenetic signal and anchor distance to predict tissue/cell type-specific genome-wide promoter-enhancer and promoter-promoter interactions. We further extend the deep learning model in a multi-task learning and a transfer learning framework. We demonstrate that the proposed approach outperforms state-of-the-art deep learning methods and is robust to the inclusion of anchor distance as a feature. In addition, we find that the proposed approach can achieve comparable prediction performance using biologically relevant tissues/cell types compared to using all tissues/cell types especially for predicting promoter-enhancer interactions. Availability https://github.com/lichen-lab/DeepPHiC

Genetics

Artificial Intelligence

22

Paper

Save

DeepPerVar: a multimodal deep learning framework for functional interpretation of genetic variants in personal genome

Ye Wang et al.Apr 11, 2022

Abstract Motivation Understanding the functional consequence of genetic variants, especially the noncoding ones, is important but particularly challenging. Genome-wide association studies or quantitative trait locus analyses may be subject to limited statistical power and linkage disequilibrium, and thus are less optimal to pinpoint the causal variants. Moreover, most existing machine learning approaches, which exploit the functional annotations to interpret and prioritize putative causal variants, cannot accommodate the heterogeneity of personal genetic variations and traits in a population study, targeting a specific disease. Results By leveraging paired whole genome sequencing data and epigenetic functional assays in a population study, we propose a multi-modal deep learning framework to predict genome-wide quantitative epigenetic signals by considering both personal genetic variations and traits. The proposed approach can further evaluate the functional consequence of noncoding variants on an individual level by quantifying the allelic difference of predicted epigenetic signals. By applying the approach to the ROSMAP cohort studying Alzheimer’s disease (AD), we demonstrate that the proposed approach can accurately predict quantitative genome-wide epigenetic signals and in key genomic regions of AD causal genes, learn canonical motifs reported to regulate gene expression of AD causal genes, improve the partitioning heritability analysis, and prioritize putative causal variants in a GWAS risk locus. Finally, we release the proposed deep learning model as a stand-alone Python toolkit and a web server. Availability https://github.com/lichen-lab/DeepPerVar

Genetics

Molecular Biology

1

Paper

Save

Exploiting deep transfer learning for the prediction of functional noncoding variants using genomic sequence

Li Chen et al.Mar 21, 2022

ABSTRACT Motivation Though genome-wide association studies have identified tens of thousands of variants associated with complex traits and most of them fall within the noncoding regions, they may not the causal ones. The development of high-throughput functional assays leads to the discovery of experimental validated noncoding functional variants. However, these validated variants are rare due to technical difficulty and financial cost. The small sample size of validated variants makes it less reliable to develop a supervised machine learning model for achieving a whole genome-wide prediction of noncoding causal variants. Results We will exploit a deep transfer learning model, which is based on convolutional neural network, to improve the prediction for functional noncoding variants. To address the challenge of small sample size, the transfer learning model leverages both large-scale generic functional noncoding variants to improve the learning of low-level features and context-specific functional noncoding variants to learn high-level features toward the contextspecific prediction task. By evaluating the deep transfer learning model on three MPRA datasets and 16 GWAS datasets, we demonstrate that the proposed model outperforms deep learning models without pretraining or retraining. In addition, the deep transfer learning model outperforms 18 existing computational methods in both MPRA and GWAS datasets. Availability https://github.com/lichen-lab/TLVar Supplementary Information Supplementary data are available at Bioinformatics online. Contact chen61@iu.edu

Genetics

Artificial Intelligence

5

Paper

Save

WEVar: a novel statistical learning framework for predicting noncoding regulatory variants

Ye Wang et al.Nov 18, 2020

Abstract Understanding the functional consequence of noncoding variants is of great interest. Though genome-wide association studies (GWAS) or quantitative trait locus (QTL) analyses have identified variants associated with traits or molecular phenotypes, most of them are located in the noncoding regions, making the identification of causal variants a particular challenge. Existing computational approaches developed for for prioritizing non-coding variants produce inconsistent and even conflicting results. To address these challenges, we propose a novel statistical learning framework, which directly integrates the precomputed functional scores from representative scoring methods. It will maximize the usage of integrated methods by automatically learning the relative contribution of each method and produce an ensemble score as the final prediction. The framework consists of two modes. The first “context-free” mode is trained using curated causal regulatory variants from a wide range of context and is applicable to predict noncoding variants of unknown and diverse context. The second “context-dependent” mode further improves the prediction when the training and testing variants are from the same context. By evaluating the framework via both simulation and empirical studies, we demonstrate that it outperforms integrated scoring methods and the ensemble score successfully prioritizes experimentally validated regulatory variants in multiple risk loci.

Genetics

Artificial Intelligence

5

Paper

Genetics

Artificial Intelligence

0

Save

9

TIVAN-indel: A computational framework for annotating and predicting noncoding regulatory small insertion and deletion

Aman Agarwal et al.Sep 30, 2022

Abstract Motivation Small insertion and deletion (sindel) of human genome has an important implication for human disease. One important mechanism for noncoding sindel to have an impact on human diseases and phenotypes is through the regulation of gene expression. Nevertheless, current sequencing technology may lack statistical power and resolution to pinpoint the causal sindel due to lower minor allele frequency or small effect. As an alternative solution, a supervised machine learning method can identify the otherwise missing causal sindels by predicting the regulatory potential of sindels directly. However, computational methods for annotating and predicting the regulatory sindels, especially in the noncoding regions, are underdeveloped. Results By leveraging recognized sindels in cis -expression quantitative trait loci ( cis -eQTLs) across 44 tissues and cell types in GTEx, and a compilation of both generic functional annotations and tissue/cell typespecific multi-omics features generated by a sequence-based deep learning model, we developed TIVAN-indel, which is an XGBoost-based supervised framework for scoring noncoding sindels based their potential to regulate the nearby gene expression. As a result, we demonstrate that TIVAN-indel achieves the best prediction performance in both cross-validation with-tissue prediction and independent cross-tissue evaluation. As an independent evaluation, we train TIVAN-indel from “Whole Blood” tissue in GTEx data and test the model using 15 immune cell types from an independent study DICE. Lastly, we perform an enrichment analysis for both recognized and predicted sindels in key regulatory regions such as chromatin interactions, open chromatin and histone modification sites, and find biologically meaningful enrichment patterns. Availability and implementation https://github.com/lichen-lab/TIVAN-indel Contact li.chen1@ufl.edu

Genetics

Molecular Biology

9

Paper

Save

MPRAVarDB: an online database and web server for exploring regulatory effects of genetic variants

Nizomov Javlon et al.Apr 3, 2024

Abstract Summary Massively parallel reporter assay (MPRA) is an important technology to evaluate the impact of genetic variants on gene regulation. Here, we present MPRAVarDB, an online database and web server, for exploring regulatory effects of genetic variants. MPRAVarDB harbors 18 MPRA experiments designed to assess the regulatory effects of genetic variants associated with GWAS loci, eQTLs and various genomic features, resulting in a total of 242,818 variants tested across more than 30 cell lines and 30 human diseases or traits. MPRAVarDB empowers the query of MPRA variants by genomic region, disease and cell line or by any combination of these query terms. Notably, MPRAVarDB offers a suite of pretrained machine learning models tailored to the specific disease and cell line, facilitating the genome-wide prediction of regulatory variants. MPRAVarDB is friendly to use, and users only need a few clicks to receive query and prediction results. Availability https://mpravardb.rc.ufl.edu Supplementary information Supplementary data are available at Bioinformatics online.

Genetics

Molecular Biology

0

Paper

Save

Spatial Metabolome Lipidome and Glycome from a Single brain Section

Harrison Clarke et al.Jul 25, 2023

ABSTRACT Metabolites, lipids, and glycans are fundamental biomolecules involved in complex biological systems. They are metabolically channeled through a myriad of pathways and molecular processes that define the physiology and pathology of an organism. Here, we present a blueprint for the simultaneous analysis of spatial metabolome, lipidome, and glycome from a single tissue section using mass spectrometry imaging. Complimenting an original experimental protocol, our workflow includes a computational framework called Spatial Augmented Multiomics Interface (Sami) that offers multiomics integration, high dimensionality clustering, spatial anatomical mapping with matched multiomics features, and metabolic pathway enrichment to providing unprecedented insights into the spatial distribution and interaction of these biomolecules in mammalian tissue biology.

Genetics

Biochemistry

19

Paper

Save

MetaVision3D: Automated Framework for the Generation of Spatial Metabolome Atlas in 3D

Xin Ma et al.Nov 28, 2023

Abstract High-resolution spatial imaging is transforming our understanding of foundational biology. Spatial metabolomics is an emerging field that enables the dissection of the complex metabolic landscape and heterogeneity from a thin tissue section. Currently, spatial metabolism highlights the remarkable complexity in two-dimensional space and is poised to be extended into the three-dimensional world of biology. Here, we introduce MetaVision3D, a novel pipeline driven by computer vision techniques for the transformation of serial 2D MALDI mass spectrometry imaging sections into a high-resolution 3D spatial metabolome. Our framework employs advanced algorithms for image registration, normalization, and interpolation to enable the integration of serial 2D tissue sections, thereby generating a comprehensive 3D model of unique diverse metabolites across host tissues at mesoscale. As a proof of principle, MetaVision3D was utilized to generate the mouse brain 3D metabolome atlas (available at https://metavision3d.rc.ufl.edu/ ) as an interactive online database and web server to further advance brain metabolism and related research.

Artificial Intelligence

Biophysics

0

Paper

Artificial Intelligence

Biophysics

0

Save