ResearchHub | Open Science Community

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings

Nathan LeRoy et al.Jul 2, 2024

Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

2

0

Save

0

Methods for evaluating unsupervised vector representations of genomic regions

Guangtao Zheng et al.Jul 2, 2024

Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

Genetics

Artificial Intelligence

0

Paper

Save

Methods for evaluating unsupervised vector representations of genomic regions

Guangtao Zheng et al.Aug 29, 2023

Background Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. Methods To bridge this gap, we propose four evaluation metrics: the cluster tendency test (CTT), the reconstruction test (RCT), the genome distance scaling test (GDST), and the neighborhood preserving test (NPT). The CTT and RCT are statistical methods that evaluate how well region embeddings can be clustered and how much the embeddings can preserve the information contained in training data. The GDST and NPT exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings and a set of region embeddings. Results We demonstrate the utility of these statistical and biological tests for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings. Availability Code is available at https://github.com/databio/geniml .

Genetics

Artificial Intelligence

9

Paper

Genetics

Artificial Intelligence

0

Save

3

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings

Nathan LeRoy et al.Aug 3, 2023

Motivation Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) is now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower-dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. Results We implemented our approach in scEmbed, an unsupervised machine learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed is competitive with alternative scATAC embedding approaches in terms of clustering ability and has the advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, pre-trained models on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. Availability scEmbed is open source and available at https://github.com/databio/geniml . Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio .

Artificial Intelligence

Molecular Biology

3

Paper

Artificial Intelligence

Molecular Biology

0

Save

0

Methods for constructing and evaluating consensus genomic interval sets

Julia Rymuza et al.Aug 24, 2024

Abstract The amount of genomic region data continues to increase. Integrating across diverse genomic region sets requires consensus regions, which enable comparing regions across experiments, but also by necessity lose precision in region definitions. We require methods to assess this loss of precision and build optimal consensus region sets. Here, we introduce the concept of flexible intervals and propose three novel methods for building consensus region sets, or universes: a coverage cutoff method, a likelihood method, and a Hidden Markov Model. We then propose three novel measures for evaluating how well a proposed universe fits a collection of region sets: a base-level overlap score, a region boundary distance score, and a likelihood score. We apply our methods and evaluation approaches to several collections of region sets and show how these methods can be used to evaluate fit of universes and build optimal universes. We describe scenarios where the common approach of merging regions to create consensus leads to undesirable outcomes and provide principled alternatives that provide interoperability of interval data while minimizing loss of resolution.

Molecular Biology

Biology

0

Paper

Save

Pivotal role of biallelic frequency analysis in identifying copy number alterations using genome-wide methods in tumors with a high level of aneuploidy

Julia Rymuza et al.Mar 16, 2024

Abstract Chromosome number abnormalities is one of the hallmarks of cancer. DNA copy number alterations (CNA) are studied using various genome-wide methods. In our study we investigated CNA in human pituitary tumors using three platforms CytoSNP-850K microarrays, low-pass whole-genome sequencing (average x7 coverage, LPWGS), and Infinium Methylation EPIC array. Virtual karyotypes based on each dataset were generated using open-source software packages for each sample. Concordant CNA profiles were found for most of tumor. Surprisingly, substantial discrepancies between results from SNP arrays and LPWGS/EPIC arrays were identified in 20% of tumors, for which discrimination of true karyotype was required. B-allelic frequency data from SNP arrays was crucial to adjust normal ploidy level as ultimately verified with FISH. The discrepancy between virtual karyotypes was more pronounced the more CNAs were found. When CNAs covered more than half of genome the level of normal/diploid copy number was incorrectly set with methods, based solely on signal intensity/read-counts coverage. To conclude, CNA analysis with methods such as LPWGS and methylation arrays in highly aneuploid tumors are prone to a bias from improper normal ploidy level setting. These methods are commonly used therefore we aimed to aware the scientific community about this underestimated methodological problem.

Genetics

Cell Biology

0

Paper

Save

Methods for constructing and evaluating consensus genomic interval sets

Julia Rymuza et al.Aug 5, 2023

Motivation The amount of genomic region data continues to increase. Integrating across diverse genomic region sets requires consensus regions, which enable comparing regions across experiments, but also by necessity lose precision in region definitions. We require methods to assess this loss of precision and build optimal consensus region sets. Results We introduce the concept of flexible intervals and propose 3 novel methods for building consensus region sets, or universes: a coverage cutoff method, a likelihood method, and a Hidden Markov Model. We then propose 3 novel measures for evaluating how well a proposed universe fits a collection of region sets: a base-level overlap score, a region boundary score, and a likelihood score. We apply our methods and evaluation approaches to several collections of region sets and show how these methods can be used to evaluate fit of universes and build optimal universes. We describe scenarios where the common approach of merging regions to create consensus leads to undesirable outcomes and provide principled alternatives that provide interoperability of interval data while minimizing loss of resolution. Availability https://github.com/databio/geniml .

Genetics

Molecular Biology

7

Paper

Genetics

Molecular Biology

0

Save