ResearchHub | Open Science Community

Artificial Neural Networks for classification of single cell gene expression

Jiahui Zhong et al.Jul 30, 2021

Abstract Background Single-cell transcriptome (SCT) sequencing technology has reached the level of high-throughput technology where gene expression can be measured concurrently from large numbers of cells. The results of gene expression studies are highly reproducible when strict protocols and standard operating procedures (SOP) are followed. However, differences in sample processing conditions result in significant changes in gene expression profiles making direct comparison of different studies difficult. Unsupervised machine learning (ML) uses clustering algorithms combined with semi-automated cell labeling and manual annotation of individual cells. They do not scale up well and a workflow used on a specific dataset will not perform well with other studies. Supervised ML classification shows superior classification accuracy and generalization properties as compared to unsupervised ML methods. We describe a supervised ML method that deploys artificial neural networks (ANN), for 5-class classification of healthy peripheral blood mononuclear cells (PBMC) from multiple diverse studies. Results We used 58 data sets to train ANN incrementally – over ten cycles of training and testing. The sample processing involved four protocols: separation of PBMC, separation of PBMC + enrichment (by negative selection), separation of PBMC + FACS, and separation of PBMC + MACS. The training data set included between 85 and 110 thousand cells, and the test set had approximately 13 thousand cells. Training and testing were done with various combinations of data sets from four principal data sources. The overall accuracy of classification on independent data sets reached 5-class classification accuracy of 94%. Classification accuracy for B cells, monocytes, and T cells exceeded 95%. Classification accuracy of natural killer (NK) cells was 75% because of the similarity between NK cells and T cell subsets. The accuracy of dendritic cells (DC) was low due to very low numbers of DC in the training sets. Conclusions The incremental learning ANN model can accurately classify the main types of PBMC. With the inclusion of more DC and resolving ambiguities between T cell and NK cell gene expression profiles, we will enable high accuracy supervised ML classification of PBMC. We assembled a reference data set for healthy PBMC and demonstrated a proof-of-concept for supervised ANN method in classification of previously unseen SCT data. The classification shows high accuracy, that is consistent across different studies and sample processing methods.

Genetics

Artificial Intelligence

1

Paper

Save

Prediction of therapy outcomes of CLL using gene expression intensity, clustering, and ANN classification of single cell transcriptomes

Minjie Lyu et al.Aug 9, 2021

Background Single cell transcriptomics is a new technology that enables us to measure the expression levels of genes from an individual cell. The expression information reflects the activity of that individual cell which could be used to indicate the cell types. Chronic lymphocytic leukemia (CLL) is a malignancy of B cells, one of the peripheral blood mononuclear cells subtypes. We applied five analytical tools for the study of single cell gene expression in CLL course of therapy. These tools included the analysis of gene expression distributions – median, interquartile ranges, and percentage above quality control (QC) threshold; hierarchical clustering applied to all cells within individual single cell data sets; and artificial neural network (ANN) for classification of healthy peripheral blood mononuclear cell (PBMC) subtypes. These tools were applied to the analysis of CLL data representing states before and during the therapy. Results We identified patterns in gene expression that distinguished two patients that had complete remission (complete response), a patient that had a relapse, and a patient that had partial remission within three years of Ibrutinib therapy. Patients with complete remission showed a rapid decline of median gene expression counts, and the total number of gene counts below the QC threshold for healthy cells (670 counts) in 80% of more of the cells. These patients also showed the emergence of healthy-like PBMC cluster maps within 120 days of therapy and distinct changes in predicted proportions of PBMC cell types. Conclusions The combination of basic statistical analysis, hierarchical clustering, and supervised machine learning identified patterns from gene expression that distinguish four CLL patients treated with Ibrutinib that experienced complete remission, partial remission, or relapse. These preliminary results suggest that new bioinformatics tools for single cell transcriptomics, including ANN comparison to healthy PBMC, offer promise in prognostics of CLL.

Genetics

Immunology

5

Paper

Save

Correlation-based feature selection of single cell transcriptomics data from multiple sources

Nenad Mitić et al.Jan 6, 2025

When applying data mining or machine learning techniques to large and diverse datasets, it is often necessary to construct descriptive and predictive models. Descriptive models are used to discover relationships between the attributes of the data while predictive models identify the characteristics of the data that will be collected in the future. Bioinformatics data is high-dimensional, making it practically impossible to apply the majority of "classical" algorithms for classification and clustering. Even if the algorithms are useful, training with large multidimensional data significantly increases processing time. The algorithms specialized for working with high-dimensional data often cannot process data containing large data sets with several thousand dimensions (features). Dimension reduction methods (such as PCA) do not provide satisfactory results, and also obscure the meaning of the original attributes in the data. For the constructed models to be usable, they must fulfill the requirement of scalability, as the amount of bioinformatics data is increasing rapidly. Furthermore, the significance of individual data features can differ from source to source. This paper describes an attribute selection method for efficient classification of high-dimensional (30,698) transcriptomics data collected from different sources. The proposed method was tested with 22 classification algorithms. The classification results for the selected attribute sets are comparable to the results for the complete attribute set.

Philosophy

Artificial Intelligence

0

Paper

Philosophy

Artificial Intelligence

0

Save