ResearchHub | Open Science Community

Using control genes to correct for unwanted variation in microarray data

Johann Gagnon-Bartsch et al.Nov 17, 2011

Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method “Remove Unwanted Variation, 2-step” (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.

Genetics

Philosophy

0

Paper

Save

Signatures of tumour immunity distinguish Asian and non-Asian gastric adenocarcinomas

Suling Lin et al.Nov 10, 2014

0

Paper

Save

dtangle: accurate and fast cell-type deconvolution

Gregory Hunt et al.Mar 27, 2018

Abstract Motivation Understanding cell type composition is important to understanding many biological processes. Furthermore, in gene expression studies cell type composition can confound differential expression analysis (DEA). To aid understanding cell type composition, methods of estimating (deconvolving) cell type proportions from gene expression data have been developed. Results We propose dtangle, a new cell-type deconvolution method. dtangle works on a range of DNA microarray and bulk RNA-seq platforms. It estimates cell-type proportions using publicly available, often cross-platform, reference data. To comprehensively evaluate dtangle, we assemble ten benchmark data sets. Here, dtangle is competitive with published deconvolution methods, is robust to selection of tuning parameters and is quicker than other methods. As a case study, we investigate the human immune response to Lyme disease. dtangle’s estimates reveal a temporal trend consistent with previous findings and are important covariates for DEA across disease status. Availability dtangle is on CRAN ( cran.r-project.org/package=dtangle ) or github ( dtangle.github.io ). Contact gjhunt@umich.edu

Genetics

Molecular Biology

0

Paper

Save

Systematic Replication Enables Normalization of High-throughput Imaging Assays

Gregory Hunt et al.Apr 28, 2022

Abstract Motivation High-throughput fluorescent microscopy is a popular class of techniques for studying tissues and cells through automated imaging and feature extraction of hundreds to thousands of samples. Like other high-throughput assays, these approaches can suffer from unwanted noise and technical artifacts that obscure the biological signal. In this work we consider how an experimental design incorporating multiple levels of replication enables removal of technical artifacts from such image-based platforms. Results We develop a general approach to remove technical artifacts from high-throughput image data that leverages an experimental design with multiple levels of replication. To illustrate the methods we consider microenvironment microarrays (MEMAs), a high-throughput platform designed to study cellular responses to microenvironmental perturbations. In application on MEMAs, our approach removes unwanted spatial artifacts and thereby enhances the biological signal. This approach has broad applicability to diverse biological assays. Availability Raw data is on synapse (syn2862345), analysis code is on github (gjhunt/mema norm), a Docker image is available on dockerhub (gjhunt/memanorm). online.

Artificial Intelligence

Biophysics

5

Paper

Artificial Intelligence

1

0

Save

1

Removing unwanted variation from large-scale cancer RNA-sequencing data

Ramyar Molania et al.Nov 3, 2021

Abstract The accurate identification and effective removal of unwanted variation are essential to derive meaningful biological results from RNA-seq data, especially when the data come from large and complex studies. We have used The Cancer Genome Atlas (TCGA) RNA-seq data to show that library size, batch effects, and tumor purity are major sources of unwanted variation across all TCGA RNA-seq datasets and that existing gold standard approaches to normalizations fail to remove this unwanted variation. Additionally, we illustrate how different sources of unwanted variation can compromise downstream analyses, including gene co-expression, association between gene expression and survival outcomes, and cancer subtype identifications. Here, we propose the use of a novel strategy, pseudo-replicates of pseudo-samples (PRPS), to deploy the Removing Unwanted Variation III (RUV-III) method to remove different sources of unwanted variation from large and complex gene expression studies. Our approach requires at least one roughly known biologically homogenous subclass of samples shared across sources of unwanted variation. To create PRPS, we first need to identify the sources of unwanted variation, which we will call batches in the data. Then the gene expression measurements of biologically homogeneous sets of samples are averaged within batches, and the results called pseudo-samples. Pseudo-samples with the same biology and different batches are then defined to be pseudo-replicates and used in RUV-III as replicates. The variation between pseudo-samples of a set pseudo-replicates is mainly unwanted variation. We illustrate the value of our approach by comparing it to the TCGA normalizations on several TCGA RNA-seq datasets. RUV-III with PRPS can be used for any large genomics project involving multiple labs, technicians, or platforms.

Artificial Intelligence

Molecular Biology

1

Paper

Artificial Intelligence

1

0

Save

0

A new normalization for the Nanostring nCounter gene expression assay

Ramyar Molania et al.Jul 23, 2018

The Nanostring nCounter gene expression assay uses molecular barcodes and single molecule imaging to detect and count hundreds of unique transcripts in a single reaction. These counts need to be normalized to adjust for the amount of sample, variations in assay efficiency, and other factors. Most users adopt the normalization approach described in the nSolver analysis software, which involves background correction based on the observed values of negative control probes, a within-sample normalization using the observed values of positive control probes and normalization across samples using reference (housekeeping) genes. Here we present a new normalization method, Removing Unwanted Variation-III (RUV-III), which makes vital use of technical replicates and suitable control genes. We also propose an approach using pseudo-replicates when technical replicates are not available. The effectiveness of RUV-III is illustrated on four different data sets. We also offer suggestions on the design and analysis of studies involving this technology.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

Transformation and Integration of Microenvironment Microarray Data Improves Discovery of Latent Effects

Gregory Hunt et al.May 5, 2019

The immediate physical and bio-chemical surroundings of a cell, the cellular microenvironment, is an important component of any fundamental cell and tissue level processes and is implicated in many diseases and dysfunctions. Thus understanding the interaction of cells with their microenvironment can further both basic research and aid the discovery of therapeutic agents. To study perturbations of cellular microenvironments a novel image-based cell-profiling technology called the microenvironment microarray (MEMA) has been recently employed. In this paper we explore the effect of preprocessing transformations for MEMA data on the discovery of biological and technical latent effects. We find that Gaussianizing the data and carefully removing outliers can enhance discovery of important biological effects. In particular, these transformations help reveal a relationship between cell morphological features and the extra-cellular-matrix protein THBS1 in MCF10A breast tissue. More broadly, MEMAs are part of a recent and wide-spread adoption of image-based cell-profiling technologies in the quantification of phenotypic differences among cell populations (Caicedo et al., 2017). Thus we anticipate that the advantages of the proposed preprocessing transformations will likely also be realized in the analysis of data from other highly-multiplexed technologies like Cyclic Immunofluorescence. All code and supplementary analysis for this paper is available at gjhunt.github.io/rr.

Artificial Intelligence

Biochemistry

0

Paper

Artificial Intelligence

Biochemistry

0

Save

0

The Role of Scale in the Estimation of Cell-type Proportions

Gregory Hunt et al.Nov 29, 2019

Complex tissues are composed of a large number of different types of cells, each involved in a multitude of biological processes. Consequently, an important component to understanding such processes is understanding the cell-type composition of the tissues. Estimating cell type composition using high-throughput gene expression data is known as cell-type deconvolution. In this paper, we first summarize the extensive deconvolution literature by identifying a common regression-like approach to deconvolution. We call this approach the Unified Deconvolution-as-Regression (UDAR) framework. While methods that fall under this framework all use a similar model, they fit using data on different scales. Two popular scales for gene expression data are logarithmic and linear. Unfortunately, each of these scales has problems in the UDAR framework. Using log-scale gene expressions proposes a biologically implausible model and using linear-scale gene expressions will lead to statistically inefficient estimators. To overcome these problems, we propose a new approach for cell-type deconvolution that works on a hybrid of the two scales. This new approach is biologically plausible and improves statistical efficiency. We compare the hybrid approach to other methods on simulations as well as a collection of eleven real benchmark datasets. Here, we find the hybrid approach to be accurate and robust.

Molecular Biology

Machine Learning

0

Paper

Save

scMerge: Integration of multiple single-cell transcriptomics datasets leveraging stable expression and pseudo-replication

Yingxin Lin et al.Aug 16, 2018

Concerted examination of multiple collections of single cell RNA-Seq (scRNA-Seq) data promises further biological insights that cannot be uncovered with individual datasets. However, such integrative analyses are challenging and require sophisticated methodologies. To enable effective interrogation of multiple scRNA-Seq datasets, we have developed a novel algorithm, named scMerge, that removes unwanted variation by combining stably expressed genes and utilizing pseudo-replicates across datasets. Analysis of large collections of publicly available datasets demonstrates that scMerge performs well in multiple scenarios and enhances biological discovery, including inferring cell developmental trajectories.

Genetics

Molecular Biology

0

Paper

Save

Comprehensive evaluation of human brain gene expression deconvolution methods

Gavin Sutton et al.Jun 1, 2020

Abstract Gene expression measurements, similarly to DNA methylation and proteomic measurements, are influenced by the cellular composition of the sample analysed. Deconvolution of bulk transcriptome data aims to estimate the cellular composition of a sample from its gene expression data, which in turn can be used to correct for composition differences across samples. Although a multitude of deconvolution methods have been developed, it is unclear whether their performance is consistent across tissues with different complexities of cellular composition. For example, the human brain is unique in its transcriptomic diversity, and in the complexity of its cellularity, yet a comprehensive assessment of the accuracy of transcriptome deconvolution methods on human brain data is currently lacking. Here we carry out the first comprehensive comparative evaluation of the accuracy of deconvolution methods for human brain transcriptome data, and assess the tissue-specificity of our key observations by comparison with transcriptome data from human pancreas. We evaluate 22 transcriptome deconvolution approaches, covering all main classes: 3 partial deconvolution methods, each applied with 6 different categories of cell-type signature data, 2 enrichment methods and 2 complete deconvolution methods. We test the accuracy of cell type estimates using in silico mixtures of single-cell RNA-seq data, mixtures of neuronal and glial RNA, as well as nearly 2,000 human brain samples. Our results bring several important insights into the performance of transcriptome deconvolution: (a) We find that cell-type signature data has a stronger impact on brain deconvolution accuracy than the choice of method. In contrast, cell-type signature only mildly influences deconvolution of pancreas transcriptome data, highlighting the importance of tissue-specific benchmarking. (b) We demonstrate that biological factors influencing brain cell-type signature data ( e.g. brain region, in vitro cell culturing), have stronger effects on the deconvolution outcome than technical factors ( e.g. RNA sequencing platform). (c) We find that partial deconvolution methods outperform complete deconvolution methods on human brain data. (d) We demonstrate that the impact of cellular composition differences on differential expression analyses is tissue-specific, and more pronounced for brain than for pancreas. To facilitate wider implementation of correction for cellular composition, we develop a novel brain cell-type signature, MultiBrain , which integrates single-cell, immuno-panned, and single-nucleus datasets. We demonstrate that it achieves improved deconvolution accuracy over existing reference signatures. Deconvolution of transcriptome data from autism cases and controls using MultiBrain identified cell-type composition changes replicable across studies, and highlighted novel genes dysregulated in autism.

Genetics

Molecular Biology

0

Paper

Genetics

Molecular Biology

0

Save

Using control genes to correct for unwanted variation in microarray data

Signatures of tumour immunity distinguish Asian and non-Asian gastric adenocarcinomas

Objective

Design

Results

Conclusions

dtangle: accurate and fast cell-type deconvolution

Systematic Replication Enables Normalization of High-throughput Imaging Assays

Removing unwanted variation from large-scale cancer RNA-sequencing data

A new normalization for the Nanostring nCounter gene expression assay

Transformation and Integration of Microenvironment Microarray Data Improves Discovery of Latent Effects

The Role of Scale in the Estimation of Cell-type Proportions

scMerge: Integration of multiple single-cell transcriptomics datasets leveraging stable expression and pseudo-replication

Comprehensive evaluation of human brain gene expression deconvolution methods

Scan to connect with one of our mobile apps

Coinbase Wallet app

Coinbase app

Or try the Coinbase Wallet browser extension