ResearchHub | Open Science Community

Exploration, normalization, and summaries of high density oligonucleotide array probe level data

Rafael Irizarry et al.Apr 1, 2003

In this paper we report exploratory analyses of high‐density oligonucleotide array data from the Affymetrix GeneChip® system with the objective of improving upon currently used measures of gene expression. Our analyses make use of three data sets: a small experimental study consisting of five MGU74A mouse GeneChip® arrays, part of the data from an extensive spike‐in study conducted by Gene Logic and Wyeth's Genetics Institute involving 95 HG‐U95A human GeneChip® arrays; and part of a dilution study conducted by Gene Logic involving 75 HG‐U95A GeneChip® arrays. We display some familiar features of the perfect match and mismatch probe (PM and MM) values of these data, and examine the variance–mean relationship with probe‐level data from probes believed to be defective, and so delivering noise only. We explain why we need to normalize the arrays to one another using probe level intensities. We then examine the behavior of the PM and MM using spike‐in data and assess three commonly used summary measures: Affymetrix's (i) average difference (AvDiff) and (ii) MAS 5.0 signal, and (iii) the Li and Wong multiplicative model‐based expression index (MBEI). The exploratory data analyses of the probe level data motivate a new summary measure that is a robust multi‐array average (RMA) of background‐adjusted, normalized, and log‐transformed PM values. We evaluate the four expression summary measures using the dilution study data, assessing their behavior in terms of bias, variance and (for MBEI and RMA) model fit. Finally, we evaluate the algorithms in terms of their ability to detect known levels of differential expression using the spike‐in data. We conclude that there is no obvious downside to using RMA and attaching a standard error (SE) to this quantity using a linear model which removes probe‐specific affinities.

Genetics

Molecular Biology

0

Paper

Save

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

Benjamin Bolstad et al.Jan 21, 2003

Abstract Motivation: When running experiments that involve multiple high density oligonucleotide arrays, it is important to remove sources of variation between arrays of non-biological origin. Normalization is a process for reducing this variation. It is common to see non-linear relations between arrays and the standard normalization provided by Affymetrix does not perform well in these situations. Results: We present three methods of performing normalization at the probe intensity level. These methods are called complete data methods because they make use of data from all arrays in an experiment to form the normalizing relation. These algorithms are compared to two methods that make use of a baseline array: a one number scaling based algorithm and a method that uses a non-linear normalizing relation by comparing the variability and bias of an expression measure. Two publicly available datasets are used to carry out the comparisons. The simplest and quickest complete data method is found to perform favorably. Availability: Software implementing all three of the complete data normalization methods is available as part of the R package Affy, which is a part of the Bioconductor project http://www.bioconductor.org. Contact: bolstad@stat.berkeley.edu. Supplementary information: Additional figures may be found at http://www.stat.berkeley.edu/~bolstad/normalize/index.html * To whom correspondence should be addressed.

Artificial Intelligence

Biochemistry

0

Paper

Artificial Intelligence

8,050

0

Save

0

Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR, and NF1

Roel Verhaak et al.Jan 1, 2010

The Cancer Genome Atlas Network recently cataloged recurrent genomic abnormalities in glioblastoma multiforme (GBM). We describe a robust gene expression-based molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes and integrate multidimensional genomic data to establish patterns of somatic mutations and DNA copy number. Aberrations and gene expression of EGFR, NF1, and PDGFRA/IDH1 each define the Classical, Mesenchymal, and Proneural subtypes, respectively. Gene signatures of normal brain cell types show a strong relationship between subtypes and different neural lineages. Additionally, response to aggressive therapy differs by subtype, with the greatest benefit in the Classical subtype and no benefit in the Proneural subtype. We provide a framework that unifies transcriptomic and genomic dimensions for GBM molecular stratification with important implications for future studies.

Genetics

Molecular Biology

0

Paper

Save

Summaries of Affymetrix GeneChip probe level data

Rafael Irizarry et al.Feb 11, 2003

High density oligonucleotide array technology is widely used in many areas of biomedical research for quantitative and highly parallel measurements of gene expression. Affymetrix GeneChip arrays are the most popular. In this technology each gene is typically represented by a set of 11–20 pairs of probes. In order to obtain expression measures it is necessary to summarize the probe level data. Using two extensive spike‐in studies and a dilution study, we developed a set of tools for assessing the effectiveness of expression measures. We found that the performance of the current version of the default expression measure provided by Affymetrix Microarray Suite can be significantly improved by the use of probe level summaries derived from empirically motivated statistical models. In particular, improvements in the ability to detect differentially expressed genes are demonstrated.

Genetics

History

0

Paper

Save

A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes

Richard Neve et al.Dec 1, 2006

Recent studies suggest that thousands of genes may contribute to breast cancer pathophysiologies when deregulated by genomic or epigenomic events. Here, we describe a model "system" to appraise the functional contributions of these genes to breast cancer subsets. In general, the recurrent genomic and transcriptional characteristics of 51 breast cancer cell lines mirror those of 145 primary breast tumors, although some significant differences are documented. The cell lines that comprise the system also exhibit the substantial genomic, transcriptional, and biological heterogeneity found in primary tumors. We show, using Trastuzumab (Herceptin) monotherapy as an example, that the system can be used to identify molecular features that predict or indicate response to targeted therapies or other physiological perturbations.

Genetics

Oncology

0

Paper

Save

Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data

Sandrine Dudoit et al.Mar 1, 2002

A reliable and precise classification of tumors is essential for successful diagnosis and treatment of cancer. cDNA microarrays and high-density oligonucleotide chips are novel biotechnologies increasingly used in cancer research. By allowing the monitoring of expression levels in cells for thousands of genes simultaneously, microarray experiments may lead to a more complete understanding of the molecular variations among tumors and hence to a finer and more informative classification. The ability to successfully distinguish between tumor classes (already known or yet to be discovered) using gene expression data is an important aspect of this novel approach to cancer classification. This article compares the performance of different discrimination methods for the classification of tumors based on gene expression data. The methods include nearest-neighbor classifiers, linear discriminant analysis, and classification trees. Recent machine learning approaches, such as bagging and boosting, are also considered. The discrimination methods are applied to datasets from three recently published cancer gene expression studies.

Genetics

Artificial Intelligence

0

Paper

Save

Normalization of cDNA microarray data

Gordon Smyth et al.Oct 31, 2003

Normalization means to adjust microarray data for effects which arise from variation in the technology rather than from biological differences between the RNA samples or between the printed probes. This paper describes normalization methods based on the fact that dye balance typically varies with spot intensity and with spatial position on the array. Print-tip loess normalization provides a well-tested general purpose normalization method which has given good results on a wide range of arrays. The method may be refined by using quality weights for individual spots. The method is best combined with diagnostic plots of the data which display the spatial and intensity trends. When diagnostic plots show that biases still remain in the data after normalization, further normalization steps such as plate-order normalization or scale-normalization between the arrays may be undertaken. Composite normalization may be used when control spots are available which are known to be not differentially expressed. Variations on loess normalization include global loess normalization and two-dimensional normalization. Detailed commands are given to implement the normalization techniques using freely available software.

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

1,875

0

Save

0

Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data

Manhong Dai et al.Nov 27, 2005

Genome-wide expression profiling is a powerful tool for implicating novel gene ensembles in cellular mechanisms of health and disease. The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip. However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge. The resultant informatics problems have a profound impact on analysis and interpretation the data. Here, we address these critical issues and offer a solution. We identified several classes of problems at the individual probe level in the existing annotation, under the assumption that current genome and transcriptome databases are more accurate than those used for GeneChip design. We then reorganized probes on more than a dozen popular GeneChips into gene-, transcript- and exon-specific probe sets in light of up-to-date genome, cDNA/EST clustering and single nucleotide polymorphism information. Comparing analysis results between the original and the redefined probe sets reveals ∼30–50% discrepancy in the genes previously identified as differentially expressed, regardless of analysis method. Our results demonstrate that the original Affymetrix probe set definitions are inaccurate, and many conclusions derived from past GeneChip analyses may be significantly flawed. It will be beneficial to re-analyze existing GeneChip data with updated probe set definitions.

Genetics

Molecular Biology

0

Paper

Save

Normalization of RNA-seq data using factor analysis of control genes or samples

Davide Risso et al.Aug 21, 2014

Remove unwanted variation (RUV) is a new statistical method for RNA-seq data normalization that uses control genes or samples to improve differential expression analysis. Normalization of RNA-sequencing (RNA-seq) data has proven essential to ensure accurate inference of expression levels. Here, we show that usual normalization approaches mostly account for sequencing depth and fail to correct for library preparation and other more complex unwanted technical effects. We evaluate the performance of the External RNA Control Consortium (ERCC) spike-in controls and investigate the possibility of using them directly for normalization. We show that the spike-ins are not reliable enough to be used in standard global-scaling or regression-based normalization procedures. We propose a normalization strategy, called remove unwanted variation (RUV), that adjusts for nuisance technical effects by performing factor analysis on suitable sets of control genes (e.g., ERCC spike-ins) or samples (e.g., replicate libraries). Our approach leads to more accurate estimates of expression fold-changes and tests of differential expression compared to state-of-the-art normalization methods. In particular, RUV promises to be valuable for large collaborative projects involving multiple laboratories, technicians, and/or sequencing platforms.

Genetics

Artificial Intelligence

0

Paper

Save

On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9

Jerzy Splawa-Neyman et al.Nov 1, 1990

In the portion of the paper translated here, Neyman introduces a model for the analysis of field experiments conducted for the purpose of comparing a number of crop varieties, which makes use of a double-indexed array of unknown potential yields, one index corresponding to varieties and the other to plots. The yield corresponding to only one variety will be observed on any given plot, but through an urn model embodying sampling without replacement from this doubly indexed array, Neyman obtains a formula for the variance of the difference between the averages of the observed yields of two varieties. This variance involves the variance over all plots of the potential yields and the correlation coefficient $r$ between the potential yields of the two varieties on the same plot. Since it is impossible to estimate $r$ directly, Neyman advises taking $r = 1$, observing that in practice this may lead to using too large an estimated standard deviation, when comparing two variety means.

Accounting

Management Science And Operations Research

0

Paper

Accounting

1,526

0

Save