Classification of Microarray Data

Classification of Microarray Data- Recent Statistical Approaches Geoff McLachlan and Liat Ben-Tovim Jones Department of Mathematics & Institute for Molecular Bioscience University of Queensland Tutorial for the APBC January 2005

Institute for Molecular Bioscience, University of Queensland

Outline of Tutorial • Introduction to Microarray Technology • Detecting Differentially Expressed Genes in Known Classes • of Tissue Samples • Cluster Analysis: Clustering Genes and Clustering Tissues • Supervised Classification of Tissue Samples • Linking Microarray Data with Survival Analysis

Outline of Tutorial • Introduction to Microarray Technology • Detecting Differentially Expressed Genes in Known Classes • of Tissue Samples BREAK • Cluster Analysis: Clustering Genes and Clustering Tissues • Supervised Classification of Tissue Samples • Linking Microarray Data with Survival Analysis

“Large-scale gene expression studies are not a passing fashion, but are instead one aspect of new work of biological experimentation, one involving large-scale, high throughput assays.” Speed et al., 2002, Statistical Analysis of Gene Expression Microarray Data, Chapman and Hall/ CRC

Growth of microarray and microarray methodology literature listed in PubMed from 1995 to 2003. The category ‘all microarray papers’ includes those found by searching PubMed for microarray* OR ‘gene expression profiling’. The category ‘statistical microarray papers’ includes those found by searching PubMed for ‘statistical method*’ OR ‘statistical techniq*’ OR ‘statistical approach*’ AND microarray* OR ‘gene expression profiling’.

A microarray is a new technology which allows the measurement of the expression levels of thousands of genes simultaneously. • sequencing of the genome (human, mouse, and others) • (2) improvement in technology to generate high-density • arrays on chips (glass slides or nylon membrane). The entire genome of an organism can be probed at a single point in time.

(1) mRNA Levels Indirectly Measure Gene Activity Every cell contains the same DNA. The activity of a gene (expression) can be determined by the presence of its complementary mRNA. Cells differ in the DNA (gene) which is active at any one time. Genes code for proteins through the intermediary of mRNA.

Target and Probe DNA Probe DNA - known Sample (target) - unknown

(2) Microarrays Indirectly Measure Levels of mRNA • mRNA is extracted from the cell • mRNA is reverse transcribed to cDNA (mRNA itself is unstable) • cDNA is labeled with fluorescent dye TARGET • The sample is hybridized to known DNA sequences on the array • (tens of thousands of genes) PROBE • If present, complementary target binds to probe DNA • (complementary base pairing) • Target bound to probe DNA fluoresces

Spotted cDNA Microarray Compare the gene expression levels for two cell populations on a single microarray.

Microarray Image Red: High expression in target labelled with cyanine 5 dye Green : High expression in target labelled with cyanine 3 dye Yellow : Similar expression in both target samples

Assumptions: Gene Expression (1) cellular mRNA levels directly reflect gene expression mRNA intensity of bound target is a measure of the abundance of the mRNA in the sample. (2) Fluorescence Intensity

Experimental Error Sample contamination Poor quality/insufficient mRNA Reverse transcription bias Fluorescent labeling bias Hybridization bias Cross-linking of DNA (double strands) Poor probe design (cross-hybridization) Defective chips (scratches, degradation) Background from non-specific hybridization

The Microarray Technologies Spotted Microarray Affymetrix GeneChip cDNAs, clones, or short and long oligonucleotides deposited onto glass slides Each gene (or EST) represented by its purified PCR product Simultaneous analysis of two samples (treated vs untreated cells) provides internal control. short oligonucleotides synthesized in situ onto glass wafers Each gene represented multiply - using 16-20 (preferably non-overlapping) 25-mers. Each oligonucleotide has single-base mismatch partner for internal control of hybridization specifity. relative gene expressions absolute gene expressions Each with its own advantages and disadvantages

Pros and Cons of the Technologies Spotted Microarray Affymetrix GeneChip Flexible and cheaper Allows study of genes not yet sequenced (spotted ESTs can be used to discover new genes and their functions) Variability in spot quality from slide to slide Provide information only on relative gene expressions between cells or tissue samples More expensive yet less flexible Good for whole genome expression analysis where genome of that organism has been sequenced High quality with little variability between slides Gives a measure of absolute expression of genes

Aims of a Microarray Experiment • observe changes in a gene in response to external stimuli • (cell samples exposed to hormones, drugs, toxins) • compare gene expressions between different tissue types • (tumour vs normal cell samples) • To gain understanding of • function of unknown genes • disease process at the molecular level • Ultimately to use as tools in Clinical Medicine for diagnosis, • prognosis and therapeutic management.

Importance of Experimental Design • Good DNA microarray experiments should have clear objectives. • not performed as “aimless data mining in search of unanticipated patterns that will provide answers to unasked • questions” • (Richard Simon, BioTechniques 34:S16-S21, 2003)

Replicates Technical replicates: arrays that have been hybridized to the same biological source (using the same treatment, protocols, etc.) Biological replicates: arrays that have been hybridized to different biological sources, but with the same preparation, treatments, etc.

Extracting Data from the Microarray • Cleaning • Image processing • Filtering • Missing value estimation • Normalization • Remove sources of systematic variation. Sample 1 Sample 2 Sample 3 Sample 4 etc…

Gene Expressions from Measured Intensities Spotted Microarray: log 2(Intensity Cy5 / Intensity Cy3) Affymetrix: (Perfect Match Intensity – Mismatch Intensity)

Data Transformation Rocke and Durbin (2001), Munson (2001), Durbin et al. (2002), and Huber et al. (2002)

Representation of Data from M Microarray Experiments Sample 1 Sample 2 Sample M Gene 1 Gene 2 Gene N Assume we have extracted gene expressions values from intensities. Expression Signature Gene expressions can be shown as Heat Maps Expression Profile

Microarrays present new problems for statistics because the data is very high dimensional with very little replication.

Gene Expression Data represented as N x M Matrix Sample 1 Sample 2 Sample M Gene 1 Gene 2 Gene N Expression Signature N rows correspond to the N genes. M columns correspond to the M samples (microarray experiments). Expression Profile

Microarray Data Notation Represent the N x M matrix A: A = (y1, ... , yM) Classifying Tissues on Gene Expressions the feature vector yj contains the expression levels on the N genes in the jth experiment (j = 1, ... , M). yj is the expression signature. AT = (y1, ... , yN) Classifying Genes on the Tissues the feature vector yj contains the expression levels on the M tissues on the jth gene (j = 1, ... , N). yj is the expression profile.

In the N x M matrix A: N = No. of genes (103-104) M = No. of tissues (10-102) Classification of Tissues on Gene Expressions: Standard statistical methodology appropriate when M >> N, BUT here N >> M Classification of Genes on the Basis of the Tissues: Falls in standard framework, BUT not all the genes are independently distributed.

Mehta et al (Nature Genetics, Sept. 2004): “The field of expression data analysis is particularly active with novel analysis strategies and tools being published weekly”, and the value of many of these methods is questionable. Some results produced by using these methods are so anomalous that a breed of ‘forensic’ statisticians (Ambroise and McLachlan, 2002; Baggerly et al., 2003) who doggedly detect and correct other HDB (high-dimensional biology) investigators’ prominent mistakes, has been created.

Sample 1 Sample 2 Sample M Gene 1 Gene 2 . . . . . . . . Gene N

Sample 1 Sample 2 Sample M Gene 1 Gene 2 . . . . . . . . Gene N Class 2 Class 1

Fold Change is the Simplest Method Calculate the log ratio between the two classes and consider all genes that differ by more than an arbitrary cutoff value to be differentially expressed. A two-fold difference is often chosen. Fold change is not a statistical test.

Multiplicity Problem Perform a test for each gene to determine the statistical significance of differential expression for that gene. Problem:When many hypotheses are tested, the probability of a type I error (false positive) increases sharply with the number of hypotheses. Further complicated by gene co-regulation and subsequent correlation between the test statistics.

Example: Suppose we measure the expression of 10,000 genes in a microarray experiment. If all 10,000 genes were not differentially expressed, then we would expect for: P= 0.05 for each test, 500 false positives. P= 0.05/10,000 for each test, .05 false positives.

Methods for dealing with the Multiplicity Problem • The Bonferroni Method • controls the family wise error rate (FWER). • (FWER is the probability that at least one false positive • error will be made.) - but this method is very • conservative, as it tries to make it unlikely that even • one false rejection is made. • The False Discovery Rate (FDR) • emphasizes the proportion of false positives among the identified differentially expressed genes.

Test of a Single Hypothesis The M tissue samples are classified with respect to g classes on the basis of the N gene expressions. Assume that there are ni tissue samples from each class Ci (i = 1, …, g), where M = n1 + … + ng. Take a simple case where g = 2. The aim is to detect whether some of the thousands of genes have different expression levels in class C1 than in class C2.

Test of a Single Hypothesis (cont.) For gene j, let Hj denote the null hypothesis of no association between its expression level and membership of the class, where (j = 1, …, N). Hj = 0 Null hypothesis for the jth gene holds. Hj = 1 Null hypothesis for the jth gene does not hold.

Two-Sample t-Statistic Student’s t-statistic:

Two-Sample t-Statistic Pooled form of the Student’s t-statistic, assumed common variance in the two classes:

Two-Sample t-Statistic Modified t-statistic of Tusher et al. (2001):

Multiple Hypothesis Testing Consider measures of error for multiple hypotheses. Focus on the rate offalse positiveswith respect to the number ofrejected hypotheses, Nr.

Possible Outcomes for N Hypothesis Tests

Possible Outcomes for N Hypothesis Tests FWER is the probability of getting one or more false positives out of all the hypotheses tested:

Bonferroni method for controlling the FWER Consider N hypothesis tests: H0j versus H1j, j = 1, … , N and let P1, … , PN denote the NP-values for these tests. The Bonferroni Method: Given P-values P1, … , PN reject null hypothesis H0jif Pj < a / N .

False Discovery Rate (FDR) The FDR is essentially the expectation of the proportion of false positives among the identified differentially expressed genes.

Classification of Microarray Data - Recent Statistical Approaches

Classification of Microarray Data - Recent Statistical Approaches

Presentation Transcript

Statistical Methods for the Screening and Classification of Microarray Gene Expression Data

Applying statistical tests to microarray data

Some Statistical Issues in Microarray Data Analysis

Allowing Access to Confidential Data: Some Recent Experiences and Statistical Approaches

Statistical Analysis of Microarray Data

Data Classification by Statistical Methods

Classification (Discrimination, Supervised Learning) Using Microarray Data

Statistical methods for analyzing DNA microarray data

Statistical Techniques for Temporal Microarray Data Analysis

Microarray Classification

Statistical Analysis of Microarray Data

Classification of Microarray Gene Expression Data

Allowing Access to Confidential Data: Some Recent Experiences and Statistical Approaches

Statistical Analysis of DNA Microarray.

Classification of Microarray Gene Expression Data

Statistical Methods for the Screening and Classification of Microarray Gene Expression Data

Classification of Microarray Data

Statistical Analysis of Microarray Data

Statistical Analysis of Microarray Data