Design & Analysis of Microarray Studies for Diagnostic & Prognostic Classification

Design & Analysis of Microarray Studies for Diagnostic & Prognostic Classification Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://linus.nci.nih.gov/brb

http://linus.nci.nih.gov/brb • http://linus.nci.nih.gov/brb • Powerpoint presentations • Reprints & Technical Reports • BRB-ArrayTools software • BRB-ArrayTools Data Archive • Sample Size Planning for Targeted Clinical Trials

Simon R, Korn E, McShane L, Radmacher M, Wright G, Zhao Y. Design and analysis of DNA microarray investigations, Springer-Verlag, 2003. Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. Journal of Computational Biology 9:505-511, 2002. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of DNA microarray data. Journal of the National Cancer Institute 95:14-18, 2003. Dobbin K, Simon R. Comparison of microarray designs for class comparison and class discovery, Bioinformatics 18:1462-69, 2002; 19:803-810, 2003; 21:2430-37, 2005; 21:2803-4, 2005. Dobbin K and Simon R. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27-38, 2005. Dobbin K, Shih J, Simon R. Questions and answers on design of dual-label microarrays for identifying differentially expressed genes. Journal of the National Cancer Institute 95:1362-69, 2003. Wright G, Simon R. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 19:2448-55, 2003. Korn EL, Troendle JF, McShane LM, Simon R.Controlling the number of false discoveries. Journal of Statistical Planning and Inference 124:379-08, 2004. Molinaro A, Simon R, Pfeiffer R. Prediction error estimation: A comparison of resampling methods. Bioinformatics 21:3301-7,2005.

Simon R. Using DNA microarrays for diagnostic and prognostic prediction. Expert Review of Molecular Diagnostics, 3(5) 587-595, 2003. Simon R. Diagnostic and prognostic prediction using gene expression profiles in high dimensional microarray data. British Journal of Cancer 89:1599-1604, 2003. Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research 10:6759-63, 2004. Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-339, 2005. Simon R. When is a genomic classifier ready for prime time? Nature Clinical Practice – Oncology 1:4-5, 2004. Simon R. An agenda for Clinical Trials: clinical trials in the genomic era. Clinical Trials 1:468-470, 2004. Simon R. Development and Validation of Therapeutically Relevant Multi-gene Biomarker Classifiers. Journal of the National Cancer Institute 97:866-867, 2005. Simon R. A roadmap for developing and validating therapeutically relevant genomic classifiers. Journal of Clinical Oncology (In Press). Freidlin B and Simon R. Adaptive signature design. Clinical Cancer Research (In Press). Simon R. Validation of pharmacogenomic biomarker classifiers for treatment selection. Disease Markers (In Press). Simon R. Guidelines for the design of clinical studies for development and validation of therapeutically relevant biomarkers and biomarker classification systems. In Biomarkers in Breast Cancer, Hayes DF and Gasparini G, Humana Press (In Press).

Myth • That microarray investigations should be unstructured data-mining adventures without clear objectives

Good microarray studies have clear objectives, but not generally gene specific mechanistic hypotheses • Design and analysis methods should be tailored to study objectives

Good Microarray Studies Have Clear Objectives • Class Comparison • Find genes whose expression differs among predetermined classes • Class Prediction • Prediction of predetermined class (phenotype) using information from gene expression profile • Class Discovery • Discover clusters of specimens having similar expression profiles • Discover clusters of genes having similar expression profiles

Class Comparison and Class Prediction • Not clustering problems • Global similarity measures generally used for clustering arrays may not distinguish classes • Don’t control multiplicity or for distinguishing data used for classifier development from data used for classifier evaluation • Supervised methods • Requires multiple biological samples from each class

Levels of Replication • Technical replicates • RNA sample divided into multiple aliquots and re-arrayed • Biological replicates • Multiple subjects • Replication of the tissue culture experiment

Biological conclusions generally require independent biological replicates. The power of statistical methods for microarray data depends on the number of biological replicates. • Technical replicates are useful insurance to ensure that at least one good quality array of each specimen will be obtained.

Class Prediction • Predict which tumors will respond to a particular treatment • Predict which patients will relapse after a particular treatment

Class prediction methods usually have gene selection as a component • The criteria for gene selection for class prediction and for class comparison are different • For class comparison false discovery rate is important • For class prediction, predictive accuracy is important

Clarity of Objectives is Important • Patient selection • Many microarray studies developing classifiers are not “therapeutically relevant” • Analysis methods • Many microarray studies use cluster analysis inappropriately or misleadingly

Microarray Platforms for Developing Predictive Classifiers • Single label arrays • Affymetrix GeneChips • Dual label arrays using common reference design • Dye swaps are unnecessary

Common Reference Design A1 A2 B1 B2 RED R R R R GREEN Array 1 Array 2 Array 3 Array 4 Ai = ith specimen from class A Bi = ith specimen from class B R = aliquot from reference pool

The reference generally serves to control variation in the size of corresponding spots on different arrays and variation in sample distribution over the slide. • The reference provides a relative measure of expression for a given gene in a given sample that is less variable than an absolute measure. • The reference is not the object of comparison. • The relative measure of expression will be compared among biologically independent samples from different classes.

Myth • For two color microarrays, each sample of interest should be labeled once with Cy3 and once with Cy5 in dye-swap pairs of arrays.

Dye Bias • Average differences among dyes in label concentration, labeling efficiency, photon emission efficiency and photon detection are corrected by normalization procedures • Gene specific dye bias may not be corrected by normalization

Dye swap technical replicates of the same two rna samples are rarely necessary. • Using a common reference design, dye swap arrays are not necessary for valid comparisons of classes since specimens labeled with different dyes are never compared. • For two-label direct comparison designs for comparing two classes, it is more efficient to balance the dye-class assignments for independent biological specimens than to do dye swap technical replicates

Balanced Block Design A1 B2 A3 B4 RED B1 A2 B3 A4 GREEN Array 1 Array 2 Array 3 Array 4 Ai = ith specimen from class A Bi = ith specimen from class B

Detailed comparisons of the effectiveness of designs: • Dobbin K, Simon R. Comparison of microarray designs for class comparison and class discovery. Bioinformatics 18:1462-9, 2002 • Dobbin K, Shih J, Simon R. Statistical design of reverse dye microarrays. Bioinformatics 19:803-10, 2003 • Dobbin K, Simon R. Questions and answers on the design of dual-label microarrays for identifying differentially expressed genes, JNCI 95:1362-1369, 2003

Common reference designs are very effective for many microarray studies. They are robust, permit comparisons among separate experiments, and permit many types of comparisons and analyses to be performed. • For simple two class comparison problems, balanced block designs require many fewer arrays than common reference designs. • Efficiency decreases for more than two classes • Are more difficult to apply to more complicated class comparison problems. • They are not appropriate for class discovery or class prediction. • Loop designs are less robust, and dominated by either common reference designs or balanced block designs, and are not suitable for class prediction or class discovery.

What We Will Not Discuss Today • Image analysis • Normalization • Clustering methods • Class comparison • SAM and other methods of gene finding • FDR (false discovery rate) and methods for controlling the number of false positive genes

Class Prediction Model • Given a sample with an expression profile vector x of log-ratios or log signals and unknown class. • Predict which class the sample belongs to • The class prediction model is a function f which maps from the set of vectors x to the set of class labels {1,2} (if there are two classes). • f generally utilizes only some of the components of x (i.e. only some of the genes) • Specifying the model f involves specifying some parameters (e.g. regression coefficients) by fitting the model to the data (learning the data).

Problems With Many Diagnostic/Prognostic Marker Studies • Are not reproducible • Retrospective non-focused analysis • Multiplicity problems • Inter-laboratory assay variation • Have no impact • Not therapeutically relevant questions • Not therapeutically relevant group of patients • Black box predictors

Components of Class Prediction • Feature (gene) selection • Which genes will be included in the model • Select model type • E.g. Diagonal linear discriminant analysis, Nearest-Neighbor, … • Fitting parameters (regression coefficients) for model • Selecting value of tuning parameters

Do Not Confuse Statistical Methods Appropriate for Class Comparison with Those Appropriate for Class Prediction • Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy. • Demonstrating goodness of fit of a model to the data used to develop it is not a demonstration of predictive accuracy. • Statisticians are used to inference, not prediction • Most statistical methods were not developed for p>>n prediction problems

Feature Selection • Genes that are differentially expressed among the classes at a significance level  (e.g. 0.01) • The  level is selected only to control the number of genes in the model

t-test Comparisons of Gene Expression • xj~N(j1 , j2) for class 1 • xj~N(j2 , j2) for class 2 • H0j: j1 = j2

Estimation of Within-Class Variance • Estimate separately for each gene • Limited degrees of freedom • Gene list dominated by genes with small fold changes and small variances • Assume all genes have same variance • Poor assumption • Random (hierarchical) variance model • Wright G.W. and Simon R. Bioinformatics19:2448-2455,2003 • Inverse gamma distribution of residual variances • Results in exact F (or t) distribution of test statistics with increased degrees of freedom for error variance • For any normal linear model

Feature Selection • Small subset of genes which together give most accurate predictions • Combinatorial optimization algorithms • Genetic algorithms • Little evidence that complex feature selection is useful in microarray problems • Failure to compare to simpler methods • Some published complex methods for selecting combinations of features do not appear to have been properly evaluated

Linear Classifiers for Two Classes

Linear Classifiers for Two Classes • Fisher linear discriminant analysis • Requires estimating correlations among all genes selected for model • y = vector of class labels • Diagonal linear discriminant analysis (DLDA) assumes features are uncorrelated • Naïve Bayes classifier • Compound covariate predictor (Radmacher) and Golub’s method are similar to DLDA in that they can be viewed as weighted voting of univariate classifiers

Linear Classifiers for Two Classes • Compound covariate predictor Instead of for DLDA

Linear Classifiers for Two Classes • Support vector machines with inner product kernel are linear classifiers with weights determined to separate the classes with a hyperplain that minimizes the length of the weight vector

Support Vector Machine

Perceptrons • Perceptrons are neural networks with no hidden layer and linear transfer functions between input output • Number of input nodes equals number of genes selected • Number of output nodes equals number of classes minus 1 • Number of inputs may be major principal components of genes or major principal components of informative genes • Perceptrons are linear classifiers

Naïve Bayes Classifier • Expression profiles for class j assumed normal with mean vector mj and diagonal covariance matrix D • Likelihood of expression profile vector x is l(x; mj ,D) • Posterior probability of class j for case with expression profile vector x is proportional to πjl(x; mj ,D)

Compound Covariate Bayes Classifier • Compound covariate y = tixi • Sum over the genes selected as differentially expressed • xi the expression level of the ith selected gene for the case whose class is to be predicted • ti the t statistic for testing differential expression for the i’th gene • Proceed as for the naïve Bayes classifier but using the single compound covariate as predictive variable • GW Wright et al. PNAS 2005.

When p>>n The Linear Model is Too Complex • It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero. • Why consider more complex models?

Myth • Complex classification algorithms such as neural networks perform better than simpler methods for class prediction.

Artificial intelligence sells to journal reviewers and peers who cannot distinguish hype from substance when it comes to microarray data analysis. • Comparative studies have shown that simpler methods work as well or better for microarray problems because they avoid overfitting the data.

Other Simple Methods • Nearest neighbor classification • Nearest k-neighbors • Nearest centroid classification • Shrunken centroid classification

Nearest Neighbor Classifier • To classify a sample in the validation set as being in outcome class 1 or outcome class 2, determine which sample in the training set it’s gene expression profile is most similar to. • Similarity measure used is based on genes selected as being univariately differentially expressed between the classes • Correlation similarity or Euclidean distance generally used • Classify the sample as being in the same class as it’s nearest neighbor in the training set

Other Methods • Neural networks • Top-scoring pairs • CART • Random Forrest • Genetic algorithm based classification

Apparent Dimension Reduction Based Methods • Principal component regression • Supervised principal component regression • Partial least squares • Stepwise logistic regression

When There Are More Than 2 Classes • Nearest neighbor type methods • Decision tree of binary classifiers

Decision Tree of Binary Classifiers • Partition the set of classes {1,2,…,K} into two disjoint subsets S1 and S2 • Develop a binary classifier for distinguishing the composite classes S1 and S2 • Compute the cross-validated classification error for distinguishing S1 and S2 • Repeat the above steps for all possible partitions in order to find the partition S1and S2 for which the cross-validated classification error is minimized • If S1and S2 are not singleton sets, then repeat all of the above steps separately for the classes in S1and S2 to optimally partition each of them

Design & Analysis of Microarray Studies for Diagnostic & Prognostic Classification