Computer Aid Discovery Course: Molecular Classification of Cancer

Computer Aid Discovery Course: Molecular Classification of Cancer Chris TK Man, Ph.D. Texas Children’s Cancer Center Baylor College of Medicine Feb 11, 2009

Outline • Introduction • Differences between class comparison, class discovery, and class prediction/classification • Methods used in class discovery • Methods used in classification • Examples from the literature

Part 1: Introduction

What is molecular classification in cancer research? • Use of molecular profiles (e.g. DNA, RNA, or proteins) to classify, diagnose, or predict different types or subtypes of cancer • histology subtypes • prognostic subtypes • Chemotherapy response • Metastasis • Survival • Recurrence • types of similar cancers

The Golub study • Published in 1999, Science: 286:531 • Classified acute leukemias from lymphoid precursors (ALL) or myeloid precursors (AML) • Cited 2806 times

Successful example in breast cancer I • MammaPrint developed by agendia/ Netherlands Cancer Institute and the Antoni Van Leeuwenhoek Hospital in Amsterdam • A gene expression profiling test based on a 70-gene signature that predicts the risk of metastasis in breast cancer patients • Superior to current standards for determination of the recurrence risk for breast cancer, like the NIH criteria • Validated in more than 1,000 patients and is backed by peer-reviewed medical research

Successful example in breast cancer II • Oncotype DX developed by genomic health • A clinically validated laboratory, 21-gene assay (RT-PCR) that predicts the likelihood of breast cancer recurrence in women with newly diagnosed, early stage invasive breast cancer based on a Recurrence Score • Assess the benefit from chemotherapy • Use formalin-fixed, paraffin-embedded tumor tissue

Part 2: Study objectives • Class comparison • Class discovery • Class prediction/Classification

Class comparison • Determine whether gene expression profiles differ among samples selected from predefined classes • Identify which genes are differentially expressed among the classes • Understand the biology of the disease and the underlying processes or pathways • Requires control of false discoveries or multiple testing, such as Bonferonni correction • Examples: cancers with • Different stages • Primary site • genetic mutations • Therapy response • Before and after an intervention • Classes are predefined independently of the expression profiles • Methods: t-test and Wilcoxon’s test

Class Discovery • Cluster analysis, unsupervised learning, and unsupervised pattern recognition • The classes are unknown a priori and need to be discovered from the data • Involves estimating the number of classes (or clusters) and assigning objects to these classes • Goal: Identify novel subtypes of specimens with a population • Assumption: clinically and morphologically similar specimens may be distinguishable at the molecular level • Example: • identify subclasses of tumors that are biologically homogeneous and whose expression profiles either reflect different cells of origin or disease pathogenesis, e.g. subtypes of B-cell lymphoma • Uncover biological features of the disease that may be clinically or therapeutically useful • Methods: hierarchical and K-mean clustering

Class Prediction/Classification • Also called supervised learning • The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects (learning or training set) • To build a classifier that will then be used to predict the class of future unlabeled observations • Similar to class comparison except that the emphasis is on developing a statistical model that can predict class label of a specimen • Important for diagnostic classification, prognostic prediction, and treatment selection • Methods: linear discriminant analysis, weighted voting, nearest neighbors

Part 3: Class Discovery

Hierarchical Clustering • An agglomerative method to join similar genes or cases into groups based on a distance metric • The process is iterated until all groups are connected in a hierarchical tree

Hierarchical method G10 G6 G2 G1 G8 G5 G3 {G1,G5} is most similar to {G6,{G2,G8}} G7 G4 G9 G2 G8 G6 G1 G5 G2 G8 G6 G1 G5 G2 is most similar to G8 G6 is most similar to {G2, G8} Repeat joining until all the samples are clustered G1 G5 G2 G8 G1 is most similar to G5 G2 G8

Commonly used distance metric • Euclidean distance • 1- correlation

Agglomerative linkage method • Rules or metrics to determine which elements should be linked • Single linkage • Average linkage • Complete linkage

Single Linkage • Calculate the minimum distance between members of one clusters and members of another cluster DAB A B Where and

Average Linkage • Calculates the average distance between all members of one cluster and all members of another cluster DAB A B

Complete Linkage • Calculates maximum distance between members of one cluster and members of another cluster DAB A B

Differences in linkage methods • Single linkage creates extended clusters with individual genes • Average linkage produces clusters of similar variance • Complete linkage creates clusters of similar size and variability

Class Problem 1: How many clusters of samples? I II I II III IV I II III

K-Means Clustering Step 1: Randomly assign genes to clusters Specify cluster number G1 G2 C1 C1 G5 G1 G9 G3 G11 G4 G5 C2 C2 G4 G6 G3 G7 G10 G7 G8 C3 C3 G6 G2 G9 G8 G12 G10 Step 2: Calculate mean expression profile of each cluster G11 G12

K-means clustering Step 3: Move genes among clusters to minimize the mean distance between genes and clusters C1 G5 G1 G9 G3 G11 C2 G4 G9 Repeat the steps 2 and 3 until no genes can be shuffled G3 G7 G10 G8 C3 G6 G5 G2 G8 G12

K-means • Pros: • Fast algorithm, cluster thousands of object • Little difficulty with missing data • Cons: • Different solutions for different starting values. Use multiple runs. • Sensitive to outliers • An appropriate number of clusters is often unknown

Part 4: Class Prediction/ Classification

Classification • The object to be predicted assumes one of K predefined classes {1, …, K} • Associated with each object are: a response or dependent variable (class label), and a set of measurements that form the feature vector (genes) • The task is to classify an object into one of the K classes on the basis of an observed measurement X=x

Steps in Classifier Building Specimens Accuracy Specificity Sensitivity Test Set Training Set Feature Selection Model Fitting CV Classifier

Define the goal of a classifier • The goal of the classifier should be biological or clinically relevant and motivated • An example from cancer treatment—Personalized medicine • Most cancer treatments benefit only a minority of patients • Predict which patients are likely to benefit from the treatment would prevent unnecessary toxicity and inconvenience • Over treatment also results in major expense for individuals and society • Provide an alternative therapy to the non-responders

Feature Selection

Feature selection • Most of the features are uninformative • Includes a large number of irrelevant features could degrade classification performance • A small set of features is more useful for downstream applications and analysis • Feature selection can be performed: • Explicitly, prior the building of the classifier (filter method) • Implicitly, an inherent part of the classifier building procedure (wrapper method), e.g. CART

Feature selection methods • t- or F-statistics • Signal-to-noise statistics • Nonparametric Wilcoxon statistics • Correlation • Fold change • Univariate classification rate • And many others……

Welch t-statistic • Does not assume equal variance Where and denote the sample average intensities in groups A and B and and denote the sample variances for each group

Max T • Determines the family-wise error rate-adjusted p-values using Welch t-statistics • Algorithm: • Permute class labels and compute Welch t-statistic for each gene • Record max Welch t-statistic for 10,000 permutation • Compare the distribution of max t-statistic with observed values for the statistic • Estimate p-value for each gene as the proportion of the max permutation-based t-statistics that are greater than the observed value

Template matching • Algorithm: • Define a template or profile of gene expression • Identify genes which match the template using correlation • Simple and flexible • Can be used with any number of groups and templates, such as finding specific biological expression profiles in multigroup microarray datasets

Area Under the ROC • ROC analysis: • Proportion of true positives vs. false negatives (sensitivity vs. 1-specificity) from each possible decision threshold value • Used in two-class problem • Calculate the area under the ROC curve (AUC)

Rank Product • Assume that a gene in an experiment with n genes in k replicates has a probability of being ranked first of 1/nk if the list is random • Calculate the combined probability as a rank product is the position of gene g in the list of genes in the i-th replicate

SAM • Address a problem in t-statistic that small fold change genes may statistically significant due to small variance • Add a small “fudge factor”,s0,to the denominator of the test statistic • The factor is chosen to minimize the coefficient of variation of the test statistic d(i)

Classification algorithms

Problems of microarray classification • Number of candidate predictors, p >> number of cases, n • Algorithm works well to uncover structure and provide accurate predictors when n>>p often works poorly when p>>n • Overfitting problem if the same dataset is used for training and testing

Classification algorithms • Discriminant analysis • Nearest neighbors • Decision trees • Compound covariate predictor • Neural networks • Support vector machines • And many more….

Comparison of classification methods • No single method is likely to be optimal in all situations • Performance depends on: • Biological classification under investigation • Genetic disparity among classes • Within-class heterogeneity • Size of training set, etc

A comparative study • Dudoit et al (2002) compared standard and diagonal discriminant analysis, Golub’s weighted vote method, classification trees, and nearest neighbors • Applied to three datasets: • Adult lymphoid malignancies separated into two or three classes (Alizadeh et al, 2000) • Acute lymphocytic and myelogenous leukemia (Golub et al, 1999) • 60 human tumor cell lines into 8 classes based on site of origin (Ross et al, 2000) • Diagonal discriminant analysis and nearest neighbors performed the best, suggesting methods that ignored gene correlations and interactions performed better than more complex models

Diagonal linear discriminant analysis • Assumptions: • Gene covariances are assumed to be zero • Variances of the two classes are the same • The new samples is assigned to class 2 if

Golub’s weighted gene voting scheme • Proposed in the first microarray-based classification paper in 1999 • A variant of DLDA • correlation of gene j within the class label is defined by • Each gene casts a “weighted vote” for class prediction • Sum of all votes determines the class of the sample (V>0, class 1)

Compound covariate predictor Compound covariate for specimen j ti is the t-statistic respect to gene i xij is the log expression in specimen j for gene i and are mean values of the compound covariate for specimens of class 1 and class 2 Classification threshold

Nearest Neighbors Training set contains 3 classes: A, B, and C A v is a test case Measure v’s 3 nearest neighbors, such as Euclidean distance v C B Class C is most frequently represented in v’s nearest neighbors, v is assigned to class C

Nearest Neighbors • Simple and capture nonlinearities in the true boundary between classes when the number of specimens is sufficient • Number of neighbors k can have a large impact on the performance of the classifier • Can be selected by LOOCV • Votes can be weighted according to class prior probabilities • Assign weights to the neighbors that are inversely proportional to their distance for the test case • Heavy computing time and storage requirement

Classification Tree Node 1 Class 1: 10 Class 2: 10 >2 <=2 Gene A Node 2 Class 1: 6 Class 2: 9 Node 3 Class 1: 4 Class 2: 1 Prediction: 1 >1.7 <=1.7 Gene B Node 5 Class 1: 6 Class 2: 5 Node 4 Class 1: 0 Class 2: 4 Prediction:2 >-0.3 <=-0.3 Gene C Node 7 Class 1: 5 Class 2: 0 Prediction: 1 Node 6 Class 1: 1 Class 2: 5 Prediction:2 Accuracy = 18/20 =80%

Use of clustering methods for classification • Avoid overfitting problem because class information plays no role in deriving the predictors • Results in poor performance of the predictor • Only a subset of genes can distinguish classes, their influence may be lost in a cluster analysis

Computer Aid Discovery Course: Molecular Classification of Cancer