580 likes | 689 Vues
Systems Approaches to Disease Stratification. Nathan Price Introduction to Systems Biology Short Course August 20, 2012. Goals and Motivation. Currently most diagnoses based on symptoms and visual features (pathology, histology)
E N D
Systems Approaches to Disease Stratification Nathan Price Introduction to Systems Biology Short Course August 20, 2012
Goals and Motivation • Currently most diagnoses based on symptoms and visual features (pathology, histology) • However, many diseases appear deceptively similar, but are, in fact, distinct entities from the molecular perspective • Drive towards personalized medicine
Outline • Molecular signature classifiers: main issues • Signal to noise • Small sample size issues • Error estimation techniques • Phenotypes and sample heterogeneity • Example study • Advanced topics • Network-based classification • Importance of broad disease context
Molecular signature classifiers Overall strategy
Molecular signatures for diagnosis • The goals of molecular classification of tumors: • Identify subpopulations of cancer • Inform choice of therapy • Generally, a set of microarray experiments is used with • ~100 patient samples • ~ 104 transcripts (genes) • This very small number of samples relative to the number of transcripts is a key issue • Feature selection & model selection • Small sample size issues dominate • Error estimation techniques • Also, the microarray platform used can have a significant effect on results
Randomness • Expression values have randomness arising from both biological and experimental variability. • Design, performance evaluation, and application of classifiers must take this randomness into account.
Three critical issues arise… • Given a set of variables, how does one design a classifier from the sample data that provides good classification over the general population? • How does one estimate the error of a designed classifier when data is limited? • Given a large set of potential variables, such as the large number of expression levels provided by each microarray, how does one select a set of variables as the input vector to the classifier?
Small sample issues • Our task is to predict future events • Thus, we must avoid overfitting • It is easy (if the model is complicated enough) to fit data we have • Simplicity of model vital when data is sparse and possible relationships are large • This is exactly the case in virtually all microarray studies, including ours • In the clinic • At the end, want a test that can easily be implemented and actually benefit patients
Error estimation and variable selection • An error estimator may be unbiased but have a large variance, and therefore often be low. • This can produce a large number of gene sets and classifiers with low error estimates. • For a small sample, one can end up with thousands of gene sets for which the error estimate from the sample data is near zero!
Overfitting • Complex decision boundary may be unsupported by the data relative to the feature-label distribution. • Relative to the sample data, a classifier may have small error; but relative to the feature-label distribution, the error may be severe! • Classification rule should not cut up the space in a manner too complex for the amount of sample data available.
Overfitting: example of KNN rule N = 30 test sample; k = 3 N = 90
Example: How to identify appropriate models(regression… but the issues are the same) noise learn f from data
Cross-validation • Simple: just choose the classifier with the best cross-validation error • But… (there is always a but) • we are training on even less data, so the classifier design is worse • if sample size is small, test set is small and error estimator has high variance • so we may be fooling ourselves into thinking we have a good classifier…
mean square error: 0.96 mean square error: 2.12 best mean square error: 3.33
Estimating Error on Future Cases Data Set Resampling: Shuffled repeatedly into training and test sets. Average performance on test set provides estimate for behavior on future cases Can be MUCH different than behavior on training set Training Set Test Set NO information passage • Methodology • Best case: have an independent test set • Resampling techniques • Use cross validation to estimate accuracy on future cases • Feature selection and model selection must be within loop to avoid overly optimistic estimates
Classification methods • k-nearest neighbor • Support vector machine (SVM) • Linear, quadratic • Perceptrons, neural networks • Decision trees • k-Top Scoring Pairs • Many others
Molecular signature classifiers Example Study
Diagnosing similar cancers with different treatments ? GIST Patient LMS Patient • Challenge in medicine: diagnosis, treatment, prevention of disease suffer from lack of knowledge • Gastrointestinal Stromal Tumor (GIST) and Leiomyosarcoma (LMS) • morphologically similar, hard to distinguish using current methods • different treatments, correct diagnosis is critical • studying genome-wide patterns of expression aids clinical diagnosis • Goal: Identify molecular signature that will accurately differentiate these two cancers
Geman, D., et al. Stat. Appl. Geneti. Mol. Biol., 3, Article 19, 2004 Tan et al., Bioinformatics, 21:3896-904, 2005 Relative Expression Reversal Classifiers • Find a classification rule as follows: • IFgene A > gene BTHENclass1, ELSEclass2 • Classifier is chosen finding the most accurate and robust rule of this type from all possible pairs in the dataset • If needed, a set of classifiers of the above form can be used, with final classification resulting from a majority vote (k-TSP)
Rationale for k-TSP • Based on concept of relative expression reversals • Advantages • Does not require data normalization • Does not require population-wide cutoffs or weighting functions • Has reported accuracies in literature comparable to SVMs, PAM, other state-of-the art classification methods • Results in classifiers that are easy to implement • Designed to avoid overfitting • n = number of genes, m = number of samples • For the example I will show, this equation yields: • 10^9 << 10^20
Price, N.D. et al, PNAS 104:3414-9 (2007) 5 10 Classified as GIST 4 10 OBSCN expression 3 10 2 10 Clinicopathological Diagnosis X – GIST Classified as LMS O - LMS 1 10 1 2 3 4 5 10 10 10 10 10 C9orf65 expression Accuracy on data = 99% Predicted accuracy on future data (LOOCV) = 98% Diagnostic Marker Pair
Price, N.D. et al, PNAS 104:3414-9 (2007) RT-PCR Classification Results LMS GIST • 100% Accuracy • 19 independent samples • 20 samples from microarray study • including previously indeterminate case Price, N.D. et al, PNAS 104:3414-9 (2007)
Price, N.D. et al, PNAS 104:3414-9 (2007) Comparative biomarker accuracies C-kit gene expression GIST – X LMS – O 2-gene relative expression classifier Price, N.D. et al, PNAS 104:3414-9 (2007)
Price, N.D. et al, PNAS 104:3414-9 (2007) Kit Protein Staining of GIST-LMS Blue arrows - GIST Red arrows - LMS Accuracy as a classifier ~ 87%. • Top Row – GIST Positive Staining • Bottom Row – GIST negative staining
A few general lessons • Choosing markers based on relative expression reversals of gene pairs has proven to be very robust with high predictive accuracy in sets we have tested so far • Simple and independent of normalization • Easy to implement clinical test ultimately • All that’s needed is RT-PCR on two genes • Advantages of this approach may be even more applicable to proteins in the blood • Each decision rule requiring the measurement of the relative concentration of 2 proteins
Chuang, Lee, Liu, Lee, Ideker, Molecular Systems Biology 3:40 Network-based classification • Can modify feature selection methods based on networks • Can improve performance (not always) • Generally improves biological insight by integrating heterogeneous data • Shown to improve prediction of breast cancer metastasis (complex phenotype)
Rationale: Differential Rank Analysis (DIRAC) Price, N.D. et al, PNAS, 2007 • Networks or pathways inform best targets for therapies • Cancer is a multi-genic disease • Analyze high-throughput data to identify aspects of the genome-scale network that are most affected • Initial version uses a priori defined gene sets • BioCarta, KEGG, GO, etc. • Differential rank conservation (DIRAC) for studying • Expression rank conservation for pathways within a phenotype • Pathways that discriminate well between phenotypes Eddy, J.A. et al, PLoS Computational Biology (2010)
Differential Rank Conservation …across pathways in a phenotype tightly regulated pathway Highest conservation g3 g3 g3 g3 …across phenotypes for a pathway g1 g2 g2 g2 g1 g1 g1 g2 g4 g4 g4 g4 shuffled pathway ranking between phenotypes weakly regulated pathway GIST LMS g6 g5 g7 g7 g3 g4 g8 g8 g7 g6 g2 g1 Lowest conservation g7 g8 g6 g6 g1 g3 g5 g5 g5 g8 g4 g2
Visualizing global network rank conservation Average rank conservation across all 248 networks: 0.903 … … …
Global regulation of networks across phenotypes Highest rank conservation Lowest rank conservation Eddy et al, PLoS Computational Biology, (2010)
Global regulation of networks across phenotypes Highest rank conservation Lowest rank conservation Tighter network regulation:normal prostate Looser network regulation:primary prostate cancer Loosest network regulation:metastatic prostate cancer Eddy et al, PLoS Computational Biology, (2010)
Differential Rank Conservation …across pathways in a phenotype tightly regulated pathway Highest conservation g3 g3 g3 g3 …across phenotypes for a pathway g1 g2 g2 g2 g1 g1 g1 g2 g4 g4 g4 g4 shuffled pathway ranking between phenotypes weakly regulated pathway GIST LMS g6 g5 g7 g7 g3 g4 g8 g8 g7 g6 g2 g1 Lowest conservation g7 g8 g6 g6 g1 g3 g5 g5 g5 g8 g4 g2
DIRAC classification is comparable to other methods Cross validation accuracies in prostate cancer
Eddy et al, PLoS Computational Biology, (2010) Differential Rank Conservation (DIRAC): Key Features • Independent of data normalization • Independent of genes/proteins outside of network • Can show massive/complete perturbations • Unlike Fischer’s exact test (e.g. GO enrichment) • Measures the “shuffling” of the network in terms of the hierarchy of expression of he components • Distinct from enrichment or GSEA • Provides a distinct mathematically classifier to yield measurement of predictive accuracy on test data • Stronger than p-value for determining signal • Code for the method can be found at our website: http://price.systemsbiology.net
Global Analysis of Human Disease Importance of broad context to disease diagnosis
Why global disease analyses are essential • Organ-specificity: separating signal from noise • Hierarchy of classification • Context-independent classifiers • Based on organ-specific markers • Context-dependent classifiers • Based on excellent markers once organ-specificity defined • Provide context for how disease classifiers should be defined • Provide broad perspective into how separable diseases are and if disease diagnosis categories seem appropriate
GLOBAL ANALYSIS OF DISEASE-PERTURBED TRANSCRIPTOMES IN THE HUMAN BRAIN Example case study