Machine Learning techniques for biomarker discovery in proteomic pattern data

Machine Learning techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam

Overview • Proteomic pattern data • How to use the data • Approaches • Methodology • Case study • Conclusion

SELDI-TOF MSSurface-enhanced laser desorption/ionization time-of-flight mass spectronomy • Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins. • The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones. 1 Serum on protein binding plate 2 Insert plate in vacuum chamber 3 Irradiate plate with laser 4 This “launches” the proteins / peptides 5 Measure “time of flight” (TOF) of Ions, which corresponds to molecular Weights of proteins

Example Abundance Time of flight • Heavier peptides move slower -> • Time of flight corresponds to weight • Weight corresponds to peptides • Measuring relative abundance of detected proteins in serum

How to use the data? • Diagnostic tool: • design a classifier for discriminating healthy from disease samples • Biomarkers identification: • Feature selection (FS): select features (peptides / proteins) that best discriminate the two classes (potential biomarkers)

Classification / FS • diagnostic tool => classifier • train a classifier that separates the two classes of diseased and healthy examples • biomarkers => feature subset selection • for a given type of classifier (e.g. KNN, SVM) find a small set of features that optimizes the performance of the classifier when restricted to the selected features • for a given clustering algorithm find a small set of features that maximizes the coherence of class labels of examples in the clusters (Petricoin et al, The Lancet 2002)

Approaches: Commercial • Proteome Quest (Correlogic): GA+clustering, no pre-selection (Petricoin et al., The Lancet 2002) • Propeak (3Z Informatics): separability analysis + bootstrap • Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?

Approaches: Non-commercial • Pre-processing + ranking + kNN (Zhu et al., PNAS 2003) • Pre-selection + boosted decision trees (Qu et al., Clin. Chem. 2002) • Filter FS + classifier (Liu et al., Genome Informatics 2002) • GA + SVM (Jong et al., EvoBIO 2004) • Many others: any ML method for classification/FS (see, e.g., special issue on FS, JMLR 2003)

SVM-based methods • Linear Support Vector Machine

GA_SVM • Training set T= T_1  T_2. • A genetic algorithm evolves a number of populations. Each population consists of sets of features of a given size. The fitness of an individual of the population is based on the performance of a SVM. SVM is trained on T_1 using only the features of the individual. The fitness is the SVM error over T_2. • At each generation new individuals are created and inserted into the population by selecting fit parents which are mutated and recombined. • Individuals may migrate to neighbor populations.

Ensemble SVM-RFE SVM-RFE(a cutoff, a training set T=T_1T_2) • Train a linear soft-SVM(C, class label penalties) on T_1 • Order features using the weights of the resulting classifier • Eliminate features with weight smaller than cutoff • Repeat the process with T_1 restricted to the remaining features This algorithm generates a chain of feature sets F_1  F_2  …  F_k SVM-RFE selects from {F_1, …,F_k} the set F* that minimizes the error over T_2 of the classifier restricted to the feature set, plus a term for penalizing large feature sets. We proposed a variant of this FS algorithm that uses ensembles of results of SVM-RFE over different cutoff values.

Methodology • Cross Validation • split data randomly in train and test set • apply the classification/FS method to the training set • use the test set only to assess the performance of the method • repeat the process a number of times to analyze bias induced by the data splitting

About Methodology • Examples of recent papers that do NOT use a correct methodology: • Qu et al. (Clin. Chem. 2002): perform feature pre-selection before application of CV • Villanueva et al (Anal. Chem. 2004): use the entire dataset for feature ranking • Petricoin et al (The Lancet 2002): consider one data split into train/test set • papers addressing methodology pitfalls: • Simon et al, J Nat. Cancer Inst 2003 • Ambroise and Mc Lachlan, PNAS 2002

Case Study: Data • Used in Petricoin et al papers • Commercial analysis software (Proteome Quest): http://www.correlogic.com/ • Data sets: http://ncifdaproteomics.com/ppatterns.php • Ovarian data set: • 162 Positive (Cancer) 92 Negative (Healthy) • 15154 Variables (Peptides / Proteins) • Prostate data set: • 69 Positive 253 Negative • 15154 Variables • number variables >> number examples

Preliminary analysis Prostate data: • Few visible differences in means between healthy/cancer groups • But many very low p-values (in particular ovarian -> easy) Ovarian data: Difference in means Histogram p-values

The Methods • Diagnostic tool: • Support Vector Machine with linear and polynomial kernel • Biomarkers Detection and Diagnostic: • Feature subset selection, using Genetic Algorithms and Support Vector Machine

Diagnostics: Results • Support Vector Machine (SVM) on all features • Linear and quadratic kernel • Evaluation measures: • Error: fp + fn / total • Sensitivity: tp / (tp + fn) • Specificity: tn / (fp + tn) • Positive Predictive Value: tp / (tp + fp) Results seem consistent with preliminary analysis: ovarian easier than prostate

Biomarker Detection: Results Linear SVM, Prostate data set Quadratic SVM, Prostate data set Bigger error than SVM on all features (+/- 0.06)

Results of Experiments • Results of experiments with GA-SVM indicate that there is variability both due to the data splitting and the algorithm. • Different sets of features are obtained at each run, however there is a group of about 50 features that occur more often over all the runs.

Results of Experiments • Ensemble-RFE-SVM achieves perfect classification on ovarian dataset while on the prostate dataset achieves sensitivity 0.97(0.04) and specificity of 0.89(0.06). • Ensemble-RFE-SVM outperforms both GA-SVM and the commercial software of Petricoin et al. However, it finds feature sets of larger sizes. • Features provided by Petricoin et al URL site yield scarce performance when SVM is used, showing that performance depends on the type of classifier used…

Diagnostic tool Design • Effective FS algorithms, like ensemble SVM-RFE, have to be enhanced with a user-friendly interface and visualization features in order to become operative in research laboratories and hospitals. • The resulting tools can be used by biologists and pathologists for analyzing their data without need of direct support from CS people.

Conclusion • Many machine learning techniques can be used for the analysis of pattern proteomic data. SVM based approaches are effective. • Computational analysis of pattern proteomic data has to use a correct methodology that considers biases induced by the selection and classification algorithms and by the data splitting. • Collaboration: • Connie Jimenez • Gus Smit • Kees Jong • Aad van der Vaart

Machine Learning techniques for biomarker discovery in proteomic pattern data