1 / 27

- Especially on Cancer type classification 2003/12/17 Dept. of BioSystem Sangwoo Kim

Uses of SVM and other Algorithms in Analyzing Microarray Gene Expression Data. - Especially on Cancer type classification 2003/12/17 Dept. of BioSystem Sangwoo Kim. Presentation Outline. Introduction Microarray Gene Expression Data Cancer type diagnosis problem

vito
Télécharger la présentation

- Especially on Cancer type classification 2003/12/17 Dept. of BioSystem Sangwoo Kim

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Uses of SVM and other Algorithms in Analyzing Microarray Gene Expression Data - Especially on Cancer type classification 2003/12/17 Dept. of BioSystem Sangwoo Kim CS774 Topics in AI - BISL Sangwoo Kim

  2. Presentation Outline • Introduction • Microarray Gene Expression Data • Cancer type diagnosis problem • Classification Methods • Real Experiments and their results • Conclusion • Further Works • References CS774 Topics in AI - BISL Sangwoo Kim

  3. Introduction A - microarray • Several thousands DNA samples are fixed (we know the location of the genes) • Two different RNA samples are prepared with two different fluorescent dyes (ex. red and green) • Measure the ratio of each of the two dyes (red > yellow > green) • m experiments with n genes form n x m matrix, and take a logarithm of them CS774 Topics in AI - BISL Sangwoo Kim

  4. Introduction A - microarray • n rows, m columns -> n genes, m experiments • row vector : a flow of expression level of a gene • column vector : a set of expression level of many genes at a certain situation • Clustering problem : what genes are significantly related to a certain phenomenon • Classification problem : In what state is a certain experiment CS774 Topics in AI - BISL Sangwoo Kim

  5. Introduction B – Cancer type classification • Cancer : malignant tumor with • Fast growth, invasion, anaplasia, metastasis • Cancer is a result of various and consecutive genetic transformation like being hyperactive, dumb by mutations like SNP, translocation, inappropriate mitosis etc. • Cancer is so related with genes • Finding out which genes are related is very important for setting treatment CS774 Topics in AI - BISL Sangwoo Kim

  6. Introduction B– Cancer type classification • Previous (current) diagnosis • tumor’s morphology • histochemistry • immunophenotyping • cytogenetic analysis • Most of microarray classification are cancer typing • Current method is pretty accurate, but we want to make use of this newly established microarray technology • And this is just important… CS774 Topics in AI - BISL Sangwoo Kim

  7. Input size reducing- Neighboring analysis Reducing number of genes (clearing genes that might be noisy) • “neighboring analysis” • define ‘idealized expression pattern” • check unusually high density of genes nearby this idealized pattern -> “there are many more genes correlated with the pattern than expected by chance” CS774 Topics in AI - BISL Sangwoo Kim

  8. Classification method A- Weighted voting • uAML or uALL denotes respectively the mean expression levels of AML and ALL in a set of reference samples • Calculate wivi • wi : a weighting factor that reflects how well the gene is correlated with the class distinction • vi = | xi – (uAML + uALL) / 2 | CS774 Topics in AI - BISL Sangwoo Kim

  9. Classification method B- Support Vector Machine SVM for microarray analysis • SVMs avoid overfitting by choosing a specific hyperplane among the many that can separate the data in the feature space • Usually there are relatively small number of sample compared with the number of input features • Sparseness of its representation of the decision boundary CS774 Topics in AI - BISL Sangwoo Kim

  10. Classification method B- Support Vector Machine … CS774 Topics in AI - BISL Sangwoo Kim

  11. Classification method B- Support Vector Machine CS774 Topics in AI - BISL Sangwoo Kim

  12. Classification method B- Support Vector Machine Kernel Functions • Gaussian Kernel • Polynomial Kernel • Linear Kernel • Rarely other kernel functions Multi-class classification • OVA (One versus all) algorithm constructs n-1 classifiers and vote the biggest score • AP (All Pairs) algorithm constructs n(n-1)/2 classifiers and summarizes scores in the same rows CS774 Topics in AI - BISL Sangwoo Kim

  13. Classification method C- Other machine learning tools Decision Tree • C4.5 and MOC1 – Construct a decision tree by top-down approaching strategy with greedy algorithm Parzen Windows • With radial based function Fisher’s linear discrimination • Maximizing the signal-to-interference ratio CS774 Topics in AI - BISL Sangwoo Kim

  14. Real classification experiments A- Using weighted voting • Golub et al (1999). Molecular Classification of Cancer : Class Discovery and Class Prediction by Gene Expression Monitoring. Science • ClassifyingALL (acute myeloid leukemia) from AML (acute lymphoblastic leukemia) • Hard to decide. Critical for successful treatment • Class predictor and Class discovery • Sample : 38 bone marrow samples (27 ALL, 11 AML), each of which contains 6817 human genes CS774 Topics in AI - BISL Sangwoo Kim

  15. Real classification experiments A- Using weighted voting • Prediction strength : (Vwin – Vlose) / (Vwin + Vlose) • When PS < 0.3 : “uncertain” • The Predictor Assigned 36 of 38 samples as either AML or ALL, all of which are correct from cross validation • The Predictor Assigned 29 of 34 additional samples as either AML or ALL, all of which are correct CS774 Topics in AI - BISL Sangwoo Kim

  16. Gene xi ? Real classification experiments B- Using SVM • Michael P.S. Brown(2000) Knowledge-based analysis of microarray gene expression data by using support vector machines, PNAS • Classifying a set of genes into six categories using OVA • 2467 genes from the budding yeast S. cerevisiae, 79 experiments [Eisen et al., 1998] • Training and test set : MIPS Yeast Genome Database (MYGD) Tricarboxylic-acid pathway Respiration chain complexes Cytoplasmic ribosomal proteins Proteasome Histones Helix-turn-helix (used as a control) CS774 Topics in AI - BISL Sangwoo Kim

  17. Real classification experiments B- Using SVM Experiment and Result 1 The cost function is FP + 2FN CS774 Topics in AI - BISL Sangwoo Kim

  18. Real classification experiments B- Using SVM Experiment and Result 2 • SVM showed better performance than any other machine learning algorithm including parzen window, fisher’s linear discriminant, decision tree • Gaussian kernel function performed better than any polynomial kernel function • They found some errorneous notation from MYGD classification such as false positive YAL003W CS774 Topics in AI - BISL Sangwoo Kim

  19. Real classification experiments B- Using SVM • Terrence S. Furey(2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics • Supervised learning with support vector machine • 97,802 cDNAs about Ovarian cancer tissue, normal ovarian tissue, and other normal tissue (some of them may not be related) • 31 Tissue sample (31 experiments) • Binary classification about whether a tissues is cancerous or not CS774 Topics in AI - BISL Sangwoo Kim

  20. Real classification experiments B- Using SVM Result • Also found wrong notations like N039 or HWBC3 • A simple linear kernel works well • SVM didn’t show any superior performance than other machine learning algorithm like multilayer perceptron • Maybe because of lack of example CS774 Topics in AI - BISL Sangwoo Kim

  21. Real classification experiments B- Using SVM • Sayan Mukherjee(2003), Classifying microarray data using support vector machines, Technical Report AI Memo • Supervised learning with support vector machine • Same dataset with Golub’s (27 ALL, 11 AML) • 7,129 genes for each sample • Binary classification about whether a tissues is cancerous or not • Mentions about multi class classification CS774 Topics in AI - BISL Sangwoo Kim

  22. Real classification experiments B- Using SVM Experiment and Result 1 • Same dataset with Golub’s • Binary classification of ALL or AML, 38 training samples, 35 testing samples • A linear SVM classified 34 of 35 test samples accurately • Polynomial or Gaussian kernel did not increase the accuracy of the classifier CS774 Topics in AI - BISL Sangwoo Kim

  23. Real classification experiments B- Using SVM Experiment and Result 2 • Comparison SVM method with Weighed voting and kNN for seven cancer classification problems Test set Result CS774 Topics in AI - BISL Sangwoo Kim

  24. Real classification experiments B- Using SVM Experiment and Result 3 • Multi-class classification • 218 tumor sample (various, 14 types) + 90 normal sample, 16,063 genes OVA gave the best. CS774 Topics in AI - BISL Sangwoo Kim

  25. Conclusions • Support vector machine is useful to analyze microarray gene expression data • Relatively small number of sample and large number of feature would cause overfitting, which SVM can avoid • So far as research tells, selecting kernel function doesn’t affect much to SVM’s performance. It is believed that if the number of sample increases the role of kernel function will be bigger than now • This study can be also useful in checking and correcting labels of original data set CS774 Topics in AI - BISL Sangwoo Kim

  26. Further works • Test with other microarray data open to public by myself • Test with SVM for real data (maybe we are able to get microarray data about stomach cancer and liver cancer that researchers don’t study much on out of this country, Korea. • Study on other researches which use other kinds of SVM CS774 Topics in AI - BISL Sangwoo Kim

  27. References • Terrence et al(2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16(10)906-914 • Michael et al(1999) Support vector machine classification of Microarray gene expression data. UCSC-CRL-99-09 • Michael et al(2000) Knowledge-based analysis of microarray gene expression data by using support vector machines, PNAS97(1)262-267 (paper version of above) • Golub et al(1999) Molecular classification of cancer : class discovery and class prediction by gene expression monitoring, Science 286(15) 531-537 • Lee et al(2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data, Bioinformatics19(9) 1132-1139 • Mukherjee et al (2003) Classifying microarray data using support vector machines, “A Practical Approach to Microarray Data Analysis”, D. P. Berrar. W. Dubitzky and M. Granzow, Chapter 9, 166-185 • Class Material from CS774 Machine Learning : Theory and Practice, KAIST, Jahwan Kim. Chapter 1 to 5 • http://www.mathlove.org사단법인 수학사랑 홈페이지 질문과 답변 게시판 CS774 Topics in AI - BISL Sangwoo Kim

More Related