160 likes | 171 Vues
This project analyzes decision tree learning and its application in classifying gene expression patterns in leukemia patients. The analysis is performed using WEKA, a popular machine learning tool. The project aims to compare the performance of different decision tree algorithms and identify factors that influence their effectiveness.
E N D
Artificial Intelligence Project #3: Analysis of Decision Tree Learning Using WEKA May 23, 2006
Introduction • Decision tree learning is a method for approximating discrete-valued target function • The learned function is represented by a decision tree • Decision tree can also be re-represented as if-then rules to improve human readability
Decision Tree Representation (1/2) • Decision tree classify instances by sorting them down the tree from the root to some leaf node • Node • Specifies test of some attribute • Branch • Corresponds to one of the possible values for this attribute
Each path corresponds to a conjunction of attribute tests (Outlook=sunny, Temperature=Hot, Humidity=high, Wind=Strong)(Outlook=Sunny ∧ Humidity=High) so NO Decision trees represent a disjunction of conjunction of constraints on the attribute values of instances (Outlook=Sunny ∧Humidity=normal) ∨(Outlook=Overcast) ∨(Outlook=Rain ∧Wind=Weak) Decision Tree Representation (2/2) Outlook Sunny Rain Overcast Humidity Yes Wind High Normal Strong Weak No Yes No Yes • What is the merit of tree representation?
Appropriate Problems for Decision Tree Learning • Instances are represented by attribute-value pairs • The target function has discrete output values • Disjunctive descriptions may be required • The training data may contain errors • Both errors in classification of the training examples and errors in the attribute values • The training data may contain missing attribute values • Suitable for classification
60 leukemia patients Bone marrow samples Affymetrix GeneChip arrays Gene expression data Study • Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells,MH Cheok et al., Nature Genetics 35, 2003.
Gene Expression Data • # of data examples • 120 (60: before treatment, 60: after treatment) • # of genes measured • 12600 (Affymetrix HG-U95A array) • Task • Classification between “before treatment” and “after treatment” based on gene expression pattern
Affymetrix GeneChip Arrays • Use short oligos to detect gene expression level. • Each gene is probed by a set of short oligos. • Each gene expression level is summarized by • Signal: numerical value describing the abundance of mRNA • A/P call: denotes the statistical significance of signal
Preprocessing • Remove the genes having more than 60 ‘A’ calls • # of genes: 12600 3190 • Discretization of gene expression level • Criterion: median gene expression value of each sample • 0 (low) and 1 (high)
Gene Filtering • Using mutual information • Estimated probabilities were used. • # of genes: 3190 1000 • Final dataset • # of attributes: 1001 (one for the class) • Class: 0 (after treatment), 1 (before treatment) • # of data examples: 120
Final Dataset 1000 120
Materials for the Project • Given • Preprocessed microarray data file: data2.txt • Downloadable • WEKA (http://www.cs.waikato.ac.nz/ml/weka/)
Submission • Due date: June 15 (Thu.), 12:00(noon) • Report: Hard copy(301-419) & e-mail. • ID3, J48 and another decision tree algorithm with learning parameter. • Show the experimental results of each algorithm. Except for ID3, you should try to find out better performance, changing learning parameter. • Analyze what makes difference between selected algorithms. • E-mail : jwha@bi.snu.ac.kr