220 likes | 372 Vues
Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics. Hendrik Blockeel 1 , Leander Schietgat 1 , Jan Struyf 1,2 , Saso Dzeroski 3 , Amanda Clare 4 1 Katholieke Universiteit Leuven 2 University of Wisconsin, Madison
E N D
Decision Trees for Hierarchical Multilabel Classification :A Case Study in Functional Genomics Hendrik Blockeel1, Leander Schietgat1, Jan Struyf1,2, Saso Dzeroski3, Amanda Clare4 1 Katholieke Universiteit Leuven 2 University of Wisconsin, Madison 3 Jozef Stefan Institute, Ljubljana 4 University of Wales, Aberystwyth
Overview • The task: Hierarchical multilabel classification (HMC) • Applied to functional genomics • Decision trees for HMC • Multiple prediction with decision trees • HMC decision trees • Experiments • How does HMC tree learning compare to learning multiple standard trees? • Conclusions
Classification settings • Normally, in classification, we assign one class label ci from a set C = {c1, …, ck} to each example • In multilabel classification, we have to assign a subset S C to each example • i.e., one example can belong to multiple classes • Some applications: • Text classification: assign subjects (newsgroups) to texts • Functional genomics: assign functions to genes • In hierarchical multilabel classification (HMC), the classes C form a hierarchy C, • Partial order expresses “is a superclass of”
Hierarchical multilabel classification • Hierarchy constraint: • cicj coverage(cj) coverage(ci) • Elements of a class must be elements of its superclasses • Should hold for given data as well as predictions • Straightforward way to learn a HMC model: • Learn k binary classifiers, one for each class • Disadvantages: • 1. difficult to guarantee hierarchy constraint • 2. skewed class distributions (few pos, many neg) • 3. relatively slow • 4. no simple interpretable model • Alternative: learn one classifier that predicts a vector of classes • Quite natural for, e.g., neural networks • We will do this with (interpretable) decision trees
Goal of this work • There has been work on extending decision tree learning to the HMC case • Multiple prediction trees: Blockeel et al., ICML 1998; Clare and King, ECML 2001; … • HMC trees: Blockeel et al., 2002; Clare, 2003; Struyf et al., 2005 • HMC trees were evaluated in functional genomics, with good results ( proof of concept) • But: no comparison with learning multiple single classification trees has been made • Size of trees, predictive accuracy, runtimes… • Previous work focused on the knowledge discovery aspect • We compare both approaches for functional genomics
1 2 250 … Functional genomics • Task:Given a data set with descriptions of genes and the functions they have, learn a model that can predict for a new gene what functions it performs • A gene can have multiple functions (out of 250 possible functions, in our case) • Could be done with decision trees, with all the advantages that brings (fast, interpretable)… But: • Decision trees predict only one class, not a set of classes • Should we learn a separate tree for each function? • 250 functions = 250 trees: not so fast and interpretable anymore! description functions Name A1 A2 A3 ….. An 1 2 3 4 5 … 249 250 G1 … … … … x x x x G2 … … … … x x x G3 … … … … x x x … … … …. … … … … … … … … …
description function Name A1 A2 A3 ….. An 1 2 3 4 5 … 249 250 G1 … … … … x x x x G2 … … … … x x x G3 … … … … x x x … … … …. … … … … … … … … … 1 4,12,105,250 1,5 2 1,5,24,35 140 Multiple prediction trees • A multiple prediction tree (MPT) makes multiple predictions at once • Basic idea: (Blockeel, De Raedt, Ramon, 1998) • A decision tree learner prefers tests that yield much information on the “class” attribute (measured using information gain (C4.5) or variance reduction (CART)) • MPT learner prefers tests that reduce variance for all target variables together • Variance = mean squared distance of vectors to mean vector, in k-D space
The algorithm Procedure MPTree(T) returns tree (t*,h*,P*) = (none, , ) For each possible test t P = partition induced by t on T h = TkP |Tk|/|T| Var(Tk) if (h<h*) and acceptable(t,P) (t*,h*,P*) = (t,h,P) If t* <> none for each TkP* treek = MPTree(Tk) return node(t*, k{treek}) Else return leaf(v)
HMC tree learning • A special case of MPT learning • Class vector contains all classes in hierarchy • Main characteristics: • Errors higher up in the hierarchy are more important • Use weighted euclidean distance (higher weight for higher classes) • Need to ensure hierarchy constraint • Normally, leaf predicts ci iff proportion of ci examples in leaf is above some threshold ti (often 0.5) • We will let ti vary (see further) • To ensure compliance with hierarchy constraint: • cicjtitj • Automatically fulfilled if all ti equal
. Weight 1 c1 c2 c3 x1 Weight 0.5 c4 c5 c6 c7 . x1: {c1, c3, c5} = [1,0,1,0,1,0,0] x2: {c1, c3, c7} = [1,0,1,0,0,0,1] x3: {c1, c2, c5} = [1,1,0,0,0,0,0] c1 c2 c3 x2 c4 c5 c6 c7 . x3 c1 c2 c3 c4 c5 c6 c7 Example . c1 c2 c3 c4 c5 c6 c7 d2(x1, x2) = 0.25 + 0.25 = 0.5 d2(x1, x3) = 1+1 = 2 x1 is more similar to x2 than to x3 DT tries to create leaves with “similar” examples i.e., relatively pure w.r.t. class sets
Evaluating HMC trees • Original work by Clare et al.: • Derive rules with high “accuracy” and “coverage” from the tree • Quality of individual rules was assessed • No simple overall criterion to assess quality of tree • In this work: using precision-recall curves • Precision = P(pos| predicted pos) • Recall = P(predicted pos | pos) • The P,R of a tree depends on the tresholds ti used • By changing the threshold ti from 1 to 0, a precision-recall curve emerges • For 250 classes: • Precision = P(X | predicted X) [with X any of the 250 classes] • Recall = P(predicted X | X) • This gives a PR curve that is a kind of “average” of the individual PR curves for each class
The Clus system • Created by Jan Struyf • Propositional DT learner, implemented in Java • Implements ideas from • C4.5 (Quinlan, ’93) • CART (Breiman et al., ’84) • predictive clustering trees (Blockeel et al., ’98) • includes multiple prediction trees and hierarchical multilabel classification trees • Reads data in ARFF format (Weka) • We used two versions for our experiments: • Clus-HMC: HMC version as explained • Clus-SC: single classification version, +/- CART
The datasets • 12 datasets from functional genomics • Each with a different description of the genes • Sequence statistics (1) • Phenotype (2) • Predicted secondary structure (3) • Homology (4) • Micro-array data (5-12) • Each with the same class hierarchy • 250 classes distributed over 4 levels • Number of examples: 1592 to 3932 • Number of attributes: 52 to 47034
Our expectations… • How does HMC tree learning compare to the “straightforward” approach of learning 250 trees? • We expect: • Faster learning: Learning 1 HMCT is slower than learning 1 SPT (single prediction tree), but faster than learning 250 SPT’s • Much faster prediction: Using 1 HMCT for prediction is as fast as using 1 SPT for prediction, and hence 250 times faster than using 250 SPT’s • Larger trees: HMCT is larger than average tree for 1 class, but smaller than set of 250 trees • Less accurate: HMCT is less accurate than set of 250 SPT’s (but hopefully not much less accurate) • So how much faster / simpler / less accurate are our HMC trees?
The results • The HMCT is on average less complex than one single SPT • HMCT has 24 nodes, SPT’s on average 33 nodes • … but you’d need 250 of the latter to do the same job • The HMCT is on average slightly more accurate than a single SPT • Measured using “average precision-recall curves” (see graphs) • Surprising, as each SPT is tuned for one specific prediction task • Expectations w.r.t. efficiency are confirmed • Learning: min. speedup factor = 4.5x, max 65x, average 37x • Prediction: >250 times faster (since tree is not larger) • Faster to learn, much faster to apply
Precision recall curves Precision: proportion of predictions that is correct P(X | predicted X) Recall: proportion of class memberships correctly identified P(predicted X | X)
An example rule • High interpretability: IF-THEN rules extracted from the HMCT are quite simple IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28 THEN40, 40/3, 5, 5/1 For class 40/3: Recall = 0.15; precision = 0.97. (rule covers 15% of all class 40/3 cases, and 97% of the cases fulfilling these conditions are indeed 40/3)
The effect of merging… . . . Optimized for c1 Optimized for c2 Optimized for c250 • Smaller than average individual tree • - More accurate than average individual tree Optimized for c1, c2, …, c250
Any explanation for these results? • Seems too good to be true… how is it possible? • Answer: the classes are not independent • Different trees for different classes actually share structure • Explains some complexity reduction achieved by the HMC tree, but not all ! • One class carries information on other classes • This increases the signal-to-noise ratio • Provides better guidance when learning the tree (explaining good accuracy) • Avoids overfitting (explaining further reduction of tree size) • This was confirmed empirically
Overfitting • To check our “overfitting” hypothesis: • Compared area under PR curve on training set (Atr) and test set (Ate) • For SPC: Atr – Ate = 0.219 • For HMCT: Atr – Ate = 0.024 • (to verify, we tried Weka’s M5’ too: 0.387) • So HMCT clearly overfits much less
Conclusions • Surprising discovery: a single tree can be found that • predicts 250 different functions with, on average, equal or better accuracy than special-purpose trees for each function • is not more complex than a single special-purpose tree (hence, 250 times simpler than the whole set) • is (much) more efficient to learn and to apply • The reason for this is to be found in the dependencies between the gene functions • Provide better guidance when learning the tree • Help to avoid overfitting • Multiple prediction / HMC trees have a lot of potential and should be used more often !
Ongoing work • More extensive experimentation • Predicting classes in a lattice instead of a tree-shaped hierarchy