1 / 90

Classification with Multiple Decision Trees

Classification with Multiple Decision Trees. CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin. Plan. Basic framework Query selection, impurity, stopping … Intermediate summary Combining multiple trees Y.Amit & D.Geman’s approach Randomization, Bagging, Boosting Applications.

Gabriel
Télécharger la présentation

Classification with Multiple Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification with Multiple Decision Trees CV-2003 Eran Shimonovitz Ayelet Akselrod-Ballin

  2. Plan • Basic framework Query selection, impurity, stopping … • Intermediate summary • Combining multiple trees • Y.Amit & D.Geman’s approach • Randomization, Bagging, Boosting • Applications

  3. Introduction A general classifier: Uses measurements made on the object to assign the object to a category

  4. Some popular classification methods • Nearest Neighbor Rule • Bayesian Decision Theory • Fisher Linear Discriminant • SVM • Neural Network

  5. Formulation • xmeasurement vector (x1,x2,…xd) Є X is pre-computed for each data point • C = {1,…,J} set of J classes, labeled Y(x) • L = {(x1, y1),…,(xN, yN)} learning sample • Data patterns can be • Ordered, numerical, real numbers • Categorical, nominal list of attributes A classification rule: a function defined on X so that for every x, Ŷ(x) is equal to one of (1,..J)

  6. The goal is to construct a classifier such that is as small as possible

  7. Root Node Sub-tree leaf Basic framework CART - classification and regression trees Trees are constructed by repeated splits of subsets of X into descendent Breiman and Colleagues 1984

  8. Split Number: binary / multi value Every tree can be represented using just binary decision. Duda, Hart & Stork 2001

  9. 1/6, 1/6, 1/6, 1/6, 1/6, 1/6 1/3, 1/3, 1/3, 0, 0, 0 0, 0, 0, 1/3, 1/3, 1/3 Query selection & Impurity • P(ωj) more precisely Fraction of patterns at node T in category ωj • Impurity Φ is a nonnegative function defined on the set of all J–tuples satisfying

  10. Impurity properties • When all categories are equally represented: • If all patterns that reach the nodes bear the same category: • Symmetric function of pω1, …,pωj Given Φ define the impurity measure i(T) at any node T:

  11. 10/16 6/16 7/10 , 3/10 1/6 , 5/6 X1<0.6 i(T)=0.88 i(T)=0.65 Entropy impurity 8/16 , 8/16

  12. X1<0.6 X2<0.32 X2<0.61 X1<0.35 w1 w2 X1<0.69 w2 w1 w2 w1 Entropy impurity

  13. Other Impurity Functions • Variance impurity: • Gini impurity: • Misclassification Impurity: The measure does not affect the overall performance

  14. Conditional Entropy Goodness of split • Defined as the decrease in impurity where • Select the splits that maximize Δi(s,T) • Greedy method: local optimization t PL PR tL tR

  15. Entropy formulation Vector of predictors assumed binary For each f calculate the conditional entropy on class given Xf

  16. Trade off grow the tree fully until min impurity Over fitting Stop splitting too early error is not sufficiently low Stopping Criteria • Best candidate split at a node reduces the impurity by less than threshold. • Lower bound on the number/ percentage of points at a node. • Validation & cross validation • Statistical significance of impurity reduction

  17. Recognizing Overfitting .5 .6 .7 .8 .9 Accuracy On training data On test data 0 10 20 30 40 50 60 70 80 Size of tree (number of nodes)

  18. Assignments of leaf labels • When leaf nodes have positive impurity each node will be labeled by the category that has most points.

  19. Recursive partitioning scheme Select attribute A to max impurity reduction [by defining P(j|N), i(N) ] For each possible value of A add a new branch Below new branch add sub-tree If stopping criterion is met Y N Node label = most common

  20. X2<0.83 X1<0.27 X1<0.89 X2<0.34 X2<0.56 w2 w1 X1<0.09 w2 w1 X1<0.56 w2 w1 w2 w1

  21. Preprocessing - PCA -0.8X1+0.6X2<0.3 w1 w2

  22. Popular tree algortihms • ID3 - 3rd “interactive dichotomizer” (Quinlan 1983) • C4.5 – descendent of ID3 (Quinlan 1993) • C5.0

  23. Interpretability, good insight of data structure. Rapid classification Multi – class Space complexity Refinement without reconstructing. Can be further depend Natural to incorporate prior experts knowledge … Instability - sensitivity to training points, a result of greedy process. Training time Over training sensitivity Difficult to understand if large … Pros&Cons

  24. Combining multiple classification trees Main problem – Stability Small changes in training set cause large changes in classifier. Solution Grow multiple trees instead of just one and then combine the information. The aggregation produces significant improvement in accuracy.

  25. Protocols for generating multiple classifiers • Randomization: of queries at each node. • Boosting: sequential reweighing, AdaBoost • Bagging: bootstrap aggregation Multiple trees

  26. Y. Amit & D. Geman’s Approach Shape quantization and recognition with randomized trees, Neural computation, 1997. Shape recognition based on shape features & tree classifiers. The Goal: to select the informative shape features and build tree classifiers.

  27. Randomization At each node: • Choose a random sample of predictors from the whole candidate collection. • Estimate the optimal predictor using a random sample of data points. • The size of these 2 random samples are parameters. Multiple trees

  28. Multiple classification trees • Different trees correspond to different aspects of shapes, characterize from "different point of view". • Statistically weakly dependent due to randomization.

  29. Aggregation After producing N trees T1, ..TN Maximize average terminal distribution P at terminal node, Lt(c) : set of training points of class c at node t Multiple trees

  30. test point ω T1 T2 Tn Multiple trees

  31. Data Classification examples: • Handwritten digits • LATEX symbols. • Binary images of 2D shapes. • All images are registered to a fixed grid of 32X32. • Considerable within class variation. Y. Amit & D. Geman

  32. 223,000 binary images of isolated digits written by more than 2000 writers. 100,000 for training and 50,000 for testing. Handwritten digits – NIST (National institute of standards and technology) Y. Amit & D. Geman

  33. 32 samples per class for all 293 classes. LATEX Symbols Synthetic deformations Y. Amit & D. Geman

  34. Shape features • Each query corresponds to a spatialarrangement of local codes "tags". • Tags: coarse description (5 bit codes) of the local topography of intensity surface in the neighborhood of a pixel. • Discriminating power comes from their relative angles and distances of tags. Y. Amit & D. Geman

  35. Tags • 4X4 sub-images are randomly extracted & recursively partitioned based on individual pixel values. • A tag type for each node of the resulting tree. • If 5 question are asked 2+4+8+16+32 = 62 tags Y. Amit & D. Geman

  36. Tags (cont.) • Tag 16 is a depth 4 tag. The corresponding 4 questions in the following sub-image are indicated by the following mask. Where • 0 - background • 1 - object • n – “not asked” • These neighborhoods are loosely described by background to lower left, object to upper right. Y. Amit & D. Geman

  37. Spatial arrangement of local features • The arrangement A is a labeled hyper-graph. Vertex labels correspond to the tag types and edge labels to relations. • Directional and distance constraints. • Query: whether such an arrangement exists anywhere in the image. Y. Amit & D. Geman

  38. Example of node splitting The minimal extension of an arrangement A means the addition of one relation between existing tags, or the addition of exactly one tag and one relation binding the new tag to the existing one. Y. Amit & D. Geman

  39. The trees are grown by the scheme described … Y. Amit & D. Geman

  40. Importance of Multiple randomization trees Graphs found in the terminal node of five different trees. Y. Amit & D. Geman

  41. Experiment - NIST • Stopping: Nodes are split while at least m points in the second largest class. • Q : # of random queries per node. • Random sample of 200 training points per node. • 25-100 trees are produces. • Depth 10 on average Y. Amit & D.Geman

  42. Results • Best error rate with a single tree is 5% • The average rate per tree is about 91% • By aggregating trees the classification climbs • State-of-the-art error rates Rejection rate→ above 99% #T↓ Y. Amit & D.Geman

  43. Conclusions • Stability & Accuracy - combining multiple trees leads to drastic decrease in error rates, relative to the best individual tree. • Efficiency – fast training & testing • Ability for visual interpretation of trees output. • Few parameters & Insensitive to parameter setting. Y. Amit & D.Geman

  44. Conclusions – (cont.) • The approach is not model based, does not involve advanced geometry or extracting boundary information. • Missing aspect: features from more than one resolution. • Most successful handwritten CR reported by Lecun et al. 1998 (99.3%). • Used multi-layer feed forward based on raw pixel intensity. Y. Amit & D.Geman

  45. Voting Tree Learning Algorithms A family of protocols for producing and aggregating multiple classifiers. • Improve predictive accuracy. • For unstable procedures. • Manipulate the training data in order to generate different classifiers. • Methods: Bagging, Boosting

  46. Bagging A name derived from “bootstrap aggregation”. A “bootstrap” data set:Created by randomly selecting points from the training set, with replacement. Bootstrap estimation:Selection process is independently repeated.Data sets are treated as independent. Bagging Predictors, Leo Breiman, 1996

  47. Bagging • Select a bootstrap sample, LB from L. • Grow a decision tree from LB. • Estimate the class of xn by plurality vote • # (estimated class ≠ true class)  Bagging - Algorithm

  48. UCI Machine Learning Repository Bagging – DataBase

  49. Results Error rates are the averages over 100 iterations. Bagging - Algorithm

  50. C4.5 vs. bagging C4.5 UCI repository of machine learning database Boosting the Margin, Robert E. Schapire, Yoav Freund, Peter Bartlett & Wee Sun Lee, 1998 Bagging

More Related