1 / 101

Non-Metric Methods

Non-Metric Methods. Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University. Non-Metric Descriptions. Nominal data Discrete Without natural notation of similarity or even ordering

levi
Télécharger la présentation

Non-Metric Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Non-Metric Methods Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University

  2. Non-Metric Descriptions • Nominal data • Discrete • Without natural notation of similarity or even ordering • Property d-tuple • With lists of attributes • e.g., { red, shiny, sweet, small } • i.e., color = red, texture = shiny, taste = sweet, size = small

  3. Non-Metric Descriptions • Strings of nominal attributes • e.g., Base sequences in DNA segments “AGCTTCAGATTCCA” • Might themselves being the output of other component classifiers • e.g., Chinese character recognizer and a neural network for classifying component brush strokes

  4. Non-Metric Methods • Learn categories from non-metric data • Represent structures in strings • Toward discrete problems addressed by • Rule based pattern recognition methods • Syntactic pattern recognition methods

  5. Decision Trees

  6. Benefits of Decision Trees • Interpretability • Rapid classification • Through a sequence of simple queries • Natural way to incorporate prior knowledge from human experts

  7. Interpretability • Conjunctions and disjunctions • For any particular test pattern • e.g.,properties:{taste, color, shape, size} • x = { sweet, yellow, thin, medium } • (color = yellow) AND (shape = thin) • For category description • e.g., Apple = (green AND medium) OR (red AND medium) • Rule reduction • e.g., Apple = (medium AND NOT yellow)

  8. Tree Construction • Given • Set D of labeled training data • Set of properties for discriminating patterns • Goal • Organize the tests into a tree

  9. Tree Construction • Split samples progressively into smaller subsets • Pure subset • All samples have the same category label • Could terminate that portion of the tree • Subset with mixture of labels • Decide either to stop or select another property and grow the tree further

  10. CART • Classification and regression trees • A general framework for decision trees • General questions in CART • Number of decision outcomes at a node • Property tested at a node • Declaration of leaf • When and how to prune • Decision of impure leaf node • Handling of missing data

  11. Branching Factor and Binary Decisions • Branching factor (branching ratio) B • Number of links descending from a node • Binary decisions • Every decision can be represented using just binary decision • e.g., query of color (B=3)  • color = green? Color = yellow? • Universal expressive power

  12. Binary Trees

  13. Geometrical Interpretation for Trees for Numerical Data

  14. Fundamental Principle • Prefer decisions leading to a simple, compact tree with few nodes • A version of Occam’s razor • Seek a property query T at each node N • Make the data reaching the immediate descendent nodes as pure as possible • i.e., achieve lowest impurity • Impurity i(N) • Zero if all patterns bear the same label • Large if the categories are equally represented

  15. Entropy Impurity (Information Impurity) • Most popular measure of impurity

  16. Variance Impurity for Two-Category Case • Particular useful in two-category case

  17. Gini Impurity • Generalization of variance impurity • Applicable to two or more categories • Expected error rate at node N

  18. Misclassification Impurity • Minimum probability that a training pattern would be misclassified at N • Most strongly peaked at equal probabilities

  19. Impurity for Two-Category Case *Adjusted in scale and offset for comparison

  20. Heuristic to Choose Query • If entropy impurity is used, the impurity reduction is corresponding to an information gain • Reduction of entropy impurity due to a split can not be greater than 1 bit

  21. Finding Extrema • Nominal attributes • Perform extensive or exhaustive search over all possible subsets of the training set • Real-valued attributes • Use gradient descent algorithms to find a splitting hyperplane • As a one-dimensional optimization problem for binary trees

  22. Tie Breaking • Nominal data • Choose randomly • Real-valued data • Assume a split lying in xl < xs < xu • Choose either the middle point or the weighted average xs = (1-P)xl + Pxu • P is the probability a pattern goes to the “left” under the decision • Computational simplicity may be a determining factor

  23. Greedy Method • Get a local optimum at each node • No assurance that successive locally optimal decisions lead to the global optimum • No guarantee that we will have the smallest tree • For reasonable impurity measure and learning methods • Often continue to split further to get the lowest possible impurity at the leafs

  24. Favoring Gini Impurity to Misclassification Impurity • Example: 90 in w1 and 10 in w2 • Misclassification impurity: 0.1 • Suppose no splits guarantee a w2 majority in either of the two descendent nodes • Misclassification impurity remains at 0.1 for all splits • An attractive split: 70 w1, 0 w2 to the right and 20 w1, 10 w2 to the left • Gini impurity shows that this is a good split

  25. Twoing Criterion • For multiclass binary tree creation • Find “supercategories”C1 and C2 • C1 = {wi1, wi2, …, wik}, C2 = C-C1 • Compute Di(s,C1) as though it corresponding to a standard two-class problem • Find s*(C1) that maximize the change and then the supercategory C1*

  26. Practical Considerations • Choice of impurity function rarely affects the final classifier and its accuracy • Stopping criterion and pruning methods are more important in determining final accuracy

  27. Multiway Splits

  28. Importance of Stopping Criteria • Fully growing trees have typically been overfit • Extreme case: each leaf corresponds to a single training point • The full tree is merely a look-up table • Not to generalize well in noisy problem • Early stopping • Error on training data not sufficiently low • Performance may suffer

  29. Stopping by Checking Validation Error • Using a subset of the data (e.g., 90%) for training and the remaining (10%) as a validation set • Continue splitting until the error on the validation data is minimized

  30. Stopping by Setting a Threshold • Stop if maxsDi(s) < b • Benefits • Use all training data • Leaf can lie in different levels • Fundamental drawback • Difficult to determine the threshold • An alternative simple method • Stop when a node represents fewer than some • A threshold number of points, • A fixed percentage of the total training set

  31. Stopping by Checking a Global Criterion • Stop when a global criterion is minimum • Minimum description length • Criterion: complexity and uncertainty

  32. Stopping Using Statistical Tests

  33. Horizon Effect • Determination of optimal split at a node is not influenced by decisions at its descendent nodes • A stopping condition may be met too early for overall optimal recognition accuracy • Biases toward trees in which the greatest impurity is near the root node

  34. Pruning • Grow a tree fully first • All pairs of neighboring leaf nodes are considered for elimination • If the elimination yields a satisfactory (small) increase in impurity, the common antecedent node is declared a leaf • merging or joining

  35. Rule Pruning • Each leaf has an associated rule • Some of rules can be simplified if a series of decisions is redundant • Can improve generalization and interpretability • Allows us to distinguish between contexts in which the node is used

  36. Example 1: A Simple Tree

  37. Example 1: A Simple Tree

  38. Example 1: A Simple Tree

  39. Example 1: A Simple Tree

  40. Computation Complexity • Training • Root node • Sorting: O(dn log n) • Entropy computation: O(n)+(n-1)O(d) • Total: O(dn log n) • Level 1 node • Average case: O(dn log (n/2)) • Total number of levels: O(log n) • Total average complexity: O(dn (log n)2) • Recall and classification • O(log n)

  41. Feature Choice

  42. Feature Choice

  43. Multivariate Decision Trees

  44. Multivariate Decision Trees Using General Linear Decisions

  45. Priors and Costs • Priors • Weight samples to correct for the prior frequencies • Costs • Cost matrix lij • Incorporate cost into impurity, e.g.,

  46. Training and Classification with Deficient Patterns • Training • Proceed as usual • Calculate impurities at a node using only the attribute information present • Classification • Use traditional (“primary”) decision whenever possible • Use surrogate splits when test pattern is missing some features • Or use virtual values

  47. Example 2: Surrogate Splits and Missing Attributes

  48. Example 2: Surrogate Splits and Missing Attributes

  49. Algorithm ID3 • Interactive dichotomizer • For use with nominal (unordered) inputs only • Real-valued variables are handled by bins • Gain ratio impurity is used • Continues until all nodes are pure or there are no more variables • Pruning can be incorporated

  50. Algorithm C4.5 • Successor and refinement of ID3 • Real-valued variables are treated as in CART • Gain ratio impurity is used • Use pruning based on statistical significance of splits

More Related