1 / 28

Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining. V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by Christos Faloutsos). More details… Decision Trees. Decision trees Problem Approach

nell
Télécharger la présentation

Temple University – CIS Dept. CIS664– Knowledge Discovery and Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Temple University – CIS Dept.CIS664– Knowledge Discovery and Data Mining V. Megalooikonomou Introduction to Decision Trees (based on notes by Jiawei Han and Micheline Kamber and on notes by Christos Faloutsos)

  2. More details… Decision Trees • Decision trees • Problem • Approach • Classification through trees • Building phase - splitting policies • Pruning phase (to avoid over-fitting)

  3. Decision Trees • Problem: Classification – i.e., given a training set (N tuples, with M attributes, plus a label attribute) find rules, to predict the label for newcomers Pictorially:

  4. Decision trees ??

  5. Decision trees • Issues: • missing values • noise • ‘rare’ events

  6. Decision trees • types of attributes • numerical (= continuous) - eg: ‘salary’ • ordinal (= integer) - eg.: ‘# of children’ • nominal (= categorical) - eg.: ‘car-type’

  7. num. attr#2 (e.g., chol-level) - - + + + - + - + - + - + num. attr#1 (e.g., ‘age’) Decision trees • Pictorially, we have

  8. Decision trees • and we want to label ‘?’ ? num. attr#2 (e.g., chol-level) - - + + + - + - + - + - + num. attr#1 (e.g., ‘age’)

  9. Decision trees • so we build a decision tree: ? num. attr#2 (e.g., chol-level) - - + + + 40 - + - + - + - + 50 num. attr#1 (e.g., ‘age’)

  10. Decision trees • so we build a decision tree: age<50 N Y chol. <40 + Y N - ...

  11. Decision trees • Typically, two steps: • tree building • tree pruning (for over-training/over-fitting)

  12. num. attr#2 (e.g.,chol-level) - - + + + - + - + - + - + num. attr#1 (e.g., ‘age’) Tree building • How?

  13. Tree building • How? • A: Partition, recursively - pseudocode: Partition ( dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2) • Q1: how to introduce splits along attribute Ai • Q2: how to evaluate a split?

  14. Tree building • Q1: how to introduce splits along attribute Ai • A1: • for numerical attributes: • binary split, or • multiple split • for categorical attributes: • compute all subsets (expensive!), or • use a greedy approach

  15. Tree building • Q2: how to evaluate a split? • A: by how close to uniform each subset is - ie., we need a measure of uniformity:

  16. Tree building Any other measure? entropy: H(p+, p-) p+: relative frequency of class + in S 1 0 0.5 0 1 p+

  17. 1 0 0.5 0 1 p+ Tree building ‘gini’ index: 1-p+2 - p-2 entropy: H(p+, p-) p+: relative frequency of class + in S 1 0 0.5 0 1 p+

  18. Tree building Intuition: • entropy: #bits to encode the class label • gini: classification error, if we randomly guess ‘+’ with prob. p+

  19. num. attr#2 (e.g., chol-level) - - + + + - + - + - + - + num. attr#1 (e.g., ‘age’) Tree building Thus, we choose the split that reduces entropy/classification-error the most: E.g.:

  20. Tree building • Before split: we need (n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels • After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half

  21. num. attr#2 (eg., chol-level) - - + + + - + - ... + - + - + num. attr#1 (eg., ‘age’) Tree pruning • What for?

  22. num. attr#2 (eg., chol-level) - - + + + - + - ... + - + - + num. attr#1 (eg., ‘age’) Tree pruning • Q: How to do it?

  23. Tree pruning • Q: How to do it? • A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

  24. Tree pruning • Q: How to do it? • A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) • A2: or, rely on MDL (= Minimum Description Language) - in detail:

  25. Tree pruning • envision the problem as compression • and try to minimize the # bits to compress (a) the class labels AND (b) the representation of the decision tree

  26. (MDL) • a brilliant idea – e.g.: best n-degree polynomial to compress these points: • the one that minimizes (sum of errors + n)

  27. Major Issues in Data Mining • Issues relating to the diversity of data types • Handling relational as well as complex types of data • Mining information from heterogeneous databases and global information systems (WWW) • Issues related to applications and social impacts • Application of discovered knowledge • Domain-specific data mining tools • Intelligent query answering • Process control and decision making • Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem • Protection of data security, integrity, and privacy

  28. Summary • Data mining: discovering interesting patterns from large amounts of data • A natural evolution of database technology, in great demand, with wide applications • A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation • Mining can be performed in a variety of information repositories • Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. • Classification of data mining systems • Major issues in data mining

More Related