210 likes | 315 Vues
Explore the concept of decision trees through practical examples in data analysis, classification, and pattern recognition. Understand terminology, technical issues, and the iterative process of growing and evaluating decision trees.
E N D
Decision Tree Developed by: Dr Eddie Ip Modified by: Dr Arif Ansari
Outline • Example • Concept
Decision Tree: Example 1 • Life insurance: whether preferred rate should be given? • ID low risk group & give preferred rate • Criteria: smoking ? overweight ?
Decision Tree: Example 2 • Database of loan applications • Variable of interest = Loan approved / not approved (binary) • Predictors = age, gender, income group, own a house, …. • Similar application in Direct mail: To whom should I send mail ?
Decision Tree: Concept • Classification of customers in DB known • Use historical data (“learn”) to guide your future decisions (classify)
Steps • Build a model by “learning” from past data (learning/training set) • Tune model by using data not seen by model (testing set) • Evaluate accuracy of decision tree model by yet another new data set (evaluation set) • Use tree to classify new customers
Decision Tree: Concept • Pattern recognition tool • Used in • recognizing hand writing • recognizing chemicals • recognizing ships at sea
Decision Tree: Example • Vermont Country Store – student presentation
Decision Tree: terminology • Variable of interest: response/ target (Y) • other variables : predictors (X) • loan example: X & Y • training set = records from DB
Decision Tree: terminology • node (root & leaf) • child (left & right)
Decision Tree: Concept • tree creates a set of bins into which records are tossed • start with root node (all records) • get best split so as to produce 2 homogeneous groups
Decision Tree: Concept • Go down till a tree is formed • Stopping criteria: statistical test or grow-full-tree & prune
Decision Tree: Technical issues • Example: loan application • measure of homogeneity/ diversity • e.g. Gini index, p(1-p)
Decision Tree: concept • Grow tree = continue splitting • till no further split reduces diversity • Two philosophies: stopping rule or grow full tree & prune • Testing set may be required to stop growing or prune tree • In final decision tree, each terminal node is given a class
Decision Tree: concept • NOTE: each terminal leaf node is not pure • Misclassification (error) rate =% incorrectly classified by tree
Decision Tree: concept • Misclassification rate (sample) • Bigger tree ==> lower misclassifcation rate on training set • Big tree => overfit= “getting too close to data” • misclassification rate on training set over optimistic • More objective: evaluation set
Decison Tree: products • CART = classification & regression tree(statistics) • C4.5/ C5.0 (machine-learning) • CHAID (statistics)
Decision Tree : summary • method for classification & prediction • iterative splitting • training/testing to obtain an optimal tree • concept of overfit