Data Mining Decision Trees

Data MiningDecision Trees Last updated 8/21/19

categorical categorical continuous class Example of a Decision Tree Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data

NO Another Decision Tree Example categorical categorical continuous class Single, Divorced MarSt Married NO Refund No Yes TaxInc < 80K > 80K YES NO More than one tree may perfectly fit the data

Decision Tree Classification Task Decision Tree

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Start from the root of tree.

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO

Apply Model to Test Data Test Data Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO

Decision Tree Terminology

Decision Tree Induction • Many Algorithms: • Hunt’s Algorithm (one of the earliest) • CART • ID3, C4.5 • John Ross Quinlan is a computer science researcher in data mining and decision theory. He has contributed extensively to the development of decision tree algorithms, including inventing the canonical ID3 and C4.5 algorithms.

Decision Tree Classifier 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length > 7.1? Antenna Length yes no Antenna Length > 6.0? Katydid yes no Katydid Grasshopper Abdomen Length

Antennae shorter than body? Yes No 3 Tarsi? Grasshopper Yes No Foretiba has ears? Yes No Cricket Decision trees predate computers Katydids Camel Cricket

Definition • Decision tree is a classifier in the form of a tree structure • Decision node: specifies a test on a single attribute • Leaf node: indicates the value of the target attribute • Arc/edge: split of one attribute • Path: a disjunction of test to make the final decision • Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node.

Decision Tree Classification • Decision tree generation consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • This can also be called supervised segmentation • This emphasizes that we are segmenting the instance space • Tree pruning • Identify and remove branches that reflect noise or outliers

Decision Tree Representation • Each internal node tests an attribute • Each branch corresponds to attribute value • Each leaf node assigns a classification outlook sunny overcast rain humidity yes wind weak normal strong high no yes no yes

How do we Construct a Decision Tree? • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Examples are partitioned recursively based on selected attributes. • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., info. gain) • Why do we call this a greedy algorithm? • Because it makes locally optimal decisions (at each node).

When Do we Stop Partitioning? • All samples for a node belong to same class • No remaining attributes • majority voting used to assign class • No samples left

How to Pick Locally Optimal Split • Hunt’s algorithm: recursively partition training records into successively purer subsets. • How to measure purity/impurity? • Entropy and associated information gain • Gini • Classification error rate • Never used in practice but good for understanding and simple exercises

How to Determine Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Why is student id a bad feature to use?

Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: How to Determine Best Split Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

Information Theory • Think of playing "20 questions": I am thinking of an integer between 1 and 1,000 -- what is it? What is the first question you would ask? • What question will you ask? • Why? • Entropy measures how much more information you need before you can identify the integer. • Initially, there are 1000 possible values, which we assume are equally likely. • What is the maximum number of question you need to ask?

Entropy • Entropy (disorder, impurity) of a set of examples, S, relative to a binary classification is: where p1 is the fraction of positive examples in S and p0 is fraction of negatives. • If all examples are in one category, entropy is zero (we define 0log(0)=0) • If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1. • For multi-class problems with c categories, entropy generalizes to:

Entropy for Binary Classification • The entropy is 0 if the outcome is certain. • The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible). Entropy of a 2-class problem with regard to the portion of one of the two groups

Information Gain in Decision Tree Induction • Is the expected reduction in entropy caused by partitioning the examples according to this attribute. • Assume that using attribute A, a current set will be partitioned into some number of child sets • The encoding information that would be gained by branching on A The summation in the above formula is a bit misleading since when doing the summation we weight each entropy by the fraction of total examples in the particular child set. This applies to GINI and error rate also.

Examples for Computing Entropy • NOTE: p( j | t) is computed as the relative frequency of class j at node t P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log2 0– 1 log2 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6)– (5/6) log2 (5/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6)– (4/6) log2 (4/6) = 0.92 P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2 Entropy = – (1/2) log2 (1/2)– (1/2) log2 (1/2) = -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1

How to Calculate log2x • Many calculators only have a button for log10x and logex (“log” typically means log10) • You can calculate the log for any base b as follows: • logb(x) = logk(x) / logk(b) • Thus log2(x) = log10(x) / log10(2) • Since log10(2) = .301, just calculate the log base 10 and divide by .301 to get log base 2. • You can use this for HW if needed

Splitting Based on INFO... • Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i • Uses a weighted average of the child nodes, where weight is based on number of examples • The attribute split that yields the lowest entropy is best • See example on page 130 of textbook • Used in ID3 and C4.5 decision tree learners • WEKA’s J48 is a Java version of C4.5 • Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

How Split on Continuous Attributes? • For continuous attributes • Partition the continuous value of attribute A into a discrete set of intervals • Create a new boolean attribute Ac , looking for a threshold c • One method is to try all possible splits How to choose c ?

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no yes Hair Length <= 5? Let us try splitting on Hair length Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.9710 Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.8113 Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no yes Weight <= 160? Let us try splitting on Weight Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0 Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219 Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no yes age <= 40? Let us try splitting on Age Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3) = 0.9183 Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1 Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183

Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply recurse! no yes Weight <= 160? This time we find that we can split on Hair length, and we are done! no yes Hair Length <= 2?

We don’t need to keep the data around, just the test conditions. Weight <= 160? yes no How would these people be classified? Hair Length <= 2? Male yes no Male Female

It is trivial to convert Decision Trees to rules… Weight <= 160? yes no Hair Length <= 2? Male no yes Male Female Rules to Classify Males/Females IfWeightgreater than 160, classify as Male ElseifHair Lengthless than or equal to 2, classify as Male Else classify as Female Note: could avoid use of “elseif” by specifying all test conditions from root to corresponding leaf.

Once we have learned the decision tree, we don’t even need a computer! This decision tree is attached to a medical machine, and is designed to help nurses make decisions about what type of doctor to call. Decision tree for a typical shared-care setting applying the system for the diagnosis of prostatic obstructions.

The worked examples we have seen were performed on small datasets. However with small datasets there is a great danger of overfitting the data… When you have few datapoints, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. Yes No Wears green? Male Female For example, the rule “Wears green?” perfectly classifies the data, so does “Mothers name is Jacqueline?”, so does “Has blue shoes”…

GINI is Another Measure of Impurity • Gini for a given node t with classes j NOTE: p( j | t) is again computed as relative frequency of class j at node t Compute best split by computing the partition that yields the lowest GINI where we again take the weighted average of the children’s GINI Worst GINI = 0.5 Best GINI = 0.0

Splitting Criteria based on Classification Error • Classification error at a node t : • Measures misclassification error made by a node. • Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information. This is ½ for 2-class problems • Minimum (0.0) when all records belong to one class, implying most interesting information

Examples for Computing Error P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 Equivalently, predict majority class and determine fraction of errors P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Complete Example using Error Rate • Initial sample has 3 C1 and 15 C2 • Based on one 3-way split you get the 3 child nodes to the left • What is the decrease in error rate? • What is the error rate initially? • What is it afterwards? • As usual you need to take the weighted average (but there is a shortcut)

Error Rate Example Continued • Error rate before: 3/18 • Error rate after: • Shortcut: • Number of errors = 0 + 1 + 2 • Out of 18 examples • Error rate = 3/18 • Weighted average method: • 6/18 x 0 + 6/18 x 1/6 + 6/18 x 2/6 • Simplifies to 1/18 + 2/18 = 3/18

Comparison among Splitting Criteria For a 2-class problem:

Discussion • Error rate is often the metric used to evaluate a classifier (but not always) • So it seems reasonable to use error rate to determine the best split • That is, why not just use a splitting metric that matches the ultimate evaluation metric? • But this is wrong! • The reason is related to the fact that decision trees use a greedy strategy, so we need to use a splitting metric that leads to globally better results • The other metrics will empirically outperform error rate, although there is no proof for this.

How to Specify Test Condition? • Depends on attribute types • Nominal • Ordinal • Continuous • Depends on number of ways to split • 2-way split • Multi-way split

CarType Family Luxury Sports CarType CarType {Sports, Luxury} {Family, Luxury} {Family} {Sports} Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. • Binary split: Divides values into two subsets. Need to find optimal partitioning. OR

Size Small Large Medium Size Size Size {Small, Medium} {Small, Large} {Medium, Large} {Medium} {Large} {Small} Splitting Based on Ordinal Attributes • Multi-way split: Use as many partitions as distinct values. • Binary split: Divides values into two subsets. Need to find optimal partitioning. • What about this split? OR

Splitting Based on Continuous Attributes • Different ways of handling • Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. • Binary Decision: (A < v) or (A  v) • consider all possible splits and finds the best cut • can be more compute intensive

Data Mining Decision Trees

Data Mining Decision Trees

Presentation Transcript

Data Mining in Artificial Intelligence: Decision Trees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining, Decision Trees and Earthquake Prediction

Data Mining and Decision Trees

Data Mining With Decision Trees

Data Mining using Decision Trees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision Support: Data Mining

Mining Decision Trees from Data Streams

Data Mining and Machine Learning Decision Trees and ID3

Data Mining, Decision Trees and Earthquake Prediction

Data Mining using Decision Trees

Data Mining – Algorithms: Decision Trees - ID3

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision trees for stream data mining – new results

Data Mining – Algorithms: Decision Trees - ID3

Applied Data Mining Basic Decision Trees in R

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation