Decision Trees and an Introduction to Classification

Decision Trees and an Introduction to Classification

Classification: Definition • Given a collection of records (training set ) • Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible • Decision trees evaluated on test set

Illustrating Classification Task

Classifier Evaluation • Model Performance is evaluated on a test set • The test set has the class labels filled in, just like the training set • The test set and training set normally created by splitting the original data • This reduces the amount of data for training, which is bad, but this is the best way to get a good measure of the classifier’s performance • Assumption is that future data will adhere to the distribution in the test set, which is not always true

Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Categorizing news stories as finance, weather, entertainment, sports, etc

Introduction to Decision Trees • Decision Trees • Powerful/popular for classification & prediction • It is similar to a flowchart • Represents sets of non-overlapping rules • Rules can be expressed in English • IF Age <=43 & Sex = Male & Credit Card Insurance = NoTHEN Life Insurance Promotion = No • Useful to explore data to gain insight into relationships of a large number of features (input variables) to a target (output) variable • You use mental decision trees often! • Game: “20 Questions”

Introduction to Decision Trees • A structure that can be used to divide up a large collection of records into successively smaller sets of records by applying a sequence of simple tests • A decision tree model consists of a set of rules for dividing a large heterogeneous population into smaller, more homogeneous groups with respect to a particular target variable • Decision trees partition the “space” of all possible examples and assigns a class label to each partition

Classification Techniques • Decision Tree based Methods • Rule-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines

categorical categorical continuous class Example of a Decision Tree Splitting Attributes Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Model: Decision Tree Training Data

NO Another Example of Decision Tree categorical categorical continuous class Single, Divorced MarSt Married NO Refund No Yes TaxInc < 80K > 80K YES NO There could be more than one tree that fits the same data!

Decision Tree Classification Task Decision Tree

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data Start from the root of tree.

Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO Apply Model to Test Data Test Data

Apply Model to Test Data Test Data Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO

Apply Model to Test Data Test Data Refund Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K YES NO

Decision Tree Classification Task Decision Tree

Decision Tree Induction • Many Algorithms: • Hunt’s Algorithm (one of the earliest) • CART (Classification and Regression Trees) • ID3, C4.5 • We have easy access to several: • C4.5 (freely available and I have a copy on storm) • C5.0 (commercial version of C4.5 on storm) • WEKA free suite has J48 a java reimplementation of C4.5

How Would you Design a DT Algorithm? • Can you define the algorithm recursively? • First, do you know what recursion is? • Yes, at each invocation you only need to build one more level of the tree • What decisions to you need to make at each call? • What feature f to split on • How to partition the examples based on f • for numerical features, what number to split on • for categorical features, what values with each split • When to stop • How might you make these decisions?

General Structure of Hunt’s Algorithm • Let Dt be the set of training records that reach a node t • General Procedure: • If Dt contains records that only belong to the same class yt, then t is a leaf node labeled as yt • If Dt is an empty set, then t is a leaf node labeled by the default class, yd • If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ?

Refund Refund Yes No Yes No Don’t Cheat Marital Status Don’t Cheat Marital Status Single, Divorced Refund Married Married Single, Divorced Yes No Don’t Cheat Taxable Income Cheat Don’t Cheat Don’t Cheat Don’t Cheat < 80K >= 80K Don’t Cheat Cheat Hunt’s Algorithm Don’t Cheat

Tree Induction • Greedy strategy • Split the records based on an attribute test that optimizes certain criterion • Greedy because not globally optimal • In context of decision trees, no extra look ahead • What are the costs and benefits of looking ahead? • Issues • Determine how to split the records • How to select the attribute to split? How would you do it? • How to determine the best split? How would you do it? • Determine when to stop splitting • Why not go until you can’t go any further?

How to Specify Test Condition? • Depends on attribute types • Nominal/Categorical (limited number of predefined values) • Ordinal (like nominal but there is an ordering) • Continuous (ordering and many values: numbers) • Depends on number of ways to split • 2-way split • Multi-way split

CarType Family Luxury Sports CarType CarType {Sports, Luxury} {Family, Luxury} {Family} {Sports} Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. • Binary split: Divides values into two subsets. Need to find optimal partitioning. OR

Size Small Large Medium Size Size Size {Small, Medium} {Small, Large} {Medium, Large} {Medium} {Large} {Small} Splitting Based on Ordinal Attributes • Multi-way split: Use as many partitions as distinct values. • Binary split: Divides values into two subsets. Need to find optimal partitioning. • What about this split? OR

Splitting Based on Continuous Attributes • Different ways of handling • Discretization to form ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. • Binary Decision: (A < v) or (A  v) • consider all possible splits and finds the best cut • can be more compute intensive

Splitting Based on Continuous Attributes

Tree Induction • Greedy strategy. • Split the records based on an attribute test that optimizes certain criterion. • Issues • Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? • Determine when to stop splitting

How to Determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Why is student id a bad feature to use?

How to Determine the Best Split • Greedy approach: • Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

Measures of Node Impurity • Gini Index • Entropy • Misclassification error

M0 M2 M3 M4 M1 M12 M34 How to Find the Best Split Before Splitting: A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 Gain = M0 – M12 vs M0 – M34

Measure of Impurity: GINI (at node t) • Gini Index for a given node t with classes j NOTE: the conditional probability p( j | t) is computed as the relative frequency of class j at node t • Example: Two classes C1 & C2 and node t has 5 C1 and 5 C2 examples. Compute Gini(t) • 1 – [p(C1|t) + p(C2|t)] = 1 – [(5/10)2 + [(5/10)2 ] • 1 – [¼ + ¼] = ½. • Do you think this Gini value indicates a good split or bad split? Is it an extreme value?

More on Gini • Worst Gini corresponds to probabilities of 1/nc, where nc is the number of classes. • For 2-class problems the worst Gini will be ½ • How do we get the best Gini? Come up with an example for node t with 10 examples for classes C1 and C2 • 10 C1 and 0 C2 • Now what is the Gini? • 1 – [(10/10)2 + (0/10)2 = 1 – [1 + 0] = 0 • So 0 is the best Gini • So for 2-class problems: • Gini varies from 0 (best) to ½ (worst).

Some More Examples • Below we see the Gini values for 4 nodes with different distributions. They are ordered from best to worst. See next slide for details • Note that thus far we are only computing GINI for one node. We need to compute it for a split and then compute the change in Gini from the parent node.

Examples for computing GINI P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Splitting Based on GINI • When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. • Should make sense- weighted average where weight is proportional to number of examples at the node • If two nodes and one node has 5 examples and the other has 10 examples, the first node has weight 5/15 and the second node has weight 10/15

Computing GINI for Binary Feature • Splits into two partitions • Weighting gives more importance to larger partitions • Are we better off stopping at node B or splitting? • Compare Gini before and after. Do it! B? Yes No Node N1 Node N2

Computing GINI for Binary Feature • Splits into two partitions • Weighting gives more importance to larger partitions • Are we better off stopping at node B or splitting? • Compare impurity before and after B? Yes No Node N1 Node N2 Gini(N1) = 1 – (5/7)2 – (2/7)2= 0.41 Gini(N2) = 1 – (1/5)2 – (4/5)2= 0.32 Gini (Children) = 7/12 x 0.41 + 5/12 x 0.32= 0.375 .375

Categorical Attributes: Computing Gini Index • For each distinct value, gather counts for each class in the dataset • Use the count matrix to make decisions • Exercise: Compute the Gini’s Multi-way split Two-way split (find best partition of values)

Continuous Attributes: Computing Gini Index • Use Binary Decisions based on one value • Several Choices for the splitting value • Number of possible splitting values = Number of distinct values • Each splitting value has a count matrix associated with it • Class counts in each of the partitions, A < v and A  v • Simple method to choose best v • For each v, scan the database to gather count matrix and compute its Gini index • Computationally Inefficient! Repetition of work.

Sorted Values Split Positions Continuous Attributes: Computing Gini Index... • For efficient computation: for each attribute, • Sort the attribute on values • Linearly scan these values, each time updating the count matrix and computing gini index • Choose the split position that has the least gini index

Alternative Splitting Criteria based on INFO • Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). • Measures homogeneity of a node. • Maximum (log2nc) when records are equally distributed among all classes implying least information • Minimum (0.0) when all records belong to one class, implying most information • Entropy based computations are similar to the GINI index computations

Examples for computing Entropy P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log20– 1 log21 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6)– (5/6) log2 (5/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6)– (4/6) log2 (4/6) = 0.92 P(C1) = 3/6=1/2 P(C2) = 3/6 = 1/2 Entropy = – (1/2) log2(1/2)– (1/2) log2(1/2) = -(1/2)(-1) – (1/2)(-1) = ½ + ½ = 1

How to Calculate log2x • Many calculators only have a button for log10x and logex (note log typically means log10) • You can calculate the log for any base b as follows: • logb(x) = logk(x) / logk(b) • Thus log2(x) = log10(x) / log10(2) • Since log10(2) = .301, just calculate the log base 10 and divide by .301 to get log base 2. • You can use this for HW if needed

Splitting Based on INFO... • Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i • Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) • Used in ID3 and C4.5 • Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

Splitting Criteria based on Classification Error • Classification error at a node t : • Measures misclassification error made by a node. • Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information • Minimum (0.0) when all records belong to one class, implying most interesting information

Examples for Computing Error P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Comparison among Splitting Criteria For a 2-class problem:

Decision Trees and an Introduction to Classification