CS 4700: Foundations of Artificial Intelligence

CS 4700:Foundations of Artificial Intelligence Prof. Carla P. Gomes gomes@cs.cornell.edu Module: Extensions of Decision Trees (Reading: Chapter 18)

Extensions of the Decision Tree Learning Algorithm(Briefly) • Noisy data and overfitting • Cross Validation • Missing Data (See exercise 18.12 ) • Using gain ratios (See exercise 18.13) • Real-valued data • Generation of rules

Noisy Data and Overfitting

Noisy data • Many kinds of "noise" that could occur in the examples: • Two examples have same attribute/value pairs, but different classifications report majority classification for the examples corresponding to the node deterministic hypothesis. report estimated probabilities of each classification using the relative frequency (if considering stochastic hypotheses) • Some values of attributes are incorrect because of errors in the data acquisition process or the preprocessing phase • The classification is wrong (e.g., + instead of -) because of some error

Overfitting • Problem of trying to predict the roll of a die. The experiment data include: • Day of the week; (2) Month of the week; (3) Color of the die; • …. • As long as two examples have identical descriptions, • DTL will find an exact hypothesis, with irrelevant attributes. Some attributes are irrelevant to the decision-making process, e.g., color of a die is irrelevant to its outcome but they are used to differentiate examplesOverfitting. Overfitting means fitting the training set “too well”  performance on the test set degrades.

Overfitting • If the hypothesis space has many dimensions because of a large number of attributes, we may find meaningless regularity in the data that is irrelevant to the true, important, distinguishing features. • Fix by pruning lower nodes in the decision tree. • For example, if Gain of the best attribute at a node is below a threshold, stop and make this node a leaf rather than generating children nodes. • Overfitting is a key problem in learning. Mathematical treatment of overfitting beyond this course. • Flavor of “decision tree pruning”

Overfitting • Let’s consider D, the entire distribution of data, and T, the training set. • Hypothesis h  H overfits D if •  h’h H such that • errorT(h) < errorT(h’) but • errorD(h) > errorD(h’)

Dealing with Overfitting • Decision tree pruning • prevent splitting on features that are not clearly relevant. • Testing of relevance of features: • statistical tests • cross-validation • Grow full tree, then post-prune • rule post-pruning

Pruning Decision Trees • Idea: prevent splitting on attributes that are not relevant. • How do we decide that an attribute is irrelevant? • If an attribute is not really relevant, the resulting subsets would have roughly the same proportion of each class (pos/neg) as the original set •  Gain would be close to zero. • How large should the Gain be, to be considered significant? •  Statistical significance test:

Statistical Significance Test:Pruning Decision Trees • How large should the Gain be to be considered significant?  statistical significance test: • Null hypothesis (H0) --- the attribute we are considering to split on is • irrelevant, i.e., Gain is zero for a sufficiently large sample. • We can measure the deviation (D) by comparing the actual number of pos • and negative examples in each of the new v subsets (pi and ni) with the • expected numbers and , assuming true irrelevance of the attribute, • therefore using the original set proportion (p pos and n neg examples) • applied to the new subsets:

2 Test D ~ 2 Distribution with (v-1) degrees of freedom v = size of domain of attribute A pi – actual number of positive examples ni – actual number of negative examples – number of positive examples using the original set proportion p of positive examples (therefore assuming Ho, i.e., irrelevant attribute) – number of negative examples using the original set proportion n of positive examples (therefore assuming Ho, i.e., irrelevant attribute)

2 Test:Summary • Ho: A is irrelevant: (exp – obs)2 / exp D ~ 2 Distribution with (v-1) degrees of freedom • - probability of making a mistake (i.e., rejecting H0 when it is true) T(v-1) – threshold value (i.e., theoretical value for 2 distribution with (v-1) degrees of freedom,  probability in the tail) D < T(v-1) do not reject H0 (therefore do not split on A since we do not reject that it is irrelevant) D > T(v-1)  reject H0; therefore split on A since we reject that it is irrelevant)

2 Table

Cross Validation • Another technique to reduce overfitting • it can be applied to any learning algorithm. • Key idea: Estimate how well each hypothesis will predict unseen data. • Set aside some fraction of the training set and use it to test the prediction performance of a hypothesis induced from the remaining data. Note: in order to avoid peeking, another test set should be used. K-fold cross-validation – run k experiments by setting each time 1/k of the data to test on, and average the results. Popular values for k: 5 and 10. Leave-one-out cross-validation – extreme case k=n.

Validation Set • Sometimes an additional validation set is used , independent from the • training and testing sets. The validation set can be used for example to • watch out for overfitting, or to choose the best input parameters for a • classifier model. • In that case the data is split in three parts: training, validation, and test. • 1) Use the training data in a cross-validation scheme like 10-fold or 2/3 - 1/3 to simply estimate the average quality (e.g. error rate (or accuracy) of a classifier)

Validation Set • 2) Leave an additional subset of data, the validation set, to adjust additional parameters or adjust the structure (such as number of layers or neurons in a neural network, or number of nodes in a decision tree) of the model. • For example: you can use the validation set to decide when to stop growing the decision tree, thus to test for overfitting. • Repeat this for many parameter/structure choices and select the choice with best quality on the validation set • 3) Finally take this best choice of parameters + model from step 2 and use it to estimate the quality on the test data.

Validation Set • So • Training set to compute the model, • Validation set to choose the best parameters of this model (in case there are "additional" parameters that cannot be computed based on training) • Test set as the final "judge" to get an estimate of the quality on new data that was used neither to train the model, nor to determine its underlying parameters or structure or complexity of this model

How to pick best tree? • Measure performance over training data • Measure performance over separate validation data set MDL (minimal description length): minimize size(tree) + size(misclassifications(tree))

Selecting Algorithm Parameters • Optimal choice of algorithm parameters depends on learning task: • (e.g., nodes in decision trees, nodes in neural network, k in k in k-nearest neighbor, etc) Real-world Process drawn randomly Data D split randomly (50%) split randomly (30%) split randomly (20%) Train Data Dtrain Test Data Dtest Validation Data Dval (x1,y1), …, (xn,yn) (x1,y1),…(xk,yk) (x1,y1), …, (xl,yl) Dval Dtrain Dtrain hfinal hp Learner 1 Learner p … argminhi{Errval(hi)} h1

Extensions of Decision Tree LearningOnly a flavor, briefly

Mulivalued Attributes:Using Gain Ratios • The notion of Information Gain favors attributes that have a large number of values. • If we have an attribute A that has a distinct value for each example, then each subset of examples would be a singleton •  Entropy(S,A) is 0, thus Gain(S,A) is maximal. Nevertheless the attribute could be bad for generalization or even irrelevant. Average entropy of each subset induced by attribute A Example: name of restaurant does not generalize well!

Using Gain Ratios • To compensate for this Quinlan suggests using a Gain ratio instead of Gain: GainRatio(S,A) = Gain(S,A) / SplitInfo(A,S) • SplitInfo(A,S) (Information Content of an attribute or intrinsic information of an attribute) •  information due to the split of S on the basis of value of categorical attribute A. SplitInfo(A,S) = I(|S1|/|S|, |S2|/|S|, .., |Sm|/|S|) • where {S1, S2, .. Sm} is the partition of S induced by values of A

Real-valued data • Select a set of thresholds defining intervals; • each interval becomes a discrete value of the attribute • We can use some simple heuristics • always divide into quartiles • We can use domain knowledge • divide age into infant (0-2), toddler (3 - 5), and school aged (5-8) • or treat this as another learning problem • try a range of ways to discretize the continuous variable and see which yield “better results” w.r.t. some metric • (e.g., information gain, gain ratio, etc).

Converting Trees to Rules • Every decision tree corresponds to set of rules: • IF (Patrons = None) THEN WillWait = No • IF (Patrons = Full) & (Hungry = No) &(Type = French) THEN WillWait = Yes • ...

Decision Trees to Rules • It is easy to derive a rule set from a decision tree: write a rule for each path in the decision tree from the root to a leaf. • In that rule the left-hand side is easily built from the label of the nodes and the labels of the arcs. • The resulting rules set can be simplified: • Let LHS be the left hand side of a rule. • Let LHS' be obtained from LHS by eliminating some conditions. • We can certainly replace LHS by LHS' in this rule if the subsets of the training set that satisfy respectively LHS and LHS' are equal.

Fighting Overfitting:Using Rule Post-Pruning

C4.5 • This is the strategy used by the most successful commercial decision learning method •  C4.5 is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees, rule derivation, and so on. • C4.5: Programs for Machine Learning • J. Ross Quinlan, The Morgan Kaufmann Series • in Machine Learning, Pat Langley,

Summary:When to use Decision Trees • Instances presented as attribute-value pairs • Method of approximating discrete-valued functions • Target function has discrete values: classification problems • Disjunctive descriptions required (propositional logic) • Robust to noisy data: • Training data may contain • errors • missing attribute values • Typical bias: prefer smaller trees (Ockham's razor ) Widely used, practical and easy to interpret results

Summary of Decision Tree Learning • Inducing decision trees is one of the most widely used learning methods in practice • Can outperform human experts in many problems • Strengths include • Fast • simple to implement • can convert result to a set of easily interpretable rules • empirically valid in many commercial products • handles noisy data • Weaknesses include: • "Univariate" splits/partitioning using only one attribute at a time so limits types of possible trees • large decision trees may be hard to understand • requires fixed-length feature vectors • non-incremental (i.e., batch method)

CS 4700: Foundations of Artificial Intelligence