Decision Tree Problems

Decision Tree Problems CSE-391: Artificial IntelligenceUniversity of Pennsylvania Matt Huenerfauth April 2005

Homework 7 • Perform some entropy and information gain calculations. • We’ll also do some information gain ratio calculations in class today. You don’t need to do these on the midterm, but you should understand generally how it’s calculated and you should know when we should use this metric. • Using the C4.5 decision tree learning software. • You’ll learn trees to do word sense disambiguation. • Read chapter 18.1 – 18.3.

Looking at some data

Calculate Entropy • For many of the tree-building calculations we do today, we’ll need to know the entropy of a data set. • Entropy is the degree to which a dataset is mixed up. That is, how much variety of classifications (+/-) are still in the set. • For example, a set that is still 50/50 +/- classified will have an Entropy of 1.0. • A set that’s all + or all – will have Entropy 0.0.

Entropy Calculations: I() If we have a set with k different values in it, we can calculate the entropy as follows: Where P(valuei) is the probability of getting the ith value when randomly selecting one from the set. So, for the set R = {a,a,a,b,b,b,b,b} a-values b-values

Looking at some data

Entropy for our data set • 16 instances: 9 positive, 7 negative. • This equals: 0.9836 • This makes sense – it’s almost a 50/50 split; so, the entropy should be close to 1.

How do we use this? • The computer needs a way to decide how to build a decision tree. • First decision: what’s the attribute it should use to ‘branch on’ at the root? • Recursively: what’s the attribute it should use to ‘branch on’ at all subsequent nodes. • Guideline: Always branch on the attribute that will divide the data into subsets that have as low entropy as possible (that are as unmixed +/- as possible).

Information Gain Metric: G() • When we select an attribute to use as our branching criteria at the root, then we’ve effectively split our data into two sets, the set the goes down the left branch, and the set that goes down the right. • If we know the entropy before we started, and then we calculate the entropy of each of these resulting subsets, then we can calculate the information gain.

Information Gain Metric: G() • Why is reducing entropy a good idea? • Eventually we’d like our tree to distinguish data items into groups that are fine-grained enough that we can label them as being either + or - • In other words, we’d like to separate our data in such a way that each group is as ‘unmixed’ in terms of +/- classifications as possible. • So, the ideal attribute to branch at the root would be the one that can separate the data into an entirely + group and an entirely – one.

Entropy of set = 0.9836 (16 examples) Entropy = 0.9544 (from 8 examples) Entropy = 0.8113 (from 8 examples) Visualizing Information Gain Size Small Large

Visualizing Information Gain The data set that goes down each branch of the tree has its own entropy value. We can calculate for each possible attribute its expected entropy.This is the degree to which the entropy would change if branch on this attribute. You add the entropies of the two children, weighted by the proportion of examples from the parent node that ended up at that child. 0.9836 Size (16 examples) Small Large 0.8113 0.9544 (8 examples) (8 examples) Entropy of left child is 0.8113 I(size=small) = 0.8113 Entropy of right child is 0.9544 I(size=large) = 0.9544 8 examples with ‘small’ 8 examples with ‘large’

G(attrib) = I(parent) – I(attrib) We want to calculate the information gain (or entropy reduction). This is the reduction in ‘uncertainty’ when choosing our first branch as ‘size’. We will represent information gain as “G.” G(size) = I(parent) – I(size)G(size) = 0.9836 – 0.8828G(size) = 0.1008Entropy of all data at parent node = I(parent) = 0.9836 Child’s expected entropy for ‘size’ split = I(size) = 0.8828 So, we have gained 0.1008 bits of information about the dataset by choosing ‘size’ as the first branch of our decision tree.

Using Information Gain • For each of the attributes we’re thinking about branching on, and for all of the data that will reach this node (which is all of the data when at the root), do the following: • Calculate the Information Gain if we were to split the current data on this attribute. • In the end, select the attribute with the greatest Information Gain to split on. • Create two subsets of the data (one for each branch of the tree), and recurse on each branch.

Showing the calculations • For color, size, shape. • Select the one with the greatest info gain value as the attribute we’ll branch on at the root. • Now imagine what our data set will look like on each side of the branch. • We would then recurse on each of these data sets to select how to branch below.

Our Data Table

Sequence of Calculations • Calculate I(parent). This is entropy of the data set before the split. Since we’re at the root, this is simply the entropy for all the data. I(all_data) = (-9/16)*log2(9/16)+(-7/16)*log2(7/16) • Next, calculate the I() for the subset of the data where the color=green and for the subset of the data where color=yellow. I(color=green) = (-1/3)*log2(1/3) + (-2/3)*log2(2/3) I(color=yellow) = (-8/13)*log2(8/13) + (-5/13)*log2(5/13) • Now calculate expected entropy for ‘color.’ I(color)=(3/16)*I(color=green)+(13/16)*I(color=yellow) • Finally, the information gain for ‘color.’ G(color) = I(parent) – I(color)

Calculations • I(all_data) = .9836 • I(size)=.8829 G(size) = .1007 size=small,+2,-6; I(size=small)=.8112 size=large,+3,-5; I(size=large)=.9544 • I(color)=.9532 G(color)= .0304 color=green,+1,-2; I(color=green)=.9183 color=yellow,+8,-5; I(color=yellow)=.9612 • I(shape)=.9528 G(shape)= .0308 shape=regular,+6,-6; I(shape=regular)=1.0 shape=irregular,+3,-1; I(shape=irregular)=.8113

Visualizing the Recursive Step Now we have split on a particular feature, we delete that feature from the set considered at the next layer. Since this effectively gives us a ‘new’ smaller dataset, with one less feature, at each of these child nodes, we simply apply the same entropy calculation procedures recursively for each child. Size Small Large

Calculations • Entropy of this whole set(+6,-2): 0.8113 • I(color)=.7375 G(color)=.0738 color=yellow,+5,-1; I(color=yellow)=0.65 color=green,+1,-1; I(color=green)= 1.0 • I(shape)=.6887 G(shape)=.1226 shape=regular,+4,-2; I(shape=regular)= .9183 shape=irregular,+2,-0; I(shape=irregular)= 0

Binary Data • Sometimes most of our attributes are binary values or have a low number of possible values. (Like the berry example.) • In this case, the information gain metric is appropriate for selecting which attribute to use to branch at each node. • When we have some attributes with very many values, then there is another metric which is better to use.

Information Gain Ratio: GR() • The information gain metric has a bias toward branching on attributes that have very many possible values. • To combat this bias, we use a different branching-attribute selection metric, which is called: “Information Gain Ratio” GR(size)

Formula for Info Gain Ratio • Formula for Information Gain Ratio… • P(v) is the proportion of the values of this attribute that are equal to v. • Note: we’re not counting +/- in this case. We’re counting the values in the ‘attribute’ column. • Let’s use the information gain ratio metric to select the best attribute to branch on.

Calculation of GR() • GR(size) = G(size) / Sum(…) • GR(size) = .1007 G(size) = .1007 8 occurrences of size=small; 8 occurrences of size=large. Sum(…) = (-8/16)*log2(8/16) + (-8/16)*log2(8/16) =1 • GR(color)= .0437 G(color)=.0304 3 occurrences of color=yellow; 13 of color=green. Sum(…) = (-3/16)*log2(3/16) + (-13/16)*log2(13/16) =.6962 • GR(shape)= .0379 G(shape)=.0308 12 occurrences of shape=regular; 4 of shape=irregular Sum(…) = (-12/16)*log2(12/16) + (-4/16)*log2(4/16) =.8113

Selecting the root • Same as before, but now instead of selecting the attribute with the highest information gain, we select the one with the highest information gain ratio. • We will use this attribute to branch at the root.

Data Subsets / Recursive Step • Same as before. • After we select an attribute for the root, then partition the data set into subsets. And then remove that attribute from consideration for those subsets below its node. • Now, we recurse. We calculate what each of our subsets will be down each branch. • We recursively calculate the info gain ratio for all the attributes on each of these data subsets in order to select how the tree will branch below the root.

Recursively • Entropy of this whole set(+6,-2): 0.8113 • G(color)=.0738 GR(color)=.0909 color=yellow,+5,-1; I(color=yellow)=0.65 color=green,+1,-1; I(color=green)= 1.0 • G(shape)=.1226 GR(shape)=.1511 shape=regular,+4,-2; I(shape=regular)= .9183 shape=irregular,+2,-0; I(shape=irregular)= 0

Decision Tree Problems

Decision Tree Problems

Presentation Transcript

Decision tree

Decision tree

Decision Tree

Decision Tree

Decision tree

Decision Tree

DECISION TREE

Decision Tree Modeling

Decision Tree Classifiers

DECISION TREE

Decision Tree

Decision Tree Learning

Decision Tree

Decision Tree

Decision Tree

Decision Tree

Decision Tree

Decision Tree