Data Mining CSCI 307, Spring 2019 Lecture 15

Data MiningCSCI 307, Spring 2019Lecture 15 Constructing Trees

Wishlist for a Purity Measure Properties we require from a purity measure: • When node is pure, measure should be zero • When impurity is maximal (i.e. all classes equally likely), measure should be maximal • Measure should obey multistage property (i.e. decisions can be made in several stages): measure([2,3,4]) = measure([2,7]) + (7/9) x measure([3,4]) Then decide on the second case This decision is made in two stages Make the first decision Entropy is the only function that satisfies all three properties

Outlook . Yes No . Sunny 23 Overcast 40 Rainy 32 Example: attribute Outlook Outlook = Sunny : Outlook = Overcast : Outlook = Rainy :

Example: attribute Outlook info([2,3]) = 0.971 bits Outlook = Sunny : Outlook = Overcast : Outlook = Rainy : Expected information for the attribute: info([4,0]) = 0 bits info([3,2]) = 0.971 bits info([2,3],[4,0],[3,2]) =

Computing Information Gain Information gain....... information before splitting – information after splitting We've calculated the information BEFORE splitting and we've calculated the information AFTER the split for the Outlook attribute. So we can calculate the information gain for Outlook. gain(Outlook ) =

Temperature Yes No Hot 22 Mild 42 Cool 31 attribute: Temperature info([2,2]) = entropy(2/4, 2/4) = −2/4 log2(2/4) − 2/4 log2(2/4)= 1 bit Temperature = Hot : Temperature = Mild : Temperature = Cool : info([4,2]) = entropy(4/6, 2/6) = −2/3 log(2/3) − 1/3 log(1/3)= 0.918bits info([3,1]) = entropy(3/4, 1/4) = −3/4 log(3/4) − 1/4 log(1/4) = 0.811bits

attribute: Temperature info([2,2]) = 1 bit info([4,2]) = 0.918bits Temperature = Hot : Temperature = Mild : Temperature = Cool : Expected information for the attribute: info([3,1]) = 0.811bits Average information value. (Use the number of instances that go down each branch.) info([2,2],[4,2],[3,1]) = gain(Temperature ) = info([9,5]) – info([2,2],[4,2],[3,1]) =

Humidity Yes No High 34 Normal 61 attribute: Humidity Humidity = High : Humidity = Normal : info([3,4]) = entropy(3/7, 4/7) = −3/7 log2(3/7) − 4/7 log2(4/7) = 0.985bits info([6,1]) = entropy(6/7, 1/7) = −6/7 log(6/7) − 1/7 log(1/7) = 0.592bits

attribute: Humidity info([3,4]) = 0.985bits info([6,1]) = 0.592bits Humidity = High : Humidity = Normal : Expected information for the attribute: Average information value. (Use the number of instances that go down each branch.) info([3,4],[6,1]) = 7/14 x0.985+ 7/14 x 0.592 =0.788bits gain(Humidity ) = info([9,5]) – info([3,4],[6,1]) = 0.940 – 0.788 = 0.152 bits

Windy Yes No False 62 True 33 attribute: Windy Windy = False : Windy = True : info([6,2]) = entropy(6/8, 2/8) = −6/8 log2(6/8) − 2/8 log2(2/8) = 0.811bits info([3,3]) = entropy(3/6, 3/6) = −3/6 log(3/6) − 3/6 log(3/6) = 1 bit

attribute: Windy info([6,2]) = 0.8112777 bits info([3,3]) = 1 bit Windy = False : Windy = True : Expected information for the attribute: Average information value. (Use the number of instances that go down each branch.) info([6,2],[3,3]) = 8/14 x0.811+ 6/14 x 1 =0.892bits gain(Windy ) = info([9,5]) – info([6,2],[3,3]) = 0.940 – 0.892 = 0.048 bits

Which Attribute to Select as Root? For all the attributes from the weather data: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity )= 0.152 bits gain(Windy ) = 0.048 bits Outlook is the way to go ... it's the root.

Continuing to Split Now, determine the gain for EACH of Outlook's branches, sunny, overcast, and rainy. For the sunny branch we know at this point the entropy is 0.971; it is our "before" split information as we calculate our gain from here. yes The rainy branch entropy is also 0.971; use it is our "before" split information as we calculate our gain from here on down. Splitting stops when we can't split any further; that is the case with the value overcast. We don't need to consider Outlook further.

Continuing the Split at Sunny Now, must determine the gain for EACH of Outlook's branches. For the sunny branch we know the "before" split entropy is 0.971

Find Subroot for Sunny humidity = high: info([0,3]) = entropy(0,1) = humidity = normal: info([2,0]) = entropy(1,0) = info([0,3],[2,0]) = gain(Humidity ) = info([2,3]) – info([0,3],[2,0]) =

Find Subroot for Sunny (continued) windy = false: info([1,2]) = entropy(1/3,2/3) = −⅓ log(⅓) − ⅔log(⅔) = 0.183 bits windy = true: info([1,1]) = entropy(1/2,1/2) = −½ log(½) − ½ log(½) = 1 bit info([0,3],[2,0]) = 3/5 x 0.183 + 2/5 x 1 = 0.951 bits gain(Windy ) = info([2,3]) – info([1,2],[1,1]) = .971 - .951= 0.020

Find Subroot for Sunny (continued) temperature = hot: info([0,2]) = entropy(0,1) =−0 log(0) − 1 log(1) = 0bits temperature = mild: info([1,1]) = entropy(1/2,1/2) =−½ log(½) − ½ log(½) = 1 bit temperature = cool: info([1,0]) = entropy(1,0) =−1 log(1) − 0 log(0) = 0 bits info([0,2],[1,1],[1,0]) = 2/5 x 0 + 2/5 x 1 + 1/5 x 0 = 0 + 0.4 + 0 = 0.4bits gain(Temperature ) = info([2,3]) – info([0,2],[1,1],[1,0]) = 0.971 – 0.4 = 0.571 bits

Finish the Split at Sunny gain(Humidity )= 0.971 bits gain(Temperature ) = 0.571 bits gain(Windy ) = 0.020 bits

Possible Splits at Rainy No need to actually do the calculations because windy is pure

Final Decision Tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can't be split any further.

Highly-branching Attributes • Problematic: attributes with a large number of values (extreme case: ID code) • Subsets are more likely to be pure if there is a large number of values • Information gain is biased towards choosing attributes with a large number of values • This may result in overfitting (selection of an attribute that is non-optimal for prediction) • Another problem: fragmentation

Tree Stump for ID code Attribute This seems like a bad idea for a split. Entropy of split: info(ID code) = info([0,1]) + (info([0,1]) + (info([1,0]) + ... + (info([1,0]) + info([0,1]) = 0 bits Information gain is maximal for ID code (namely 0.940 bits, i.e. the before split information)

Gain Ratio • Gain ratio: a modification of the information gain that reduces its bias • Gain ratio takes number and size of branches into account when choosing an attribute • It corrects the information gain by taking the intrinsic information of a split into account • Intrinsic information: entropy of distribution of instances into branches (i.e. how much information do we need to tell which branch an instance belongs to)

Computing the Gain Ratio Example: intrinsic information for ID code info([1,1,...,1]) = 14 x (-1/14 x log(1/14)) = 3.807bits Value of attribute decreases as intrinsic information gets larger Definition of gain ratio: Example: 0.940 bits 3.807 bits gain(attribute) intrinsic_info(attribute) gain_ratio(ID code) = = 0.246 gain_ratio(attribute) =

Data Mining CSCI 307, Spring 2019 Lecture 15

Data Mining CSCI 307, Spring 2019 Lecture 15

Presentation Transcript

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

CSci 8980: Data Mining (Fall 2002)

Data Structures CSCI 132, Spring 2014 Lecture 17 Backtracking

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Mining Spring 2013

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead

Data Mining Spring 2007