Data Mining CSCI 307, Spring 2019 Lecture 14

Data MiningCSCI 307, Spring 2019Lecture 14 Naive Bayes

We want to Classify a New Day Outlook Temperature Humidity Windy Play Sunny 66 90 True ? • But what to do with this is numeric data? • We cannot just "count" up the occurrence of a particular numeric value to calculate probabilities. • We calculate the mean (the average) and standard deviation (a measure of how spread out the data is, low std dev means that the data points are pretty close the mean, high std dev means the points are spread out) and then...

Numeric Attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: Sample mean m Standard deviation s Then the density function f(x) is

Statistics for Weather data http://easycalculation.com/statistics/normal-pdf.php Outlook Temperature Humidity Windy Play . Yes No Yes No Yes No Yes No Yes No . Sunny 23 64,68,65,71,65,70, 70,85, False 6295 Overcast 4069,70, 72,80,70,75, 90,91, True 3 3 Rainy 3272,75, 8580,80, 95 80,83, 86,90, 90 96 Sunny 2/93/5 m=73 m=74.6m=79.1 m=86.2 False 6/92/59/145/14 Overcast 4/90/5 s=6.2 s=7.9s=10.2 s=9.7 True 3/9 3/5 Rainy 3/92/5 f[temperature=66|yes] = f[humidity=90|yes] = f[temperature=66|no] = f[humidity=90|no] =

PDF Calculator http://easycalculation.com/statistics/normal-pdf.php m= 73 s= 6.2 What is the PDF?

Classifying a New Day Outlook Temperature Humidity Windy Play Sunny 66 90 True ? Likelihood of "yes" = Note: Missing values during training are not included in calculation of mean and standard deviation Likelihood of "no" = P("yes") = P("no") =

Naive Bayes: Discussion • Naive Bayes works surprisingly well (even if independence assumption is clearly violated) • Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class • However: adding too many redundant attributes will cause problems (e.g. identical attributes) • Note also: many numeric attributes are not normally distributed, if know the distribution for an attribute,I it can be use, otherwise --> kernel density estimators

Constructing Decision Trees • Strategy: top down • Recursive divide-and-conquer fashion • First: select attribute for root node Create branch for each possible attribute value • Then: split instances into subsets One for each branch extending from the node • Finally: repeat recursively for each branch, using only instances that reach the branch • Stop if all instances have the same class

Which Attribute to Select?

Criterion for Attribute Selection Which is the best attribute? • Want to get the smallest tree • Heuristic: choose the attribute that produces the “purest” nodes • Popular impurity criterion: information gain • Information gain increases with the average purity of the subsets • Strategy: choose attribute that gives greatest information gain

Information Theory: Measure Information in Bits Information gain Amount of information gained by knowing the value of the attribute (Entropy of distribution before the split) – (Entropy of distribution after it) Claude Shannon, American mathematician and scientist 1916–2001, came up with the whole idea of information theory and quantifying entropy, which measures information in bits. He could ride a unicycle and juggle clubs at the same time -- when he was in his 80's. That's pretty impressive. He was living in Massachusetts when he died of Alzheimer's disease.

Computing Information Measure information in bits • Given a probability distribution, the information required to predict an event is the distribution’s entropy • Entropy gives the information required in bit form (can involve fractions of bits.) Formula for computing the entropy: entropy(p1,p2, ... pn) = − p1log2 p1 − p2 log2 p2 ... − pnlog2 pn Example: Weather Data What do we know before we split? There are 9 yes and 5 no outcomes. Calculate the information: info([9,5]) = entropy(9/14, 5/14) = ?

Data Mining CSCI 307, Spring 2019 Lecture 14

Data Mining CSCI 307, Spring 2019 Lecture 14

Presentation Transcript

Data Mining CSCI 307 Spring, 2019

Data Mining CSCI 307, Spring 2019 Lecture 13

Data Structures CSCI 132, Spring 2019 Lecture 21 Doubly Linked Lists

CSci 162 Lecture 14

Lecture 14: Graph Data Mining

Data Structures CSCI 132, Spring 2014 Lecture 17 Backtracking

Data Structures CSCI 132, Spring 2019 Lecture 14 Review for Exam 1

Data Structures CSCI 132, Spring 2019 Lecture 18 Recursion and Look-Ahead

Data Mining Spring 2007