Data Mining and Knowledge Acquizition — Chapter 5 III —

Data Mining and Knowledge Acquizition — Chapter 5 III — BIS 541 2011-2012 Spring

Chapter 7. Classification and Prediction • Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy

Bayesian Classification: Why? • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayesian Theorem: Basics • Let X be a data sample whose class label is unknown • Let H be a hypothesis that X belongs to class C • For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X • P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) • P(X): probability that sample data is observed • P(X|H) : probability of observing the sample X, given that the hypothesis holds

Bayesian Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem • Informally, this can be written as posterior =likelihood x prior / evidence • MAP (maximum posteriori) hypothesis • Practical difficulty: require initial knowledge of many probabilities, significant computational cost

Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent: • The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) • No dependence relation between attributes • Greatly reduces the computation cost, only count the class distribution. • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci)

Example • H X is an apple • P(H) priori probability that X is an apple • X observed data round and red • P(H/X) probability that X is an apple given that we observe that it is red and round • P(X/H) posteriori probability that a data is red and round given that it is an apple • P(X) priori probabilility that it is red and round

Applying Bayesian Theorem • P(H/X)= P(H,X)/P(X) from Bayesian theorem • Similarly: • P(X/H) = P(H,X)/P(H) • P(H,X) = P(X/H)P(H) • hence • P(H/X)= P(X/H)P(H)/P(X) • calculate P(H/X) from • P(X/H),P(H),P(X)

Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(Ci|X) = prob. that the sample tuple X=<x1,…,xk> is of class Ci. There are m classes Ci i =1 to m • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sampleXthe class labelCisuch thatP(Ci|X) is maximal • P(Ci|X)> P(Cj|X) 1<=j<=m ji

Estimating a-posteriori probabilities • Bayes theorem: P(Ci|X) = P(X|Ci)·P(Ci) / P(X) • P(X) is constant for all classes • P(Ci) = relative freq of class Ci samples • Ci such that P(Ci|X) is maximum = Ci such that P(X|Ci)·P(Ci) is maximum • Problem: computing P(X|Ci) is unfeasible!

Naïve Bayesian Classification • Naïve assumption: attribute independence P(x1,…,xk|Ci) = P(x1|Ci)·…·P(xk|Ci) • If i-th attribute is categorical:P(xi|Ci) is estimated as the relative freq of samples having value xi as i-th attribute in class Ci =sik/si . • If i-th attribute is continuous:P(xi|Ci) is estimated thru a Gaussian density function • Computationally easy in both cases

Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)

Given the new customer What is the probability of buying computer X=(age<=30 ,income =medium, student=yes,credit_rating=fair) Compute P(buy computer = yes/X) and P(buy computer = no/X) Decision: list as probabilities or chose the maximum conditional probability

Compute P(buy computer = yes/X) = P(X/yes)*P(yes)/P(X) P(buy computer = no/X) P(X/no)*P(no)/P(X) Drop P(X) Decision: maximum of • P(X/yes)*P(yes) • P(X/no)*P(no)

Naïve Bayesian Classifier: Example • Compute P(X/Ci) for each class • P(X/C = yes)*P(yes) P(age=“<30” | buys_computer=“yes”)* P(income=“medium” |buys_computer=“yes”)* P(credit_rating=“fair” | buys_computer=“yes”)* P(student=“yes” | buys_computer=“yes)* P(C =yes)

P(X/C = no)*P(no) P(age=“<30” | buys_computer=“no”)* P(income=“medium” | buys_computer=“no”)* P(student=“yes” | buys_computer=“no”)* P(credit_rating=“fair” | buys_computer=“no”)* P(C=no)

Naïve Bayesian Classifier: Example P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 P(buys_computer=“yes”)=9/14=0,643 P(buys_computer=“no”)=5/14=0,357

P(X|buys_computer=“yes”) = 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“yes”) * P(buys_computer=“yes”) =0.044*0.643=0.02 P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|buys_computer=“no”) * P(buys_computer=“no”) =0.019*0.357=0.0007 X belongs to class “buys_computer=yes”

Class probabilities • P(yes/X) = P(X/yes)*P(yes)/P(X) • P(no/X) = P(X/no)*P(no)/P(X) • What is P(X)? • P(X)= P(X/yes)*P(yes)+P(X/no)*P(no) • = 0.02 + 0.0007 • = 0.0207 • So • P(yes/X) = 0.02/0.0207 • P(no/X) = 0.0007/0.0207 • Hence • P(yes/X) + P(no/X) = 1

Naïve Bayesian Classifier: Comments • Advantages : • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence , therefore loss of accuracy • Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc • Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? • Bayesian Belief Networks

Y Z P Bayesian Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships • Represents dependency among the variables • Gives a specification of joint probability distribution • Nodes: random variables • Links: dependency • X,Y are the parents of Z, and Y is the parent of P • No dependency between Z and P • Has no loops or cycles X

Bayesian Belief Network: An Example Family History Smoker (FH, ~S) (~FH, S) (~FH, ~S) (FH, S) LC 0.7 0.8 0.5 0.1 LungCancer Emphysema ~LC 0.3 0.2 0.5 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Bayesian Belief Networks

Learning Bayesian Networks • Several cases • Given both the network structure and all variables observable: learn only the CPTs • Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning • Network structure unknown, all variables observable: search through the model space to reconstruct graph topology • Unknown structure, all hidden variables: no good algorithms known for this purpose • D. Heckerman, Bayesian networks for data mining

Other Classification Methods • k-nearest neighbor classifier • case-based reasoning • Genetic algorithm • Rough set approach • Fuzzy set approaches

Instance-Based Methods • Instance-based learning: • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified • Typical approaches • k-nearest neighbor approach • Instances represented as points in a Euclidean space. • Locally weighted regression • Constructs local approximation • Case-based reasoning • Uses symbolic representations and knowledge-based inference

Nearest Neighbor Approaches Based on the concept of similarity Memory-Based Reasoning (MBR) – results are based on analogous situations in the past Collaborative Filtering – results use preferences in addition to analogous situations from the past

Memory-Based Reasoning (MBR) • Our ability to reason from experience depends on our ability to recognize appropriate examples from the past… • Traffic patterns/routes • Movies • Food • We identify similar example(s) and apply what we know/learned to current situation • These similar examples in MBR are referred to as neighbors

MBR Applications • Fraud detection • Customer response prediction • Medical treatments • Classifying responses – MBR can process free-text responses and assign codes

MBR Strengths • Ability to use data “as is” – utilizes both a distance function and a combination function between data records to help determine how “neighborly” they are • Ability to adapt – adding new data makes it possible for MBR to learn new things • Good results without lengthy training

MBR Example – Rents in Tuxedo, NY • Classify nearest neighbors based on descriptive variables – population & median home prices (not geography in this example) • Range midpoint in 2 neighbors is $1,000 & $1,250 so Tuxedo rent should be $1,125; 2nd method yields rent of $977 • Actual midpoint rent in Tuxedo turns out to be $1,250 (one method) and $907 in another.

MBR Challenges • Choosing appropriate historical data for use in training • Choosing the most efficient way to represent the training data • Choosing the distance function, combination function, and the number of neighbors

Distance Function • For numerical variables • Absolute value of distane |A-B| • Ex d(27,51)= |27-51|=24 • Square of differences (A-B)2 • Ex d(27,51)= (27-51)=242 • Normalized absolute value |A-B|/max differ • Ex d(27,51)= |27-51|/|27-52|=0,96 • Standardised absolute value • |A-B|/standard deviation • Categorical variables (similar to clusteing) • Ex gender • d(male,male)=0, d(female,female)=0 • d(male,female)=1, d(female,male)=1

Combining distance between variables • Manhatten • Ex dsum(A,B)=dgender(A,B)+ dsalaryr(A,B)+ dage(A,B) • Normalized summation • Ex dsum(A,B)/max dsum • Euclidean • deuc(A,B)= • Sqrt(dgender(A,B)2+ dsalaryr(A,B) 2+ dage(A,B) 2)

The Combination Function • For categorical target variables • Voting:Majority rule • Weighted voting • Weights inversly proportional to the distance • For numerical target variables • Take average • Weighted average • Weights inversly proportional to the distance

Collaborative Filtering • Lots of human examples of this: • Best teachers • Best courses • Best restaurants (ambiance, service, food, price) • Recommend a dentist, mechanic, PC repair, blank CDs/DVDs, wines, B&Bs, etc… • CF is a variant of MBR particularly well suited to personalized recommendations

Collaborative Filtering • Starts with a history of people’s personal preferences • Uses a distance function – people who like the same things are “close” • Uses “votes” which are weighted by distances, so close neighbor votes count more • Basically, judgments of a peer group are important

Collaborative Filtering • Knowing that lots of people liked something is not sufficient… • Who liked it is also important • Friend whose past recommendations were good (or bad) • High profile person seems to influence • Collaborative Filtering automates this word-of-mouth everyday activity

Preparing Recommendations for Collaborative Filtering • Building customer profile – ask new customer to rate selection of things • Comparing this new profile to other customers using some measure of similarity • Using some combination of the ratings from similar customers to predict what the new customer would select for items he/she has NOT yet rated

Collaborative Filtering Example • What rating would Nathaniel give to Planet of the Apes? • Simon, distance 2, rated it -1 • Amelia, distance 4, rated it -4 • Using weighted average inverse to distance, it is predicted that he would rate it a -2 • =(0.5*-1 + 0.25*-4) / (0.5 + 0.25) • Nathaniel can certainly enter his rating after seeing the movie which could be close or far from the prediction

Holdout estimation • What to do if the amount of data is limited? • The holdout method reserves a certain amount for testing and uses the remainder for training • Usually: one third for testing, the rest for training • Problem: the samples might not be representative • Example: class might be missing in the test data • Advanced version uses stratification • Ensures that each class is represented with approximately equal proportions in both subsets

Repeated holdout method • Holdout estimate can be made more reliable by repeating the process with different subsamples • In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) • The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method • Still not optimum: the different test sets overlap • Can we prevent overlapping?

Cross-validation • Cross-validation avoids overlapping test sets • First step: split data into k subsets of equal size • Second step: use each subset in turn for testing, the remainder for training • Called k-fold cross-validation • Often the subsets are stratified before the cross-validation is performed • The error estimates are averaged to yield an overall error estimate

More on cross-validation • Standard method for evaluation: stratified ten-fold cross-validation • Why ten? • Extensive experiments have shown that this is the best choice to get an accurate estimate • There is also some theoretical evidence for this • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation • E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)

Leave-One-Out cross-validation • Leave-One-Out:a particular form of cross-validation: • Set number of folds to number of training instances • I.e., for n training instances, build classifier n times • Makes best use of the data • Involves no random subsampling • Very computationally expensive • (exception: NN)

Leave-One-Out-CV and stratification • Disadvantage of Leave-One-Out-CV: stratification is not possible • It guarantees a non-stratified sample because there is only one instance in the test set! • Extreme example: random dataset split equally into two classes • Best inducer predicts majority class • 50% accuracy on fresh data • Leave-One-Out-CV estimate is 100% error!

The bootstrap • CV uses sampling without replacement • The same instance, once selected, can not be selected again for a particular training/test set • The bootstrap uses sampling with replacement to form the training set • Sample a dataset of n instances n times with replacement to form a new datasetof n instances • Use this data as the training set • Use the instances from the originaldataset that don’t occur in the newtraining set for testing

Data Mining and Knowledge Acquizition — Chapter 5 III —