Bayesian Classification Dr. Navneet Goyal BITS, Pilani

Bayesian ClassificationDr. Navneet GoyalBITS, Pilani

Bayesian Classification • What are Bayesian Classifiers? • Statistical Classifiers • Predict class membership probabilities • Based on Bayes Theorem • Naïve Bayesian Classifier • Computationally Simple • Comparable performance with DT and NN classifiers

Bayesian Classification • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

Bayes Theorem • Let X be a data sample whose class label is unknown • Let H be some hypothesis that X belongs to a class C • For classification determine P(H/X) • P(H/X) is the probability that H holds given the observed data sample X • P(H/X) is posterior probability

Bayes Theorem Example: Sample space: All Fruits X is “round” and “red” H= hypothesis that X is an Apple P(H/X) is our confidence that X is an apple given that X is “round” and “red” • P(H) is Prior Probability of H, ie, the probability that any given data sample is an apple regardless of how it looks • P(H/X) is based on more information • Note that P(H) is independent of X

Bayes Theorem Example: Sample space: All Fruits • P(X/H) ? • It is the probability that X is round and red given that we know that it is true that X is an apple • Here P(X) is prior probability = P(data sample from our set of fruits is red and round)

Estimating Probabilities • P(X), P(H), and P(X/H) may be estimated from given data • Bayes Theorem • Use of Bayes Theorem in Naïve Bayesian Classifier!!

Naïve Bayesian Classification • Also called Simple BC • Why Naïve/Simple?? • Class Conditional Independence • Effect of an attribute values on a given class is independent of the values of other attributes • This assumption simplifies computations

Naïve Bayesian Classification Steps Involved • Each data sample is of the type X=(xi) i =1(1)n, where xi is the values of X for attribute Ai • Suppose there are m classes Ci,i=1(1)m. X  Ci iff P(Ci|X) > P(Cj|X) for 1 j  m, ji i.e BC assigns X to class Ci having highest posterior probability conditioned on X

Naïve Bayesian Classification The class for which P(Ci|X) is maximized is called the maximum posterior hypothesis. From Bayes Theorem • P(X) is constant. Only need be maximized. • If class prior probabilities not known, then assume all classes to be equally likely • Otherwise maximize P(Ci) = Si/S Problem: computing P(X|Ci) is unfeasible! (find out how you would find it and why it is infeasible)

Naïve Bayesian Classification • Naïve assumption: attribute independence = P(x1,…,xn|C) = P(xk|C) • In order to classify an unknown sample X, evaluate for each class Ci. Sample X is assigned to the class Ci iff P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j  m, ji

Naïve Bayesian Classification EXAMPLE

Naïve Bayesian Classification EXAMPLE X= (<=30,MEDIUM, Y,FAIR, ???) We need to maximize: P(X|Ci)P(Ci) for i=1,2. P(Ci) is computed from training sample P(buys_comp=Y) = 9/14 = 0.643 P(buys_comp=N) = 5/14 = 0.357 How to calculate P(X|Ci)P(Ci) for i=1,2? P(X|Ci) = P(x1, x2, x3, x4|C) = P(xk|C)

Naïve Bayesian Classification EXAMPLE P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044 P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019 P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028 P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007 CONCLUSION: X buys computer

Naïve Bayes Classifier: Issues • Probability values ZERO! • Recall what you observed in WEKA! • If Ak is continuous valued! • Recall what you observed in WEKA! If there are no tuples in the training set corresponding to students for the class buys-comp=NO P(student = Y|buys_comp=N)=0 Implications? Solution?

Naïve Bayes Classifier: Issues • Laplacian Correction or Laplace Estimator • Philosophy – we assume that the training data set is so large that adding one to each count that we need would only make a negligible difference in the estimated prob. value. • Example: D (1000) • Class: buys_comp=Y income=low – zero tuples income=medium – 990 tuples income=high – 10 tuples Without Laplacian Correction the probs. are 0, 0.990, and 0.010 With Laplacian correction: 1/1003 = 0.001, 991/1003=0.988, and 11/1003=0.011 respectively.

Naïve Bayes Classifier: Issues • Continuous variable: need to do more work than categorical attributes! • It is typically assumed to have a Guassian distribution with a mean  and a std. dev. . • Do it yourself! And cross check with WEKA!

Naïve Bayes (Summary) • Robust to isolated noise points • Handle missing values by ignoring the instance during probability estimate calculations • Robust to irrelevant attributes • Independence assumption may not hold for some attributes • Use other techniques such as Bayesian Belief Networks (BBN)

Probability Calculations No. of attributes = 4 Distinct values = 3,3,3,3 No. of classes = 2 Total no. of probability calculations in NBC = 4*3*2 = 24! What if conditional ind. was not assumed? O(kp) for p k-valued attributes Multiply by m classes.

Bayesian Classification Dr. Navneet Goyal BITS, Pilani