Rohit Kate

Computational Intelligence in Biomedical and Health Care InformaticsHCA 590 (Topics in Health Sciences) Rohit Kate Naïve Bayes Model

Reading Paper: Constructing naïve Bayesian classifiers for veterinary medicine: A case study in the clinical diagnosis of classical swine fever, P. L. Geenen, L. C. van der Gaag, W. L. A. Loeffen, A. R. W. Elbers, Research in Veterinary Science 91 (2011) 64-70 (skip 2.3.2)

Estimating Joint Probability Distributions is Not Easy :-( • Suppose prostate cancer control (2 states) depends on: • Tumor Dose (5 states) • PSA (6 states) • Gleason (4 states) • T Stage (3 states) • Lymph Node Dose (2 states) • Local Tumor Control (2 states) • Lymph Node Control (2 states) • Total entries in the joint probability table: 5*6*4*3*2*2*2*2=2,880 • Not easy to estimate from clinical trials!

Estimating Joint Probability Distributions • Simplification assumptions are made about the joint probability distribution to reduce the number of parameters to estimate • Let the random variables be nodes of a graph, there are two major types of simplifications, they are represented as • Directed probabilistic graphical models • Simplest: Naïve Bayes model • General: Bayesian networks • Undirected probabilistic graphical models • Simplest: Maximum entropy model • General: Markov networks

Directed Graphical Models • Also known as Bayesian networks or Bayes nets • Simplification assumption: Some random variables are conditionally independent of others given values for some other random variables • Simplest directed graphical model: Naïve Bayesian or Naïve Bayes • Naïve Bayes assumption: The features (inputs) are conditionally independent given the category (output) • Very simplistic assumption, hence called “naïve”

Probabilistic Classification and Naïve Bayes • Suppose X1,..,Xn are inputs and Y is an output • We want to predict Y given all inputs i.e. P(Y|X1,..,Xn), for example, P(Label|Shape,Color) • Naïve Bayes Assumption: Features are conditionally independent given the category: • How do we estimate P(Y|X1,X2,..,Xn) from this? • Bayes’ theorem lets us calculate P(B|A) in terms of P(A|B)

Recall Bayes’ Theorem • Simple proof from definition of conditional probability: (Def. cond. prob.) (Def. cond. prob.) QED:

Naïve Bayes Model From Bayes’ Theorem Computing marginals Definition of conditional probability

Naïve Bayes Model Naïve Bayes assumption

Naïve Bayes Model • We only need to estimate P(Y) and P(Xi|Y) for all i • Assuming all Y and Xis are binary, only 2n+1 parameters instead of 2n+1-1 parameters: a dramatic reduction (for n=9, a reduction from 1023 to 19!) • Directed graphical model representation: Lightning Y P(Y) Rain Thunder X1 P(X1|Y) X2 P(X2|Y) X3 ……. P(X3|Y) Xn P(Xn|Y)

Naïve Bayes Example Y P(Y) Size P(Size|Y) Color P(Color|Y) Shape P(Shape|Y)

Naïve Bayes Example P(Label|Size,Color,Shape) Test Instance: <medium ,red, circle> What is the probability of it being positive?

Naïve Bayes Example Relevant part of the table Test Instance: <medium ,red, circle> P(positive |medium,red,circle) =P(medium,red,circle|positive)*P(positive)/P(medium,red,circle) = P(medium | positive)*P(red | positive)*P(circle | positive) *P(positive)/ P(medium,red,cirle) P(medium,red,circle) = P(medium,red,circle|positive)* P(positive) + P(medium,red,circle|negative)*P(negative) = P(medium | positive)*P(red | positive)*P(circle | positive)*P(positive) + P(medium | negative)*P(red | negative)*P(circle | negative)*P(negative) = (0.5*0.1*0.9*0.9+0.5*0.2*0.3*0.3) = 0.0405+0.009=0.0495 P(positive|medium,red,circle)= 0.5*0.1*0.9*0.9/0.0495= 0.0405/0.0495 = 0.8181 P(negative|medium,red,circle)=0.5*0.2*0.3*0.3/0.0495=0.009/0.0495=0.1818 Hence the test example is more likely to be positive.

Naïve Bayes Example prior or pre-test probability Prostate cancer (PC) High PSA (HPSA) Abnormal Free PSA (FPSA) Specificity Specificity Sensitivity Sensitivity Independence assumption: Given presence or absence of prostate cancer, the results of the two tests are independent. The above has all the information needed to compute probability ,for example, P(PC=Present|HPSA=Positive,FPSA=Positive), called posterior or post-test probability.

Estimating P(Y) and P(X|Y) • Normally, probabilities are estimated based on observed frequencies in the training data. • If data contains n examples and nkexamples are category yk then • If data nijk of these nk examples have the jth value for feature Xi, xij, then: • Although intuitive, why dividing this way is the best estimate of probability? • These are what are called maximum likelihood estimates and they maximize the probability of observing the given examples

Maximum Likelihood Estimate • P(A teenager will drink and drive) = # of Teenagers who drink and drive/# of teenagers • Suppose out of a sample of 5 teenagers one drinks and drives P(A teenager will drink and drive) = 1/5 • This estimate can be proven to be maximum likelihood estimate (MLE) because it maximizes the probability that it will generate the sampled data • Any other probability value will give less probability to the sample data

Maximum Likelihood Estimates • If P(TDD) = 1/5 then P(not TDD)=4/5 • Probability of the sampled data making assumption that each teenager is independent • P(Data)=1/5*(4/5)4=0.08192 • P(Data) will be less for any other value of P(TDD), • For P(TDD)=1/6 • P(Data) = 1/6*(5/6)4 = 0.0803. • For P(TDD)=1/4 • P(Data) = 1/4*(3/4)4 = 0.07091 • This can be also mathematically shown • Hence simple frequency counts are not only intuitive but also theoretically the best probability estimates

Test Instance: <medium, red, circle> P(positive | medium, red, circle) = 0.5 * 0.0 * 1.0 * 1.0 / P(medium, red, circle) = 0? P(negative | medium, red, circle) = 0.5 * 0.0 * 0.5 * 0.5 / P(medium, red, circle) = 0? Probability Estimation Example

Smoothing • To account for estimation from small samples, probability estimates are adjusted or smoothed. • Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m. • For binary features, p is simply assumed to be 0.5.

Laplace Smothing Example • Assume training set contains 10 positive examples: 4: small 0: medium 6: large • Estimate parameters as follows (if m=1, p=1/3) P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394 P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03 P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576

Reading: An Application of Naïve Bayes in Medicine Paper: Constructing naïve Bayesian classifiers for veterinary medicine: A case study in the clinical diagnosis of classical swine fever, P. L. Geenen, L. C. van der Gaag, W. L. A. Loeffen, A. R. W. Elbers, Research in Veterinary Science 91 (2011) 64-70 (skip 2.3.2) • Why this paper? Demonstrates a real-world application of naïve Bayes model in medicine, recent and easy to understand • Applies naïve Bayes model to predict classical swine fever in herds • Uses 32 binary features (symptoms shown by one or more pigs in a herd) • Probabilities are estimated using data collected from another study • Naïve Bayes model performs as well as manually constructed diagnostic rules to predict classical swine fever

Comments on Naïve Bayes • Tends to work well in practice despite naive assumption of conditional independence. • Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases. • It is a machine learning method, also present in Weka

Rohit Kate

Rohit Kate

Presentation Transcript

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Khokher

Rohit Kate

Rohit Kate

Rohit Kate

Natural Language Processing COMPSCI 423/723 Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate

Rohit Kate