110 likes | 140 Vues
Naive Bayes for Document Classification. Illustrative Example. Document Classification. Given a document, find its class (e.g. headlines, sports, economics, fashion…) We assume the document is a “ bag-of-words ” . d ~ { t 1 , t 2 , t 3 , … t nd }
E N D
Naive Bayes for Document Classification Illustrative Example
Document Classification • Given a document, find its class (e.g. headlines, sports, economics, fashion…) • We assume the document is a “bag-of-words”. d ~ { t1, t2, t3, … tnd } • Using Naive Bayes with multinomial distribution:
Binomial Distribution • n independent trials (a Bernouilli trial), each of which results in success with probability of p • binomial distribution gives the probability of any particular combination of numbers of successes for the two categories. • e.g. You flip a coin 10 times with PHeads=0.6 • What is the probability of getting 8 H, 2T? • P(k) = • with k being number of successes (or to see the similarity with multinomial, consider first class is selected k times, ...)
Multinomial Distribution • Generalization of Binomial distribution • n independent trials, each of which results in one of the k outcomes. • multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories k. • e.g. You have balls in three colours in a bin (3 balls of each color => pR=PG=PB), from which you draw n=9 balls with replacement. What is the probability of getting 8 Red, 1 Green, 0 Blue. • P(x1,x2,x3) =
Naive Bayes w/ Multinomial Model from McCallum and Nigam, 1995 Advanced
Naive Bayes w/ Multivariate Binomial from McCallum and Nigam, 1995 Advanced
Smoothing For each term, t, we need to estimate P(t|c) Tct is the count of term t in all documents of class c 7
Smoothing Because an estimate will be 0 if a term does not appear with a class in the training data, we need smoothing: Laplace Smoothing |V| is the number of terms in the vocabulary 8
Two topic classes: “China”, “not China” V = {Beijing, Chinese, Japan, Macao, Tokyo, Shangai} N = 4 9
Classification Probability Estimation 10
Summary: Miscellaneous Naïve Bayes is linear in the time is takes to scan the data When we have many terms, the product of probabilities with cause a floating point underflow, therefore: For a large training set, the vocabulary is large. It is better to select only a subset of terms. For that is used “feature selection”. However, accuracy is not badly affected by irrelevant attributes, if data is large. 11