1 / 34

Bayesian Learning Application to Text Classification Example: spam filtering

KI2 - 3. Bayesian Learning Application to Text Classification Example: spam filtering. Marius Bulacu & prof. dr. Lambert Schomaker. Kunstmatige Intelligentie / RuG. Founders of Probability Theory. Pierre Fermat (1601-1665, France). Blaise Pascal (1623-1662, France).

kamali
Télécharger la présentation

Bayesian Learning Application to Text Classification Example: spam filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KI2 - 3 Bayesian Learning Application to Text Classification Example: spam filtering Marius Bulacu & prof. dr. Lambert Schomaker Kunstmatige Intelligentie / RuG

  2. Founders of Probability Theory Pierre Fermat (1601-1665, France) Blaise Pascal (1623-1662, France) They laid the foundations of the probability theory in a correspondence on a dice game.

  3. Prior, Joint and Conditional Probabilities P(A) = prior probability of A P(B) = prior probability of B P(A, B) = joint probability of A and B P(A|B) = conditional (posterior) probability of A given B P(B|A) = conditional (posterior) probability of B given A

  4. Product rule: P(A, B) = P(A | B) P(B) or equivalently P(A, B) = P(B | A) P(A) Sum rule: P(A) = ΣB P(A, B) = ΣB P(A | B) P(B) if A is conditionalized on B, then the total probability of A is the sum of its joint probabilities with all B Probability Rules

  5. Statistical Independence Two random variables A and B are independent iff: • P(A, B) = P(A) P(B) • P(A|B) = P(A) • P(B|A) = P(B) knowing the value of one variable does not yield any information about the value of the other

  6. Statistical Dependence - Bayes Thomas Bayes (1702-1761, England) “Essay towards solving a problem in the doctrine of chances” published in the Philosophical Transactions of the Royal Society of London in 1764.

  7. P(A|B) = P(AB) / P(B) P(B|A) = P(AB) / P(A) Bayes Theorem => P(AB) = P(A|B) P(B) = P(B|A) P(A) P(B|A) P(A) => P(A|B) = P(B)

  8. P(B|A) P(A) P(A|B) = P(B) Bayes Theorem - Causality Diagnostic: P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect) Pattern Recognition: P(Class|Feature) = P(Feature|Class) P(Class) / P(Feature)

  9. Bayes Formula and Classification Conditional Likelihood of the data given the class Prior probability of the class before seeing anything Posterior probability of the class after seeing the data Unconditional probability of the data

  10. Medical example p(+disease) = 0.002 p(+test | +disease) = 0.97 p(+test | -disease) = 0.04 p(+test) = p(+test | +disease) * p(+disease) + p(+test | -disease) * p(-disease) = 0.97 * 0.002 + 0.04 * 0.97 = 0.00194 + 0.03992 = 0.04186 p(+disease | +test) = p(+test | +disease) * p(+disease) / p(+test) = 0.97 * 0.002 / 0.04186 = 0.00194 / 0.04186 = 0.046 p(-disease | +test) = p(+test | -disease) * p(-disease) / p(+test) = 0.04 * 0.998 / 0.04186 = 0.03992 / 0.04186 = 0.953

  11. p(C2|x) = p(x|C2)p(C2) p(C1|x) = p(x|C1)p(C1) Decision threshold MAP Classification x • To minimize probability of misclassification, assign new input x to the class with the Maximum A posteriori Probability, e.g. assign to x to class C1 if: p(C1|x) > p(C2|x) <=> p(x|C1)p(C1) > p(x|C2)p(C2) • Therefore we must impose a decision boundary where the two posterior probability distributions cross each other.

  12. Maximum Likelihood Classification • When the prior class distributions are not known or for equal (non-informative) priors: p(x|C1)p(C1) > p(x|C2)p(C2) becomes p(x|C1) > p(x|C2) • Therefore assign the input x to the class with the Maximum Likelihood to have generated it.

  13. Continuous Features • Two methods for dealing with continuous-valued features: • Binning: divide the range of continuous values into a discrete number of bins, then apply the discrete methodology. • Mixture of Gaussians: make an assumption regarding the functional form of the PDF (liniar combination of Gaussians) and derive the corresponding parameters (means and standard deviations).

  14. Accumulation of Evidence p(C|X,Y) =  p(X,Y,C) =  p(C) p(X,Y|C) =  p(C) p(X|C)p(Y|C,X) ... =  p(C) p(X|C)p(Y|C,X) p(Z|C,X,Y) prior new prior new prior • Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data. • Different forms of data (possibly incommensurable) can be fused towards the final decision using the “common currency” of probability. • As the new data arrives, the latest posterior becomes the new prior for interpreting the new input.

  15. Example: temperature classification Classes C: Cold P(x|C) Normal P(x|N) Warm P(x|W) Hot P(x|H) P(x|C) P(x|N) P(x|W) P(x|H) P(x) P(x) likelihood of x values

  16. Bayes: probability “blow up” P(C|x) P(N|x) P(W|x) P(H|x) Classes C: Cold P(x|C) Normal P(x|N) Warm P(x|W) Hot P(x|H)

  17. in P(x|C) even with an irregular PDF shape … P(C|x) P(C|x) = P(x|C) P(C) / P(x) Bayesian output has a nice plateau out

  18. Puzzle • So if Bayes is optimal and can be used for continuous data too, why has it become popular so late, i.e., much later than neural networks?

  19. P(x) x Why Bayes has become popular so late… • Note: the example was 1-dimensional • A PDF (histogram) with 100 bins for one dimension will cost 10000 bins for two dimensions etc. •  Ncells = Nbinsndims

  20. Why Bayes has become popular so late… •  Ncells = Nbinsndims • Yes… but you could use n-dimensional theoretical distributions (Gauss, Weibull etc.) instead of empirically measured PDFs…

  21. Why Bayes has become popular so late… • … use theoretical distributions instead of empirically measured PDFs … • still the dimensionality is a problem: • 20 samples needed to estimate 1-dim. Gaussian PDF • 400 samples needed to estimate 2-dim. Gaussian!, etc. • massive amounts of labeled data are needed to estimate probabilities reliably!

  22. Labeled (ground truthed) data 0.1 0.54 0.53 0.874 8.455 0.001 –0.111 risk 0.2 0.59 0.01 0.974 8.40 0.002 –0.315 risk 0.11 0.4 0.3 0.432 7.455 0.013 –0.222 safe 0.2 0.64 0.13 0.774 8.123 0.001 –0.415 risk 0.1 0.17 0.59 0.813 9.451 0.021 –0.319 risk 0.8 0.43 0.55 0.874 8.852 0.011 –0.227 safe 0.1 0.78 0.63 0.870 8.115 0.002 –0.254 risk . . . . . . . . Example: client evaluation in insurances

  23. Success of speech recognition • massive amounts of data • increased computing power • cheap computer memory • allowed for the use of Bayes in hidden Markov Models for speech recognition • similarly (but slower): application of Bayes in script recognition

  24. Global Structure: • year • title • date • date and number of entry (Rappt) • redundant lines between paragraphs • jargon-words: • Notificatie • Besluit fiat • imprint with page number •  XML model

  25. Local probabilistic structure: P(“Novb 16 is a date” | “sticks out to the left” & is left of “Rappt ”) ?

  26. Naive BayesConditional Independence • Naive Bayes assumes the attributes (features) are independent: p(X,Y|C) = p(X|C) P(Y|C) or p(x1, ... xn|C) = i p(xi|C) • Often works surprisingly well in practice despite its manifest simplicity.

  27. “naive” assumption that X and Y are independent Accumulation of Evidence – Independence

  28. Assume that each sample x to be classified is described by the attributes a1, a2 ... an. • The most probable (MAP) classification for x is: • Naive Bayes independence assumption: • Therefore: The Naive Bayes Classifier

  29. Learning to Classify Text • Representation: each electronic document is represented by the set of words that it contains under the independence assumptions - order of words does not matter - co-occurrences of words do not matter i.e. each document is represented as a “bag of words” • Learning: estimate from the training dataset of documents - the prior class probability P(ci) - the conditional likelihood of a word wj given the document class ci P(wj|ci) • Classification: maximum a posteriori (MAP)

  30. Learning to Classify e-mail • Is this e-mail a spam?: e-mail  {spam, ham} • Each word represents an attribute characterizing the e-mail. • Estimate the class priors p(spam) and p(ham) from the training data as well as the class conditional likelihoods for all the encountered words. • For a new e-mail, assuming naive Bayes conditional independence, compute the MAP hypothesis.

  31. Spam filtering Example of regular mail: From acd@essex.ac.uk Mon Nov 10 19:23:44 2003Return-Path: <alan@essex.ac.uk>Received: from serlinux15.essex.ac.uk (serlinux15.essex.ac.uk [155.245.48.17]) by tcw2.ppsw.rug.nl (8.12.8/8.12.8) with ESMTP id hAAIecHC008727; Mon, 10 Nov 2003 19:40:38 +0100 Apologies for multiple postings.> 2nd C a l l f o r P a p e r s>> DAS 2004>> Sixth IAPR International Workshop on> Document Analysis Systems>> September 8-10, 2004>> Florence, Italy>> http://www.dsi.unifi.it/DAS04>> Note:> There are two main additions with respect to the previous CFP:> 1) DAS&DL data are now available on the workshop web site> 2) Proceedings will be published by Springer Verlag in LNCS series

  32. Spam filtering Example of spam: From : Easy Qualify" <mbulacu@netaccessproviders.net>To : bulacu@hotmail.comSubject : Claim your Unsecured Platinum Card - 75OO dollar limitDate : Tue, 28 Oct 2003 17:12:07 -0400==================================================mbulacu - Tuesday, Oct 28, 2003==================================================Congratulations, you have been selected for an Unsecured Platinum Credit Card / $7500 starting credit limit.This offer is valid even if you've had past credit problems or evenno credit history. Now you can receive a $7,500 unsecured Platinum Credit Card that can help build your credit. And to help get your card to you sooner, we have been authorized to waive any employment or credit verification.

  33. Conclusions • Effective: about 90% correct classification • Could be applied to any text classification problem • Needs to be polished

  34. Summary • Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data. • Inductive bias of Naive Bayes: attributes are independent. • Although this assumption is often violated, it provides a very efficient tool often used (e.g. text classification – spam filtering). • Applicable to discrete or continuous data.

More Related