Bayesian Learning Application to Text Classification Example: spam filtering

KI2 - 3 Bayesian Learning Application to Text Classification Example: spam filtering Marius Bulacu & prof. dr. Lambert Schomaker Kunstmatige Intelligentie / RuG

Founders of Probability Theory Pierre Fermat (1601-1665, France) Blaise Pascal (1623-1662, France) They laid the foundations of the probability theory in a correspondence on a dice game.

Prior, Joint and Conditional Probabilities P(A) = prior probability of A P(B) = prior probability of B P(A, B) = joint probability of A and B P(A|B) = conditional (posterior) probability of A given B P(B|A) = conditional (posterior) probability of B given A

Product rule: P(A, B) = P(A | B) P(B) or equivalently P(A, B) = P(B | A) P(A) Sum rule: P(A) = ΣB P(A, B) = ΣB P(A | B) P(B) if A is conditionalized on B, then the total probability of A is the sum of its joint probabilities with all B Probability Rules

Statistical Independence Two random variables A and B are independent iff: • P(A, B) = P(A) P(B) • P(A|B) = P(A) • P(B|A) = P(B) knowing the value of one variable does not yield any information about the value of the other

Statistical Dependence - Bayes Thomas Bayes (1702-1761, England) “Essay towards solving a problem in the doctrine of chances” published in the Philosophical Transactions of the Royal Society of London in 1764.

Bayes Formula and Classification Conditional Likelihood of the data given the class Prior probability of the class before seeing anything Posterior probability of the class after seeing the data Unconditional probability of the data

Medical example p(+disease) = 0.002 p(+test | +disease) = 0.97 p(+test | -disease) = 0.04 p(+test) = p(+test | +disease) * p(+disease) + p(+test | -disease) * p(-disease) = 0.97 * 0.002 + 0.04 * 0.97 = 0.00194 + 0.03992 = 0.04186 p(+disease | +test) = p(+test | +disease) * p(+disease) / p(+test) = 0.97 * 0.002 / 0.04186 = 0.00194 / 0.04186 = 0.046 p(-disease | +test) = p(+test | -disease) * p(-disease) / p(+test) = 0.04 * 0.998 / 0.04186 = 0.03992 / 0.04186 = 0.953

p(C2|x) = p(x|C2)p(C2) p(C1|x) = p(x|C1)p(C1) Decision threshold MAP Classification x • To minimize probability of misclassification, assign new input x to the class with the Maximum A posteriori Probability, e.g. assign to x to class C1 if: p(C1|x) > p(C2|x) <=> p(x|C1)p(C1) > p(x|C2)p(C2) • Therefore we must impose a decision boundary where the two posterior probability distributions cross each other.

Maximum Likelihood Classification • When the prior class distributions are not known or for equal (non-informative) priors: p(x|C1)p(C1) > p(x|C2)p(C2) becomes p(x|C1) > p(x|C2) • Therefore assign the input x to the class with the Maximum Likelihood to have generated it.

Continuous Features • Two methods for dealing with continuous-valued features: • Binning: divide the range of continuous values into a discrete number of bins, then apply the discrete methodology. • Mixture of Gaussians: make an assumption regarding the functional form of the PDF (liniar combination of Gaussians) and derive the corresponding parameters (means and standard deviations).

Accumulation of Evidence p(C|X,Y) =  p(X,Y,C) =  p(C) p(X,Y|C) =  p(C) p(X|C)p(Y|C,X) ... =  p(C) p(X|C)p(Y|C,X) p(Z|C,X,Y) prior new prior new prior • Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data. • Different forms of data (possibly incommensurable) can be fused towards the final decision using the “common currency” of probability. • As the new data arrives, the latest posterior becomes the new prior for interpreting the new input.

Example: temperature classification Classes C: Cold P(x|C) Normal P(x|N) Warm P(x|W) Hot P(x|H) P(x|C) P(x|N) P(x|W) P(x|H) P(x) P(x) likelihood of x values

Bayes: probability “blow up” P(C|x) P(N|x) P(W|x) P(H|x) Classes C: Cold P(x|C) Normal P(x|N) Warm P(x|W) Hot P(x|H)

in P(x|C) even with an irregular PDF shape … P(C|x) P(C|x) = P(x|C) P(C) / P(x) Bayesian output has a nice plateau out

Puzzle • So if Bayes is optimal and can be used for continuous data too, why has it become popular so late, i.e., much later than neural networks?

P(x) x Why Bayes has become popular so late… • Note: the example was 1-dimensional • A PDF (histogram) with 100 bins for one dimension will cost 10000 bins for two dimensions etc. •  Ncells = Nbinsndims

Why Bayes has become popular so late… •  Ncells = Nbinsndims • Yes… but you could use n-dimensional theoretical distributions (Gauss, Weibull etc.) instead of empirically measured PDFs…

Why Bayes has become popular so late… • … use theoretical distributions instead of empirically measured PDFs … • still the dimensionality is a problem: • 20 samples needed to estimate 1-dim. Gaussian PDF • 400 samples needed to estimate 2-dim. Gaussian!, etc. • massive amounts of labeled data are needed to estimate probabilities reliably!

Labeled (ground truthed) data 0.1 0.54 0.53 0.874 8.455 0.001 –0.111 risk 0.2 0.59 0.01 0.974 8.40 0.002 –0.315 risk 0.11 0.4 0.3 0.432 7.455 0.013 –0.222 safe 0.2 0.64 0.13 0.774 8.123 0.001 –0.415 risk 0.1 0.17 0.59 0.813 9.451 0.021 –0.319 risk 0.8 0.43 0.55 0.874 8.852 0.011 –0.227 safe 0.1 0.78 0.63 0.870 8.115 0.002 –0.254 risk . . . . . . . . Example: client evaluation in insurances

Success of speech recognition • massive amounts of data • increased computing power • cheap computer memory • allowed for the use of Bayes in hidden Markov Models for speech recognition • similarly (but slower): application of Bayes in script recognition

Global Structure: • year • title • date • date and number of entry (Rappt) • redundant lines between paragraphs • jargon-words: • Notificatie • Besluit fiat • imprint with page number •  XML model

Local probabilistic structure: P(“Novb 16 is a date” | “sticks out to the left” & is left of “Rappt ”) ?

“naive” assumption that X and Y are independent Accumulation of Evidence – Independence

Assume that each sample x to be classified is described by the attributes a1, a2 ... an. • The most probable (MAP) classification for x is: • Naive Bayes independence assumption: • Therefore: The Naive Bayes Classifier

Learning to Classify Text • Representation: each electronic document is represented by the set of words that it contains under the independence assumptions - order of words does not matter - co-occurrences of words do not matter i.e. each document is represented as a “bag of words” • Learning: estimate from the training dataset of documents - the prior class probability P(ci) - the conditional likelihood of a word wj given the document class ci P(wj|ci) • Classification: maximum a posteriori (MAP)

Learning to Classify e-mail • Is this e-mail a spam?: e-mail  {spam, ham} • Each word represents an attribute characterizing the e-mail. • Estimate the class priors p(spam) and p(ham) from the training data as well as the class conditional likelihoods for all the encountered words. • For a new e-mail, assuming naive Bayes conditional independence, compute the MAP hypothesis.

Spam filtering Example of regular mail: From acd@essex.ac.uk Mon Nov 10 19:23:44 2003Return-Path: <alan@essex.ac.uk>Received: from serlinux15.essex.ac.uk (serlinux15.essex.ac.uk [155.245.48.17]) by tcw2.ppsw.rug.nl (8.12.8/8.12.8) with ESMTP id hAAIecHC008727; Mon, 10 Nov 2003 19:40:38 +0100 Apologies for multiple postings.> 2nd C a l l f o r P a p e r s>> DAS 2004>> Sixth IAPR International Workshop on> Document Analysis Systems>> September 8-10, 2004>> Florence, Italy>> http://www.dsi.unifi.it/DAS04>> Note:> There are two main additions with respect to the previous CFP:> 1) DAS&DL data are now available on the workshop web site> 2) Proceedings will be published by Springer Verlag in LNCS series

Spam filtering Example of spam: From : Easy Qualify" <mbulacu@netaccessproviders.net>To : bulacu@hotmail.comSubject : Claim your Unsecured Platinum Card - 75OO dollar limitDate : Tue, 28 Oct 2003 17:12:07 -0400==================================================mbulacu - Tuesday, Oct 28, 2003==================================================Congratulations, you have been selected for an Unsecured Platinum Credit Card / $7500 starting credit limit.This offer is valid even if you've had past credit problems or evenno credit history. Now you can receive a $7,500 unsecured Platinum Credit Card that can help build your credit. And to help get your card to you sooner, we have been authorized to waive any employment or credit verification.

Conclusions • Effective: about 90% correct classification • Could be applied to any text classification problem • Needs to be polished

Summary • Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data. • Inductive bias of Naive Bayes: attributes are independent. • Although this assumption is often violated, it provides a very efficient tool often used (e.g. text classification – spam filtering). • Applicable to discrete or continuous data.

Bayesian Learning Application to Text Classification Example: spam filtering

Bayesian Learning Application to Text Classification Example: spam filtering

Presentation Transcript

Mathematical Morphology II: Filtering

Bayesian models of inductive learning

Vector Space Text Classification

Bayesian models of inductive learning

Text Classification

Computer Science CPSC 502 Uncertainty Probability and Bayesian Networks (Ch. 6)

Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disru

Market Microstructure Daniel Sungyeon Kim

Naïve Bayes

Text Features

Chapter 11 Supervised Learning: STATISTICAL METHODS

Inferring gene regulatory networks with non-stationary dynamic Bayesian networks

A Tutorial on Bayesian Speech Feature Enhancement

Automatic Text Summarization Introduction and Research Problems

Bayesian models of human inductive learning Josh Tenenbaum MIT

Introduction to Bayesian Learning

The International Classification of Functioning, Disability and Health (ICF)

A Contribution to Reinforcement Learning; Application to Computer Go

Inference in Bayesian Networks

Data Mining: Classification and Prediction

Fundamentals of Bayesian Inference