180 likes | 292 Vues
This overview discusses methods of text classification, focusing on Naïve Bayesian classifiers and decision trees. Text classification involves assigning documents to predefined categories based on given features. We explore the principles behind generative and discriminative models, touching on techniques like maximum likelihood estimation and Laplace smoothing. The lecture emphasizes practical examples, including calculating probabilities using Bayes' theorem for drug testing scenarios. Additionally, we compare performance metrics and feature selection techniques using χ² test and mutual information.
E N D
SET(5) Prof. Dragomir R. Radev radev@umich.edu
SET Fall 2013 … 9. Text classification Naïve Bayesian classifiers Decision trees …
Introduction • Text classification: assigning documents to predefined categories: topics, languages, users • A given set of classes C • Given x, determine its class in C • Hierarchical vs. flat • Overlapping (soft) vs non-overlapping (hard)
Introduction • Ideas: manual classification using rules (e.g., Columbia AND University EducationColumbia AND “South Carolina” Geography • Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression) • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) • Discriminative: model p(y|x) directly.
Bayes formula Full probability
Example (performance enhancing drug) • Drug(D) with values y/n • Test(T) with values +/- • P(D=y) = 0.001 • P(T=+|D=y)=0.8 • P(T=+|D=n)=0.01 • Given: athlete tests positive • P(D=y|T=+)= P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)=(0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074
Naïve Bayesian classifiers • Naïve Bayesian classifier • Assuming statistical independence • Features = words (or phrases) typically
Example • p(well)=0.9, p(cold)=0.05, p(allergy)=0.05 • p(sneeze|well)=0.1 • p(sneeze|cold)=0.9 • p(sneeze|allergy)=0.9 • p(cough|well)=0.1 • p(cough|cold)=0.8 • p(cough|allergy)=0.7 • p(fever|well)=0.01 • p(fever|cold)=0.7 • p(fever|allergy)=0.4 Example from Ray Mooney
Example (cont’d) • Features: sneeze, cough, no fever • P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e) • P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e) • P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e) • P(e) = 0.0089+0.01+0.019=0.379 • P(well|e)=.23 • P(cold|e)=.26 • P(allergy|e)=.50 Example from Ray Mooney
Issues with NB • Where do we get the values – use maximum likelihood estimation (Ni/N) • Same for the conditionals – these are based on a multinomial generator and the MLE estimator is (Tji/STji) • Smoothing is needed – why? • Laplace smoothing ((Tji+1)/S(Tji+1)) • Implementation: how to avoid floating point underflow
Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
SpamAssassin • http://spamassassin.apache.org/ • http://spamassassin.apache.org/tests_3_1_x.html
Feature selection: The 2 test • For a term t: • C=class, it = feature • Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0) • P(C=0) = (k00+k01)/n • P(C=1) = 1-P(C=0) = (k10+k11)/n • P(It=0) = (k00+K10)/n • P(It=1) = 1-P(It=0) = (k01+k11)/n
Feature selection: The 2 test • High values of 2 indicate lower belief in independence. • In practice, compute 2 for all words and pick the top k among them.
Feature selection: mutual information • No document length scaling is needed • Documents are assumed to be generated according to the multinomial model • Measures amount of information: if the distribution is the same as the background distribution, then MI=0 • X = word; Y = class
Well-known datasets • 20 newsgroups • http://qwone.com/~jason/20Newsgroups/ • Reuters-21578 • http://www.daviddlewis.com/resources/testcollections/reuters21578/ • Cats: grain, acquisitions, corn, crude, wheat, trade… • WebKB • http://www-2.cs.cmu.edu/~webkb/ • course, student, faculty, staff, project, dept, other • NB performance (2000) • P=26,43,18,6,13,2,94 • R=83,75,77,9,73,100,35
Evaluation of text classification • Microaveraging – average over classes • Macroaveraging – uses pooled table
Readings • MRS18 • MRS17, MRS19