CS 277: Data Mining Text Classification

CS 277: Data MiningText Classification Padhraic Smyth Department of Computer Science University of California, Irvine Some slides adapted from Introduction to Information Retrieval, by Manning, Raghavan, and Schutze, Cambridge University Press, 2009, with additional slides from Ray Mooney.

Progress Report • New deadline • In class, Thursday February 18th (not Tuesday) • Outline • 3 pages maximum • Suggested content • Brief restatement of the problem you are addressing (no need to repeat everything in your original proposal), e.g., ½ a page • Summary of progress so far • Background papers read • Data preprocessing, exploratory data analysis • Algorithms/Software reviewed, implemented, tested • Initial results (if any) • Challenges and difficulties encountered • Brief outline of plans between now and end of quarter • Use diagrams, figures, tables, where possible • Write clearly: check what you write

Currently in the News…. • Communications of the ACM, Feb 2010 • Fun short article, “Where the Data is” • http://cacm.acm.org/magazines/2010/2/69384-where-the-data-is/fulltext • Technical article on Fast Dimension Reduction • Introduction: http://cacm.acm.org/magazines/2010/2/69382-strange-effects-in-high-dimension/fulltext • Article: http://cacm.acm.org/magazines/2010/2/69359-faster-dimension-reduction/fulltext • New York Times, Feb 8, 2010 • Interesting study of which email articles are mailed most often http://www.nytimes.com/2010/02/09/science/09tier.html?ref=technology

Conferences related to Text Mining • ACM SIGIR Conference • WWW Conference • ACM SIGKDD Conference Worth looking at papers in recent years for many different ideas related to text mining

Further Reading on Text Classification • General references on text and language modeling • Introduction to Information Retrieval, C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press, 2008 • Foundations of Statistical Language Processing, C. Manning and H. Schutze, MIT Press, 1999. • Speech and Language Processing: An Introduction to Natural Language Processing, Dan Jurafsky and James Martin, Prentice Hall, 2000. • The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data, Feldman and Sanger, Cambridge University Press, 2007 • Web-related text mining in general • S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003. • See chapter 5 for discussion of text classification • SVMs for text classification • T. Joachims, Learning to Classify Text using Support Vector Machines: Methods, Theory and Algorithms, Kluwer, 2002

Text Classification • Text classification has many applications • Spam email detection • Automated tagging of streams of news articles, e.g., Google News • Online advertising: what is this Web page about? • Data Representation • “Bag of words” most commonly used: either counts or binary • Can also use “phrases” (e.g., bigrams) for commonly occurring combinations of words • Classification Methods • Naïve Bayes widely used (e.g., for spam email) • Fast and reasonably accurate • Support vector machines (SVMs) • Typically the most accurate method in research studies • But more complex computationally • Logistic Regression (regularized) • Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)

Ch. 13 Types of Labels/Categories/Classes • Assigning labels to documents or web-pages • Labels are most often topics such as Yahoo-categories • "finance," "sports," "news>world>asia>business" • Labels may be genres • "editorials" "movie-reviews" "news” • Labels may be opinion on a person/product • “like”, “hate”, “neutral” • Labels may be domain-specific • "interesting-to-me" : "not-interesting-to-me” • “contains adult language” : “doesn’t” • language identification: English, French, Chinese, … • search vertical: about Linux versus not • “link spam” : “not link spam”

Common Data Sets used for Evaluation • Reuters • 10700 labeled documents • 10% documents with multiple class labels • Yahoo! Science Hierarchy • 95 disjoint classes with 13,598 pages • 20 Newsgroups data • 18800 labeled USENET postings • 20 leaf classes, 5 root level classes • WebKB • 8300 documents in 7 categories such as “faculty”, “course”, “student”. • Industry • 6449 home pages of companies partitioned into 71 classes

Practical Issues • Tokenization • Convert document to word counts = “bag of words” • word token = “any nonempty sequence of characters” • for HTML (etc) need to remove formatting • Canonical forms, Stopwords, Stemming • Remove capitalization • Stopwords: • remove very frequent words (a, the, and…) – can use standard list • Can also remove very rare words, e.g., words that only occur in k or fewer documents, e.g., k = 5 • Stemming (next slide) • Data representation • e.g., sparse 3 column for bag of words: <docid termid count> • can use inverted indices, etc

Toy example of a document-term matrix

Stemming • Want to reduce all morphological variants of a word to a single index term • e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document) • Stemming - reduce words to their root form • e.g. fish – becomes a new index term • Porter stemming algorithm (1980) • relies on a preconstructed suffix list with associated rules • e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE • BINARIZATION => BINARIZE • Not always desirable: e.g., {university, universal} -> univers (in Porter’s) • WordNet: dictionary-based approach • May be more useful for information retrieval/search than classification

Confusion Matrix and Accuracy Accuracy = fraction of examples classified correctly = 280/400 = 70%

Precision Precision for class k = fraction predicted as class k that belong to class k = 90/100 = 90% for class 2 below

Recall Recall for class k = fraction that belong to class k predicted as class k = 90/180 = 50% for class 2 below

Precision-Recall Curves • Rank our test documents by predicted probability of belonging to class k • Sort the ranked list • For t = 1…. N (N = total number of test documents) • Select top t documents on the list • Classify documents above as class k, documents below as not class k • Compute precision-recall number • Generates a set of precision-recall values we can plot • Perfect performance: precision = 1, recall = 1 • Similar concept to ROC

Precision-recall for category: Crude Precision LSVM Decision TreeNaïve Bayes Rocchio Dumais (1998) Recall

F-measure • F-measure = weighted harmonic mean of precision and recall (at some fixed operating threshold for the classifier) F1 = 1/ ( 0.5/P + 0.5/R ) = 2PR / (P + R) Useful as a simple single summary measure of performance Sometimes “breakeven” F1 used, i.e., measured when P = R

Sec.13.6 F-Measures on Reuters Data Set

Probabilistic “Generative” Classifiers • Model p( x | ck )for each class and perform classification via Bayes rule,c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) } • How to model p( x | ck )? • p( x | ck ) = probability of a “bag of words” x given a class ck • Two commonly used approaches (for text): • Naïve Bayes: treat each term xj as being conditionally independent, given ck • Multinomial: model a document with N words as N tosses of a p-sided die • Other models possible but less common, • E.g., model word order by using a Markov chain for p( x | ck )

Naïve Bayes Classifier for Text • Naïve Bayes classifier = conditional independence model • Assumes conditional independence assumption given the class: p( x | ck ) = Pp( xj | ck ) • Note that we model each term xj as a discrete random variable • Binary terms (Bernoulli): p( x | ck ) = Pp( xj | ck ) • Non-binary terms (counts) (far less common) p( x | ck ) = Pp( xj = k | ck ) can use a parametric model (e.g., Poisson) or non-parametric model (e.g., histogram) for p(xj = k | ck ) distributions.

Simple Example: Naïve Bayes (binary) • Vocabulary = {complexity, bound, cluster, classifier} • Two classes: class 1 = algorithms, class 2 = data mining • Naïve Bayes model: • Probability that a word appears in a document, given the document class • Assumes that each word appears independently of the others

Spam Email Classification • Naïve Bayes widely used in spam filtering • See Paul Graham’s A Plan for Spam (on the Web) • Naive Bayes-like classifier with unusual parameter estimation • Widely used in spam filters • Naive Bayes works well when appropriately used • Very easy to update the model: just update counts • But also many other aspects • black lists, HTML features, etc.

Multinomial Classifier for Text • Multinomial Classification model • Assume that the data are generated by a p-sided die (multinomial model) • where product is over all terms in the vocabulary nj = number of times term j occurs in the document • Here we have a single random variable for each class, and the p( xj = i | ck ) probabilities sum to 1 over i(i.e., a multinomial model) • Note that high correlation can not be modeled well compared to naïve Bayes • e.g., two words that should almost always appear in a document for a given class • Probabilities typically only defined and evaluated for non-zero counts • But “zero counts” could also be modeled if desired • This would be equivalent to a Naïve Bayes model with a geometric distribution on counts

Simple Example: Multinomial • Vocabulary = {complexity, bound, cluster, classifier} • Two classes: class 1 = algorithms, class 2 = data mining • Multinomial model: • Probability of next word in a document, given the document class • Assumes that words are “memoryless”

Highest Probability Terms in Multinomial Distributions

Prediction • Naïve Bayes • log [ p( x | c ) p(c) ] = log [ P p( xj | c ) ] + log p(c) = S log p( xj | c ) + log p(c) Sum over all words in vocabulary

Prediction • Naïve Bayes • log [ p( x | c ) p(c) ] = log [ P p( xj | c ) ] + log p(c) = S log p( xj | c ) + log p(c) • Multinomial • log [ p( x | c ) p(c) ] = log [ P p( xj | c )nj ] + log p(c) = S nj log p( xj | c ) + log p(c) Sum over all words in vocabulary Sum over all words that occur in document

Classification Example • Document 1 (bag of words) • Document 2 (bag of words) • Compute log p(class |document) for both naïve Bayes and multinomial model, for each document • Can assume p(class 1) = p(class 2)

Classification Example (for document 1) Naïve Bayes Multinomial

Classification Example (for document 1) Naïve Bayes Multinomial S log p( xj | c ) + log p(c) S nj log p( xj | c ) + log p(c) Class A: score = -0.11 – 2.3 = -2.41 Class DM: score = -1.6 – 0.22 = -1.08 Class A: score = -4 * 0.69 – 3.0 ~ -5.8 Class DM: score = -4 * 2.3 – 0.6 ~ -8.9

Comparing Naïve Bayes and Multinomial models McCallum and Nigam (1998) Found that multinomial outperformed naïve Bayes (with binary features) in text classification experiments Note on names used in the literature - Bernoulli (or multivariate Bernoulli) sometimes used for binary version of Naïve Bayes model - multinomial model is also referred to as “unigram” model - multinomial model is also sometimes (confusingly) referred to as naïve Bayes

WebKB Data Set • Classify webpages from CS departments into: • student, faculty, course,project • Train on ~5,000 hand-labeled web pages • Cornell, Washington, U.Texas, Wisconsin • Crawl and classify a new site (CMU) • Results with NB

Comparing Naïve Bayes and Multinomial on Web KB Data

Sec. 15.2.4 Classic Reuters-21578 Data Set • Most (over)used data set • 21578 documents • 9603 training, 3299 test articles (ModApte/Lewis split) • 118 categories • An article can be in more than one category • Learn 118 binary category distinctions • Average document: about 90 types, 200 tokens • Average number of classes assigned • 1.24 for docs with at least one category • Only about 10 out of 118 categories are large • Earn (2877, 1087) • Acquisitions (1650, 179) • Money-fx (538, 179) • Grain (433, 149) • Crude (389, 189) • Trade (369,119) • Interest (347, 131) • Ship (197, 89) • Wheat (212, 71) • Corn (182, 56) Common categories (#train, #test)

Sec. 15.2.4 Example of Reuters document <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:43.42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter </BODY></TEXT></REUTERS>

Comparing Multinomial and Naïve Bayes on Reuter’s Data(from McCallum and Nigam, 1998)

Comparing Naïve Bayes and Multinomial(slide from Chris Manning, Stanford) Results from classifying 13,589 Yahoo! Web pages in Science subtree of hierarchy into 95 different topics

Feature Selection • Performance of text classification algorithms can be optimized by selecting only a subset of the discriminative terms • Particularly important for Naïve Bayes (see previous slides) • Multinomial classifier is less sensitive • Greedy search • Start from empty set or full set and add/delete one at a time • Heuristics for adding/deleting • Information gain (mutual information of term with class) • Chi-square • Other ideas Methods tend not to be particularly sensitive to the specific heuristic used for feature selection, but some form of feature selection often improves performance

Sec.13.5.1 Feature selection via Mutual Information • In training set, choose k words which best discriminate (give most information about) the categories. • The Mutual Information between a word, class is: For each word w and each category c the probability of w being in a document if the document belongs to class c

Sec.13.5.1 Example of Feature Selection • For each category we build a list of k most discriminating terms • Use the union of these terms in our classifier • K selected heuristically, or via cross-validation • For example (on 20 Newsgroups), using mutual information • sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, … • rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, … • Greedy: does not account for correlations between terms

Example of Feature Selection(from Chakrabarti, Chapter 5) 9600 documents from US Patent database 20,000 raw features (terms)

Comments on Generative Models for Text (Comments applicable to both Naïve Bayes and Multinomial classifiers) • Simple and fast => popular in practice • e.g., linear in p, n, M for both training and prediction • Training = “smoothed” frequency counts, e.g., • e.g., easy to use in situations where classifier needs to be updated regularly (e.g., for spam email) • Numerical issues • Typically work with log p( ck | x ),etc., to avoid numerical underflow • Useful trick for naïve Bayes classifier: • when computing S log p( xj | ck ) , for sparse data, it may be much faster to • precompute S log p( xj = 0| ck ) • and then subtract off the log p( xj = 1| ck ) terms • Note: both models are “wrong”: but for classification are often sufficient

Beyond Independence • Naïve Bayes and multinomial both assume conditional independence of words given class • Alternative approaches try to account for higher-order dependencies • Bayesian networks: • p(x | c) = Px p(x | parents(x), c) • Equivalent to directed graph where edges represent direct dependencies • Various algorithms that search for a good network structure • Useful for improving quality of distribution model • ….however, this does not always translate into better classification • Maximum entropy models • p(x | c) = 1/Z Psubsets f( subsets(x) | c) • Equivalent to undirected graph model • Estimation is equivalent to maximum entropy assumption • Feature selection is crucial (which f terms to include) – • can provide high accuracy classification • …. however, tends to be computationally complex to fit (estimating Z is difficult)

Dumais et al. 1998: Reuters - Accuracy

Sec. 15.2.4 Precision-recall for category: Ship Precision LSVM Decision TreeNaïve Bayes Rocchio Dumais (1998) Recall

Sec. 15.2.4 Micro- vs. Macro-Averaging • If we have more than one class, how do we combine multiple performance measures into one quantity? • Macroaveraging: Compute performance for each class, then average. • Microaveraging: Collect decisions for all classes, compute contingency table, evaluate. Note that macroaveraging ignores how likely each class is, gives performance on each class equal weight. Microaveraging weights according to how likely the classes are.

Sec. 15.2.4 Micro- vs. Macro-Averaging: Example Class 1 Class 2 Micro Ave. Table • Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 • Microaveraged precision: 100/120 = .83 • Microaveraged score is dominated by score on common classes

Sec. 15.2.4 Results on Reuters Data

Sec. 15.2.4 SVM vs. Other Methods(Yang and Liu, 1999)

CS 277: Data Mining Text Classification