Text Classification

Text Classification Elnaz Delpisheh Introduction to Computational Linguistics York University Department of Computer Science and Engineering August 24, 2014

Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • Subjective text classification

Outline • Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification

Text Classification-Definition • Text classification is the assignment of text documents to one or more predefined categories based on their content. • The classifier: • Input: a set of m hand-labeled documents (x1,y1),....,(xm,ym) • Output: a learned classifier f:x  y Text document Classifier Class A Class B Class C Text document Text document

Text Classification-Applications • Classify news stories as World, US, Business, SciTech, Sports, Entertainment, Health, Other. • Classify business names by industry. • Classify student essays as A,B,C,D, or F. • Classify email as Spam, Other. • Classify email to tech staff as Mac, Windows, ..., Other. • Classify pdf files as ResearchPaper, Other • Classify documents as WrittenByReagan, GhostWritten • Classify movie reviews as Favorable,Unfavorable,Neutral. • Classify technical papers as Interesting, Uninteresting. • Classify jokes as Funny, NotFunny. • Classify web sites of companies by Standard Industrial Classification (SIC) code.

Text Classification-Example • Best-studied benchmark: Reuters-21578 newswire stories • 9603 train, 3299 test documents, 80-100 words each, 93 classes • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat (of 93 binary choices)

Text Classification-Representing Texts f( )=y • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... simplest useful ? What is the best representation for the document x being classified?

Pre-processing the Text • Removing stop words • Punctuations, Prepositions, Pronouns, etc. • Stemming • Walk, walker, walked, walking. • Indexing • Dimensionality reduction

f( )=y • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... Representing text: a list of words f( )=y (argentine, 1986, 1987, grain, oilseed, registrations, buenos, aires, feb, 26, argentine, grain, board, figures, show, crop, registrations, of, grains, oilseeds, and, their, products, to, february, 11, in, …

Pre-processing the Text-Indexing • Using vector space model

Indexing (Cont.)

Indexing (Cont.) • tfc-weighting • It considers the normalized length of documents (M). • ltc-weighting • It considers the logarithm of the word frequency to reduce the effect of large differences in frequencies. • Entropy weighting

Indexing-Word Frequency Weighting word freq • ARGENTINE 1986/87 GRAIN/OILSEEDREGISTRATIONS • BUENOS AIRES, Feb 26 • Argentinegrain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... If the order of words doesn’t matter, x can be a vector of word frequencies. “Bag of words”: a long sparse vector x=(,…,fi,….) where fi is the frequency of the i-th word in the vocabulary Categories: grain, wheat

Pre-processing the Text-Dimensionality Reduction • Feature selection: It attempts to remove non-informative words. • Document frequency thresholding • Information gain • Latent Semantic Indexing • Singular Value Decomposition

Text Classification-Methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks

Text Classification-Naïve Bayes • Represent document x as list of words w1,w1,… • For each class y, build a probabilistic model Pr(X|Y=y) of “documents” Pr(X={argentine,grain...}|Y=wheat) = .... Pr(X={stocks,rose,in,heavy,...}|Y=nonWheat) = .... • To classify, find the y which was most likely to generate x—i.e., whichgives x the best score according to Pr(x|y) • f(x) = argmaxyPr(x|y)*Pr(y)

Text Classification-Naïve Bayes • How to estimate Pr(X|Y) ? • Simplest useful process to generate a bag of words: • pick word 1according to Pr(W|Y) • repeat for word 2, 3, .... • each word is generated independently of the others (which is clearly not true) but means How to estimate Pr(W|Y)?

Text Classification-Naïve Bayes • How to estimate Pr(X|Y) ? Estimate Pr(w|y) by looking at the data... • Terms: • This Pr(W|Y) is a multinomialdistribution • This use of m and p is a Dirichlet prior for the multinomial

Text Classification-Naïve Bayes • How to estimate Pr(X|Y) ? for instance: m=1, p=0.5 This Pr(W|Y) is a multinomialdistribution This use of m and p is a Dirichlet prior for the multinomial

Text Classification-Naïve Bayes • Putting this together: • for each document xi with label yi • for each word wij inxi • count[wij][yi]++ • count[yi]++ • count++ • to classify a new x=w1...wn, pick y with top score: key point: weonly need counts for words that actually appear in document x

Naïve Bayes for SPAM filtering

Naïve Bayes-Summary • Pros: • Very fast and easy-to-implement • Well-understood formally & experimentally • Cons: • Seldom gives the very best performance • “Probabilities” Pr(y|x) are not accurate • e.g., Pr(y|x) decreases with length of x • Probabilities tend to be close to zero or one

The Curse of Dimensionality • How can you learn with so many features? • For efficiency (time & memory), use sparse vectors. • Use simple classifiers (linear or loglinear) • Rely on wide margins.

Margin-based Learning + + + + + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - The number of features matters: but not if the margin is sufficiently wide and examples are sufficiently close to the origin (!!)

Text Classification-Voted Perceptron • Text documents: X=<x1,x2,…,xk> • K= number of features • Two classes={yes, no} • W= < w1, w2,… ,wk> • Objective: • Learn a weight vector W and a threshold θ • If • Yes • Otherwise • No • If the answer is • Correct: • w1 ++ • Incorrect: There is a mistake. Correction is made: • xk+1 = xk + xiwk • k = k+1 • wk+1 = 1

Voted Perceptron-Error Correction

Lessons of the Voted Perceptron • Voted Perceptron shows that you can make few mistakes in incrementally learning as you pass over the data, if the examples x are small (bounded by R), some u exists that has large margin. • Why not look for this line directly? • Support vector machines: • find u to maximize the margin.

Text Classification-Support vector Machines • Facts about support vector machines: • the “support vectors” are the xi’s that touch the margin. • the classifier can be written as • where the xi’s are the support vectors. • support vector machines often give very good results on topical text classification. + + + + + + + + + + + + + + - - + - - - - - - - - - - - - - -

Support Vector Machine Results

Text Classification-Decision Trees • Objective of decision tree learning • Learn a decision tree from a set of training data • The decision tree can be used to classify new examples • Decision tree learning algorithms • ID3 (Quinlan, 1986) • C4.5 (Quinlan, 1993) • CART (Breiman, Friedman, et. al. 1983) • etc.

Decision Trees-CART • Creating the tree • Splitting the set of training vectors • The best splitter has the purest children • Diversity measure is entropy Where S: set of example; k:number of classes; P(Ci): the probability of examples in S that belong to Ci. • To find the best splitter each component of the document vector is considered in turn. • This process is repeated until no sets can be partitioned any further. • Each leaf is now assigned a class.

Decision Trees-Example

TF-IDF Representation • The results above use a particular way to represent documents: bag of words with TFIDF weighting • “Bag of words”: a long sparse vector x=(,…,fi,….) where fi is the “weight” of the i-th word in the vocabulary • for word w that appears in DF(w) documents out of N in a collection, and appears TF(w) times in the documents being represented use weight: • also normalize all vector lengths (||x||) to 1

TF-IDF Representation • TF-IDF representation is an old trick from the information retrieval community, and often improves performance of other algorithms: • K- nearest neighbor algorithm • Rocchio’s algorithm

Text Classification-K-nearest Neighbor • The nearest neighbor algorithm • Objective • To classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor. • Determine the largest similarity with any element in the training set: • Collect the subset of X that has highest similarity with :

Text Classification-Rocchio’s Algorithm • classify using distance to centroid of documents from each class • Assign test document to the class with maximum similarity

Support Vector Machine Results

Text Classification-Neural Networks • A neural network text classifier is a collection of interconnected neurons representing documents that incrementally learn from its environment to categorize the documents. • A neuron X1 W1 y1 training set X2 W2 U =∑XiWi . . . y2 F(u) Wn y3 Xn

Text Classification-Neural Networks • Forward and Backward propagation Forward propagation H1 u11 IM1 OS1(r) H2 IM2 OS2(r) vj . . . . . . OS3(r) uji Hm IMn Backward propagation

Performance Evaluation • Performance of a classification algorithm can be evaluated in the following aspects • Predictive performance • How accurate is the learned model in prediction? • Complexity of the learned model • Run time • Time to build the model • Time to classify examples using the model • Here we focus on the predictive performance

Text Classification