Download Presentation
## Text Classification

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Text Classification**Elnaz Delpisheh Introduction to Computational Linguistics York University Department of Computer Science and Engineering January 5, 2020**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • Subjective text classification**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Text Classification-Definition**• Text classification is the assignment of text documents to one or more predefined categories based on their content. • The classifier: • Input: a set of m hand-labeled documents (x1,y1),....,(xm,ym) • Output: a learned classifier f:x y Text document Classifier Class A Class B Class C Text document Text document**Text Classification-Applications**• Classify news stories as World, US, Business, SciTech, Sports, Entertainment, Health, Other. • Classify business names by industry. • Classify student essays as A,B,C,D, or F. • Classify email as Spam, Other. • Classify email to tech staff as Mac, Windows, ..., Other. • Classify pdf files as ResearchPaper, Other • Classify documents as WrittenByReagan, GhostWritten • Classify movie reviews as Favorable,Unfavorable,Neutral. • Classify technical papers as Interesting, Uninteresting. • Classify jokes as Funny, NotFunny. • Classify web sites of companies by Standard Industrial Classification (SIC) code.**Text Classification-Example**• Best-studied benchmark: Reuters-21578 newswire stories • 9603 train, 3299 test documents, 80-100 words each, 93 classes • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat (of 93 binary choices)**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Text Classification-Representing Texts**f( )=y • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... simplest useful ? What is the best representation for the document x being classified?**Pre-processing the Text**• Removing stop words • Punctuations, Prepositions, Pronouns, etc. • Stemming • Walk, walker, walked, walking. • Indexing • Dimensionality reduction**f(**)=y • ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS • BUENOS AIRES, Feb 26 • Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... Representing text: a list of words f( )=y (argentine, 1986, 1987, grain, oilseed, registrations, buenos, aires, feb, 26, argentine, grain, board, figures, show, crop, registrations, of, grains, oilseeds, and, their, products, to, february, 11, in, …**Pre-processing the Text-Indexing**• Using vector space model**Indexing (Cont.)**• tfc-weighting • It considers the normalized length of documents (M). • ltc-weighting • It considers the logarithm of the word frequency to reduce the effect of large differences in frequencies. • Entropy weighting**Indexing-Word Frequency Weighting**word freq • ARGENTINE 1986/87 GRAIN/OILSEEDREGISTRATIONS • BUENOS AIRES, Feb 26 • Argentinegrain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: • Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). • Maize Mar 48.0, total 48.0 (nil). • Sorghum nil (nil) • Oilseed export registrations were: • Sunflowerseed total 15.0 (7.9) • Soybean May 20.0, total 20.0 (nil) • The board also detailed export registrations for subproducts, as follows.... If the order of words doesn’t matter, x can be a vector of word frequencies. “Bag of words”: a long sparse vector x=(,…,fi,….) where fi is the frequency of the i-th word in the vocabulary Categories: grain, wheat**Pre-processing the Text-Dimensionality Reduction**• Feature selection: It attempts to remove non-informative words. • Document frequency thresholding • Information gain • Latent Semantic Indexing • Singular Value Decomposition**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Text Classification-Methods**• Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Text Classification-Naïve Bayes**• Represent document x as list of words w1,w1,… • For each class y, build a probabilistic model Pr(X|Y=y) of “documents” Pr(X={argentine,grain...}|Y=wheat) = .... Pr(X={stocks,rose,in,heavy,...}|Y=nonWheat) = .... • To classify, find the y which was most likely to generate x—i.e., whichgives x the best score according to Pr(x|y) • f(x) = argmaxyPr(x|y)*Pr(y)**Text Classification-Naïve Bayes**• How to estimate Pr(X|Y) ? • Simplest useful process to generate a bag of words: • pick word 1according to Pr(W|Y) • repeat for word 2, 3, .... • each word is generated independently of the others (which is clearly not true) but means How to estimate Pr(W|Y)?**Text Classification-Naïve Bayes**• How to estimate Pr(X|Y) ? Estimate Pr(w|y) by looking at the data... • Terms: • This Pr(W|Y) is a multinomialdistribution • This use of m and p is a Dirichlet prior for the multinomial**Text Classification-Naïve Bayes**• How to estimate Pr(X|Y) ? for instance: m=1, p=0.5 This Pr(W|Y) is a multinomialdistribution This use of m and p is a Dirichlet prior for the multinomial**Text Classification-Naïve Bayes**• Putting this together: • for each document xi with label yi • for each word wij inxi • count[wij][yi]++ • count[yi]++ • count++ • to classify a new x=w1...wn, pick y with top score: key point: weonly need counts for words that actually appear in document x**Naïve Bayes-Summary**• Pros: • Very fast and easy-to-implement • Well-understood formally & experimentally • Cons: • Seldom gives the very best performance • “Probabilities” Pr(y|x) are not accurate • e.g., Pr(y|x) decreases with length of x • Probabilities tend to be close to zero or one**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**The Curse of Dimensionality**• How can you learn with so many features? • For efficiency (time & memory), use sparse vectors. • Use simple classifiers (linear or loglinear) • Rely on wide margins.**Margin-based Learning**+ + + + + + + + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - The number of features matters: but not if the margin is sufficiently wide and examples are sufficiently close to the origin (!!)**Text Classification-Voted Perceptron**• Text documents: X=<x1,x2,…,xk> • K= number of features • Two classes={yes, no} • W= < w1, w2,… ,wk> • Objective: • Learn a weight vector W and a threshold θ • If • Yes • Otherwise • No • If the answer is • Correct: • w1 ++ • Incorrect: There is a mistake. Correction is made: • xk+1 = xk + xiwk • k = k+1 • wk+1 = 1**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Lessons of the Voted Perceptron**• Voted Perceptron shows that you can make few mistakes in incrementally learning as you pass over the data, if the examples x are small (bounded by R), some u exists that has large margin. • Why not look for this line directly? • Support vector machines: • find u to maximize the margin.**Text Classification-Support vector Machines**• Facts about support vector machines: • the “support vectors” are the xi’s that touch the margin. • the classifier can be written as • where the xi’s are the support vectors. • support vector machines often give very good results on topical text classification. + + + + + + + + + + + + + + - - + - - - - - - - - - - - - - -**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Text Classification-Decision Trees**• Objective of decision tree learning • Learn a decision tree from a set of training data • The decision tree can be used to classify new examples • Decision tree learning algorithms • ID3 (Quinlan, 1986) • C4.5 (Quinlan, 1993) • CART (Breiman, Friedman, et. al. 1983) • etc.**Decision Trees-CART**• Creating the tree • Splitting the set of training vectors • The best splitter has the purest children • Diversity measure is entropy Where S: set of example; k:number of classes; P(Ci): the probability of examples in S that belong to Ci. • To find the best splitter each component of the document vector is considered in turn. • This process is repeated until no sets can be partitioned any further. • Each leaf is now assigned a class.**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**TF-IDF Representation**• The results above use a particular way to represent documents: bag of words with TFIDF weighting • “Bag of words”: a long sparse vector x=(,…,fi,….) where fi is the “weight” of the i-th word in the vocabulary • for word w that appears in DF(w) documents out of N in a collection, and appears TF(w) times in the documents being represented use weight: • also normalize all vector lengths (||x||) to 1**TF-IDF Representation**• TF-IDF representation is an old trick from the information retrieval community, and often improves performance of other algorithms: • K- nearest neighbor algorithm • Rocchio’s algorithm**Text Classification-K-nearest Neighbor**• The nearest neighbor algorithm • Objective • To classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor. • Determine the largest similarity with any element in the training set: • Collect the subset of X that has highest similarity with :**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Text Classification-Rocchio’s Algorithm**• classify using distance to centroid of documents from each class • Assign test document to the class with maximum similarity**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Text Classification-Neural Networks**• A neural network text classifier is a collection of interconnected neurons representing documents that incrementally learn from its environment to categorize the documents. • A neuron X1 W1 y1 training set X2 W2 U =∑XiWi . . . y2 F(u) Wn y3 Xn**Text Classification-Neural Networks**• Forward and Backward propagation Forward propagation H1 u11 IM1 OS1(r) H2 IM2 OS2(r) vj . . . . . . OS3(r) uji Hm IMn Backward propagation**Outline**• Definition and applications • Representing texts • Pre-processing the text • Text classification methods • Naïve Bayes • Voted Perceptron • Support Vector Machines • Decision Trees • K-nearest neighbor • Rocchio’s algorithm • Neural Networks • Performance evaluation • subjective text classification**Performance Evaluation**• Performance of a classification algorithm can be evaluated in the following aspects • Predictive performance • How accurate is the learned model in prediction? • Complexity of the learned model • Run time • Time to build the model • Time to classify examples using the model • Here we focus on the predictive performance