Optimizing Partially Supervised Text Document Classification

Partially Supervised Classification of Text Documents Authors: Bing Liu Wee Sun Lee Philip S. Yu Xiaoli Li Presented by: Swetha Nandyala CIS 525: Neural Computation

Overview • Introduction • Theoretical Foundation • Background Methodology • NB-C • EM-Algorithm • Proposed Strategy • Evaluation Measures & Experiments • Conclusion CIS 525: Neural Computation

… the activity of labeling natural language texts with thematic categories from a pre-defined set [Sebastiani, 2002] Text Categorization is a task of automatically assigning to a text document d from a given domain D, a category label c selected among a predefined set of category labels C. c1 c2 … ….. cj … ck Categorization System … … c1 c2 ……... ck Text Categorization D C CIS 525: Neural Computation

Text Categorization(Contd.) • Standard Supervised Learning Problem • Bottleneck: Need for very large number of labeled training documents to build accurate classifier • Goal: To identify a particular class of documents from a set of mixed unlabeled documents • Standard classifications inapplicable • Partially Supervised Classification is used CIS 525: Neural Computation

Theoretical foundations • AIM: To show PSC is a constrained optimization problem • fixed distributionD over X x Y, where Y = {0,1} • X, Y: sets of possible documents, classes • Two sets of documents • labeled as positiveP of size n1 drawn from X for DX|Y=1 • Unlabeled U of size n2 drawn from X for DX independently • GOAL: Find the positive documents in U CIS 525: Neural Computation

Theoretical foundations • learning algorithm: selects a function f F: X  {0, 1}(a class of functions) to classify unlabeled documents • probability oferror: Pr[f(X) Y]issum of “false positive” and “false negative” cases • rewritten as Pr[f(X) Y]= Pr[f(X) = 1  Y=0]+Pr[f(X) = 0  Y=1] After transforming Pr[f(X) Y] = Pr[f(X) = 1]-Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] CIS 525: Neural Computation

Theoretical foundations (contd..) Pr[f(X) Y] = Pr[f(X) = 1]-Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • Note that Pr[Y = 1]is constant • approximation: keepingPr[f(X) = 0|Y = 1]small error Pr[f(X) = 1]-Pr[Y = 1] Pr[f(X) = 1] – const i.e. minimizingPr[f(X) = 1] minimizing error minimizingPrU[f(X) = 1]) & keeping PrP[f(X) = 1])  r • NOTHING BUT CONSTRAINT OPTIMIZATION PROBLEM  Learning Possible CIS 525: Neural Computation

Naïve Bayesian Text Classification • D be set of training documents • C = {c1, c2, ...,c|C|}: predif. classes, here: c1, c2 • For diD,Pr[cj|di]: posterior probs are calculated • in NB model:class with the highest Pr[cj|di] is assigned to the document CIS 525: Neural Computation

The EM-Algorithm • Iterative algorithm for maximum likelihood estimation in problems with incomplete data • Two step method • Expectation Step • Fills in missing data • Maximization Step • Estimate parameters after the missing data is filled CIS 525: Neural Computation

Proposed Strategy • Step 1: Re-initialization • Iterative-EM: by applying EM-algorithm over P and U • Identifying a set of reliable negative documents from the unlabeled set, by introducing spies • Step 2: Building and selecting a classifier • Spy-EM: building a set of classifiers iteratively • selecting a good classifier from the set of classifiers constructed above CIS 525: Neural Computation

Iterative EM with NB-C • Assign each document in P(ositive) to class label c1 and in U(nlabeled) to class c2 • Pr[c1/ di] = 1 & Pr[c2/ di] = 0 for each di in P • Pr[c2/ dj] = 1 & Pr[c1/ dj] = 0 for each dj in U • After initial labeling, a NB-C is built and used to classify documents in U • revise posterior probabilities for documents in U • After revising, a NB-C with new posterior probs. is built • Iterative process goes on until EM converges • Setback: strongly biased towards positive documents CIS 525: Neural Computation

Step1: Re-Initialization • Sample a certain % of positive examples say “S” and put them into unlabeled set to act as “spies” • I-EM algorithm is utilized but the U(nlabeled) set now has some spy documents • After EM completes, the probabilistic labels of spies are used to decide which documents are most likely negative(LN) • threshold t used for decision making: • if Pr[c1|dj] < t: denoted as L(ikely)N(egative) • if for dj  S Pr[c1|dj] > t: denoted as U(nlabeled) CIS 525: Neural Computation

positives Step-1 effect negatives BEFORE AFTER LN (likely negative) U(un-labeled) U un-labeled spies some spies P(positive) P(positive) initial situation: U = P  N no clue which are P and N spies from P added to U help of spies: most positives in U get into unlabeled set, while most negatives get into LN; purity of LN higher than that of U CIS 525: Neural Computation

Step-2: S-EM • Apply EM over P, LN and U • algorithm proceeds as follows: • put all spies S back to P (where they were before) • diP:  c1 (i.e. Pr[c1|di] = 1); (fixed thru iterations) • djLN:  c2 (i.e. Pr[c2|dj] = 1); (changing thru EM) • dkU: initially assigned no label (will be after EM(1)) • run EM using P, LN and U until it converges • final classifier is produced when EM stops CIS 525: Neural Computation

Selecting Classifier Pr[f(X) Y] = Pr[f(X) = 1]-Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • S-EM generates set of classifiers but classification is not necessarily improving • remedy: stop iterating of EM at some point • estimating the change of the probability error between iterations i and i+1 • i = Pr[fi+1(X)  Y] - Pr[fi(X)  Y] • if i > 0 for the first time, then ith classifier produced is the final classifier CIS 525: Neural Computation

Accuracy (of a classifier) A = m/(m+i) , where m, i are numbers of correct and incorrect decisions, respectively F-Score: F = 2pr / (p+r) is a classification performance measure Where recall r = a/(a+c) precision p = a/(a+b) The F-value reflects the average effect of both precision and recall Evaluation measures CIS 525: Neural Computation

Experiments • 30 datasets created from 2 large document corpora • objective: • recovering positive documents placed into mixed sets • for each experiment: • dividing full positive set into two subsets: P and R • P: positive set used in the algorithm with a% of the full positive set • R: set of remaining documents with b% have been put into U (not all in R put to U) CIS 525: Neural Computation

Experiments (contd…) • techniques used NB-C: applied directly to P (c1) and U(c2) to built a classifier to classify data in set U I-EM: applies EM to P and U as long as converges (no spy yet); final classifier to be applied to U to identify its positives S-EM: spies used to re-initialize; I-EM to build the final classifier; threshold t used CIS 525: Neural Computation

Experiments (contd…) • S-EM outperforms NB and I-EM in F dramatically • S-EM outperforms NB and I-EM in A as well • comment: datasets skewed, so A is not a reliable measure of classifier’s performance CIS 525: Neural Computation

Experiments (contd…) • results show great effect of re-initialization with spies: • S-EM outperforms I-EMbest • re-initialization is not, however, the only factor of improvement: • S-EM outperforms S-EM4 • conclusions: both Step-1 (reinitializing) and Step-2 (selecting the best model) are needed! CIS 525: Neural Computation

Conclusion • Gives an overview of the theory on learning with positive and unlabeled examples • Describes a two-step strategy for learning which produces extremely accurate classifiers • Partially supervised classification is most helpful when initial model is insufficiently trained CIS 525: Neural Computation

Questions? CIS 525: Neural Computation

Optimizing Partially Supervised Text Document Classification

Optimizing Partially Supervised Text Document Classification

Presentation Transcript

Classification (Supervised Clustering)

Text Classification

Supervised Classification

Soft-Supervised Learning for Text Classification

TEXT CLASSIFICATION

SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

ICA of Text Documents

Supervised classification

Supervised Multiattribute Classification

Text Classification

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

Pseudo-supervised Clustering for Text Documents

Partially Edentulous arches classification

Supervised classification

Chapter 5: Partially-Supervised Learning

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li

Classification of Business Documents

Classification Text

Supervised Classification

Partially Edentulous arches classification

Text Classification

TEXT CLASSIFICATION