1 / 22

Partially Supervised Classification of Text Documents

Partially Supervised Classification of Text Documents. Authors: Bing Liu Wee Sun Lee Philip S. Yu Xiaoli Li Presented by: Swetha Nandyala. Overview. Introduction Theoretical Foundation Background Methodology NB-C EM-Algorithm Proposed Strategy

dmartha
Télécharger la présentation

Partially Supervised Classification of Text Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Partially Supervised Classification of Text Documents Authors: Bing Liu Wee Sun Lee Philip S. Yu Xiaoli Li Presented by: Swetha Nandyala CIS 525: Neural Computation

  2. Overview • Introduction • Theoretical Foundation • Background Methodology • NB-C • EM-Algorithm • Proposed Strategy • Evaluation Measures & Experiments • Conclusion CIS 525: Neural Computation

  3. … the activity of labeling natural language texts with thematic categories from a pre-defined set [Sebastiani, 2002] Text Categorization is a task of automatically assigning to a text document d from a given domain D, a category label c selected among a predefined set of category labels C. c1 c2 … ….. cj … ck Categorization System … … c1 c2 ……... ck Text Categorization D C CIS 525: Neural Computation

  4. Text Categorization(Contd.) • Standard Supervised Learning Problem • Bottleneck: Need for very large number of labeled training documents to build accurate classifier • Goal: To identify a particular class of documents from a set of mixed unlabeled documents • Standard classifications inapplicable • Partially Supervised Classification is used CIS 525: Neural Computation

  5. Theoretical foundations • AIM: To show PSC is a constrained optimization problem • fixed distributionD over X x Y, where Y = {0,1} • X, Y: sets of possible documents, classes • Two sets of documents • labeled as positiveP of size n1 drawn from X for DX|Y=1 • Unlabeled U of size n2 drawn from X for DX independently • GOAL: Find the positive documents in U CIS 525: Neural Computation

  6. Theoretical foundations • learning algorithm: selects a function f F: X  {0, 1}(a class of functions) to classify unlabeled documents • probability oferror: Pr[f(X) Y]issum of “false positive” and “false negative” cases • rewritten as Pr[f(X) Y]= Pr[f(X) = 1  Y=0]+Pr[f(X) = 0  Y=1] After transforming Pr[f(X) Y] = Pr[f(X) = 1]-Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] CIS 525: Neural Computation

  7. Theoretical foundations (contd..) Pr[f(X) Y] = Pr[f(X) = 1]-Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • Note that Pr[Y = 1]is constant • approximation: keepingPr[f(X) = 0|Y = 1]small error Pr[f(X) = 1]-Pr[Y = 1] Pr[f(X) = 1] – const i.e. minimizingPr[f(X) = 1] minimizing error minimizingPrU[f(X) = 1]) & keeping PrP[f(X) = 1])  r • NOTHING BUT CONSTRAINT OPTIMIZATION PROBLEM  Learning Possible CIS 525: Neural Computation

  8. Naïve Bayesian Text Classification • D be set of training documents • C = {c1, c2, ...,c|C|}: predif. classes, here: c1, c2 • For diD,Pr[cj|di]: posterior probs are calculated • in NB model:class with the highest Pr[cj|di] is assigned to the document CIS 525: Neural Computation

  9. The EM-Algorithm • Iterative algorithm for maximum likelihood estimation in problems with incomplete data • Two step method • Expectation Step • Fills in missing data • Maximization Step • Estimate parameters after the missing data is filled CIS 525: Neural Computation

  10. Proposed Strategy • Step 1: Re-initialization • Iterative-EM: by applying EM-algorithm over P and U • Identifying a set of reliable negative documents from the unlabeled set, by introducing spies • Step 2: Building and selecting a classifier • Spy-EM: building a set of classifiers iteratively • selecting a good classifier from the set of classifiers constructed above CIS 525: Neural Computation

  11. Iterative EM with NB-C • Assign each document in P(ositive) to class label c1 and in U(nlabeled) to class c2 • Pr[c1/ di] = 1 & Pr[c2/ di] = 0 for each di in P • Pr[c2/ dj] = 1 & Pr[c1/ dj] = 0 for each dj in U • After initial labeling, a NB-C is built and used to classify documents in U • revise posterior probabilities for documents in U • After revising, a NB-C with new posterior probs. is built • Iterative process goes on until EM converges • Setback: strongly biased towards positive documents CIS 525: Neural Computation

  12. Step1: Re-Initialization • Sample a certain % of positive examples say “S” and put them into unlabeled set to act as “spies” • I-EM algorithm is utilized but the U(nlabeled) set now has some spy documents • After EM completes, the probabilistic labels of spies are used to decide which documents are most likely negative(LN) • threshold t used for decision making: • if Pr[c1|dj] < t: denoted as L(ikely)N(egative) • if for dj  S Pr[c1|dj] > t: denoted as U(nlabeled) CIS 525: Neural Computation

  13. positives Step-1 effect negatives BEFORE AFTER LN (likely negative) U(un-labeled) U un-labeled spies some spies P(positive) P(positive) initial situation: U = P  N no clue which are P and N spies from P added to U help of spies: most positives in U get into unlabeled set, while most negatives get into LN; purity of LN higher than that of U CIS 525: Neural Computation

  14. Step-2: S-EM • Apply EM over P, LN and U • algorithm proceeds as follows: • put all spies S back to P (where they were before) • diP:  c1 (i.e. Pr[c1|di] = 1); (fixed thru iterations) • djLN:  c2 (i.e. Pr[c2|dj] = 1); (changing thru EM) • dkU: initially assigned no label (will be after EM(1)) • run EM using P, LN and U until it converges • final classifier is produced when EM stops CIS 525: Neural Computation

  15. Selecting Classifier Pr[f(X) Y] = Pr[f(X) = 1]-Pr[Y = 1] + 2Pr[f(X) = 0|Y = 1]Pr[Y = 1] • S-EM generates set of classifiers but classification is not necessarily improving • remedy: stop iterating of EM at some point • estimating the change of the probability error between iterations i and i+1 • i = Pr[fi+1(X)  Y] - Pr[fi(X)  Y] • if i > 0 for the first time, then ith classifier produced is the final classifier CIS 525: Neural Computation

  16. Accuracy (of a classifier) A = m/(m+i) , where m, i are numbers of correct and incorrect decisions, respectively F-Score: F = 2pr / (p+r) is a classification performance measure Where recall r = a/(a+c) precision p = a/(a+b) The F-value reflects the average effect of both precision and recall Evaluation measures CIS 525: Neural Computation

  17. Experiments • 30 datasets created from 2 large document corpora • objective: • recovering positive documents placed into mixed sets • for each experiment: • dividing full positive set into two subsets: P and R • P: positive set used in the algorithm with a% of the full positive set • R: set of remaining documents with b% have been put into U (not all in R put to U) CIS 525: Neural Computation

  18. Experiments (contd…) • techniques used NB-C: applied directly to P (c1) and U(c2) to built a classifier to classify data in set U I-EM: applies EM to P and U as long as converges (no spy yet); final classifier to be applied to U to identify its positives S-EM: spies used to re-initialize; I-EM to build the final classifier; threshold t used CIS 525: Neural Computation

  19. Experiments (contd…) • S-EM outperforms NB and I-EM in F dramatically • S-EM outperforms NB and I-EM in A as well • comment: datasets skewed, so A is not a reliable measure of classifier’s performance CIS 525: Neural Computation

  20. Experiments (contd…) • results show great effect of re-initialization with spies: • S-EM outperforms I-EMbest • re-initialization is not, however, the only factor of improvement: • S-EM outperforms S-EM4 • conclusions: both Step-1 (reinitializing) and Step-2 (selecting the best model) are needed! CIS 525: Neural Computation

  21. Conclusion • Gives an overview of the theory on learning with positive and unlabeled examples • Describes a two-step strategy for learning which produces extremely accurate classifiers • Partially supervised classification is most helpful when initial model is insufficiently trained CIS 525: Neural Computation

  22. Questions? CIS 525: Neural Computation

More Related