1 / 22

Partially Supervised Classification of Text Documents

This paper presents a practical technique for partially supervised classification of text documents using the naive Bayes classifier and the Expectation-Maximization algorithm.

rhom
Télécharger la présentation

Partially Supervised Classification of Text Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Partially Supervised Classification of Text DocumentsbyBing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005

  2. Agenda • Problem Statement • Related Work • Theoretical Foundations • Proposed Technique • Evaluation • Conclusions

  3. Problem Statement: Common Approach • Text categorization: automated assigning of text documents to pre-defined classes • Common Approach: Supervised Learning • Manually label a set of documents to pre-defined classes • Use a learning algorithm to build a classifier + _ + _ _ + _ + + + _ _ _ + + _ _ + _ + + _ + _ _ + _ + + _

  4. Problem Statement: Common Approach (cont.) • Problem: bottleneck associated with large number of labeled training documents to build the classifier • Nigram, et al, have shown that using a large dose of unlabeled data can help . . . _ + _ . + . _ . + + . . _ _ . + _ _ + . + . . . _ . + _ . . + _

  5. . . . . . + . . + . . . + + . + . . . . . + . + . . + . . . + . . + . . . + + . . A different approach:Partially supervised classification • Two class problem: positive and unlabeled • Key feature is that there is no labeled negative document • Can be posed as a constrained optimization problem • Develop a function that correctly classifies all positive docs and minimizes the number of mixed docs classified as positive will have an expected error rate of no more than e. • Examplar: Finding matching (i.e., positive documents) from a large collection such as the Web. • Matching documents are positive • All others are negative

  6. Related Work • Text Classification techniques • Naïve Bayesian • K-nearest neighbor • Support vector machines • Each requires labeled data for all classes • Problem similar to traditional information retrieval • Rank orders documents according to their similarities to the query document • Does not perform document classification

  7. Theoretical Foundations • Some discussion regarding the theoretical foundations. Focused primarily on • Minimization of the probability of error • Expected recall and precision of functions Pr[f(X)=Y] = Pr[f(X)=1] - Pr[Y=1] + 2Pr Pr[f(X)=0 | Y=1]Pr[Y=1] • Painful, painful… but it did show you can build accurate classifiers with high probability when sufficient documents in P (the positive document set) and M (the unlabeled set) are available. (1) /

  8. Theoretical Foundations (cont.) • Two serious practical drawbacks to the theoretical method • Constrained optimization problem may not be easy to solve for the function class in which we are interested • Not easy to choose a desired recall level that will give a good classifier using the function class we are using

  9. Proposed Technique • Theory be darned! • Paper introduces a practical technique based on the naïve Bayes classifier and the Expectation-Maximization (EM) algorithm • After introducing a general technique, the authors offer an enhancement using spies

  10. Proposed Technique:Terms • D is the set of training documents • V = < w1, w2, …, w|V| > is the set of all words considered for classification • wdi,k is the word in position k in document di • N(wt, di) is the number of times wt occurs in di • C = {c1, c2} is the set of predefined classes • P is the set of positive documents • M is the set of unlabeled set of documents • S is the set of spy documents • Posterior probability Pr[cj | di] e {0,1} depends on the class label of the document

  11. Proposed Technique:naïve Bayesian classifer (NB-C) Pr[cj] = Si Pr[cj|di] / |D| Pr[wt|cj] = 1 + Si=1P[cj|di] N(wt, di) |V| + Ss=1 Si=1 P[cj|di] N(ws, di) and assuming the words are independent given the class Pr[cj|di] = Pr[cj] Pk=1Pr[wdi,k|cj] Sr=1Pr[cr] Pk=1Pr[wdi,k|cr] The class with the highest Pr[cj|di] is assigned as the class of the doc (2) |D| (3) |V| |D| |di| (4) |C| |di|

  12. Proposed Technique:EM algorithm • Popular class of iterative algorithms for maximum likelihood estimation in problems with incomplete data. • Two steps • Expectation: fills in the missing data • Maximization: parameters are estimated • Rinse and repeat • Using a NB-C, (2) and (3) equate to the E step, and (4) is the M step • Probability of a class now takes the value in [0,1] instead of {0,1}

  13. Proposed Technique:EM algorithm (cont.) • All positive documents have the class value c1 • Need to determine class value of each doc in mixed set. • EM can help assign a probabilistic class label to each document dj in the mixed set • Pr[c1|dj] and Pr[c2|dj] • After a number of iterations, all the probabilities will converge

  14. Proposed Technique:Step 1 - Reinitialization (I-EM) • Reinitialization • Build an initial NB-C using the documents sets M and P • For class P, Pr[c1|dj] = 1 and Pr[c2|dj] = 0 • For class M, Pr[c1|dj] = 0 and Pr[c2|dj] = 1 • Loop while classifier parameters change • For each document dje M • Compute Pr[c1|dj] using the current NB-C • Pr[c2|dj] = 1 - Pr[c1|dj] • Update Pr[wt|c1] and Pr[c1] given the probabilistically assigned class for dj (Pr[c1|dj]) and P (a new NB-C is being built in the process • Works well on easy datasets • Problem is that our initialization is strongly biased towards positive documents

  15. Proposed Technique:Step 1 - Spies • Problem is that our initialization is strongly biased towards positive documents • Need to identify some very likely negative documents from the mixed set • We do this by sending “spy” documents from the positive set P and put in the mixed set M • (10% was used) • A threshold t is set and those documents with a probabilistic label less than t are identified as negative • 15% was the threshold used mix c2 likely negative c2 unlabeled spies spies positive c1 c1 positive

  16. Proposed Technique:Step 1 - Spies (cont) • N (most likely negative docs) = U (unlabeled docs) = f • S (spies) = sample(P,s%) • MS = M U S • P = P - S • Assign every document di in P the class c1 • Assign every document dj in MS the class c2 • Run I-EM(MS,P) • Classify each document dj in MS • Determine the probability threshold t using S • For each document dj in M • If its probability Pr[c1|dj] < t • N = N U {dj} • Else U = U U {dj}

  17. Proposed Technique:Step 2 - Building the final classifier • Using P, N and U as developed in the previous step • Put all the spy documents S back in P • Assign Pr[c1 | di] =1 for all documents in P • Assign Pr[c2 | di] =1 for all documents in N. This will change with each iteration of EM • Each doc dk in U is not assigned a label initially. At the end of the first iteration, it will have a probabilistic label Pr[c1 | dk] • Run EM using the document sets P, N and U until it converges • When EM stops, the final classifier has been produced. • This two step technique is called S-EM (Spy EM)

  18. Proposed TechniqueSelecting a classifier • The local maximum that the final classifier may not cleanly separate the positive and negative documents • Likely if there are many local clusters • If so, from the set of classifiers developed over each iteration, select the one with the least probability of error • Refer to (1) Pr[f(X)=Y] = Pr[f(X)=1] - Pr[Y=1] + 2Pr Pr[f(X)=0 | Y=1]Pr[Y=1] /

  19. EvaluationMeasurements • Breakeven Point • 0 = p - r, where p is precision and r is recall • Only evaluates sorting order of class probabilities of documents • Not appropriate • F score • F = 2pr / (p+r) • Measures performance on a particular class • Reflects average effect of both precision and recall • Only when both p and r are large will F be large • Accuracy

  20. EvaluationResults • 2 large document corpora • 20NG • Removed UseNet headers and subject lines • WebKB • HTML tags removed • 8 iterations

  21. EvaluationResults (cont) • Also varied the % of positive documents both in P (%a) and in M (%b)

  22. Conclusions • This paper studied the problem of classification with only partial information: one class and a set of mixed documents • Technique • Naïve Bayes classifier • Expectation Maximization algorithm • Reinitialized using the positive documents and the most likely negative documents to compensate bias • Use estimate of classification error to select a good classifier • Extremely accurate results

More Related