1 / 19

Text Classification from Labeled and Unlabeled Documents using EM

This text classification study explores the use of Expectation-Maximization (EM) algorithm to label documents that have both labeled and unlabeled data. The Naïve Bayes Learning approach is combined with EM to improve parameter estimates. The study also discusses challenges, motivations, and extensions of the EM algorithm for text classification. Experiments are conducted to evaluate the effectiveness of EM with unlabeled data.

rmoreau
Télécharger la présentation

Text Classification from Labeled and Unlabeled Documents using EM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eleni Foteinopoulou s0969664 Efthymios Kouloumpis s0928744 Text Classification from Labeled and Unlabeled Documents using EM [ Kamal Nigal, Andrew McCallum, Sebastian Thrun, Tom Mitchell, 1999 ]

  2. Text Classification - Labeled and Unlabeled Documents Overview • Introduction • Motivation • Naïve Bayes Learning • Combination of NB and EM • EM Extensions • Experiments • Summary

  3. Text Classification - Labeled and Unlabeled Documents Text Classification Bag of Words

  4. Text Classification - Labeled and Unlabeled Documents Need for an intermediate approach • Unsupervised and Supervised learning • Unsupervised learning • collection of documents without any labels • easy to collect, free, inexpensive, large pool • Supervised learning • each object tagged with a class • laborious job, time-consuming process • Semi-supervised learning • Real life applications

  5. Text Classification - Labeled and Unlabeled Documents Challenges • How to reduce the number of labeled examples? • Can unlabeled examples increase the classification accuracy? Any ideas...? Semi-Supervised Learning

  6. Text Classification - Labeled and Unlabeled Documents Motivation • Document collection D • A subset (with ) has known labels • Goal: to label the rest of the collection. • Approach • Train a supervised learner using , the labeled subset. → NB • Apply the trained learner on the remaining documents. → EM • Idea • Harness information from unlabeled subset.

  7. Text Classification - Labeled and Unlabeled Documents The Generative Model • Probabilistic generative model • Every document ~ Probability distribution • Assumptions • Mixture model • One to one correspondence between mixture components and classes • Document length distribution

  8. Text Classification - Labeled and Unlabeled Documents Naïve Bayes Learning • Assign each document to a particular mixture component. • The parameters of an individual mixture component form a multinomial distribution over words • Estimate model parameters → maximum a posteriori estimation

  9. Text Classification - Labeled and Unlabeled Documents Naïve Bayes Learning • Maximum a posteriori estimate of the model parameters given a small set of labeled data → high variance • How to improve parameter estimates? • Incorporate unlabeled documents

  10. Text Classification - Labeled and Unlabeled Documents EM Algorithm • Iterative algorithm for parameter estimation (maximum a posteriori) • Incomplete data → missing labels • Estimate parameters θ from labeled subset • Iterate: • E step : calculate probabilistic labels for the unlabeled documents using current parameter estimate θ. • M step : maximize the complete likelihood → new maximum a posteriori estimate using current estimates of • Continue till convergence → θ local max.

  11. Text Classification - Labeled and Unlabeled Documents EM: Issues • Generative model vs. real-world text data • Mixture model - One to one correspondence between mixture components and classes • Same parameter model as used in • classification → violation • Word conditional independence • NB assumption • Extreme class probability estimates

  12. Text Classification - Labeled and Unlabeled Documents EM: Extensions • Real world data? • Weighting factor • Multiple mixture components

  13. EM: Reducing belief in unlabeled data • Problems due to unlabeled data • Noise in term distribution of documents in • Mistakes in E-step • Solution • attenuate the contribution from documents in • Add a damping factor αε[0,1], in E Step for contribution from Text Classification - Labeled and Unlabeled Documents

  14. Text Classification - Labeled and Unlabeled Documents EM: Modeling labels using many mixture components • Previous extension → reduces effect of mixture model assumption • Goal: Relax assumption of one to one correspondence between mixture components and class labels. • Introduce many to one mapping → missing values D • E.g.: For two class case “football” vs. “not football” • Documents not about “football” are actually about a variety of other things;

  15. Text Classification - Labeled and Unlabeled Documents EM: Modeling labels using many mixture components • Lower accuracy with one mixture component per label → not naturally modeled • Higher accuracy with more mixture components per label → word dependencies • Over fitting and poor performance with too large mixture components

  16. Text Classification - Labeled and Unlabeled Documents Experiments • Unlabeled Data & EM (Newsgroup articles)

  17. Text Classification - Labeled and Unlabeled Documents Experiments • Unlabeled Data & EM (web pages)

  18. Text Classification - Labeled and Unlabeled Documents Experiments • Varying the weights on Unlabeled Data

  19. Text Classification - Labeled and Unlabeled Documents Summary - Conclusions • Labels are expensive • Unlabeled data supplement scarce labeled data • reduce classification error up to 30% • Data inconsistency with generative model assumptions • Extensions of EM : • Weighted unlabeled data prevents decrease of accuracy • Many to one mixture components. • Future Work • Incremental learning algorithm using unlabeled data of test phase

More Related