Text Classification from Labeled and Unlabeled Documents using EM

Eleni Foteinopoulou s0969664 Efthymios Kouloumpis s0928744 Text Classification from Labeled and Unlabeled Documents using EM [ Kamal Nigal, Andrew McCallum, Sebastian Thrun, Tom Mitchell, 1999 ]

Text Classification - Labeled and Unlabeled Documents Overview • Introduction • Motivation • Naïve Bayes Learning • Combination of NB and EM • EM Extensions • Experiments • Summary

Text Classification - Labeled and Unlabeled Documents Text Classification Bag of Words

Text Classification - Labeled and Unlabeled Documents Need for an intermediate approach • Unsupervised and Supervised learning • Unsupervised learning • collection of documents without any labels • easy to collect, free, inexpensive, large pool • Supervised learning • each object tagged with a class • laborious job, time-consuming process • Semi-supervised learning • Real life applications

Text Classification - Labeled and Unlabeled Documents Challenges • How to reduce the number of labeled examples? • Can unlabeled examples increase the classification accuracy? Any ideas...? Semi-Supervised Learning

Text Classification - Labeled and Unlabeled Documents Motivation • Document collection D • A subset (with ) has known labels • Goal: to label the rest of the collection. • Approach • Train a supervised learner using , the labeled subset. → NB • Apply the trained learner on the remaining documents. → EM • Idea • Harness information from unlabeled subset.

Text Classification - Labeled and Unlabeled Documents The Generative Model • Probabilistic generative model • Every document ~ Probability distribution • Assumptions • Mixture model • One to one correspondence between mixture components and classes • Document length distribution

Text Classification - Labeled and Unlabeled Documents Naïve Bayes Learning • Assign each document to a particular mixture component. • The parameters of an individual mixture component form a multinomial distribution over words • Estimate model parameters → maximum a posteriori estimation

Text Classification - Labeled and Unlabeled Documents Naïve Bayes Learning • Maximum a posteriori estimate of the model parameters given a small set of labeled data → high variance • How to improve parameter estimates? • Incorporate unlabeled documents

Text Classification - Labeled and Unlabeled Documents EM Algorithm • Iterative algorithm for parameter estimation (maximum a posteriori) • Incomplete data → missing labels • Estimate parameters θ from labeled subset • Iterate: • E step : calculate probabilistic labels for the unlabeled documents using current parameter estimate θ. • M step : maximize the complete likelihood → new maximum a posteriori estimate using current estimates of • Continue till convergence → θ local max.

Text Classification - Labeled and Unlabeled Documents EM: Issues • Generative model vs. real-world text data • Mixture model - One to one correspondence between mixture components and classes • Same parameter model as used in • classification → violation • Word conditional independence • NB assumption • Extreme class probability estimates

Text Classification - Labeled and Unlabeled Documents EM: Extensions • Real world data? • Weighting factor • Multiple mixture components

EM: Reducing belief in unlabeled data • Problems due to unlabeled data • Noise in term distribution of documents in • Mistakes in E-step • Solution • attenuate the contribution from documents in • Add a damping factor αε[0,1], in E Step for contribution from Text Classification - Labeled and Unlabeled Documents

Text Classification - Labeled and Unlabeled Documents EM: Modeling labels using many mixture components • Previous extension → reduces effect of mixture model assumption • Goal: Relax assumption of one to one correspondence between mixture components and class labels. • Introduce many to one mapping → missing values D • E.g.: For two class case “football” vs. “not football” • Documents not about “football” are actually about a variety of other things;

Text Classification - Labeled and Unlabeled Documents EM: Modeling labels using many mixture components • Lower accuracy with one mixture component per label → not naturally modeled • Higher accuracy with more mixture components per label → word dependencies • Over fitting and poor performance with too large mixture components

Text Classification - Labeled and Unlabeled Documents Experiments • Unlabeled Data & EM (Newsgroup articles)

Text Classification - Labeled and Unlabeled Documents Experiments • Unlabeled Data & EM (web pages)

Text Classification - Labeled and Unlabeled Documents Experiments • Varying the weights on Unlabeled Data

Text Classification - Labeled and Unlabeled Documents Summary - Conclusions • Labels are expensive • Unlabeled data supplement scarce labeled data • reduce classification error up to 30% • Data inconsistency with generative model assumptions • Extensions of EM : • Weighted unlabeled data prevents decrease of accuracy • Many to one mixture components. • Future Work • Incremental learning algorithm using unlabeled data of test phase

Text Classification from Labeled and Unlabeled Documents using EM