Combining labeled and unlabeled data for text categorization with a large number of categories

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project

Supervised Learning with Labeled Data Labeled data is required in large quantities and can be very expensive to collect.

Why use Unlabeled data? • Very Cheap in the case of text • Web Pages • Newsgroups • Email Messages • May not be equally useful as labeled data but is available in enormous quantities

Goal • Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories

•ECOC very accurate and efficient for text categorization with a large number of classes •Co-Training useful for combining labeled and unlabeled data with a small number of classes

Related research with unlabeled data • Using EM in a generative model (Nigam et al. 1999) • Transductive SVMs (Joachims 1999) • Co-Training type algorithms (Blum & Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)

What is ECOC? • Solve multiclass problems by decomposing them into multiple binary problems (Dietterich & Bakiri 1995) • Use a learner to learn the binary problems

Testing ECOC Training ECOC f1 f2 f3 f4 f5 00 1 1 0 10 1 0 0 01110 01 001 A B C D 11 110 X

The Co-training algorithm [Blum & Mitchell 1998] • Loop (while unlabeled documents remain): • Build classifiers A and B • Use Naïve Bayes • Classify unlabeled documents with A & B • Use Naïve Bayes • Add most confident A predictions and most confident B predictions as labeled training examples

Naïve Bayeson A Estimate labels Select most confident Learn from labeled data Naïve Bayeson B Estimate labels Select most confident Add to labeled data The Co-training Algorithm [Blum & Mitchell, 1998]

One Intuition behind co-training • A and B are redundant • A features independent of B features • Co-training like learning with random classification noise • Most confident A prediction gives random B • Small misclassification error for A

ECOC + CoTraining = ECoTrain • ECOC decomposes multiclass problems into binary problems • Co-Training works great with binary problems • ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

SPORTS SCIENCE ARTS POLITICS HEALTH LAW

What happens with sparse data?

Datasets • Hoovers-255 • Collection of 4285 corporate websites • Each company is classified into one of 255 categories • Baseline 2% • Jobs-65 (from WhizBang) • Job Postings (Two feature sets – Title, Description) • 65 categories • Baseline 11%

Dataset Naïve Bayes (No UnLabeled Data) ECOC(No UnLabeled Data) EM Co-Training ECOC + Co-Training 10% Labeled 100% Labeled 10% Labeled 100% Labeled 10% Labeled 10% Labeled 10% Labeled Jobs-65 50.1 68.2 59.3 71.2 58.2 54.1 64.5 Hoovers-255 15.2 32.0 24.8 36.5 9.1 10.2 27.6 Results

Results

What Next? • Use improved version of co-training (gradient descent) • Less prone to random fluctuations • Uses all unlabeled data at every iteration • Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Summary

The Co-training setting Tom Mitchell Fredkin Professor of AI… …My advisor… …Professor Blum… Avrim Blum My research interests are… …My grandson… JohnnyI like horsies! Classifier A Classifier B

Learning from Labeled and Unlabeled Data:Using Feature Splits • Co-training [Blum & Mitchell 98] • Meta-bootstrapping [Riloff & Jones 99] • coBoost [Collins & Singer 99] • Unsupervised WSD [Yarowsky 95] • Consider this the co-training setting

Learning from Labeled and Unlabeled Data:Extending supervised learning • MaxEnt Discrimination [Jaakkola et al. 99] • Expectation Maximization [Nigam et al. 98] • Transductive SVMs [Joachims 99]

Using Unlabeled Data with EM Estimate labels of the unlabeled documents Naïve Bayes Use all documents to build anew naïve Bayes classifier

Co-training Uses feature split Incremental labeling Hard labels EM Ignores feature split Iterative labeling Probabilistic labels Co-training vs. EM Which differences matter?

Hybrids of Co-training and EM Uses Feature Split? Labeling Label all Learn from all Label all Naïve Bayeson A Naïve Bayeson B Naïve Bayeson A & B Learn from all Label all Add only best

Learning from Unlabeled Datausing Feature Splits • coBoost [Collins & Singer 99] • Meta-bootstrapping [Riloff & Jones 99] • Unsupervised WSD [Yarowsky 95] • Co-training [Blum & Mitchell 98]

Intuition behind Co-training • A and B are redundant • A features independent of B features • Co-training like learning with random classification noise • Most confident A prediction gives random B • Small misclassification error for A

Using Unlabeled Data with EM [Nigam, McCallum, Thrun & Mitchell, 1998] Estimate labels of unlabeled documents Initially learn from labeled only Naïve Bayes Use all documents to rebuild naïve Bayes classifier

Yes No Label Few co-training Label All co-EM EM Co-EM Build naïve Bayes with all data Estimate labels Initialize withlabeled data Naïve Bayeson A Naïve Bayeson B Build naïve Bayes with all data Estimate labels Use Feature Split?

Combining labeled and unlabeled data for text categorization with a large number of categories

Combining labeled and unlabeled data for text categorization with a large number of categories

Presentation Transcript

Text Categorization

Text Classification from Labeled and Unlabeled Documents using EM

Clustering tagged documents with labeled and unlabeled documents

Text Categorization

Text Categorization

Text Categorization

Combining Labeled and Unlabeled Data for Multiclass Text Categorization

text categorization

Efficient Text Categorization with a Large Number of Categories

Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

Motivation : Graph on labeled and unlabeled data W; Laplacian

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories

Text Categorization

Text Classification from Labeled and Unlabeled Documents using EM

Text Classification with Limited Labeled Data

Text Classification from Labeled and Unlabeled Documents using EM

Text Categorization

A Study of Text Categorization

Learning from Labeled and Unlabeled Data using Graph Mincuts

Text Classification with Limited Labeled Data

A Theoretical Model for Learning from Labeled and Unlabeled Data