ACTIVE LEARNING FOR TEXT CLASSIFICATION

ACTIVE LEARNING FOR TEXT CLASSIFICATION AnkitBhutani Y9094

AUTOMATIC TEXT CLASSIFICATION A FEW HOURS ONLY MANUAL TEXT CLASSIFICATION TAKES YEARS

ORGANIZING LARGE VOLUMES OF TEXT • Massive volume of online text available. • Organisation into categories to enable efficient search. • Find use in a lot of applications like Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc. • Learning Approaches : unsupervised, supervised and semi-supervised.

Terms Used • Multinomial Naïve Bayes : • Documents in bag of words format • Independence assumptions

Terms Used • Semi-Supervised Learning : • Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model. • Expectation Maximization : • Class of Iterative Algorithms for Maximum Likelihood Estimation in problems with incomplete data Provide Soft Labels to Documents based on estimated model parameters Parameters of the model Document labels Re-estimate the model parameters based on the soft labels

Terms used • Active Learning : • Form of supervised machine learning • Learning Algorithm is able to interactively query the user • Query has associated cost. • Algorithm requests label for document such that gain in information about model parameters is maximized But how to choose which DOCUMENT to request for Label???

Terms Used • Query by Committee : • Divide the training set into 4 – 5 sets. • Each set as member gives probability estimates. • Maximum disagreement measured by maximum average KL divergence between all pairs

Terms Used • Semi-Supervised Frequency Estimate (SFE) : • Slight variation in basic EM : Different parameters re-estimation formula.

NOTICABLE WORK: Semi-Supervised Learning • Nigam et al, 1998-99 : • MNB + EM • 100 Labeled + 2500 Unlabeled documents • 80 – 85 % accuracy • Nigam & McCullum, 2000 : • MNB + EM + Active Learning • Total 1000 Documents • Label requests : 50, Accuracy : ~90%

NOTICABLE WORK: Semi-Supervised Learning • LYRL, 2004 : • Compared various Semi-supervised Learning Techniques • Introduced Reuters Corpus as a new benchmark • Su Shirabad and Matwin, 2011 : • MNB + SFE

My work • MNB + SFE + Active Learning • Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents • Experiments on 10,000 documents starting with : • 50 Labeled Documents + 100 requests • 100 Labeled Documents + 50 requests

Results so far

ACTIVE LEARNING FOR TEXT CLASSIFICATION

ACTIVE LEARNING FOR TEXT CLASSIFICATION

Presentation Transcript

Text Classification

Active Learning for Imbalanced Sentiment Classification

Effective Multi-Label Active Learning for Text Classification

Soft-Supervised Learning for Text Classification

Active Learning in Text Retrieval

TEXT CLASSIFICATION

Text Classification

Text Classification

Active Learning for Active Citizenship

Text Classification

Text Classification

Active Teaching for Active Learning

Meta-learning for automatic selection of algorithms for text classification

Text Classification

Text Classification

Classification Text

Automatic Text Classification through Machine Learning

Text Classification

TEXT CLASSIFICATION