130 likes | 293 Vues
ACTIVE LEARNING FOR TEXT CLASSIFICATION. Ankit Bhutani Y9094. AUTOMATIC TEXT CLASSIFICATION. A FEW HOURS ONLY. MANUAL TEXT CLASSIFICATION. TAKES YEARS. ORGANIZING LARGE VOLUMES OF TEXT. Massive volume of online text available. Organisation into categories to enable efficient search.
E N D
ACTIVE LEARNING FOR TEXT CLASSIFICATION AnkitBhutani Y9094
AUTOMATIC TEXT CLASSIFICATION A FEW HOURS ONLY MANUAL TEXT CLASSIFICATION TAKES YEARS
ORGANIZING LARGE VOLUMES OF TEXT • Massive volume of online text available. • Organisation into categories to enable efficient search. • Find use in a lot of applications like Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc. • Learning Approaches : unsupervised, supervised and semi-supervised.
Terms Used • Multinomial Naïve Bayes : • Documents in bag of words format • Independence assumptions
Terms Used • Semi-Supervised Learning : • Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model. • Expectation Maximization : • Class of Iterative Algorithms for Maximum Likelihood Estimation in problems with incomplete data Provide Soft Labels to Documents based on estimated model parameters Parameters of the model Document labels Re-estimate the model parameters based on the soft labels
Terms used • Active Learning : • Form of supervised machine learning • Learning Algorithm is able to interactively query the user • Query has associated cost. • Algorithm requests label for document such that gain in information about model parameters is maximized But how to choose which DOCUMENT to request for Label???
Terms Used • Query by Committee : • Divide the training set into 4 – 5 sets. • Each set as member gives probability estimates. • Maximum disagreement measured by maximum average KL divergence between all pairs
Terms Used • Semi-Supervised Frequency Estimate (SFE) : • Slight variation in basic EM : Different parameters re-estimation formula.
NOTICABLE WORK: Semi-Supervised Learning • Nigam et al, 1998-99 : • MNB + EM • 100 Labeled + 2500 Unlabeled documents • 80 – 85 % accuracy • Nigam & McCullum, 2000 : • MNB + EM + Active Learning • Total 1000 Documents • Label requests : 50, Accuracy : ~90%
NOTICABLE WORK: Semi-Supervised Learning • LYRL, 2004 : • Compared various Semi-supervised Learning Techniques • Introduced Reuters Corpus as a new benchmark • Su Shirabad and Matwin, 2011 : • MNB + SFE
My work • MNB + SFE + Active Learning • Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents • Experiments on 10,000 documents starting with : • 50 Labeled Documents + 100 requests • 100 Labeled Documents + 50 requests