1 / 12

ACTIVE LEARNING FOR TEXT CLASSIFICATION

ACTIVE LEARNING FOR TEXT CLASSIFICATION. Ankit Bhutani Y9094. AUTOMATIC TEXT CLASSIFICATION. A FEW HOURS ONLY. MANUAL TEXT CLASSIFICATION. TAKES YEARS. ORGANIZING LARGE VOLUMES OF TEXT. Massive volume of online text available. Organisation into categories to enable efficient search.

latona
Télécharger la présentation

ACTIVE LEARNING FOR TEXT CLASSIFICATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ACTIVE LEARNING FOR TEXT CLASSIFICATION AnkitBhutani Y9094

  2. AUTOMATIC TEXT CLASSIFICATION A FEW HOURS ONLY MANUAL TEXT CLASSIFICATION TAKES YEARS

  3. ORGANIZING LARGE VOLUMES OF TEXT • Massive volume of online text available. • Organisation into categories to enable efficient search. • Find use in a lot of applications like Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc. • Learning Approaches : unsupervised, supervised and semi-supervised.

  4. Terms Used • Multinomial Naïve Bayes : • Documents in bag of words format • Independence assumptions

  5. Terms Used • Semi-Supervised Learning : • Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model. • Expectation Maximization : • Class of Iterative Algorithms for Maximum Likelihood Estimation in problems with incomplete data Provide Soft Labels to Documents based on estimated model parameters Parameters of the model Document labels Re-estimate the model parameters based on the soft labels

  6. Terms used • Active Learning : • Form of supervised machine learning • Learning Algorithm is able to interactively query the user • Query has associated cost. • Algorithm requests label for document such that gain in information about model parameters is maximized But how to choose which DOCUMENT to request for Label???

  7. Terms Used • Query by Committee : • Divide the training set into 4 – 5 sets. • Each set as member gives probability estimates. • Maximum disagreement measured by maximum average KL divergence between all pairs

  8. Terms Used • Semi-Supervised Frequency Estimate (SFE) : • Slight variation in basic EM : Different parameters re-estimation formula.

  9. NOTICABLE WORK: Semi-Supervised Learning • Nigam et al, 1998-99 : • MNB + EM • 100 Labeled + 2500 Unlabeled documents • 80 – 85 % accuracy • Nigam & McCullum, 2000 : • MNB + EM + Active Learning • Total 1000 Documents • Label requests : 50, Accuracy : ~90%

  10. NOTICABLE WORK: Semi-Supervised Learning • LYRL, 2004 : • Compared various Semi-supervised Learning Techniques • Introduced Reuters Corpus as a new benchmark • Su Shirabad and Matwin, 2011 : • MNB + SFE

  11. My work • MNB + SFE + Active Learning • Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents • Experiments on 10,000 documents starting with : • 50 Labeled Documents + 100 requests • 100 Labeled Documents + 50 requests

  12. Results so far

More Related