1 / 39

Importance of Semantic Representation: Dataless Classification

Importance of Semantic Representation: Dataless Classification. Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar University of Illinois, Urbana-Champaign. Text Categorization. Classify the following sentence: Syd Millar was the chairman of the International Rugby Board in 2003.

mandy
Télécharger la présentation

Importance of Semantic Representation: Dataless Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Importance of Semantic Representation: Dataless Classification Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar University of Illinois, Urbana-Champaign

  2. Text Categorization Classify the following sentence: Syd Millar was the chairman of the International Rugby Board in 2003. Pick a label: Class1 vs. Class2 • Traditionally, we need annotated data to train a classifier

  3. Text Categorization • Humans don’t seem to need labeled data Syd Millar was the chairman of the International Rugby Board in 2003. Pick a label: Sportsvs.Finance Label names carry a lot of information!

  4. Text Categorization Do wereally always need labeled data?

  5. Contributions • We can often go quite far without annotated data • … if we “know” the meaning of text • This works for text categorization • ….and is consistent across different domains

  6. Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains

  7. Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains

  8. Semantic Representation • One common representation is the Bag of Words representation • All text is a vector in the space of words.

  9. Semantic Representation • Explicit Semantic Analysis • [Gabrilovich & Markovitch, 2006, 2007] • Text is a vector in the space of concepts • Concepts are defined by Wikipedia articles

  10. Monetary Policy Apple IPod ESA representation ESA representation International Monetary Fund Monetary policy Economic and Monetary Union Hong Kong Monetary Authority Monetarism Central bank IPod mini IPod photo IPod nano Apple Computer IPod shuffle ITunes Explicit Semantic Analysis: Example Wikipedia article titles

  11. Semantic Representation • Two semantic representations • Bag of words • ESA

  12. Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains

  13. Traditional Text Categorization Labeled corpus Sports Finance Semantic space A classifier

  14. Dataless Classification Labeled corpus Labels Sports Finance What can we do using just the labels?

  15. But labels are text too!

  16. Dataless Classification New unlabeled document Labels Sports Finance Semantic space

  17. What is Dataless Classification? • Humans don’t need training for classification • Annotatedtraining data not always needed • Look for the meaning of words

  18. What is Dataless Classification? • Humans don’t need training for classification • Annotated training data not always needed • Look for the meaning of words

  19. On-the-fly Classification New unlabeled document Labels Sports Finance Semantic space

  20. On-the-fly Classification • No training data needed • We know the meaning of label names • Pick the label that is closest in meaning to the document • Nearest neighbors

  21. On-the-fly Classification New unlabeled document New labels Hockey Baseball Semantic space

  22. On-the-fly Classification • No need to even know labels before hand • Compare with traditional classification • Annotated training data for each label

  23. Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains

  24. Dataset 1: Twenty Newsgroups • Posts to newsgroups • Newsgroups have descriptive names sci.electronics = Science Electronics rec.motorbikes = Motorbikes

  25. Dataset 2: Yahoo Answers • Posts to Yahoo! Answers • Posts categorized into a two level hierarchy • 20 top level categories • Totally 280 categories at the second level Arts and Humanities, Theater Acting Sports,Rugby League

  26. Experiments • 20 Newsgroups • 10 binary problems (from [Raina et al, ‘06]) Religion vs. Politics.guns Motorcycles vs. MS Windows • Yahoo! Answers • 20 binary problems Health, Diet fitness vs. Health Allergies Consumer Electronics DVRs vs. Pets Rodents

  27. Results: On-the-fly classification Naïve Bayes classifier Uses annotated data, Ignores labels Nearest neighbors, Uses labels, No annotated data

  28. Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains

  29. Using Unlabeled Data • Knowing the data collection helps • We can learn specific biases of the dataset • Potential for semi-supervised learning

  30. Bootstrapping • Each label name is a “labeled” document • One “example” in word or concept space • Train initial classifier • Same as the on-the-fly classifier • Loop: • Classify all documents with current classifier • Retrain classifier with highly confident predictions

  31. Co-training • Words and concepts are two independent “views” • Each view is a teacher for the other [Blum & Mitchell ‘98]

  32. Co-training • Train initial classifiers in word space and concept space • Loop • Classify documents with current classifiers • Retrain with highly confident predictions of both classifiers

  33. Using unlabeled data • Three approaches • Bootstrapping with labels using Bag of Words • Bootstrapping with labels using ESA • Co-training

  34. More Results Co-training using just labels does as well as supervision with 100 examples No annotated data

  35. Outline • Semantic Representation • On-the-fly Classification • Datasets • Exploiting unlabeled data • Robustness to different domains

  36. Domain Adaptation • Classifiers trained on one domain and tested on another • Performance usually decreases across domains

  37. But the label names are the same • Label names don’t depend on the domain • Label names are robust across domains • On-the-fly classifiers are domain independent

  38. Example Baseball vs. Hockey

  39. Conclusion • Sometimes, label names are tell us more about a class than annotated examples • Standard learning practice of treating labels as unique identifiers loses information • The right semantic representation helps • What is the right one?

More Related