1 / 15

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006

Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li. Paper Structure. Introduction Feature Generation with Wikipedia

hogan
Télécharger la présentation

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li

  2. Paper Structure • Introduction • Feature Generation with Wikipedia • Wikipedia as a knowledge Repository • Feature Construction • Feature generator design • Using the link structure • Empirical Evaluation • Implementation Details • Experimental Methodology • The effect of feature generation • Classifying short documents • Conclusions and Future Work

  3. Introduction • Text categorization • Deals with automatic assignment of category labels to natural language documents • Represent document as bags of words • Features from words • Categorization based on features • Limitation of BOW: • by individual word occurrences in the training set • Wal-Mart supply chain goes real time • Wal-Mart manages its stock with RFID technology • Effective in medium difficulty categorization, but bad in small categories or short documents • Using encyclopedia to endow the machine document with the broader of knowledge available to humans

  4. Auxiliary text classifier: • matching documents with the most relevant articles of wikipedia • conventional bag of words + new features • Examples for idea of auxiliary text classifier: • “Bernanke takes charge” • BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, … • Using wikipedia • Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document • Leverage the knowledge gained from these articles

  5. Feature Generation with Wikipedia • Extend the representation of documents for text categorization with knowledge concepts relevant to the document text. • Wikipedia • Largest knowledge repository • Large-scale hierarchies • Qualify, stander written English • …

  6. Feature Construction • Receive a text fragment, and map to most relevant wikipedia articles • E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge • ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS • Training documents -> features -> wikipedia concepts -> augment the bag of word

  7. Feature Construction (cont.) • Unit for feature generation? • Word, sentence, paragraph, document? • Multi-resolution approach • Features are generated for • Individual words • Sentences • Paragraphs • Entire document • Polysemous words is mapped to the concepts that correspond to the sense shared by the context words

  8. Feature Construction example • “jaguar car models”, • the Wikipedia-based feature generator returns: • JAGUAR (CAR), • DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar), • V12 (Jaguar’s engine), • JAGUAR E-TYPE • JAGUAR XJ. • “jaguar Panthera onca”, • JAGUAR, • FELIDAE (feline species family), related felines such as LEOPARD, • PUMA and BLACK PANTHER, as well as KINKAJOU

  9. Feature generator design • A set of simple heuristics for pruning the sets of concepts (wikipedia): • Discarding: • with <100 non stop words • <5 incoming and outgoing links (too short) • disambiguation pages • Each concept is an attribute vector assigned weights using a TF.IDF

  10. Using the link structure • Link—anchor text: • Identical to the canonical name of the target article • Different anchor text refer to the same article: alternative names, variant spellings, and related phrases • Incoming links: significance of an article • Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material • Pursue this direction in future work

  11. Empirical Evaluation • Wikipedia snapshot: November 5, 2005 • 1.8Gb text in 910,989 articles, • removing small and overly specific concepts --remaining 171,332 articles • Removing stop words and rare words • Stemmed • 296,157 distinct terms presenting concepts

  12. Experimental Methodology • 1 Reuter-21578 • 2 Reuters Corpus Volume I (RCV1) • 3 OHSUMED • 4 20 Newsgroups(20NG) • 5 Movie Reviews (Movies) • Method: SVM with a linear kernel • Metrics: • precision-recall break-even point (BEP) • Reuter and OHSUMED: micro- and macro-average BEP • 20 NG and Movies: 4-fold cross-validation

  13. More effective in small categories Improve more

  14. Experiment on short documents Only use title of the articles to do classification

  15. Conclusion and Future work • Feature generator: • identify the most relevant encyclopedia articles • Creating new features • Add semantics to conventional BOW • Latent semantic indexing • LSI + SVM: not good • Wikipedia +svm: improve • Information retrieval

More Related