1 / 14

Named Entity Mining From Click-Through Data Using Weakly Supervised LDA

Named Entity Mining From Click-Through Data Using Weakly Supervised LDA. Gu Xu 1 , Shuang -Hong Yang 1,2 , Hang Li 1 1 Microsoft Research Asia, China 2 College of Computing, Georgia Tech, USA. Talk Outline. Named Entity Mining Exploiting click-through data

nani
Télécharger la présentation

Named Entity Mining From Click-Through Data Using Weakly Supervised LDA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu1, Shuang-Hong Yang1,2, Hang Li1 1Microsoft Research Asia, China 2College of Computing, Georgia Tech, USA

  2. Talk Outline • Named Entity Mining • Exploiting click-through data • Applying Latent Dirichlet Allocation • Developing a weakly supervised Learning approach • Weakly Supervised LDA • Experimental Results • Summary

  3. Named Entity Mining • Named Entity Mining (NEM) • To mine the information of named entities of a class from a large amount of data. • Example: mine movie titles from a textual data collection • Applications: Web search, etc. • Three Challenges • Suitable data source for NEM • Ambiguity in classes of named entities • Supervision from human knowledge Click-through Data LDA (Topic Model) Weakly Supervised Learning

  4. Click-through Data • Query context • [movie]trailer, [game]cheats • Click context • imdb.com for movies, gamespot.com for games • Wisdom-of-crowds • Very Large-scale data and keep on growing • Frequent update with emerging named entities • New data source for NEM • Over 70% queries contain named entities. • Rich context for determining the classes of entities. Click-Through Data

  5. Latent Dirichlet Allocation • Deal with ambiguity in classes of named entities • Classes of named entities are ambiguous. • Harry Potter: Book, Movie and Game • Topic models (LDA) Harry Potter harry potter trailer  imdb.com harry potter dvd  movies.yahoo.com harry potter cheats  cheats.ign.com harry potter game  gamespots.com Classes of Named Entity as Topics Movie Game Click Context Click Context Query Context Query Context gamespots.com cheats.ign.com gamefaqs.com # cheats # walkthrough # game imdb.com movies.yahoo.com disney.go.com # trailer # dvd # movie

  6. Weakly Supervised Learning • Supervise LDA training with examples • LDA is unsupervised model. • Topics in LDA are latent and not align with predefined semantic classes, like book, movie and game. • Human labels are inaccurate and partial. • Binary indicator rather than proportion • Labels only indicate that a named entity belongs to certain classes, but not exclude the possibility that it belongs to the other classes. • Weakly-supervised LDA • Supervise LDA training with partial labels

  7. Weakly Supervised LDA • Overview ……………….. Harry Potter ……………….. ……………….. Seeds harry potter book http://www.amazon.com harry potter cheats http://cheats.ign.com harry potter trailer http://www.imdb.com …………………………………….. Click-through Data Create a virtual document for each seed and train WS-LDA # book, http://www.amazon.com # cheats, http://cheats.ign.com # trailer, http://www.imdb.com …………………………………….. Virtual Document Contexts Websites Newly Discovered Entities Find new named entities as well as their classes by using obtained query contexts and clicked websites

  8. Weakly Supervised LDA (cont.) • LDA with two types of virtual words • w1: Query context • w2: Click context # book # cheats # trailer …………… Virtual Document http://www.amazon.com http://cheats.ign.com http://www.imdb.com ………………………………….

  9. Weakly Supervised LDA (cont.) • Introduce Weak Supervision • LDA log likelihood + soft constraints • Soft Constraints Soft Constraints LDA Probability Document Probability on i-th Class Document Binary Label on i-th Class

  10. Experimental Results • Dataset • Seed named entities • About 1,000 seeds for each class, and 3767 unique named entities in total • Click-through data • 1.5 billion query-URL pairs, containing 240 million unique queries and 17 million unique URLs

  11. Experimental Results (cont.) • Top Contexts and websites Movie Contexts Game Contexts Book Contexts Music Contexts Movie Websites Game Websites Book Websites Music Websites

  12. Experimental Results (cont.) • Accuracy of Mined Entities

  13. Summary • Proposed to use click-through data as a new data source for NEM • Employed topic model to deal with ambiguity in classes of named entities • Devised weakly supervised LDA for modeling click-through data • Two types of virtual words • Introduce weakly supervised learning into LDA • Experiments on large-scale data verified effectiveness of proposed approach

  14. THANKS

More Related