1 / 26

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking. Stephen Guo, Ming-Wei Chang , Emre Kıcıman. Motivation. Microblogs are data gold mines! Twitter reports that it alone captures over 340M short messages per day Many applications on tweet information extraction

jethro
Télécharger la présentation

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. To Link or Not to Link? A Study on End-to-End Tweet Entity Linking Stephen Guo, Ming-Wei Chang, Emre Kıcıman

  2. Motivation • Microblogs are data gold mines! • Twitter reports that it alone captures over 340M short messages per day • Many applications on tweet information extraction • Election results (Tumasjan et al., 2010) • Disease spreading (Paul and Dredze, 2011) • Tracking product feedback and sentiment (Asur and Huberman, 2010) • ... • Existing tools (for example, NER) are often too limited • Stanford NER on tweets set achieves 44% F1 [Ritter et. al, 2011]

  3. Entity Linking (Wikifier) in Tweets Oh Yes!! giants vs packers game now!! Touchdown!! • Q1: Which phrase should be linked? (mention detection) • Q2: Which Wikipedia page should be linked for selected phrases? (disambiguation)

  4. Contributions • Proposed a new evaluation scheme for entity linking • A natural evaluation scheme for microblogs • A system that performs significantly better on tweets than other systems • Learn to detect mention and perform linking jointly • Outperform Tagme[Ferragina & Scaiella 2010] and [Cucerzan 07] by 15% F1 • What we have learned • Mention detection is a difficult problem • Entity information can help mention detection

  5. Outline • Task Definition (again!) • Two stage versus Joint • Model + Features • Results + Analysis

  6. What should be linked? Oh Yes!! giants vs packers game now!! Touchdown!! • Comparing different Wikifiersis a tough problem [Cornolti, WWW 2013] • Really, there is no good definition on what should be linked

  7. Our Scenario What people are talking about the movie “The Town” on twitter? • Assume our customers are only interested in entities of certain types • Movies; Video Games; Sports Team;… • Type information can be directly inferred by the corresponding Wikipedia page • Now, it is fair to compare different systems • We assume PER, LOC, ORG, BOOK, TVSHOW, MOVIE

  8. The Desired Results • Oh Yes!! giants vs packers game now!! Touchdown!!

  9. Terminology • Oh Yes!! giants vs packers game now!! Touchdown!! Assignment Mention Candidates Mentions Entity

  10. Related Work • Wikifier [Cucerzan, 2007; Milne and Witten, 2008…….] • Given a document, create Wikipedia-like links • Very difficult to evaluate/compare • Mention detection and disambiguation are often treated separately • NER [Li et al., 2012; Ritter et al., 2011, ...] • No Linking • Limited Types • KBP [Ji et al., 2010; Ji et al., 2011,...] • Focus on disambiguation aspect

  11. Outline • Task Definition (again!) • Two stage versus Joint • Model + Features • Results + Analysis

  12. What approach should we use? • Task: Wikifier to the entities of the certain types (all named entities) • Approach 1: • Train a general named entity recognizer for those types • Link to entities from the output of the first stage • Approach 2: • Learn to jointly detect mention and disambiguate entities • Take advantage of Wikipedia information • Take advantage of type information into our model Limited Types; Adaptation Advanced model

  13. The Necessity of the Joint Approach The town is so so good, Don’t worry Ben, we already forgave you for Gigli • Q: Is “the town” a mention? • Deep analysis with knowledge is required • Gigli is Ben Affleck’s movie, which did not receive a good review • Ben Affleck is the lead actor in the movie “The Town”

  14. Outline • Task Definition (again!) • Two stage versus Joint • Model + Features • Results + Analysis

  15. Features • Oh Yes!! giants vs packers game now!! Touchdown!! Mention, Entity Pair Features 2-nd Order Features Type Features Mention Specific Features

  16. Mention Specific Features • Mention Specific Features • How likely “giants” is being used as an anchor? • How likely “giants” is capitalized in Wikipedia? • Is the “giants” a stopword? The number of tokens… • Entity - Mention Pair Features • Given a string "giants". Estimate by Wikipedia link structure • Similarity between the context of the and the words in Wikipedia “” • View count

  17. View Count • The Wikipedia statistics • http://dumps.wikimedia.org/other/pagecounts-raw/ • Log exists for every hour • Very valuable data • View count is useful • Sometimes the most linked entity in Wikipedia is not the most popular one • “jersey shore” ==> ? • Jersey Shore links: 441 views: 509140 • Jersey Shore (TV_series) links: 324 views: 5081377

  18. Second Order Features • = the set of Wikipages that link to • The Jaccard score

  19. Type Features • The information content on Wikipedia are different from Twitter • Wikipedia is informational; Tweets are actionable • Misspelled words: “watchin, watchn, …… “ • We want to find context for PER, LOC, ORG,… for tweets • Step 1: train on a system • Step 2: labeled 10 million unlabeled tweets • Step 3: Collect popular contextual words for each type • Step 4: train a new system with one new feature • Check if the context match the type

  20. Mining Contextual Words

  21. Procedure • Testing: step 1 • Given a tweet • Tokenize it, remove symbols, segment hashtags • Testing: step 2 • For all k-gram words in the tweet, do table look up • To find mention candidates and the entities they can link to • Testing: step 3 • Construct features and output the assignment with the trained model • Learning: Structural SVM; Inference: Exact/Beamseach • A rule-base system for categorizing Wikipedia

  22. Outline • Task Definition (again!) • Two stage versus Joint • Model + Features • Results + Analysis

  23. Data • We sample two sets of tweets • Train, Test 1 from [Ritter 2011] • Test 2 from Twitter with entertainment keywords • “director, actress”…… • P@1 is very high • Many, many algorithms focus on disambiguation • However, if the mention are correctly extracted, the system is already very good

  24. Main Results • TagMe[Ferragina & Scaiella 2010] and Cucerzan [Cucerzan 07] • Cucerzan is designed for well-written documents • We have a more principle way to handle mention detection than Tagme

  25. Impact of Features • Entity information helps mention detections • Mining contextual words helps a bit • Capturing Entity-Entity relation also improves the model

  26. Conclusion & Discussions • We provide an experimental study on tweets • Jointly detect mentions and disambiguate • A structured learning approach • What have we learned • Mention detection is a difficult problem • Entity information could potentially help mention detection • Future work • Explore the connections between the joint approaches and the two stage approaches • [Illinois—ACL 2011, Aida-- VLDB 2011] • A more principled way to handle context

More Related