To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking

To Link or Not to Link? A Study on End-to-End Tweet Entity Linking Stephen Guo, Ming-Wei Chang, Emre Kıcıman

Motivation • Microblogs are data gold mines! • Twitter reports that it alone captures over 340M short messages per day • Many applications on tweet information extraction • Election results (Tumasjan et al., 2010) • Disease spreading (Paul and Dredze, 2011) • Tracking product feedback and sentiment (Asur and Huberman, 2010) • ... • Existing tools (for example, NER) are often too limited • Stanford NER on tweets set achieves 44% F1 [Ritter et. al, 2011]

Entity Linking (Wikifier) in Tweets Oh Yes!! giants vs packers game now!! Touchdown!! • Q1: Which phrase should be linked? (mention detection) • Q2: Which Wikipedia page should be linked for selected phrases? (disambiguation)

Contributions • Proposed a new evaluation scheme for entity linking • A natural evaluation scheme for microblogs • A system that performs significantly better on tweets than other systems • Learn to detect mention and perform linking jointly • Outperform Tagme[Ferragina & Scaiella 2010] and [Cucerzan 07] by 15% F1 • What we have learned • Mention detection is a difficult problem • Entity information can help mention detection

Outline • Task Definition (again!) • Two stage versus Joint • Model + Features • Results + Analysis

What should be linked? Oh Yes!! giants vs packers game now!! Touchdown!! • Comparing different Wikifiersis a tough problem [Cornolti, WWW 2013] • Really, there is no good definition on what should be linked

Our Scenario What people are talking about the movie “The Town” on twitter? • Assume our customers are only interested in entities of certain types • Movies; Video Games; Sports Team;… • Type information can be directly inferred by the corresponding Wikipedia page • Now, it is fair to compare different systems • We assume PER, LOC, ORG, BOOK, TVSHOW, MOVIE

The Desired Results • Oh Yes!! giants vs packers game now!! Touchdown!!

Terminology • Oh Yes!! giants vs packers game now!! Touchdown!! Assignment Mention Candidates Mentions Entity

Related Work • Wikifier [Cucerzan, 2007; Milne and Witten, 2008…….] • Given a document, create Wikipedia-like links • Very difficult to evaluate/compare • Mention detection and disambiguation are often treated separately • NER [Li et al., 2012; Ritter et al., 2011, ...] • No Linking • Limited Types • KBP [Ji et al., 2010; Ji et al., 2011,...] • Focus on disambiguation aspect

What approach should we use? • Task: Wikifier to the entities of the certain types (all named entities) • Approach 1: • Train a general named entity recognizer for those types • Link to entities from the output of the first stage • Approach 2: • Learn to jointly detect mention and disambiguate entities • Take advantage of Wikipedia information • Take advantage of type information into our model Limited Types; Adaptation Advanced model

The Necessity of the Joint Approach The town is so so good, Don’t worry Ben, we already forgave you for Gigli • Q: Is “the town” a mention? • Deep analysis with knowledge is required • Gigli is Ben Affleck’s movie, which did not receive a good review • Ben Affleck is the lead actor in the movie “The Town”

Features • Oh Yes!! giants vs packers game now!! Touchdown!! Mention, Entity Pair Features 2-nd Order Features Type Features Mention Specific Features

Mention Specific Features • Mention Specific Features • How likely “giants” is being used as an anchor? • How likely “giants” is capitalized in Wikipedia? • Is the “giants” a stopword? The number of tokens… • Entity - Mention Pair Features • Given a string "giants". Estimate by Wikipedia link structure • Similarity between the context of the and the words in Wikipedia “” • View count

View Count • The Wikipedia statistics • http://dumps.wikimedia.org/other/pagecounts-raw/ • Log exists for every hour • Very valuable data • View count is useful • Sometimes the most linked entity in Wikipedia is not the most popular one • “jersey shore” ==> ? • Jersey Shore links: 441 views: 509140 • Jersey Shore (TV_series) links: 324 views: 5081377

Second Order Features • = the set of Wikipages that link to • The Jaccard score

Type Features • The information content on Wikipedia are different from Twitter • Wikipedia is informational; Tweets are actionable • Misspelled words: “watchin, watchn, …… “ • We want to find context for PER, LOC, ORG,… for tweets • Step 1: train on a system • Step 2: labeled 10 million unlabeled tweets • Step 3: Collect popular contextual words for each type • Step 4: train a new system with one new feature • Check if the context match the type

Mining Contextual Words

Procedure • Testing: step 1 • Given a tweet • Tokenize it, remove symbols, segment hashtags • Testing: step 2 • For all k-gram words in the tweet, do table look up • To find mention candidates and the entities they can link to • Testing: step 3 • Construct features and output the assignment with the trained model • Learning: Structural SVM; Inference: Exact/Beamseach • A rule-base system for categorizing Wikipedia

Data • We sample two sets of tweets • Train, Test 1 from [Ritter 2011] • Test 2 from Twitter with entertainment keywords • “director, actress”…… • P@1 is very high • Many, many algorithms focus on disambiguation • However, if the mention are correctly extracted, the system is already very good

Main Results • TagMe[Ferragina & Scaiella 2010] and Cucerzan [Cucerzan 07] • Cucerzan is designed for well-written documents • We have a more principle way to handle mention detection than Tagme

Impact of Features • Entity information helps mention detections • Mining contextual words helps a bit • Capturing Entity-Entity relation also improves the model

Conclusion & Discussions • We provide an experimental study on tweets • Jointly detect mentions and disambiguate • A structured learning approach • What have we learned • Mention detection is a difficult problem • Entity information could potentially help mention detection • Future work • Explore the connections between the joint approaches and the two stage approaches • [Illinois—ACL 2011, Aida-- VLDB 2011] • A more principled way to handle context

To Link or Not to Link ? A Study on End-to-End Tweet Entity Linking