Linking Named Entities in Tweets with Knowledge Base viaUser Interest Modeling KDD’13
Task • The task: is to link the named entity mentions detected from tweets with the corresponding real world entities in the knowledge • Application: personalized recommendation , user interest discovery.
Task • Challenge:
Task • Challenge: the noisy, short, and informal nature of tweets. • Previous methods : linking entities in Web documents, and largely rely on the context around the entity mention and the topical coherence between entities in the document. • KAURI: a graph-based framework to collectively linK all the nAmed entity mentions in all tweets posted by a user via modeling the UseR’s topics of Interest.
Main Idea • Intra-tweet local information • the prior probability of the entity being mentioned is high • the similarity between the context around the entity mention in the tweet and the context associated with the candidate mapping entity is high • the candidate mapping entity is topically coherent with the mapping entities of the other entity mentions (if the tweet has) within the same tweet.
Main Idea • Inter-tweet user interest information • We assume each user has an underlying topic interest distribution over various topics of named entities. • If a candidate mapping entity is highly topically related to entities the user is interested in, we assume this user is likely to be interested in this candidate entity as well.
TWEET ENTITY LINKING • Topical relateness • the edge weight is defined as the topical relatedness between the two candidate entities. • the Wikipedia Link-based Measure (WLM)
TWEET ENTITY LINKING • Initial interest score estimation • the prior probability of the candidate • the similarity between the context associated with the candidate and the context around the entity in tweet t • extract the short window of words around each occurrence of entity in tweet • TF-IDF • Cosine similarity
TWEET ENTITY LINKING • Initial interest score estimation • Topical coherence: • The topical coherence between entities as the topical relatedness between the candidate mapping entities and other entity in the tweet • Initial interest score • Maximize-margin technique based on the training data to automatically learn the weight
User interest propagation algorithm • normalize the interest scores of each node in graph • normalize the edge weight(the interest propagation strength) of each nodes pair in graph • Let B be a |V | × |V | interest propagation strength matrix • Since matrix B is square, stochastic, irreducible and aperiodic, our user interest propagation algorithm is guaranteed to converge
User interest propagation algorithm • Output of the tweet entity linking task: • Set a nil threshold to validate the entity . • The threshold is learned by linear search based on the training data set.
EXPERIMENTS • The gold standard data set: 3,818 tweets from 20 users which contain 2677 NE. • The May 2011 version of Wikipedia(3.5 million Wikipedia pages) to construct the dictionary D , and to obtain the context for each candidate entity . • The YAGO knowledge base : an open-domain ontology combining Wikipedia and WordNet with high coverage and quality. YAGO uses unique canonical strings from Wikipedia as the entity names. Currently, YAGO contains over one million entities.
EXPERIMENTS • Parameters: • λ= 0.4 • The weight vector w and the nil threshold τ are learned using 2-fold cross validation. • Baseline method: • state-of-the-art method LINDEN • LOCAL: our framework which regards the initial interest score as the final score
EXPERIMENTS • Experimental results
EXPERIMENTS • Sensitivity analysis.
EXPERIMENTS • Evaluation of efficiency