360 likes | 496 Vues
Using Encyclopedic Knowledge for Named Entity Disambiguation. Razvan Bunescu. Marius Pasca. Machine Learning Group Department of Computer Sciences University of Texas at Austin. Google Inc. 1600 Amphitheatre Parkway Mountain View, CA. razvan@cs.utexas.edu. mars@google.com.
 
                
                E N D
Using Encyclopedic Knowledge forNamed Entity Disambiguation Razvan Bunescu Marius Pasca Machine Learning Group Department of Computer Sciences University of Texas at Austin Google Inc. 1600 Amphitheatre Parkway Mountain View, CA razvan@cs.utexas.edu mars@google.com
Introduction: Disambiguation • Some names denote multiple entities: • “John Williams and the Boston Pops conducted a summer Star Wars concert at Tanglewood.” John Williams John Williams (composer) • “John Williams lost a Taipei death match against his brother, Axl Rotten.” John Williams John Williams (wrestler) • “John Williams won a Victoria Cross for his actions at the battle of Rorke’s Drift. John Williams John Williams (VC)
Introduction: Normalization • Some entities have multiple names: • John Williams (composer) John Williams • John Williams (composer) John Towner Williams • John Williams (wrestler) John Williams • John Williams (wrestler) Ian Rotten • Venus (planet) Venus • Venus (planet) Morning Star • Venus (planet) Evening Star
Introduction: Motivation • Web searches • Queries about Named Entities (NEs) constitute a significant portion of popular web queries. • Ideally, search results are clustered such that: • In each cluster, the queried name denotes the same entity. • Each cluster is enriched by querying the web with alternative names of the corresponding entity. • Web-based Information Extraction (IE) • Aggregating extractions from multiple web pages can lead to improved accuracy in IE tasks (e.g. extracting relationships between NEs). • Named entity disambiguation is essential for performing a meaningful aggregation.
Introduction: Approach • Build a dictionary D of named entities • Use information from a large coverage encyclopedia – Wikipedia. • Each name dD is mapped to d.E, the set of entities that d can refer to in Wikipedia. • Design a method that takes as input a proper name in its document context, and can be trained to: • Detect when a proper name refers to an entity from D. [Detection] • Find the named entity refered in that context. [Disambiguation]
Introduction: Example Dictionary John Towner Williams John Williams Ian Rotten John Williams (composer) John Williams (VC) John Williams (wrestler) John Williams (other) Document ? “… this past weekend. John Williams and the Boston Pops conducted a summer Star Wars concert at Tanglewood …”
Outline • Introduction • Wikipedia Structures • Named Entity Dictionary • Disambiguation Dataset • Disambiguation & Detection • Experimental Evaluation • Future Work • Conclusions
Wikipedia – A Wiki Encyclopedia • Wikipedia – a free online encyclopedia written collaboratively by volunteers, using wiki software. • 200 language editions, with varying levels of coverage. • Very dynamic and quickly growing resource: • May 2005: 577,860 articles • Sep. 2005: 751,666 articles
Wikipedia Articles & Titles • Each article describes a specific entity or concept. • An article is uniquely identified by its title. • Usually, the title is the most common name used to denote the entity described in the article. • If the title name is ambiguous, it may be qualified with an expression between parentheses. • Example: John Williams (composer) • Notation: • E the set of all named entities from Wikipedia. • eE  an arbitrary named entity. • e.title  the title name • e.T  the text of the article
Wikipedia Structures • In general, there is a many-to-many relationship between names and entities, captured in Wikipedia through: • Redirect articles. • Disambiguation articles. • Hyperlinks: An article may contain links to other articles in Wikipedia. • Categories: each article belongs to at least one Wikipedia category.
Redirect Articles • A redirect article exists for each alternative name used to refer to an entity in Wikipedia. • Example: The article titled John Towner Williams consists in a pointer to the article John Williams (composer). • Notation: • e.R the set of all names that redirect to e. • Example: • e.title  United States. • e.R  {USA, US, Estados Unidos, Untied States, Yankee Land, …}.
Disambiguation Articles • A disambiguation article lists all Wikipedia entities (articles) that may be denoted by an ambiguous name. • Example: The article titled John Williams (disambiguation) list 22 entities (articles). • Notation: • e.D the set of names whose disambiguation pages contain a link to e. • Example: • e.title  Venus (planet). • e.D  {Venus, Morning Star, Evening Star}.
Named Entity Dictionary • Named Entities  entities with a proper name title. • All Wikipedia titles begin with a capital letter  3 heuristics for detecting proper name titles: • If e.title is a multiword title, then e is a named entity only if all content words are capitalized (e.g. The Witches of Eastwick) • If e.title is a one word title that contains at least two capital letters, then e is a named entity (e.g. NATO) • If at least 75% of the title occurrences inside the article are capitalized, then e is a named entity. • Notation: • dD is a proper name entry in the dictionary D (500K entries). • d.E is the set of entities that may be denoted by d in Wikipedia, • ed.E  d  e.name  de.R  de.D (e.name  e.title without the expression between parantheses)
Hyperlinks • Mentions of entities in Wikipedia articles are often linked to their corresponding article, by using links or piped links. piped link link Wiki source The [[Vatican City|Vatican]] is now an enclave surrounded by [[Rome]]. Display string The Vatican is now an enclave surrounded by Rome.
Disambiguation Dataset • Hyperlinks in Wikipedia provide disambiguated named entity queries q. q1 q2 The [[Vatican City|Vatican]] is now an enclave surrounded by [[Rome]]. title display name display name  title • Notation: • q.E the set of entities that are associated in the dictionary D with the display name from the link. • q.eq.E  the true entity associated with the query, given by the title included in the link. • q.T  the text contained in a window of size 55 words [Gooi & Allan, 2004] centered on the link.
Disambiguation Dataset • Every entity ekq.E contributes a disambiguation example, labeled 1 if and only if ek q.e q “… this past weekend. [[John Williams]] and the Boston Pops conducted a summer Star Wars concert at Tanglewood …” 1,783,868 queries
Categories • Each article in Wikipedia is required to be associated with at least one category. • Categories form a directed acyclic graph, which allows multiple categorization schemes to co-exist. • 59,759 categories in Wikipedia taxonomy. • Notation: • e.C the set of categories to which e belongs (ancestors included). • Example: • e.title  Venus (planet). • e.C  {Venus, Planets of the Solar Systems, Planets, Solar System}.
Outline • Introduction • Wikipedia Structures • Named Entity Dictionary • Disambiguation Dataset • Disambiguation & Detection • Experimental Evaluation • Future Work • Conclusions
NE Disambiguation: Two Approaches • Classification: • Train a classifier for each proper name in the dictionary D. • Not feasible: 500K proper names  need 500K classifiers! • Ranking: • Design a scoring function score(q,ek) that computes the compatibility between the context of the proper name occurring in a query q, and any of the entities ekq.E that may be referred by that proper name. • For a given named entity query q, select the highest ranking entity:
Context-Article Similarity • NE disambiguation  ranking problem. • Use cosine similarity between query context and article, based on the tf x idf formulation:
Word-Category Correlations • Problem: In many cases, given a query q, the true entity q.e fails to rank first because cue words from the query context do not occur in q.e’s article. • The article may be too short, or incomplete. • Relevant concepts from the query context are captured in the article through synonymous words or phrases. • Approach: Use correlations between words in the query context wq.Tand categories to which the named entity belongs ce.C.
Word-Category Correlations People by occupation People known in connection with sports and hobbies Musicians Composers Wrestlers Film score composers Professional wrestlers John Williams (composer) John Williams (wrestler) ? “John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.” conducted
One feature for the context-article similarity: • Each word-category pair w,c  V C is translated into a feature: • One special feature for out-of-Wikipedia entities: Ranking Formulation • Redefine q.E the set of named entities from D that may be denoted by the display name in the query, plus an out-of-Wikipedia entity eout. • Use a linear ranking function: [cos|w,c|out]
People by occupation People known in connection with sports and hobbies Musicians Composers Wrestlers Film score composers Professional wrestlers e2  John Williams (wrestler) e1 John Williams (composer) ? q  “… this past weekend. John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.” conducted 1, if (w,c) q.T  e1.C w,c(q,e1)  0, otherwise. w,c(q,eout)  0 Ranking Formulation: Example q.T  {past, weekend, Boston, Pops, conducted, summer, Star, Wars, concert, Tanglewood, …} e1.C {Film score composers, Composers, Musicians, People by occupation, …} eout.C 
NE Disambiguation: Overview 1 Data Structures Redirect Pages Disambiguation Dataset NE Dictionary Disambig Pages Hyperlinks
Training Ranking Examples features(q,ek) Ranking Model weightsw Disambiguation Dataset SVM training Testing Answer: Ranking Instances features(q,ek) NE queryq Ranking Model weightsw NE Dictionary NE Disambiguation: Overview 2
Outline • Introduction • Wikipedia Structures • Named Entity Dictionary • Disambiguation & Detection • Experimental Evaluation • Future Work • Conclusions
Experimental Evaluation The normalized ranking kernel is trained and evaluated against cosine similarity in 4 scenarios: • Disambiguation between entities with different categories in the set of 110 top-level categories under People by Occupation. • Disambiguation between entities with different categories in the set of 540 most popular (size > 200) categories under People by Occupation. • Disambiguation between entities with different categories in the set of 2847 most popular (size > 20) categories under People by Occupation. • Detection & Disambiguation between entities with different categories in the set of 540 most popular (size > 200) categories under People by Occupation. Use SVMlight with the max-margin ranking approach from [Joachims 2002].
Experimental Evaluation: S2 • The set of Wikipedia categories is restricted to: C2 the 540 categories under People by Occupation that have at least 200 articles • Train & Test only on ambiguous queries q,ek such that: • ek.C  C2 (i.e. matching entities have categories in C2) • ek.C  C2 q.e.C  C2(i.e. the true entity does not have exactly the same categories as other matching entities) • Statistics & Results:
Experimental Evaluation: S4 • The set of Wikipedia categories is restricted to: C4 the 540 categories under People by Occupation that have at least 200 articles. • Train & Test: • Consider out-of-Wikipedia all entities that are not under People by Occupation. • Randomly select queries such that 10% have true answer out-of-Wikipedia. • Statistics & Results:
Future Work • Use weight vector w explicitly – reduce its dimensionality by considering only features occurring frequently in training data. • Augment article text with context from hyperlinks that point to it. • Use correlations between categories and traditional WSD features such as (syntactic) bigrams and trigrams centered on the ambiguous proper name.
Conclusion • A novel approach to Named Entity Disambiguation based on knowledge encoded in Wikipedia. • Learned correlations between Wikipedia categories and context words substantially improve disambiguation accuracy. Potential applications: • Clustering results to web searches for popular named entities. • NE disambiguation is essential for aggregating corpus-level results from Information Extraction.
Ranking Kernel • The corresponding kernel is: • The normalized version:
Experimental Evaluation: S1 • The set of Wikipedia categories is restricted to: C1 the 110 top-level categories under People by Occupation. • Train & Test only on ambiguous queries q,ek such that: • ek.C  C1 (i.e. matching entities have categories in C1) • ek.C  C1 q.e.C  C1(i.e. the true entity does not have exactly the same categories as other matching entities) • Statistics & Results:
Experimental Evaluation: S3 • The set of Wikipedia categories is restricted to: C3 the 2847 top-level categories under People by Occupation that have at least 20 articles • Train & Test only on ambiguous queries q,ek such that: • ek.C  C3 (i.e. matching entities have categories in C3) • ek.C  C3 q.e.C  C3(i.e. the true entity does not have exactly the same categories as other matching entities) • Statistics & Results: