1 / 57

2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project

2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project. T. Flati, D. Vannella, T. Pasini, R. Navigli. ERC Starting Grant MultiJEDI No. 259234. The Wikipedia structure. Article pages ~4M. Category pages ~ 700K. Two noisy graphs with no explicit hypernym relation.

alma-savage
Télécharger la présentation

2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project T. Flati, D. Vannella, T. Pasini, R. Navigli ERC Starting GrantMultiJEDI No. 259234

  2. The Wikipedia structure Article pages ~4M Category pages~ 700K Two noisy graphs with no explicit hypernymrelation.

  3. The Wikipedia structure: an example Pages Categories • Fictionalcharacters Cartoon The Walt Disney Company • Fictionalcharactersby medium Comics by genre Mickey Mouse Donald Duck Disney comics Disney character FunnyAnimal Disney comicscharacters Superman

  4. Our goal To automatically create a Wikipedia Bitaxonomyfor Wikipedia pages and categories in a simultaneous fashion. categories pages

  5. Our goal To automatically create a Wikipedia Bitaxonomyfor Wikipedia pages and categories in a simultaneous fashion. KEY IDEA The page and category level are mutually beneficial for inducing a wide-coverage and fine-grained integrated taxonomy

  6. Key idea Pages Categories is a • Fictionalcharacters is a is a Cartoon Mickey Mouse The Walt Disney Company • Fictionalcharactersby medium Comics by genre is a is a is a is a Donald Duck Disney comics is a Disney character is a Disney comicscharacters FunnyAnimal Superman

  7. A 3-phase method • Starting from two noisy graphs categories pages

  8. A 3-phase method • 1. Build the page taxonomy pages

  9. A 3-phase method • 1. Build the page taxonomy • 2.Bitaxonomy Algorithm categories pages

  10. A 3-phase method • 1. Build the page taxonomy • 2.Bitaxonomy Algorithm categories pages

  11. A 3-phase method • 1. Build the page taxonomy • 2.Bitaxonomy Algorithm • 3. Refine the category taxonomy +50% categories categories pages

  12. Contributions • Self-containedapproach • Page taxonomy and category taxonomy built simultaneously • State-of-the-artresults when compared to all other available taxonomies

  13. The WiBi Page taxonomy 1

  14. Assumptions • The first sentenceof a page is a gooddefinition (alsocalledgloss)

  15. The WiBi Page taxonomy • [Syntactic step]Extractthe hypernymlemma from a page definition using a syntactic parser; • [Semantic step]Apply a set of linking heuristics to disambiguate the extracted lemma. ScroogeMcDuckis a character […] Syntacticstep Hypernym lemma: character nn nsubj ScroogeMcDuckis a character[…] cop Semanticstep A

  16. The semantic step 5 cascading linking heuristics Linking heuristic Crowdsourced Category Multiword Monosemous Distributional Target page(CristianoRonaldo) Disambiguated hypernym(Football player) Ambiguoushypernym (‘player’)

  17. 1. Crowdsourced heuristic Use the links from the crowd! Mickey Mouse is a funny animalcartooncharacterand the official mascotofThe Walt Disney Company.

  18. 2. Category heuristic Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses. Donald Duck Pluto Characters in Disney package films Hook Mickey Mouse Disney comics characters Ambiguous hypernym: Character Goofy José Carioca

  19. 2. Category heuristic Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses. Pluto, also called Pluto the Pup, is a cartoon character […] Donald Duck Pluto Characters in Disney package films Captain James Hookis a fictionalcharacter[…] Hook Mickey Mouse is a funny animalcartooncharacter […] Mickey Mouse is a funny animalcartooncharacter[…] Mickey Mouse Disney comics characters Goofy is a funny animal cartooncharacter […] Ambiguoushypernym: Character Goofy José Carioca  is a Disney cartooncharacter[…] José Carioca

  20. 2. Category heuristic Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses. Pluto, also called Pluto the Pup, is a cartoon character […] Donald Duck Characters in Disney package films Character (arts) 5, Funnyanimal 1 Captain James Hookis a fictionalcharacter[…] Mickey Mouse is a funny animalcartooncharacter[…] Mickey Mouse is a funny animalcartooncharacter[…] Disney comics characters Character (arts) 3, Funnyanimal 1, Cartoon 1 Goofy is a funny animal cartooncharacter […] Ambiguoushypernym: Character Character(arts) 8, Funnyanimal 2, Cartoon 1 José Carioca  is a Disney cartooncharacter[…]

  21. 2. Category heuristic • Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses. Donald Duck Ambiguous hypernym: Character Character(arts) Character(arts) 8, Funnyanimal 2, Cartoon 1

  22. Page taxonomy linking heuristics 3 4 Multiword(65K) Monosemous(161K) Category(1.603M) 2 5 Distributional(561K) Crowdsourced(1.338M) 1

  23. Pagetaxonomy evaluation

  24. The story so far 1 Noisy page graph Page taxonomy

  25. The Bitaxonomy algorithm 2

  26. The Bitaxonomy algorithm The information available in the two taxonomies is mutually beneficial; • At each step exploit one taxonomy to update the otherand vice versa; • Repeat until convergence.

  27. The Bitaxonomy algorithm Startingfrom the pagetaxonomy Football team Football teams is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. categories pages

  28. The Bitaxonomy algorithm Exploitthe cross linkstoinferhypernym relations in the categorytaxonomy Football team Football teams is a is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. categories pages

  29. The Bitaxonomy algorithm Football team Football teams is a is a is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. Take advantage of cross linksto infer back is-a relations in the page taxonomy categories pages

  30. The Bitaxonomy algorithm Football team Football teams is a is a is a is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. Use the relations found in previousstep to infernew hypernymedges categories pages

  31. The Bitaxonomy algorithm Mutualenrichmentofboth taxonomies untilconvergence Football team Football teams is a is a is a is a Football clubs Atlético Madrid Football clubs in Madrid Real MadridF.C. categories pages

  32. Page taxonomy evaluation (cont’d) Sensible 3% increment in terms of recall and coverage,with unvaried precision

  33. Categorytaxonomy evaluation

  34. The story so far 2

  35. The WiBi category taxonomy refinement 3

  36. Category taxonomy refinement Some categories are affected by some structural problems. No pages associated! Comicscharacters Comicscharactersbyprotagonist Garfieldcharacters categories pages

  37. Category taxonomy refinement • 3 refinement procedures to obtain broader coverage for categories • Single super category • Sub-categories • Super-categories

  38. Single super category Fictionalcharactersby medium • So we promote its only super category to hypernym This category hasonly 1 outgoing edge Comicscharacters Animatedcharacters Animation Comicscharactersbyprotagonist Animated television characters by series Garfieldcharacters

  39. Sub-categories Focus on subcategories which have already been covered! Comics by company Comics characters Comics characters by company Comics titlesby company‎ DC Comicscharacters Marvel Comicscharacters Disney comics‎

  40. Sub-categories Focus on subcategories which have already been covered! Comics by company Comics characters 2 pathsending in v Only 1 path ending in u Comics characters by company Comics titlesby company‎ DC Comicscharacters Marvel Comicscharacters Disney comics‎

  41. Category taxonomy evaluation: coverage +50% categories covered! 1SUP SUB SUPER

  42. Category taxonomy evaluation: P & R 86% +35% recall 1SUP SUB SUPER Iterations

  43. Experimental setup • We created 2 datasets: • 1000 randomly sampled pages; • 1000 randomly sampled categories. • Each item was annotated with the most suitable generalization (lemma+page or category).

  44. Competitors WikiNet MENTA WikiTaxonomy pages categories

  45. Measures • We calculated typical measures to assess the quality of all the possible taxonomies; • Precision • Recall • Coverage • Specificity • Granularity

  46. Page taxonomy comparison

  47. Page taxonomy comparison

  48. Category taxonomy comparison

  49. Category taxonomy comparison

  50. Category taxonomy comparison Specificity measure

More Related