1 / 104

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/

Knowledge Harvesting f rom Text and Web Sources. Part 3: Knowledge Linking. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. Quiz Time. How many days do you need to visit all Shangri -La places on this planet?. Source: geonames.org.

ilya
Télécharger la présentation

Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KnowledgeHarvesting from Text and Web Sources Part 3: Knowledge Linking Gerhard Weikum Max Planck Institute forInformatics http://www.mpi-inf.mpg.de/~weikum/

  2. Quiz Time Howmanydays do youneedtovisit all Shangri-La places on this planet? Source: geonames.org Answer: 365 3-2

  3. Quiz Time Howmanydays do youneedtovisit all Shangri-La places on this planet? 3-3

  4. Linkied Data: RDF Triples on the Web 30 Bio. triples 500 Mio. links http://richard.cyganiak.de/2007/10/lod/lod-datasets_2011-09-19_colored.png

  5. Linked RDF Triples on the Web yago/wordnet: Artist109812338 rdf:subclassOf rdf:subclassOf yago/wordnet:Actor109765278 rdf:type yago/wikicategory:ItalianComposer rdf:type imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone prop:actedIn prop: composedMusicFor imdb.com/title/tt0361748/ dbpprop:citizenOf dbpedia.org/resource/Rome owl:sameAs owl:sameAs rdf.freebase.com/ns/en.rome data.nytimes.com/51688803696189142301 owl:sameAs geonames.org/5134301/city_of_rome Coord N 43° 12' 46'' W 75° 27' 20''

  6. Linked RDF Triples on the Web yago/wordnet: Artist109812338 rdf:subclassOf rdf:subclassOf yago/wordnet:Actor109765278 rdf:type yago/wikicategory:ItalianComposer rdf:type imdb.com/name/nm0910607/ dbpedia.org/resource/Ennio_Morricone prop:actedIn prop: composedMusicFor imdb.com/title/tt0361748/ dbpprop:citizenOf dbpedia.org/resource/Rome ? ? owl:sameAs owl:sameAs rdf.freebase.com/ns/en.rome_ny data.nytimes.com/51688803696189142301 Referentialdataquality? Hand-craftedsameAs links? generatedsameAs links? ? owl:sameAs geonames.org/5134301/city_of_rome Coord N 43° 12' 46'' W 75° 27' 20''

  7. http://sig.ma RDF Entities on the Web

  8. RDF Entities on the Web http://sig.ma

  9. Entity-Name Ambiguity http://sameas.org

  10. Entities in HTML http://sindice.com

  11. Entity Markup in HTML: Towards Standardized Microformats http://schema.org/

  12. Entity Markup in HTML: Towards Standardized Microformats http://schema.org/

  13. Web Page in Standard HTML http://schema.org/ Jane Doe <imgsrc="janedoe.jpg" /> Professor20341 Whitworth Institute405 WhitworthSeattle WA 98052(425) 123-4567<a href="mailto:jane-doe@xyz.edu">jane-doe@illinois.edu</a> Jane's home page:<a href="http://www.janedoe.com">janedoe.com</a> Graduate students:<a href="http://www.xyz.edu/students/alicejones.html">Alice Jones</a><a href="http://www.xyz.edu/students/bobsmith.html">Bob Smith</a>

  14. Web Page in HTML with Microdata http://schema.org/ <div itemscopeitemtype="http://schema.org/Person">   <span itemprop="name">Jane Doe</span>   <imgsrc="janedoe.jpg" itemprop="image" />   <span itemprop="jobTitle">Professor</span>   <div itemprop="address" itemscopeitemtype="http://schema.org/PostalAddress">     <span itemprop="streetAddress">       20341 Whitworth Institute       405 N. Whitworth     </span>     <span itemprop="addressLocality">Seattle</span>,     <span itemprop="addressRegion">WA</span>     <span itemprop="postalCode">98052</span>   </div>   <span itemprop="telephone">(425) 123-4567</span>   <a href="mailto:jane-doe@xyz.edu" itemprop="email">     jane-doe@xyz.edu</a>   Jane's home page:   <a href="http://www.janedoe.com" itemprop="url">janedoe.com</a>   Graduate students:   <a href="http://www.xyz.edu/students/alicejones.html" itemprop="colleague">     Alice Jones</a>   <a href="http://www.xyz.edu/students/bobsmith.html" itemprop="colleague">     Bob Smith</a> </div>

  15. Web-of-Data vs. Web-of-Contents • Critical forknowledgelinkage: • entitynameambiguity •  morestructureddatacombinedwithtext •  boostedbyknowledgeharvestingmethods

  16. Embedding RDFa in Web Contents <html … May 2, 2011 <div typeof=event:music> <span id="Maestro_Morricone"> Maestro Morricone <a rel="sameAs" resource="dbpedia…/Ennio_Morricone "/> </span> … <span property = "event:location" > Smetana Hall </span> … <span property="rdf:type" resource="yago:performance"> The concert</span> will feature … <span property="event:date" content="14-07-2011"></span> July 1 </div> May 2, 2011 Maestro Morricone will perform on the stage of the Smetana Hall to conduct the Czech National Symphony Orchestra and Choir. The concert will featureboth Classicalcompositionsand soundtracks such as the Ecstasy of Gold. In programme two concerts for July 14th and 15th. RDF dataand Web contentsneedtobeinterconnected RDFa & microformatsprovidethemechanism Need waysofcreatingmoreembedded RDF triples!

  17. Outline  Motivation Entity-Name Disambiguation Mapping QuestionsintoQueries EntityLinkage Wrap-up ...

  18. Named-Entity Disambiguation Harry fought with you know who. He defeats the dark lord. Dirty Harry Harry Potter Prince Harry of England The Who (band) Lord Voldemort Three NLP tasks: 1) named-entity detection: segment & label by HMM or CRF (e.g. Stanford NER tagger) 2) co-reference resolution: link to preceding NP (trained classifier over linguistic features) 3) named-entity disambiguation: map each mention (name) to canonical entity (entry in KB)

  19. Named Entity Disambiguation Eli (bible) Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. KB Eli Wallach Ecstasy (drug) ? Benny Goodman Ecstasy of Gold Sergio means Sergio_Leone Sergio means Serge_Gainsbourg Ennio means Ennio_Antonelli Ennio means Ennio_Morricone Eli means Eli_(bible) Eli means ExtremeLightInfrastructure Eli means Eli_Wallach Ecstasy means Ecstasy_(drug) Ecstasy means Ecstasy_of_Gold trilogy means Star_Wars_Trilogy trilogy means Lord_of_the_Rings trilogy means Dollars_Trilogy … … … Benny Andersson Star Wars Trilogy Lord of the Rings Dollars Trilogy Entities (meanings) Mentions (surface names) D5 Overview May 30, 2011

  20. Mention-Entity Graph weighted undirected graph with two types of nodes bag-of-words or language model: words, bigrams, phrases Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) KB+Stats

  21. Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) joint mapping Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) KB+Stats

  22. Mention-Entity Graph weighted undirected graph with two types of nodes Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy(drug) Ecstasy of Gold Star Wars Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 22 / 20

  23. Mention-Entity Graph weighted undirected graph with two types of nodes American Jews film actors artists Academy Award winners Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Metallica songs Ennio Morricone songs artifacts soundtrack music Ecstasy of Gold Star Wars spaghetti westerns film trilogies movies artifacts Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 23 / 20

  24. Mention-Entity Graph weighted undirected graph with two types of nodes http://.../wiki/Dollars_Trilogy http://.../wiki/The_Good,_the_Bad, _the_Ugly http://.../wiki/Clint_Eastwood http://.../wiki/Honorary_Academy_Award Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/Metallica http://.../wiki/Bellagio_(casino) http://.../wiki/Ennio_Morricone Ecstasy of Gold Star Wars http://.../wiki/Sergio_Leone http://.../wiki/The_Good,_the_Bad,_the_Ugly http://.../wiki/For_a_Few_Dollars_More http://.../wiki/Ennio_Morricone Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 24 / 20

  25. Mention-Entity Graph weighted undirected graph with two types of nodes The Magnificent Seven The Good, the Bad, and the Ugly Clint Eastwood University of Texas at Austin Sergio talked to Ennio about Eli‘s role in the Ecstasy scene. This sequence on the graveyard was a highlight in Sergio‘s trilogy of western films. Eli (bible) Eli Wallach Ecstasy (drug) Metallica on Morricone tribute Bellagio water fountain show Yo-Yo Ma Ennio Morricone composition Ecstasy of Gold Star Wars For a Few Dollars More The Good, the Bad, and the Ugly Man with No Name trilogy soundtrack by Ennio Morricone Lord of the Rings Dollars Trilogy • Popularity • (m,e): • freq(m,e|m) • length(e) • #links(e) • Similarity • (m,e): • cos/Dice/KL • (context(m), • context(e)) • Coherence • (e,e‘): • dist(types) • overlap(links) • overlap • (anchor words) KB+Stats 25 / 20

  26. Different Approaches • Combine Popularity, Similarity, andCoherence Features • (Cucerzan: EMNLP‘07, Milne/Witten: CIKM‘08): • forsim (context(m), context(e)): • considersurroundingmentions • andtheircandidateentities • usetheirtypes, links, anchors • asfeaturesofcontext(m) • set m-e edgeweightsaccordingly • usegreedymethodsforsolution • Collective Learning with Prob. Factor Graphs • (Chakrabarti et al.: KDD‘09): • model P[m|e] by similarity and P[e1|e2] by coherence • consider likelihood of P[m1 … mk | e1 … ek] • factorize by all m-e pairs and e1-e2 pairs • use hill-climbing, LP, etc. for solution

  27. Joint Mapping 50 50 30 20 30 10 10 90 100 30 20 80 90 90 100 30 5 • Build mention-entity graph or joint-inference factor graph • from knowledge and statistics in KB • Compute high-likelihood mapping (ML or MAP) or • dense subgraph such that: • each m is connected to exactly one e (or at most one e)

  28. Mention-Entity Popularity Weights [Milne/Witten 2008, Spitkovsky/Chang 2012] • Need dictionarywithentities‘ names: • fullnames: Arnold Alois Schwarzenegger, Los Angeles, Microsoft Corporation • shortnames: Arnold, Arnie, Mr. Schwarzenegger, New York, Microsoft, … • nicknames & aliases: Terminator, City of Angels, Evil Empire, … • acronyms: LA, UCLA, MS, MSFT • rolenames: the Austrian actionhero, Californiangovernor, the CEO of MS, … • … • plus genderinfo (usefulforresolvingpronouns in context): • Bill and Melinda metat MS. Theyfell in loveandhekissedher. • Collecthyperlinkanchor-text / link-targetpairsfrom • Wikipediaredirects • Wikipedia links betweenarticles • Interwiki links betweenWikipediaeditions • Web links pointingtoWikipediaarticles • … • Buildstatisticstoestimate P[entity | name]

  29. Mention-Entity Similarity Edges Precomputecharacteristickeyphrases qforeachentity e: anchortextsornounphrases in e pagewithhigh PMI: „Metallicatributeto Ennio Morricone“ Matchkeyphrase q ofcandidate e in contextofmention m Extent of partial matches Weight of matched words The Ecstasy piece was coveredbyMetallica on the Morricone tributealbum. Computeoverallsimilarityofcontext(m) andcandidate e

  30. Entity-Entity Coherence Edges Precomputeoverlapofincoming links forentities e1 and e2 Alternativelycomputeoverlapofanchortextsfor e1 and e2 oroverlapofkeyphrases, orsimilarityofbag-of-words, or … Optionallycombinewithtype distanceof e1 and e2 (e.g., Jaccardindexfor type instances) Forspecialtypesof e1 and e2 (locations, people, etc.) usespatialor temporal distance

  31. Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 50 50 30 180 20 30 10 10 90 50 100 470 30 20 80 90 145 90 100 30 5 230 • Compute dense subgraph to • maximize min weighted degree among entity nodes • such that: • each m is connected to exactly one e (or at most one e) • Greedy approximation: • iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search

  32. Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 140 50 50 30 170 180 30 10 90 50 100 470 470 30 80 90 145 145 90 100 30 5 230 210 • Compute dense subgraph to • maximize min weighted degree among entity nodes • such that: • each m is connected to exactly one e (or at most one e) • Greedy approximation: • iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search

  33. Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 140 30 170 120 90 100 460 460 30 80 90 145 145 90 100 30 5 210 210 • Compute dense subgraph to • maximize min weighted degree among entity nodes • such that: • each m is connected to exactly one e (or at most one e) • Greedy approximation: • iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search

  34. Coherence Graph Algorithm [J. Hoffart et al.: EMNLP‘11] 30 120 90 100 380 90 145 90 100 210 • Compute dense subgraph to • maximize min weighted degree among entity nodes • such that: • each m is connected to exactly one e (or at most one e) • Greedy approximation: • iteratively remove weakest entity and its edges • Keep alternative solutions, then use local/randomized search

  35. Alternative: Random Walks 0.5 50 50 0.83 0.3 30 0.2 20 0.23 30 0.1 10 10 0.17 0.7 90 0.77 100 0.25 30 0.2 20   0.4 80 0.75 90 90 0.75 0.96 100   0.15 30 5 0.04   • foreachmentionrunrandomwalkswithrestart • (likepersonalized PR withjumpstostartmention(s)) • rank candidateentitiesbystationaryvisitingprobability • veryefficient, decentaccuracy

  36. AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/

  37. AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/

  38. AIDA: VeryDifficultExample http://www.mpi-inf.mpg.de/yago-naga/aida/

  39. AIDA: VeryDifficultExample http://www.mpi-inf.mpg.de/yago-naga/aida/

  40. AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/

  41. AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/

  42. AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/

  43. AIDA: Accurate Online Disambiguation http://www.mpi-inf.mpg.de/yago-naga/aida/

  44. Some NED Online Tools for • J. Hoffart et al.: EMNLP 2011, VLDB 2011 • https://d5gate.ag5.mpi-sb.mpg.de/webaida/ • P. Ferragina, U. Scaella: CIKM 2010 • http://tagme.di.unipi.it/ • R. Isele, C. Bizer: VLDB 2012 • http://spotlight.dbpedia.org/demo/index.html • Reuters Open Calais • http://viewer.opencalais.com/ • S. Kulkarni, A. Singh, G. Ramakrishnan, S. Chakrabarti: KDD 2009 • http://www.cse.iitb.ac.in/soumen/doc/CSAW/ • D. Milne, I. Witten: CIKM 2008 • http://wikipedia-miner.cms.waikato.ac.nz/demos/annotate/ • perhapsmore • someuse Stanford NER taggerfordetectingmentions • http://nlp.stanford.edu/software/CRF-NER.shtml

  45. NED: Experimental Evaluation • Benchmark: • Extended CoNLL 2003 dataset: 1400 newswirearticles • originallyannotatedwithmentionmarkup (NER), • nowwith NED mappingstoYagoandFreebase • difficulttexts: • … AustraliabeatsIndia …  Australian_Cricket_Team • … White House talksto Kreml …  President_of_the_USA • … EDS made a contractwith … HP_Enterprise_Services Results: Best: AIDA methodwithprior+sim+coh + robustnesstest 82% precision @100% recall, 87% meanaverageprecision Comparisontoothermethods, seepaper J. Hoffart et al.: Robust DisambiguationofNamedEntities in Text, EMNLP 2011 http://www.mpi-inf.mpg.de/yago-naga/aida/

  46. Ongoing Research & Remaining Challenges • More efficient graph algorithms (multicore, etc.) • Allow mentions of unknown entities, mapped to null • Leverage deep-parsing structures, • leverage semantic types • Example: Page played Kashmir on his Gibson obj subj mod • Short and difficult texts: • tweets, headlines, etc. • fictional texts: novels, song lyrics, etc. • incoherent texts • Structured Web data: tablesandlists • Disambiguationbeyondentitynames: • coreferences: pronouns, paraphrases, etc. • commonnouns, verbal phrases (general WSD)

  47. General Word Sense Disambiguation {songwriter, composer} {cover, perform} {cover, report, treat} Which song writers covered ballads written by the Stones ? {cover, help out}

  48. Handling Out-of-Wikipedia Entities wikipedia.org/Good_Luck_Cave Cave composed haunting songslike Hallelujah, O Children, andthe Weeping Song. wikipedia.org/Nick_Cave wikipedia/Hallelujah_Chorus wikipedia/Hallelujah_(L_Cohen) last.fm/Nick_Cave/Hallelujah wikipedia/Children_(2011 film) last.fm/Nick_Cave/O_Children wikipedia.org/Weeping_(song) last.fm/Nick_Cave/Weeping_Song

  49. Handling Out-of-Wikipedia Entities GunungMulu National Park Sarawak Chamber largestundergroundchamber wikipedia.org/Good_Luck_Cave Bad Seeds No More ShallWe Part Murder Songs Cave composed haunting songslike Hallelujah, O Children, andthe Weeping Song. wikipedia.org/Nick_Cave Messiahoratorio George Frideric Handel wikipedia/Hallelujah_Chorus Leonard Cohen Rufus Wainwright Shrekand Fiona wikipedia/Hallelujah_(L_Cohen) eerieviolin Bad Seeds No More ShallWe Part last.fm/Nick_Cave/Hallelujah wikipedia/Children_(2011 film) South Korean film Nick Cave & Bad Seeds Harry Potter 7 movie hauntingchoir last.fm/Nick_Cave/O_Children wikipedia.org/Weeping_(song) Dan Heymann apartheidsystem last.fm/Nick_Cave/Weeping_Song Nick Cave Murder Songs P.J. Harvey Nick andBlixaduet

  50. Handling Out-of-Wikipedia Entities [J. Hoffart et al.: CIKM‘12] • Characterize all entities (andmentions) bysetsofkeyphrases • Entitycoherencethenbecomes: • keyphrasesoverlap, noneedforhref link data • Foreachmentionadd a „self“ candidate: • out-of-KB entitywithkeyphrasescomputedby Web search wpqmin(p(w), q(w)) PO(p,q) = withword weights wpqmax(p(w), q(w)) phrasesp,q pe,qfPO(p,q)2  min(e(p), f(q)) withphrase weights KORE (e,f) = pee(p) +qff(q) entitiese,f Efficientcomparisonoftwokeyphrase-sets  two-stagehashing, using min-hashsketchesand LSH

More Related