480 likes | 613 Vues
From Information to Knowledge :. Harvesting Entities , Relationships , and Temporal Facts from Web Sources. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. Acknowledgements. Goal: Turn Web into Knowledge Base. Source: DB & IR methods for
E N D
From Information toKnowledge: HarvestingEntities, Relationships, and Temporal Facts from Web Sources Gerhard Weikum Max Planck Institute forInformatics http://www.mpi-inf.mpg.de/~weikum/
Goal: Turn Web into Knowledge Base Source: DB & IR methods for knowledge discovery. Communications of the ACM 52(4), 2009 • comprehensiveDB ofhuman knowledge • everythingthatWikipediaknows • everythingmachine-readable • capturingentities, classes, relationships
Approach: Harvesting Facts from Web PoliticianPosition Angela Merkel Chancellor Germany Karl-Theodor zu Guttenberg Minister of Defense Germany Christoph Hartmann Minister of Economy Saarland … ActorAward Christoph Waltz Oscar Sandra Bullock Oscar Sandra Bullock Golden Raspberry … PoliticianPolitical Party Angela Merkel CDU Karl-Theodor zu Guttenberg CDU Christoph Hartmann FDP … CompanyCEO Google Eric Schmidt Yahoo Overture Facebook FriendFeed Software AG IDS Scheer … MovieReportedRevenue Avatar $ 2,718,444,933 The Reader $ 108,709,522 Facebook FriendFeed Software AG IDS Scheer … PoliticalPartySpokesperson CDU Philipp Wachholz Die Grünen Claudia Roth FacebookFriendFeed Software AG IDS Scheer … CompanyAcquiredCompany Google YouTube Yahoo Overture FacebookFriendFeed Software AG IDS Scheer … SUMO IWP Cyc ReadTheWeb WikiTax2WordNet YAGO-NAGA TextRunner
Knowledge for Intelligence • entity recognition & disambiguation • understanding natural language & speech • knowledge services & reasoning for semantic apps • (e.g. deep QA) • semantic search: preciseanswers to advanced queries • (by scientists, students, journalists, analysts, etc.) German footballcoachwhenBastian Schweinsteiger was born? FIFA 2010 finalistswhoplayed in a Champions League final? Politicianswhoare also scientists? Relationshipsbetween Manfred Pinkal, Edsger Dijkstra, Michael Dell, and Renee Zellweger? Enzymes thatinhibit HIV? Influenza drugsforteenswithhighbloodpressure? ...
Outline WhatandWhy Automatic KB Construction Growing & Maintainingthe KB Temporal Knowledge Wrap-up ...
What is Knowledge (in a KB)? • facts / assertions: bornIn (BastianSchweinsteiger, Kolbermoor), • hasWon (BastianSchweinsteiger, BronzeFIFAWorldCup2010), • playedInFinal (BastianSchweinsteiger, ChampionsLeague2010), … • taxonomic: instanceOf (BastianSchweinsteiger, footballPlayer), • subclassOf (footballPlayer, athlete), … • lexical / terminology: means (“Big Apple“, NewYorkCity), • means (“Apple“, AppleComputerCorporation) • means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) … • common-senseproperties: • applesaregreen, red, juicy, sweet, sour … - but not fast, smart … • ballsareround, smooth, slippery … - but not square, funny … • common-senseaxioms: • x: human(x) male(x) female(x) • x: (male(x) female(x)) (female(x) ) male(x)) • x: animal(x) (hasLegs(x) isEven(numberOfLegs(x)) … • procedural: howto fix/install/prepare/remove … • epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)), • believes (Copernicus, shape(Earth, sphere)) … ...
KB‘s: Example YAGO (Suchanek et al.: WWW‘07) 2 Mio. entities, 200 000 classes 40 Mio. RDF triples(facts) ( entity1-relation-entity2, subject-predicate-object) Entity subclass subclass subclass Organization Person Location subclass subclass Accuracy 95% subclass subclass subclass Country Scientist Politician subclass subclass State instanceOf instanceOf Biologist instanceOf Physicist City instanceOf Germany instanceOf instanceOf locatedIn Erwin_Planck Oct 23, 1944 diedOn locatedIn Kiel Schleswig-Holstein FatherOf bornIn Nobel Prize hasWon instanceOf citizenOf diedOn Oct 4, 1947 Max_Planck Society Max_Planck Angela Merkel Apr 23, 1858 bornOn means(0.9) means means means means(0.1) “Max Planck” “Max Karl Ernst Ludwig Planck” “Angela Merkel” “Angela Dorothea Merkel” http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: Example YAGO (F. Suchanek et al.: WWW‘07) http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: ExampleDBpedia(Auer, Bizer, et al.: ISWC‘07) • 3 Mio. entities, • 1 Bio. facts (RDF triples) • 1.5 Mio. entitiesmappedto • hand-craftedtaxonomyof • 259 classeswith 1200 properties http://www.dbpedia.org
Outline What and Why Automatic KB Construction Growing & Maintainingthe KB Temporal Knowledge Wrap-up ...
French Marriage Problem facts in KB: newfactsorfactcandidates: married(Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Michelle, Barack) married (Yoko, John) married (Kate, Leonardo) married (Carla, Sofie) married (Larry, Google) married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) forrecall: pattern-basedharvesting forprecision: consistencyreasoning
Pattern-BasedHarvesting (Hearst 92, Brin98, Agichtein 00, Etzioni 04, …) Facts & Fact Candidates Patterns (Hillary, Bill) X and her husband Y (Carla, Nicolas) X and Y on their honeymoon (Angelina, Brad) (Victoria, David) X and Y and their children (Hillary, Bill) X has been dating with Y (Carla, Nicolas) X loves Y (Yoko, John) … • good for recall • noisy, drifting • not robust enough • for high precision (Kate, Pete) (Carla, Benjamin) (Larry, Google) (Angelina, Brad) (Victoria, David)
Reasoningabout Fact Candidates Useconsistencyconstraintstoprunefalsecandidates groundatoms: FOL rules (restricted): spouse(Hillary,Bill) spouse(Carla,Nicolas) spouse(Cecilia,Nicolas) spouse(Carla,Ben) spouse(Carla,Mick) spouse(Carla, Sofie) spouse(x,y) diff(y,z) spouse(x,z) spouse(x,y) diff(w,x) spouse(w,y) spouse(x,y) f(x) spouse(x,y) m(y) spouse(x,y) (f(x)m(y)) (m(x)f(y)) f(Hillary) f(Carla) f(Cecilia) f(Sofie) m(Bill) m(Nicolas) m(Ben) m(Mick) Rules revealinconsistencies Find consistentsubset(s) ofatoms (“possibleworld(s)“, “thetruth“) • Rules canbeweighted • (e.g. byfractionofgroundatomsthatsatisfy a rule) • uncertain / probabilistic data • compute prob. distr. ofsubsetofatomsbeingthetruth
MarkovLogic Networks (MLN‘s) (M. Richardson / P. Domingos 2006) Maplogicalconstraints & factcandidates intoprobabilisticgraph model: Markov Random Field (MRF) s(x,y) diff(y,z) s(x,z) s(x,y) f(x) f(x) m(x) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y) diff(w,y) s(w,y) s(x,y) m(y) m(x) f(x) Grounding: Literal Boolean Var Literal binary RV s(Ca,Nic) s(Ce,Nic) s(Ca,Nic) s(Ca,Ben) s(Ca,Nic) m(Nic) s(Ca,Nic) s(Ca,So) s(Ce,Nic) m(Nic) s(Ca,Ben) s(Ca,So) s(Ca,Ben) m(Ben) s(Ca,Ben) s(Ca,So) s(Ca,So) m(So)
MarkovLogic Networks (MLN‘s) (M. Richardson / P. Domingos 2006) Maplogicalconstraints & factcandidates intoprobabilisticgraph model: Markov Random Field (MRF) s(x,y) diff(y,z) s(x,z) s(x,y) f(x) f(x) m(x) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y) diff(w,y) s(w,y) s(x,y) m(y) m(x) f(x) s(Ce,Nic) RVs coupled by MRF edge iftheyappear in same clause m(Nic) s(Ca,Nic) s(Ca,Ben) m(Ben) s(Ca,So) MRF assumption: P[Xi|X1..Xn]=P[Xi|N(Xi)] m(So) Varietyofalgorithmsforjointinference: Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, … jointdistribution hasproduct form over all cliques
Related Alternative Probabilistic Models ConstrainedConditional Models [D. Roth et al. 2007] log-linear classifierswithconstraint-violationpenalty mappedinto Integer Linear Programs Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008] s(Ce,Nic) RV‘sshare “factors“ (jointfeaturefunctions) generalizes MRF, BN, CRF, … inference via advanced MCMC flexible coupling & constrainingofRV‘s m(Nic) s(Ca,Nic) s(Ca,Ben) m(Ben) s(Ca,So) m(So) softwaretools: alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/
Reasoning for KB Growth: Direct Route (F. Suchanek et al.: WWW‘09) newfactcandidates: facts in KB: married(Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Carla, Sofie) married (Larry, Google) ? + married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) patterns: X and her husband Y X and Y andtheirchildren X hasbeendatingwith Y X loves Y Directapproach: factsaretrue; factcandidates& patterns hypotheses groundedconstraints clauseswithhypotheses as vars 2. type signaturesof relations greatlyreduce #clauses 3. castintoWeighted Max-Satwithweightsfrompatternstats customizedapproximationalgorithm unifies: factcandconsistency, patterngoodness, entitydisambig. www.mpi-inf.mpg.de/yago-naga/sofie/
Facts & Patterns Consistencywith SOFIE (F. Suchanek et al.: WWW’09, N. Nakashole et al.: WebDB‘10) constraintstoconnectfacts, factcandidates, patterns functionaldependencies: relationproperties: pattern-factduality: spouse(X,Y): X Y, Y X asymmetry, transitivity, acyclicity, … type constraints, inclusiondependencies: occurs(p,x,y) expresses(p,R) type(x)=dom(R) type(y)=rng(R) R(x,y) spouse Person Person capitalOfCountry cityOfCountry occurs(p,x,y) R(x,y) type(x)=dom(R) type(y)=rng(R) expresses(p,R) domain-specificconstraints: name(-in-context)-to-entitymapping: bornInYear(x) + 10years ≤ graduatedInYear(x) means(n,e1) means(n,e2) … hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t www.mpi-inf.mpg.de/yago-naga/sofie/
EntityDisambiguationRevisited occurs (“divorcedfrom“, Madonna, Guy Ritchie) expresses(“divorcedfrom“, wasMarriedTo) wasMarriedTo (Madonna, Guy Ritchie) entity level actuallyis: occurs (“divorcedfrom“, “Madonna“, “Guy Ritchie“) means(“Madonna“, Madonna Louise Ciccone) expresses(“divorcedfrom“, wasMarriedTo) wasMarriedTo (Madonna Louise Ciccone, Guy Ritchie) [0.7] word/phrase level occurs (“divorcedfrom“, “Madonna“, “Guy Ritchie“) means(“Madonna“, Madonna (Edvard Munch)) expresses(“divorcedfrom“, wasMarriedTo) wasMarriedTo (Madonna (Edvard Munch), Guy Ritchie) [0.3] • usecontext-similarityasdisambiguationprior • setclauseweightsaccordingly • reduced to normal case
Experimental Results • SOFIE (F. Suchanek et al.: WWW’09) • input: biographies of 400 US senators, 3500 HTML files • output:birth/deathdate&place, politicianOf (state) • run-time: 7 h parsing, 6 h hypotheses, 2 h Max-Sat • precision: 90-95 % (exceptfordeathplace) • recall: ca. 750 extractedfacts (300 politicianOffacts) • PROSPERA (N. Nakashole et al.: WebDB‘10): • input: 87 000 Wikipediaarticles and Web homepages of scientists • output:hasAdvisor, graduatedAt, hasCollaborator, • facultyAt, wonAward • run-time: 1 h total (largelyparallelized) • precision: 85-95 % • recall: ca. 4000 extractedfacts (400 hasAdvisorfacts) Nowrunningexperiments on ClueWeb‘09 corpus (500 Mio. English Web pages) withHadoopcluster of 10x16 cores and 10x48 GB
Outline What and Why Automatic KB Construction Growing & Maintainingthe KB Temporal Knowledge Wrap-up ...
Temporal Knowledge Whichfactsforgivenrelations hold atwhattime pointorduringwhichtime intervals? marriedTo (Madonna, Guy) [ 22Dec2000, Dec2008 ] capitalOf (Berlin, Germany) [ 1990, now ] capitalOf (Bonn, Germany) [ 1949, 1989 ] hasWonPrize (JimGray, TuringAward) [ 1998 ] graduatedAt (HectorGarcia-Molina, Stanford) [ 1979 ] graduatedAt (SusanDavidson, Princeton) [ Oct 1982 ] hasAdvisor (SusanDavidson, HectorGarcia-Molina) [ Oct 1982, forever ] Howcanwequery & reasonon entity-relationship facts in a “time-travel“manner - withuncertain/incomplete KB ? US presidentwhenBarackObama was born? students of Hector Garcia-Molina while he was at Princeton?
French Marriage Problem JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC facts in KB 1: 2: 3: married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) newfactcandidates: 4: 5: 6: 7: 8: married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) divorced (Madonna, Guy) domPartner (Angelina, Brad) validFrom (2, 2008) validFrom (4, 1996) validUntil(4, 2007) validFrom (5, 2010) validFrom (6, 2006) validFrom(7, 2008)
Challenge: Temporal Knowledge • consistencyconstraintsarepotentiallyhelpful: • functionaldependencies: husband, time wife • inclusiondependencies: marriedPerson adultPerson • age/time/genderrestrictions: birthdate + < marriage < divorce forall people in Wikipedia(300 000) gatherall spouses, incl. divorced & widowed, andcorrespondingtime periods! >95% accuracy, >95% coverage, in onenight recall: gather temporal scopesforbasefacts precision: reason on mutual consistency
(Even More Difficult) ImplicitDating explicit datesvs. implicitdatesrelative tootherdates
(Even More Difficult) Relative Dating vaguedates relative dates narrative text relative order
Framework for T-Fact Extraction (Theobald et al.: MUD’10, Wang et al.: EDBT’10; Zhang et al.: WebDB‘08) represent temporal scopes of facts in the presence of incompleteness and uncertainty • 2) gather & filter candidates for t-facts: • extract base facts R(e1, e2) first; then • focus on sentences with e1, e2 and date or temporal phrase 3) aggregate & reconcile evidence from observations 4) reason on joint constraints about facts and time scopes
1) Representing T-Fact Evidence event-style andstate-style facts meta-factstocapture temporal scopes • 1:married(Madonna, Sean), 2: married(Madonna, Guy), • validSince (1, 16-Aug-1985), validUntil (1, 14-Sep-1989), • validSince (2, 22-Dec-2000), validUntil (2, 15-Dec-2008) • 3: wonAward(Sean, AcademyAwardForBestActor) • validOn (3, 29-Feb-2004) different resolutions, laterrefinement µ=1987 σ2=1 After 4 years of happy marriage, Madonna and Sean got divorced in September 1989. • 1:married(Madonna, Sean), • earliestSince (1, 1-Jan-1985), latestSince (1, 31-Dec-1985), • earliestUntil (1, 1-Sep-1989), latestUntil (1, 30-Sep-1989) 1984 1987 1990 uncertain & inconsistentevidence confidencedistribution 0.7 0.4 0.1 1989 1990 1984 1985
2) Gather & Filter T-Fact Candidates Choice ofsources: news-stylebiography-style date in headermanydates in text relative tempexpr‘sexplicitdates, narrative simple languageelaboratedlanguage manypronounspronounsformainentity Naive approach: usedeep NLP (dependencyparser) on everysentence thenuseclassifier (orstructured-outputlearner) to detectt-facts too expensive Bruni met recently divorced president Sarkozy in November 2007 at a dinner party. A romance is said to have started a few weeks ago between her and Biolay. She has said she is easily "bored with monogamy“ …
2) Gather & Filter: Multi-Stage Approach stage 1: sentenceswith e1 and e2 from R • matchnounphrasesagainst YAGO meansrelation • usedisambiguationpriorforentitymentions stage 2: sentencesthatcontain a temporal expression • use TARSQI tool to extract relative t-expressions and • mapthem to absolute datesordurations stage 3: sentenceswherethe t-expression refers to R(e1,e2) • rundependencyparser: • check shortestpathconnecting e1, e2, verb, t-expr Jim married Sue, but later left her and began an affair with Jane in 2005. • alternatively, consideronlysentenceswith • twonoungroups & shortsurfacedistances of e1, e2, t-expr
3) Aggregate & Reconcile T-Fact Evidence Ideal input: Madonna and Sean weremarriedfrom 16-Aug-85 until 12-Sep-89. Madonna and Sean married on August 16, 1985. Madonna and Sean gotdivorced in September 1989. Impreciseinput: Madonna and Sean weremarriedfrom 1985 through 1989. Madonna and Sean weremarriedfouryears in thelatenineties. Madonna and Sean gotdivorced in fall 1989. Noisyinput: Madonna and Sean plan theirweddingin summer 1985. Madonna and Sean just returnedfromtheirhoneymoon(in Jan 1986). Madonna and Sean will bedivorcedbythethe end oftheyear (1989). The marriageof Madonna and Sean will not survivethisyear (1987). evidence time
3) Aggregate & Reconcile T-Fact Evidence Real input: … Madonna and Sean werechasedduringtheirhoneymoon … (Jan 19, 1986) Madonna and her husband Sean openedtheexhibition … (March 7, 1986) Madonna and her husband Sean wereseenat … (April 1, 1986) Madonna and Sean metothercouplesat … (June 22, 1986) Madonna and Sean plan tohavechildren … (July 4, 1986) Madonna and Sean wouldconsideradopting a child … (July 14, 1986) Sean andhiswife Madonna purchaseanothercastle in … (November 5, 1986) ... Madonna and Sean thinkaboutgettingdivorced … (April 21, 1989) The marriageof Madonna and Sean is in deepcrisis … (May 11, 1989) … evidence time
3) Aggregate & Reconcile T-Fact Evidence Real input: … Madonna and Sean werechasedduringtheirhoneymoon … (Jan 19, 1986) Madonna and her husband Sean openedtheexhibition … (March 7, 1986) Madonna and her husband Sean wereseenat … (April 1, 1986) Madonna and Sean metothercouplesat … (June 22, 1986) Madonna and Sean plan tohavechildren … (July 4, 1986) Madonna and Sean wouldconsideradopting a child … (July 14, 1986) Sean andhiswife Madonna purchaseanothercastle in … (November 5, 1986) ... Madonna and Sean thinkaboutgettingdivorced … (April 21, 1989) The marriageof Madonna and Sean is in deepcrisis … (May 11, 1989) … evidence …..……..… time
3) Aggregate & Reconcile: Solution • Classiferfor t-factobservations: begin vs. during vs. end • Build separate histogramforeachclass (andeach t-fact) • Combine histograms & derivehigh-confidence time scope evidence time eventhistogram (begin) statehistogram (during) eventhistogram (end)
4) Joint Reasoning on Facts and T-Facts Combine & reconcile t-scopesacross different facts constraint: marriedTo (m) is an injectivefunctionatanygivenpoint X, Y, Z, T1, T2: m(X,Y) m(X,Z) validTime(m(X,Y),T1) validTime(m(X,Z),T2) overlaps(T1, T2) after grounding: m(Ca,Nic) m(Ce,Nic) false m(Carla, Nicolas) m(Cecilia, Nicolas) overlaps ([2008,2010], [1996,2007]) m(Carla, Nicolas) m(Carla, Benjamin) overlaps ([2008,2010], [2009,2011]) m(Ca,Nic) m(Ca,Ben) true
4) Joint Reasoning on Facts and T-Facts m(Ca, Mi) m(Ca, Ben) m(Ca, Nic) m(Ce, Nic) m(Ce, Mi) time Conflictgraph: m(Ca, Mi) [2004,2008] m(Ca, Ben) [2009,2011] Find maximal independentset: subsetofnodes w/o adjacentpairs with (evidence-) weightednodes m(Ce, Nic) [1996,2007] m(Ca, Nic) [2008,2010] m(Ce, Mi) [1998,2005]
4) Joint Reasoning on Facts and T-Facts m(Ca, Mi) m(Ca, Ben) m(Ca, Nic) m(Ce, Nic) m(Ce, Mi) time Conflictgraph: m(Ca, Mi) [2004,2008] m(Ca, Ben) [2009,2011] 30 10 Find maximal independentset: subsetofnodes w/o adjacentpairs with (evidence-) weightednodes m(Ce, Nic) [1996,2007] m(Ca, Nic) [2008,2010] 100 80 m(Ce, Mi) [1998,2005] 20
4) Joint Reasoning on Facts and T-Facts alternative approach: split t-scopesandreason on consistencyof t-factpartitions m(Ca, Mi) m(Ca, Ben) m(Ca, Nic) m(Ce, Nic) m(Ce, Mi) time
Preliminary Results • automaticextractionof t-factsaboutfootball/soccer • fromWikipediaandnewsarticles • queryansweringbyreasoning on t-facts playsForTeam(X,Z)@T1 playsForTeam(Y,Z)@T2 overlaps (T1,T2) teammates(X,Y)
Outline What and Why Automatic KB Construction Growing & Maintainingthe KB Temporal Knowledge Wrap-up ...
KB Building: Where Do We Stand? KnowledgeBases on Entities & Classes • strong successstory, someproblemsleft: • large taxonomiesofclasseswith individual entities • longtailcallsfornewmethods • entitydisambiguationremainsgrandchallenge Relationships • goodprogress, but manychallengesleft: • recall & precisionbypatterns & reasoning • efficiency & scalability • soft rules, hardconstraints, richerlogics, … • open-domaindiscoveryofnewrelationtypes Temporal Knowledge • widelyopen (fertile) researchground: • uncertain / incomplete temporal scopesoffacts • jointreasoning on base-facts and time-scopes
Overall Take-Home Historicopportunity: reviveCycvision, makeit real & large-scale ! KB asenablerofmacroscopic„machinereading“ challenging & risky, but highpay-off Explore & exploitsynergiesbetween semantic, statistical, & social Web methods: statisticalevidence + logicalconsistency ! • Manyinterestingresearchtopicsfor CS (+ CoLi): • efficiency & scalability • constraints & reasoningon uncertaindata • NLP fortemporalstatements • statisticalrankingforsemanticsearch • knowledge-baselife-cycle: growth & maintenance ...