730 likes | 883 Vues
From Information to Knowledge. Harvesting Entities and Relationships From Web Sources. Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/. Martin Theobald Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~mtb/.
E N D
From Information to Knowledge Harvesting Entities and Relationships From Web Sources Gerhard Weikum Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~weikum/ Martin Theobald Max Planck Institute for Informatics http://www.mpi-inf.mpg.de/~mtb/
Goal: Turn Web into Knowledge Base Source: DB & IR methods for knowledge discovery. Communications of the ACM 52(4), 2009 • comprehensive DB of human knowledge • everything that Wikipedia knows • everything machine-readable • capturing entities, classes, relationships
Approach: Harvesting Facts from Web PoliticianPosition Angela Merkel Chancellor Germany Karl-Theodor zu Guttenberg Minister of Defense Germany Christoph Hartmann Minister of Economy Saarland … ActorAward Christoph Waltz Oscar Sandra Bullock Oscar Sandra Bullock Golden Raspberry … PoliticianPolitical Party Angela Merkel CDU Karl-Theodor zu Guttenberg CDU Christoph Hartmann FDP … CompanyCEO Google Eric Schmidt Yahoo Overture Facebook FriendFeed Software AG IDS Scheer … MovieReportedRevenue Avatar $ 2,718,444,933 The Reader $ 108,709,522 Facebook FriendFeed Software AG IDS Scheer … PoliticalParty Spokesperson CDU Philipp Wachholz Die Grünen Claudia Roth Facebook FriendFeed Software AG IDS Scheer … CompanyAcquiredCompany Google YouTube Yahoo Overture Facebook FriendFeed Software AG IDS Scheer … Cyc IWP ReadTheWeb TextRunner YAGO-NAGA
Knowledge as Enabling Technology • entity recognition & disambiguation • understanding natural language & speech • knowledge services & reasoning for semantic apps • (e.g. deep QA) • semantic search: preciseanswers to advanced queries • (by scientists, students, journalists, analysts, etc.) US president when Barack Obama was born? Indy 500 winners who are still alive? Politicians who are also scientists? Relationship between Angela Merkel, Jim Gray, Dalai Lama? Enzymes that inhibit HIV? Influenza drugs for teens with high blood pressure? ...
Knowledge Search (1) Who was US president when Barack Obama was born? http://www.wolframalpha.com
Knowledge Search (1) Who was mayor of Indianapolis when Barack Obama was born? not enough facts in KB ! http://www.wolframalpha.com
Knowledge Search (2) Indy500 winners? http://www.google.com/squared/
Knowledge Search (2) Indy500 winners? http://www.google.com/squared/
Knowledge Search (2) Indy500 winners from Europe? no types no inference ! http://www.google.com/squared/
Related Work Yago-Naga Text2Onto EntityRank Cazoodle Powerset ReadTheWeb Avatar System T Hakia Cyc information extraction ontologies UIMA Kylin KOG WebTables (Semantic Web) (Statistical Web) kosmix KnowItAll TextRunner WolframAlpha SWSE StatSnowball EntityCube sig.ma communities DBpedia (Social Web) Cimple DBlife PSOX TrueKnowledge GoogleSquared Freebase Answers START WorldWideTables Cyc IWP ReadTheWeb TextRunner YAGO-NAGA
Outline What and Why Framework Entities and Classes Relationships Temporal Knowledge Wrap-up ...
Framework: Types of Knowledge • facts / assertions: bornIn (JohnDillinger, Indianapolis) • hasWon (JimGray, TuringAward), … • taxonomic: instanceOf (JohnDillinger, bankRobbers), • subclassOf (bankRobbers, criminals), … • lexical / terminology: means (“Big Apple“, NewYorkCity), • means (“Big Mike“, MichaelStonebraker) • means (“MS“, Microsoft) , means (“MS“, MultipleSclerosis) … • common-sense properties: • apples are green, red, juicy, sweet, sour … - but not fast, smart … • balls are round, smooth, slippery … - but not square, funny … • common-sense axioms: • x: human(x) male(x) female(x) • x: (male(x) female(x)) (female(x) ) male(x)) • x: animal(x) (hasLegs(x) isEven(numberOfLegs(x)) … • procedural: how to fix/install/prepare/remove … • epistemic / beliefs: believes (Ptolemy, shape(Earth, disc)), • believes (Copernicus, shape(Earth, sphere)) … ...
Framework: Information Extraction (IE) Surajit obtained his PhD in CS from Stanford University under the supervision of Prof. Jeff Ullman. He later joined HP and worked closely with Umesh Dayal … instanceOf (Surajit, scientist) inField (Surajit, computer science) hasAdvisor (Surajit, Jeff Ullman) almaMater (Surajit, Stanford U) workedFor (Surajit, HP) friendOf (Surajit, Umesh Dayal) … source- centric IE 1) recall ! 2) precision one source yield-centric harvesting hasAdvisor StudentAdvisor StudentUniversity StudentAdvisor StudentAdvisor Surajit Chaudhuri Jeffrey Ullman Alon Halevy Jeffrey Ullman Jim Gray Mike Harrison … … 1) precision ! 2) recall almaMater StudentUniversity Surajit Chaudhuri Stanford U Alon Halevy Stanford U Jim Gray UC Berkeley … … near-human quality ! many sources
Framework: Knowledge Representation • RDF (Resource Description Framework, W3C): • subject-property-object (SPO) triples, binary relations • structure, but no (prescriptive) schema • Relations, frames • Description logics: OWL, DL-lite • Higher-order logics, epistemic logics facts (RDF triples): (JimGray, hasAdvisor, MikeHarrison) (SurajitChaudhuri, hasAdvisor, JeffUllman) (Madonna, marriedTo, GuyRitchie) (NicolasSarkozy, marriedTo, CarlaBruni) facts (RDF triples) 1: 2: 3: 4: facts about facts: 5: (1, inYear, 1968) 6: (2, inYear, 2006) 7: (3, validFrom, 22-Dec-2000) 8: (3, validUntil, Nov-2008) 9: (4, validFrom, 2-Feb-2008) 10: (2, source, SigmodRecord) temporal & provenance annotations can refer to reified facts via fact identifiers (approx. equiv. to RDF quadruples: “Color“ Sub Prop Obj) ...
KB‘s: Example YAGO (Suchanek et al.: WWW‘07) 2 Mio. entities, 20 Mio. facts 40 Mio. RDF triples ( entity1-relation-entity2, subject-predicate-object ) Entity subclass subclass subclass Organization Person Location subclass subclass subclass Accuracy 95% subclass subclass Country Scientist Politician subclass subclass State instanceOf instanceOf Biologist instanceOf Physicist City instanceOf Germany instanceOf instanceOf locatedIn Erwin_Planck Oct 23, 1944 diedOn locatedIn Kiel Schleswig-Holstein FatherOf bornIn Nobel Prize hasWon instanceOf citizenOf diedOn Oct 4, 1947 Max_Planck Society Max_Planck Angela Merkel Apr 23, 1858 bornOn means(0.9) means means means means(0.1) “Max Planck” “Max Karl Ernst Ludwig Planck” “Angela Merkel” “Angela Dorothea Merkel” http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: Example YAGO (F. Suchanek et al.: WWW‘07) http://www.mpi-inf.mpg.de/yago-naga/
KB‘s: Example DBpedia (Auer, Bizer, et al.: ISWC‘07) • 3 Mio. entities, • 1 Bio. facts (RDF triples) • 1.5 Mio. entities mapped to • hand-crafted taxonomy of • 259 classes with 1200 properties http://www.dbpedia.org
Outline What and Why Framework Entities and Classes Relationships Temporal Knowledge Wrap-up ...
Entities & Classes Which entity types (classes, unary predicates) are there? scientists, doctoral students, computer scientists, … female humans, male humans, married humans, … Which subsumptions should hold (subclass/superclass, hyponym/hypernym, inclusion dependencies)? subclassOf (computer scientists, scientists), subclassOf (scientists, humans), … Which individual entities belong to which classes? instanceOf (Surajit Chaudhuri, computer scientists), instanceOf (BarbaraLiskov, computer scientists), instanceOf (Barbara Liskov, female humans), … Which names denote which entities? means (“Lady Di“, Diana Spencer), means (“Diana Frances Mountbatten-Windsor”, Diana Spencer), … means (“Madonna“, Madonna Louise Ciccone), means (“Madonna“, Madonna(painting by Edward Munch)), … ...
WordNet Thesaurus [Miller/Fellbaum 1998] 3 concepts / classes & their synonyms (synset‘s) http://wordnet.princeton.edu/
WordNet Thesaurus [Miller/Fellbaum 1998] subclasses (hyponyms) superclasses (hypernyms) http://wordnet.princeton.edu/
WordNet Thesaurus [Miller & Fellbaum 1998] • > 100 000 classes and lexical relations; • can be cast into • description logics or • graph, with weights for relation strengths • (derived from co-occurrence statistics) but: only few individual entities (instances of classes) scientist, man of science (a person with advanced knowledge) => cosmographer, cosmographist => biologist, life scientist => chemist => cognitive scientist => computer scientist ... => principal investigator, PI … HAS INSTANCE => Bacon, Roger Bacon … http://wordnet.princeton.edu/
Mapping: Wikipedia WordNet [Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07] Missing Person Sailor, Crewman American Computer Scientist Scientist Jim Gray (computer specialist) Chemist Artist
Mapping: Wikipedia WordNet [Suchanek: WWW‘07, Ponzetto&Strube: AAAI‘07] Missing Person Sailor, Crewman ? People Lost at Sea Computer Scientists by Nation American instanceOf American Computer Scientists Computer Scientist Scientist subclassOf Jim Gray (computer specialist) Databases Data- base ? Database Researcher ? Engineering Societies Fellow (1), Comrade ? Fellows of the ACM ? Fellow (2), Colleague ACM name similarity (edit dist., n-gram overlap) ? Fellow (3) (of Society) Members of Learned Societies context similarity (word/phrase level) ? Member (1), Fellow ? machine learning ? Member (2), Extremity
Mapping: Wikipedia WordNet [Suchanek: WWW‘07, Ponzetto & Strube:AAAI‘07] Given: entitye in Wikipediacategoriesc1, …, ck Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN classc Problem: vagueness& ambiguity of names c1, …, ck Analyzing category names noun group parser: American Musicians of Italian Descent pre-modifier head post-modifier American Folk Music of the 20th Century pre-modifier head post-modifier American Indy 500 Drivers on Pole Positions pre-modifier head post-modifier Head word is key, should be in plural for instanceOf
Mapping Wikipedia Entities to WordNet Classes [Suchanek: WWW‘07, Ponzetto & Strube: AAAI‘07] Given: entitye in Wikipediacategoriesc1, …, ck Wanted: instanceOf(e,c) and subclassOf(ci,c) for WN classc Problem: vagueness& ambiguity of names c1, …, ck Heuristic Method: foreachci do ifheadword w ofcategorynameciis plural { 1) match w againstsynsetsofWordNetclasses 2) choosebestfittingclassc andsete c 3) expandw bypre-modifierandsetci w+ c } tuned conservatively: high precision, reduced recall • can also derive features this way • feed into supervised classifier
Learning More Mappings [ Wu & Weld: WWW‘08 ] • Kylin Ontology Generator (KOG): • learn classifier for subclassOf across Wikipedia & WordNet using • YAGO as training data • advanced ML methods (MLN‘s, SVM‘s) • rich features from various sources • category/class name similarity measures • category instances and their infobox templates: • template names, attribute names (e.g. knownFor) • Wikipedia edit history: • refinement of categories • Hearst patterns: • C such as X, X and Y and other C‘s, … • other search-engine statistics: • co-occurrence frequencies > 3 Mio. entities > 1 Mio. w/ infoboxes > 500 000 categories
Goal: Comprehensive & Consistent ! Telecomm. History Knuth Prize Laureate Doctoral Students American … Bell Labs Known For Princeton Alumni Academic Jeffrey Ullman American People by Occupation Alma Mater American Computer Scientists Scientist Notable Awards Jim Gray (computer specialist) Databases Database Researcher Fellow(1) Computer Data Fellow(2) Born Fellows of the ACM Members of Learned Societies Award Winner Years Active Madonna (entertainer) U Michigan Alumni Athlete Genres Americans of Italian Descent World Record Holders Artist Also Known As Bob Dylan People by Status Musician American Songwriters … Hall of Fame Inductees Singer Website Guitar Players Italian
Goal: Comprehensive & Consistent ! Telecomm. History Knuth Prize Laureate Doctoral Students American … Bell Labs Known For Princeton Alumni Academic Jeffrey Ullman American People by Occupation Alma Mater American Computer Scientists Scientist Notable Awards Jim Gray (computer specialist) Databases Database Researcher Fellow(1) Computer Data Fellow(2) Born Fellows of the ACM Members of Learned Societies Award Winner Years Active Madonna (entertainer) U Michigan Alumni Athlete Genres Americans of Italian Descent World Record Holders Artist Also Known As Bob Dylan People by Status American Songwriters Musician … Hall of Fame Inductees Singer Website Guitar Players Italian
Goal: Comprehensive & Consistent ! Telecomm. History Knuth Prize Laureate Doctoral Students American … Bell Labs Known For Princeton Alumni Academic Jeffrey Ullman American People by Occupation Alma Mater American Computer Scientists Scientist Notable Awards Jim Gray (computer specialist) Databases Database Researcher Fellow(1) Computer Data Fellow(2) Born Fellows of the ACM Members of Learned Societies Award Winner Years Active Madonna (entertainer) U Michigan Alumni Athlete Genres Americans of Italian Descent World Record Holders Artist Also Known As Bob Dylan People by Status American Songwriters Musician … Hall of Fame Inductees Singer Website Guitar Players Italian
Goal: Comprehensive & Consistent ! Telecomm. History Knuth Prize Laureate Doctoral Students American … Bell Labs Known For Princeton Alumni Academic Jeffrey Ullman American People by Occupation Alma Mater American Computer Scientists Scientist • Clean up the mess: • graph algorithms ? • random walk with restart • dense subgraphs … • statistical machine learning ? • logical consistency reasoning ? • gigantic schema integration ? • ontology merging Notable Awards Jim Gray (computer specialist) Databases Database Researcher Fellow(1) Computer Data Fellow(2) Born Fellows of the ACM Members of Learned Societies Award Winner Years Active Madonna (entertainer) U Michigan Alumni Athlete Genres Americans of Italian Descent World Record Holders Artist Also Known As Bob Dylan People by Status American Songwriters Musician … Hall of Fame Inductees Singer Website Guitar Players Italian
Long Tail of Class Instances [Etzioni et al. 2004, Cohen et al. 2008, Mitchell et al. 2010] • State-of-the-Art Approach (e.g. SEAL): • Start with seeds: a few class instances • Find lists, tables, text snippets (“for example: …“), … • that contain one or more seeds • Extract candidates: noun phrases from vicinity • Gather co-occurrence stats (seed&cand, cand&className pairs) • Rank candidates • point-wise mutual information, … • random walk (PR-style) on seed-cand graph But: Precision drops for classes with sparse statistics(DB profs, …) Harvested items are names, not entities Canonicalization (de-duplication) unsolved
Individual Entity Disambiguation Names Entities Sean Penn “Penn“ ? University of Pennsylvania “U Penn“ Pennsylvania State University “Penn State“ Pennsylvania (US State) „PSU“ Passenger Service Unit • ill-defined with zero context • known as record linkage for names in record fields • Wikipedia offers rich candidate mappings: • disambiguation pages, re-directs, inter-wiki links, • anchor texts of href links
Collective Entity Disambiguation [McCallum 2003, Doan 2005, Getoor 2006. Domingos 2007, Chakrabarti 2009, …] • Consider a set of names {n1, n2, …} in same context • and sets of candidate entities • E1 = {e11, e12, …}, E2 = {e21, e22, …}, … • Define joint objective function (e.g. likelihood for prob. model) • that rewards coherence of mappings ni eij • Solve optimization problem Stuart Russell (DJ) Stuart Russell Stuart Russell (computer scientist) Michael Jordan (computer scientist) Michael Jordan Michael Jordan (NBA)
Problems and Challenges Wikipedia categories reloaded comprehensive & consistent instanceOf and subClassOf across Wikipedia and WordNet (via consistency reasoning ?) Long tail of entities beyond Wikipedia: domain-specific entity catalogs discover new entities, detect new names for known entities Tags, tables, topics tap on other sources: Web2.0, Web tables, directories, etc. Robust disambiguation near-real-time mapping of names to entities with near-human quality
Outline What and Why Framework Entities and Classes Relationships Temporal Knowledge Wrap-up ...
Relationships Which instances (pairs of individual entities) are there for given binary relations with specific type signatures? hasAdvisor (JimGray, MikeHarrison) hasAdvisor (HectorGarcia-Molina, Gio Wiederhold) hasAdvisor (Susan Davidson, Hector Garcia-Molina) graduatedAt (JimGray, Berkeley) graduatedAt (HectorGarcia-Molina, Stanford) hasWonPrize (JimGray, TuringAward) bornOn (JohnLennon, 9Oct1940) diedOn (JohnLennon, 8Dec1980) marriedTo (JohnLennon, YokoOno) Which additional & interesting relation types are there between given classes of entities? competedWith(x,y), nominatedForPrize(x,y), … divorcedFrom(x,y), affairWith(x,y), … assassinated(x,y), rescued(x,y), admired(x,y), …
Deterministic Pattern Matching [Kushmerick 97, Califf & Mooney 99, Gottlob 01, …] • Regular expressions matching • Wrapper induction • (grammar learning for • restricted regular languages) • Well understood ...
French Marriage Problem facts in KB: new facts or fact candidates: married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Michelle, Barack) married (Yoko, John) married (Kate, Leonardo) married (Carla, Sofie) married (Larry, Google) married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) • for recall: pattern-based harvesting • for precision: consistency reasoning
Pattern-Based Harvesting (Hearst 92, Brin 98, Agichtein 00, Etzioni 04, …) Facts & Fact Candidates Patterns (Hillary, Bill) X and her husband Y (Carla, Nicolas) X and Y on their honeymoon (Angelina, Brad) (Victoria, David) X and Y and their children (Hillary, Bill) X has been dating with Y (Carla, Nicolas) X loves Y (Yoko, John) … • good for recall • noisy, drifting • not robust enough • for high precision (Kate, Pete) (Carla, Benjamin) (Larry, Google) (Angelina, Brad) (Victoria, David)
Reasoning about Fact Candidates Use consistency constraints to prune false candidates ground atoms: FOL rules (restricted): spouse(Hillary,Bill) spouse(Carla,Nicolas) spouse(Cecilia,Nicolas) spouse(Carla,Ben) spouse(Carla,Mick) Spouse(Carla, Sofie) spouse(x,y) diff(y,z) spouse(x,z) spouse(x,y) diff(w,y) spouse(w,y) spouse(x,y) f(x) spouse(x,y) m(y) spouse(x,y) (f(x)m(y)) (m(x)f(y)) f(Hillary) f(Carla) f(Cecilia) f(Sofie) m(Bill) m(Nicolas) m(Ben) m(Mick) Rules reveal inconsistencies Find consistent subset(s) of atoms (“possible world(s)“, “the truth“) • Rules can be weighted • (e.g. by fraction of ground atoms that satisfy a rule) • uncertain / probabilistic data • compute prob. distr. of subset of atoms being the truth
Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006) Map logical constraints & fact candidates into probabilistic graph model: Markov Random Field (MRF) s(x,y) diff(y,z) s(x,z) s(x,y) f(x) f(x) m(x) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y) diff(w,y) s(w,y) s(x,y) m(y) M(x) f(x) Grounding: Literal Boolean Var Literal binary RV s(Ca,Nic) s(Ce,Nic) s(Ca,Nic) s(Ca,Ben) s(Ca,Nic) m(Nic) s(Ca,Nic) s(Ca,So) s(Ce,Nic) m(Nic) s(Ca,Ben) s(Ca,So) s(Ca,Ben) m(Ben) s(Ca,Ben) s(Ca,So) s(Ca,So) m(So)
Markov Logic Networks (MLN‘s) (M. Richardson / P. Domingos 2006) Map logical constraints & fact candidates into probabilistic graph model: Markov Random Field (MRF) s(x,y) diff(y,z) s(x,z) s(x,y) f(x) f(x) m(x) s(Carla,Nicolas) s(Cecilia,Nicolas) s(Carla,Ben) s(Carla,Sofie) … s(x,y) diff(w,y) s(w,y) s(x,y) m(y) M(x) f(x) s(Ce,Nic) RVs coupled by MRF edge if they appear in same clause m(Nic) s(Ca,Nic) s(Ca,Ben) m(Ben) MRF assumption: P[Xi|X1..Xn]=P[Xi|N(Xi)] s(Ca,So) m(So) Variety of algorithms for joint inference: Gibbs sampling, other MCMC, belief propagation, randomized MaxSat, … joint distribution has product form over all cliques
Related Alternative Probabilistic Models Constrained Conditional Models [D. Roth et al. 2007] log-linear classifiers with constraint-violation penalty mapped into Integer Linear Programs Factor Graphs with Imperative Variable Coordination [A. McCallum et al. 2008] s(Ce,Nic) RV‘s share “factors“ (joint feature functions) generalizes MRF, BN, CRF, … inference via advanced MCMC flexible coupling & constraining of RV‘s m(Nic) s(Ca,Nic) s(Ca,Ben) m(Ben) s(Ca,So) m(So) software tools: alchemy.cs.washington.edu code.google.com/p/factorie/ research.microsoft.com/en-us/um/cambridge/projects/infernet/
Reasoning for KB Growth: Direct Route (F. Suchanek et al.: WWW‘09) new fact candidates: facts in KB: married (Cecilia, Nicolas) married (Carla, Benjamin) married (Carla, Mick) married (Carla, Sofie) married (Larry, Google) ? married (Hillary, Bill) married (Carla, Nicolas) married (Angelina, Brad) + patterns: X and her husband Y X and Y and their children X has been dating with Y X loves Y Direct approach: • facts are true; fact candidates & patterns hypotheses • grounded constraints clauses with hypotheses as vars • cast into Weighted Max-Sat with weights from pattern stats • customized approximation algorithm • unifies: fact cand consistency, pattern goodness, entity disambig. www.mpi-inf.mpg.de/yago-naga/sofie/
Facts & Patterns Consistency (F. Suchanek et al.: WWW‘09) constraints to connect facts, fact candidates, patterns functional dependencies: relation properties: spouse(X,Y): X Y, Y X asymmetry, transitivity, acyclicity, … pattern-fact duality: type constraints, inclusion dependencies: occurs(p,x,y) expresses(p,R) R(x,y) spouse Person Person capitalOfCountry cityOfCountry occurs(p,x,y) R(x,y) expresses(p,R) domain-specific constraints: name(-in-context)-to-entity mapping: bornInYear(x) + 10years ≤ graduatedInYear(x) means(n,e1) means(n,e2) … hasAdvisor(x,y) graduatedInYear(x,t) graduatedInYear(y,s) s < t www.mpi-inf.mpg.de/yago-naga/sofie/