Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchane PowerPoint Presentation
Download Presentation
In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchane

In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchane

180 Vues Download Presentation
Télécharger la présentation

In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchane

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. In collaboration with Giorgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath, Ralf Schenkel, Fabian Suchanek, Martin Theobald

  2. Vision Opportunity: Turn the Web (and Web 2.0 and Web 3.0 ...) into the world‘s most comprehensive knowledge base (semantic DB) Challenge: seize opportunity and make it happen! • Approach: combine and exploit synergies of • hand-crafted, high-quality knowledge sources •  Semantic Web • automatic knowledge extraction •  Statistical Web • social networks and human computing •  Social Web Gerhard Weikum ADBIS Oct 1, 2007

  3. Proof of Relevance Vannevar Bush: As We May Think, 1945. There is a growing mountain of research. … A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory. Gerhard Weikum ADBIS Oct 1, 2007

  4. Proof of Relevance Tim Berners-Lee: In the Semantic Web information is given well-defined meaning. Jim Gray: … system can answer questions about the text as precisely and quickly as a human expert. Brewster Kahle: The goal of universal access to our cultural heritage is within our grasp. Jimmy Wales: Our big-picture vision is to share knowledge with all of humanity. Al Gore: The future will be better tomorrow. Gerhard Weikum ADBIS Oct 1, 2007

  5. Proof of Relevance ? To know that we know what we know, and that we do not know what we do not know, that is true knowledge.  You cannot open a book without learning something. A journey of a thousand miles begins with a single step. Confucius, 551-479 BC Ignorance is less remote from the truth than prejudice. When science, art, literature, and philosophy are simply the manifestation of personality, they can make a man's name live for thousands of years. Sentences are like sharp nails, which force truth upon our memories. Denis Diderot, 1713-1784 Gerhard Weikum ADBIS Oct 1, 2007

  6. Why Google and Wikipedia Are Not Enough Turn the Web, Web2.0, and Web3.0 into the world‘s most comprehensive knowledge base („semantic DB/graph“) ! Answer „knowledge queries“ such as: proteins that inhibit both protease and some other enzyme neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘ differences in Rembetiko music from Greece and from Turkey connection between Thomas Mann and Goethe market impact of Web2.0 technology in December 2006 sympathy or antipathy for Germany from May to August 2006 Nobel laureate who survived both world wars and his children drama with three women making a prophecy to a British nobleman that he will become king Gerhard Weikum ADBIS Oct 1, 2007

  7. Outline Introduction: Search for Knowledge  • Harvesting Knowledge • Leibniz Approach • Planck Approach • Darwin Approach • Conclusion • Gerhard Weikum ADBIS Oct 1, 2007

  8. Three Roads to Knowledge Leibniz Approach: Handcrafted High-Quality Knowledge Sources (Semantic Web) Planck Approach: Large-scale Information Extraction & Harvesting (Statistical Web) Darwin Approach: Social Wisdom from Web 2.0 Communities (Social Web) Gerhard Weikum ADBIS Oct 1, 2007

  9. Leibniz Approach (Semantic Web) • Handcrafted High-Quality Knowledge: • Ontologies and other Lexical Sources • Build on Rigorous Knowledge Atoms • („Characteristica Universalis“) Gottfried Wilhelm Leibniz (1646 - 1716) Gerhard Weikum ADBIS Oct 1, 2007

  10. High-Quality Knowledge Sources General-purpose ontologies for Semantic Web: SUMO, Cyc, etc. Gerhard Weikum ADBIS Oct 1, 2007

  11. High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family woman, adult female – (an adult female person) => amazon, virago – (a large strong and aggressive woman) => donna -- (an Italian woman of rank) => geisha, geisha girl -- (...) => lady (a polite name for any woman) ... => wife – (a married woman, a man‘s partner in marriage) => witch – (a being, usually female, imagined to have special powers derived from the devil) Gerhard Weikum ADBIS Oct 1, 2007

  12. High-Quality Knowledge Sources General-purpose thesauri and concept networks: WordNet family • 200 000 concepts and relations; • can be cast into • description logics or • graph, with weights for relation strengths • (derived from co-occurrence statistics) enzyme -- (any of several complex proteins that are produced by cells and act as catalysts in specific biochemical reactions) => protein -- (any of a large group of nitrogenous organic compounds that are essential constituents of living cells; ...) => macromolecule, supermolecule ... => organic compound -- (any compound of carbon and another element or a radical) ... => catalyst, accelerator -- ((chemistry) a substance that initiates or accelerates a chemical reaction without itself being affected) => activator -- ((biology) any agency bringing about activation; ...) Gerhard Weikum ADBIS Oct 1, 2007

  13. High-Quality Knowledge Sources Domain ontologies (UMLS, GeneOntology, etc.) • 1 Mio. biomedical concepts, 135 categories, • 54 relationships (e.g. virus causes (disease | symptom) ) Gerhard Weikum ADBIS Oct 1, 2007

  14. High-Quality Knowledge Sources Wikipedia and other lexical sources • 2 Mio. articles • 40 Mio. hyperlinks • many 1000‘s of categories and lists • more than 100 languages • growing very fast Gerhard Weikum ADBIS Oct 1, 2007

  15. Exploit Hand-Crafted Knowledge Wikipedia, WordNet, and other lexical sources {{Infobox_Scientist | name = Max Planck | birth_date = [[April 23]], [[1858]] | birth_place = [[Kiel]], [[Germany]] | death_date = [[October 4]], [[1947]] | death_place = [[Göttingen]], [[Germany]] | residence = [[Germany]] | nationality = [[Germany|German]] | field = [[Physicist]] | work_institution = [[University of Kiel]]</br> [[Humboldt-Universität zu Berlin]]</br> [[Georg-August-Universität Göttingen]] | alma_mater = [[Ludwig-Maximilians-Universität München]] | doctoral_advisor = [[Philipp von Jolly]] | doctoral_students = [[Gustav Ludwig Hertz]]</br> … | known_for = [[Planck's constant]], [[Quantum mechanics|quantum theory]] | prizes = [[Nobel Prize in Physics]] (1918) … Gerhard Weikum ADBIS Oct 1, 2007

  16. YAGO: Yet Another Great Ontology[F. Suchanek, G. Kasneci, G. Weikum: WWW‘07] • Turn Wikipedia into explicit knowledge base (semantic DB) • Exploit hand-crafted categoriesand templates • Represent facts as explicit knowledge triples: • relation (entity1, entity2) • (in FOL, compatible with RDF, OWL-lite, XML, etc.) • Map (and disambiguate) relations into WordNet concept DAG relation entity1 entity2 Examples: bornIn isInstanceOf City Max_Planck Kiel Kiel Gerhard Weikum ADBIS Oct 1, 2007

  17. YAGO Knowledge Representation Knowledge Base # Facts KnowItAll 30 000 SUMO 60 000 WordNet 200 000 OpenCyc 300 000 Cyc 5 000 000 YAGO 6 000 000 Accuracy  97% Entity subclass subclass Person concepts Location subclass Scientist subclass subclass subclass subclass City Country Biologist Physicist instanceOf instanceOf Erwin_Planck Nobel Prize bornIn Kiel hasWon FatherOf individuals diedOn bornOn October 4, 1947 Max_Planck April 23, 1858 means means means “Max Karl Ernst Ludwig Planck” “Dr. Planck” “Max Planck” words Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/ Gerhard Weikum ADBIS Oct 1, 2007

  18. YAGO Disambiguation & Uncertainty capture confidence value for each fact Entity subclass 1.0 subclass 1.0 Person Location subclass 1.0 subclass 1.0 1.0 subclass subclass Mythological Figure City Country Celebrity 0.7 instanceOf 0.8 instanceOf 0.9 instanceOf 1.0 0.4 instanceOf locatedIn Paris(Myth.) Paris(France) France Paris Hilton 0.95 means 0.7 means 0.1 means means 0.9 0.2 means 0.05 “Paris” “La Grande Nation” “France” additional harvesting of relations from natural-language texts by info-extraction tools Gerhard Weikum ADBIS Oct 1, 2007

  19. NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07] Graph-based search on YAGO-style knowledge bases with built-in ranking based on statistical language model discovery queries hasWon diedOn Nobel prize $a $x bornIn isa Kiel $x scientist > hasSon diedOn $y $b connectedness queries isa * German novelist Thomas Mann Goethe queries with regular expressions hasFirstName | hasLastName isa Ling $x scientist (coAuthor | advisor)* worksFor locatedIn* $y Zhejiang Beng Chin Ooi Gerhard Weikum ADBIS Oct 1, 2007

  20. NAGA: Searching Knowledge q: Fisher isa scientist Fisher isa $x $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = alumnus_109165182 $@Fisher = Irving_Fisher $@scientist = scientist_109871938 $X = social_scientist_109927304 $@Fisher = James_Fisher $@scientist = scientist_10981938 $X = ornithologist_109711173 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = theorist_110008610 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = colleague_109301221 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = organism_100003226 … mathematician_109635652   —subClassOf—>   scientist_109871938 Alumni_of_Gonville_and_Caius_College,_Cambridge   —subClassOf—>   alumnus_109165182 "Fisher"   —familyNameOf—>   Ronald_Fisher Ronald_Fisher   —type—>   Alumni_of_Gonville_and_Caius_College,_Cambridge Ronald_Fisher   —type—>   20th_century_mathematicians "scientist"   —means—>   scientist_109871938 Gerhard Weikum ADBIS Oct 1, 2007

  21. NAGA: Searching & Ranking Knowledge q: Fisher isa scientist Fisher isa $x $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = mathematician_109635652 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = statistician_109958989 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = president_109787431 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = geneticist_109475749 $@Fisher = Ronald_Fisher $@scientist = scientist_109871938 $X = scientist_109871938 … Score: 7.184462521168058E-13 mathematician_109635652   —subClassOf—>   scientist_109871938 "Fisher"   —familyNameOf—>   Ronald_FisherRonald_Fisher   —type—>   20th_century_mathematicians "scientist"   —means—>   scientist_109871938 20th_century_mathematicians   —subClassOf—>   mathematician_109635652 Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/ Gerhard Weikum ADBIS Oct 1, 2007

  22. Ranking Factors • Confidence: • Prefer results that are likely to be correct • Certainty of IE • Authenticity and Authority of Sources bornIn (Max Planck, Kiel) from „Max Planck was born in Kiel“ (Wikipedia) livesIn (Elvis Presley, Mars) from „They believe Elvis hides on Mars“ (Martian Bloggeria) • Informativeness: • Prefer results that are likely important • May prefer results that are likely new to user • Frequency in answer • Frequency in corpus (e.g. Web) • Frequency in query log q: isa (Einstein, $y) isa (Einstein, scientist) isa (Einstein, vegetarian) q: isa ($x, vegetarian) isa (Einstein, vegetarian) isa (Al Nobody, vegetarian) • Compactness: • Prefer results that are • tightly connected • Size of answer graph vegetarian Tom Cruise isa isa bornIn Einstein won 1962 won Nobel Prize Bohr diedIn Gerhard Weikum ADBIS Oct 1, 2007

  23. Summary of Leibniz Approach Hand-crafted knowledge sources are great assets, but expensive, partial, and isolated Great mileage even from informal & semiformal sources Connecting & reconciling different sources gives added value (and sometimes is not even that hard) • Challenge: • Develop methods for comprehensive, highly accurate • mappings across many knowledge sources • Cross-lingual • Cross-temporal • Scalable Gerhard Weikum ADBIS Oct 1, 2007

  24. Planck Approach (Statistical Web) • Information Extraction & Harvesting: • Gather Entities, Relations, Facts • Live with Uncertainty Max Planck (1858 - 1947) Gerhard Weikum ADBIS Oct 1, 2007

  25. Information Extraction (IE): Text to Records Person BirthDate BirthPlace ... Max Planck 4/23, 1858 Kiel Albert Einstein 3/14, 1879 Ulm Mahatma Gandhi 10/2, 1869 Porbandar Person ScientificResult Max Planck Quantum Theory Person Collaborator Max Planck Albert Einstein Max Planck Niels Bohr Constant Value Dimension Planck‘s constant 6.2261023 Js combine NLP, pattern matching, lexicons, statistical learning Gerhard Weikum ADBIS Oct 1, 2007

  26. IE Technology: Rules, Patterns, Learning <lecture> NP NP NN IN DT NP VB IN DT ADJ NN PP NP IN CD • For natural-language text and for heterogeneous sources: • NLP techniques (parser, PoS tagging) for tokenization • identify patterns (e.g. regular expressions) as features • train statistical learners for segmentation and labeling • use learned model to automatically tag newly seen input Training data: The WWW conference takes place in Banff in Canada. Today‘s keynote speaker is Dr. Berners-Lee from W3C. The panel in Edinburgh, chaired by Ron Brachman from Yahoo!, … … <location> <organization> <person> <event> Ian Foster, father of the Grid, talks at the GES conference in Germany on 05/02/07. <person> <event> <location> <date> Gerhard Weikum ADBIS Oct 1, 2007

  27. Knowledge Acquisition from the Web • Learn Semantic Relations from Entire Corpora at Large Scale • (as exhaustively as possible but with high accuracy) • Examples: • all cities, all basketball players, all composers • headquarters of companies, CEOs of companies, synonyms of proteins • birthdates of people, capitals of countries, rivers in cities • which musician plays which instruments • who discovered or invented what • which enzyme catalyzes which biochemical reaction Existing approaches and tools use almost-unsupervised pattern matching and learning: seeds (known facts)  patterns (in text)  (extraction) rule  (new) facts Gerhard Weikum ADBIS Oct 1, 2007

  28. Methods for Web-Scale Fact Extration Assessment of facts & generation of rules based on statistics Rules can be more sophisticated: playing NN: (ADJ|ADV)* NP & class(head(NP))=person  plays (head(NP), NN) seeds  text  rules  new facts Example: city (Seattle) in downtown Seattle city (Seattle) Seattle and other towns city (Las Vegas) Las Vegas and other towns plays (Zappa, guitar) playing guitar: … Zappa plays (Davis, trumpet) Davis … blows trumpet Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X plays (Davis, trumpet) Davis … blows trumpet X … blows Y Example: city (Seattle) in downtown Seattle in downtown X city (Seattle) Seattle and other towns X and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X plays (Davis, trumpet) Davis … blows trumpet X … blows Y Example: city (Seattle) in downtown Seattlein downtown X city (Seattle) Seattle and other townsX and other towns city (Las Vegas) Las Vegas and other towns X and other towns plays (Zappa, guitar) playing guitar: … Zappa playing Y: … X plays (Davis, trumpet) Davis … blows trumpetX … blows Y in downtown Delhi city(Delhi) Coltrane blows sax plays(C., sax) city(Delhi) plays(Coltrane, sax) city(Delhi) old center of Delhi plays(Coltrane, sax) sax player Coltrane city(Delhi) old center of Delhi old center of X plays(Coltrane, sax) sax player Coltrane Y player X Gerhard Weikum ADBIS Oct 1, 2007

  29. Performance of Web-IE State-of-the-art precision/recall results: relation precision recall corpus systems countries 80% 90% Web KnowItAll cities 80% ??? Web KnowItAll scientists 60% ??? Web KnowItAll CEOs 80% 50% News Snowball, LEILA birthdates 80% 70% Wikipedia LEILA instanceOf 40% 20% Web Text2Onto, LEILA precision value-chain: entities 80%, attributes 70%, facts 60%, events 50% Anecdotic evidence: invented (A.G. Bell, telephone) married (Hillary Clinton, Bill Clinton) isa (yoga, relaxation technique) isa (zearalenone, mycotoxin) contains (chocolate, theobromine) contains (Singapore sling, gin) invented (Johannes Kepler, logarithm tables) married (Segolene Royal, Francois Hollande) isa (yoga, excellent way) isa (your day, good one) contains (chocolate, raisins) plays (the liver, central role) makes (everybody, mistakes) Gerhard Weikum ADBIS Oct 1, 2007

  30. Beyond Surface Learning with LEILA (Cairo, Rhine), (Rome, 0911), (, [0..9]*), … (Cologne, Rhine), (Cairo, Nile), … NP VP PP NP NP PP NP NP NP PP NP VP NP PP NP NP NP Cologne lies on the banks of the Rhine People in Cairo like wine from the Rhine valley Mp Js Os AN Ss MVp DMc Mp Dg Jp Js Sp Mvp Ds Js NP VP VP PP NP NP PP NP NP Paris was founded on an island in the Seine (Paris, Seine) Ss Pv MVp Ds DG Js Js MVp We visited Paris last summer. It has many museums along the banks of the Seine. Learning to Extract Information by Linguistic Analysis [F. Suchanek et al.: KDD’06] Limitation of surface patterns: who discovered or invented what “Tesla’s work formed the basis of AC electric power” “Al Gore funded more work for a better basis of the Internet” Almost-unsupervised Statistical Learning with Dependency Parsing LEILA outperforms other Web-IE methods in precision and recall, but dependency parser is slow Gerhard Weikum ADBIS Oct 1, 2007

  31. IE Efficiency and Accuracy Tradeoffs [see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi] IE is cool, but what‘s in it for DB folks? • precision vs. recall: two-stage processing (filter pipeline) • recall-oriented harvesting • precision-oriented scrutinizing • preprocessing • indexing: NLP trees & graphs, N-grams, PoS-tag patterns ? • exploit ontologies? exploit usage logs ? • turn crawl&extract into set-oriented query processing • candidate finding • efficient phrase, pattern, and proximity queries • optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06] Gerhard Weikum ADBIS Oct 1, 2007

  32. Summary of Planck Approach Human text (and speech) is diverse and produced at higher rate than manual high-quality annotations IE offers reasonably robust and scalable methods for harvesting named entities and binary relations ? Deep NLP and advanced ML are computational bottleneck ? Disambiguation (entity matching, record linkage) needed „Joe Hellerstein (UC Berkeley)“ = „Prof. Joseph M. Hellerstein, California)“ „Max Planck Institute“ = „MPI“ ≠ „MPI“ = „Message Passing Institute“ • Challenge: • Achieve Web-scale IE throughput that can • sustain rate of new content production (e.g. blogs) • (may need large-scale P2P/Grid) • with > 90% accuracy and Wikipedia-like coverage Gerhard Weikum ADBIS Oct 1, 2007

  33. Darwin Approach (Social Web) • Social Wisdom & Natural Selection: • Evolution of (Web 2.0) species • Survival of the fittest Charles Darwin (1809 - 1882) Gerhard Weikum ADBIS Oct 1, 2007

  34. „Wisdom of Crowds“ at Work on Web 2.0 • Information enrichment & knowledge extraction by humans: • Collaborative Recommendations & QA • Amazon (product ratings & reviews, recommended products) • Netflix: movie DVD rentals  $ 1 Mio. Challenge • answers.yahoo, iknow.baidu, etc. • Social Tagging and Folksonomies • del.icio.us: Web bookmarks and tags • flickr: photo annotation, categorization, rating • YouTube: same for video • Human Computing in Game Form • ESP and Google Image Labeler: image tagging • Peekaboom: image segmenting and tagging • Verbosity: facts from natural-language sentences • Online Communities • dblife.cs.wisc.edu for database research • www.lt-world.org for language technology • Yahoo! Groups, Myspace, Facebook, etc. etc. Gerhard Weikum ADBIS Oct 1, 2007

  35. Social Tagging: Example Flickr (1) Gerhard Weikum ADBIS Oct 1, 2007

  36. Social Tagging: Example Flickr (2) Gerhard Weikum ADBIS Oct 1, 2007

  37. Social Tagging: Example Flickr (3) Gerhard Weikum ADBIS Oct 1, 2007

  38. Social-Tagging Community > 1 Mio. users > 100 Mio. photos > 1 Bio. tags 30% monthly growth Gerhard Weikum ADBIS Oct 1, 2007 Source: www.flickr.com

  39. ESP Game [Luis von Ahn et al. 2004 ] played against random, anonymous partner on Internet taboo: pyramid Louvre museum Paris art • Game with a purpose • Collects annotations (wisdom) • Can exploit tag statistics (crowds) • Attracts people, fun to play, some play hours • ESP game collected > 10 Mio. tags from > 20000 users • 5000 people could tag all photos on the Web in 4 weeks • (human computing) your partner has suggested: 3 labels your partner has suggested: 7 labels your partner has suggested: 11 labels your partner has suggested: 17 labels my labels: my labels: reflection my labels: reflection water my labels: reflection water Mitterand Mona Lisa my labels: reflection water Mitterand Mona Lisa metro lignes 7, 14 my labels: reflection water Mitterand Mona Lisa metro lignes 7, 1 Da Vinci code Congratulations! You scored 1 point! Gerhard Weikum ADBIS Oct 1, 2007

  40. More Human Computing • Verbosity [von Ahn 2006]: • Collect common-knowledge facts (relation instances) • 2 players: Narrator (N) and Guessor (G) • N gives stylized clues: • is a kind of …, is used for …, is typically near/in/on …, is the opposite of … • random pairing for independence, • can build statistics over many games for same concept Peekaboom, Phetch, etc.: locating & tagging objects in images, finding images, etc. • incentives to play ? • game design for moving up the value-chain ? Gerhard Weikum ADBIS Oct 1, 2007

  41. Dark Side of Social Wisdom • Spam (Web spam – not just for email anymore): • lucky online casino, easy MBA diploma, cheap V!-4-gra, etc.; • law suits about „appropriate Google rank“ • Truthiness: • degree to which something is truthy (not necessarily facty); • truthy := property of something you know from your guts • Disputes: • editorial fights over critical Wikipedia articles; • Citizendium: new endeavor with "gentle expert oversight" • Dishonesty, Bias, … Gerhard Weikum ADBIS Oct 1, 2007

  42. Gerhard Weikum ADBIS Oct 1, 2007

  43. Gerhard Weikum ADBIS Oct 1, 2007

  44. The Wisdom of Crowds: PageRank Authority (page q) = stationary prob. of visiting q PageRank (PR): links are endorsements & increase page authority, authority is higher if links come from high-authority pages Social Ranking with and equivalent to principal eigenvector: random walk: uniformly random choice of links + random jumps; add bias to transitions and jumps for personal PR, TrustRank, etc. Gerhard Weikum ADBIS Oct 1, 2007

  45. The Wisdom of Crowds: Beyond PR users tags docs Typed graphs: data items, users, friends, groups, postings, ratings, queries, clicks, … with weighted edges spectral analysis of various graphs Evolving over time  tensor analysis Gerhard Weikum ADBIS Oct 1, 2007

  46. Decentralized Graph Analysis global graph local subgraph 1 local subgraph 3 local sub- graph 2 • Graph spectral analysis applied to: • pages, sites, tags, users, groups, queries, clicks, opinions, etc. as nodes • assessment and interaction relations as weighted edges • can compute various notions of authority, reputation, trust, quality Decentralized computation in peer-to-peer network with arbitrary, a-priori unknown overlaps of graph fragments Gerhard Weikum ADBIS Oct 1, 2007

  47. JXP Algorithm [J.X. Parreira, G. Weikum: WebDB’05, VLDB’06] Decentralized, asynchronous, peer-to-peer algorithm based on theory of Markov-chain aggregation (state lumping) [P.J. Courtois 1977, C.D. Meyer 1988] • each peer aggregates non-local part of global graph into „world node“ • peers meet randomly, • exchange data about their local computations, and • iterate their local computations Theorem: authority scores from local computations converge to global scores supported by Minerva system http://www.mpi-inf.mpg.de/departments/d5/software/minerva/index.html Gerhard Weikum ADBIS Oct 1, 2007

  48. Summary of Darwin Approach Social tagging and social networks (Web 2.0) are potentially valuable knowledge sources Games (human computing) are an interesting way of enticing „knowledge input“ and collecting statistics Spectral analysis is a highly versatile tool for rating & ranking that can be extended and scaled by decentralized algorithms • Challenges: • Design a game that intrigues serious scientists • to „semantically“ annotate their scholarly work • Develop an analysis method that identifies the „best“ facts, • resilient to egoistic and malicious behaviors (incl. coalitions) Gerhard Weikum ADBIS Oct 1, 2007

  49. Outline Introduction: Search for Knowledge  • Harvesting Knowledge • Leibniz Approach • Planck Approach • Darwin Approach  Conclusion • Gerhard Weikum ADBIS Oct 1, 2007

  50. Summary • Harvesting knowledge & organizing in semantic DB/graph for • scholarly Web, • digital libraries, • enterprise know-how, • online communities, etc. • Three roads to knowledge: • Leibniz / Semantic Web: ontologies, encyclopedia, etc. • Planck / Statistical Web: large-scale IE from text, speech, etc. • Darwin / Social Web: wisdom of crowds, tagging, folksonomies • Not covered here: search and ranking • graph IR (for ER graphs, RDF, cross-linked XML, etc.) • new ranking models (e.g. statistical LM for graphs) • efficient and scalable query processing Gerhard Weikum ADBIS Oct 1, 2007