1 / 68

Knowledge Acquisition on the Web

Knowledge Acquisition on the Web. Growing the amount of available knowledge from within. Christopher Thomas. Overview. Knowledge Representation GlycO – Complex Carbohydrates domain ontology Information Extraction Taxonomy creation (Doozer/ Taxonom.com ) Fact Extraction (Doozer++)

amalie
Télécharger la présentation

Knowledge Acquisition on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Knowledge Acquisition on the Web Growing the amount of available knowledge from within Christopher Thomas

  2. Overview • Knowledge Representation • GlycO – Complex Carbohydrates domain ontology • Information Extraction • Taxonomy creation (Doozer/Taxonom.com) • Fact Extraction (Doozer++) • Validation

  3. Circle of knowledge on the Web Suggest new propositions Background knowledge Confirm new knowledge

  4. Goal: Harness the Wisdom of the Crowds to automatically model a domain, verify the model and give the verified knowledge back to the community

  5. Circle of knowledge on the Web What is knowledge? Suggest new propositions Background knowledge How do we turn propositions/beliefs into knowledge? Confirm new knowledge How do we acquire knowledge?

  6. Background Knowledge [15] Christopher Thomas and AmitSheth, “On the Expressiveness of the Languages for the Semantic Web–Making a Case for ‘A Little More,’”inFuzzy Logic and the Semantic Web, Eli Sanchez (Ed.), Elsevier, 2006. [11] AmitSheth, Cartic Ramakrishnan, and Christopher Thomas, “Semantics for The Semantic Web: the Implicit, the Formal and the Powerful,”International Journal on Semantic Web & Information Systems, 1 (no. 1), 2005, pp. 1–18.

  7. Different Angles • Social construction • Large scale creation of knowledge vs. • Small communities define their domains • Normative vs. Descriptive • Top-Down vs. Bottom-Up • Formal vs. Informal • Machine-readable vs. human-readable

  8. Community-created knowledge • Descriptive • Bottom-up • Formally less rigid • May contain false information • If a statement in the world is in conflict with the Ontology, both may be wrong or both may be right • Good for broad, shallow domains • Good for human processing and IR tasks

  9. Wikipedia and Linked Open Data • Created by large communities • Constantly growing • Domains within the linked data are not always easily discernible • Contain few axioms and restrictions • Little value to evaluation using logics

  10. Formal - Modeling deep domains • Prescriptive / Normative • Top-down • Contains “true knowledge” • If a statement in the world is in conflict with the Ontology, the statement is false • Good for scientific domains • Good for computational reasoning/inference • Usually created by small communities of experts • Usually static, little change is expected

  11. Example: GlycO • Created in collaboration with the Complex Carbohydrate Research Center at the University of Georgia on an NCRR grant. • Deep modeling of glycan structures and metabolic pathways [6] Christopher Thomas, Amit P. Sheth, and William S. York, “Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain,”inFormal Ontology in Information Systems (FOIS 2006) [5] Satya S. Sahoo, Christopher Thomas, Amit P. Sheth, William York, and SamirTartir, “Knowledge Modeling and Its Application in Life Sciences: A Tale of Two Ontologies,”15th International World Wide Web Conference (WWW2006),

  12. GlycO

  13. N-Glycosylation metabolic pathway N-glycan_beta_GlcNAc_9 N-glycan_alpha_man_4 GNT-Vattaches GlcNAc at position 6 N-acetyl-glucosaminyl_transferase_V UDP-N-acetyl-D-glucosamine + alpha-D-Mannosyl-1,3-(R1)-beta-D-mannosyl-R2 <=> UDP + N-Acetyl-$beta-D-glucosaminyl-1,2-alpha-D-mannosyl-1,3-(R1)-beta-D-mannosyl-$R2 UDP-N-acetyl-D-glucosamine + G00020 <=> UDP + G00021 GNT-Iattaches GlcNAc at position 2

  14. Glycan Structures for the ontology • Import structures from heterogeneous databases • Possible connections modeled in the form of GlycoTree • Match structures to archetypes b-D-Manp-(1-6)+ | b-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc | b-D-Manp-(1-3)+ N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251

  15. Interplay of extraction and evaluation • Errors in the source databases are propagated through various new databases  comparing multiple sources fails for error correction • Less than 2% of incorrect information makes a database useless for automatic validation of hypotheses • The ontology contains rules on how carbohydrate structures are known to be composed • By mapping information in databases to the ontology and analyzing how successful the mapping was, we can identify possible errors.

  16. Database Verification using GlycO b-D-Manp-(1-6)+ | a-D-Manp-(1-4)-b-D-GlcpNAc-(1-4)-D-GlcNAc | b-D-Manp-(1-3)+ a-D-Manp-(1-4) is not part of the identified canonical structure for N-Glycans, hence it is likely that the database entry is incorrect N. Takahashi and K. Kato, Trends in Glycosciences and Glycotechnology, 15, 2003: 235-251

  17. Pathway Steps - Reaction Evidence for this reaction from three experiments Pathway visualization tool by M. Eavenson and M. Janik, LSDIS Lab, Univ. of Georgia

  18. Summary - GlycO • The amount of accuracy and detail that can be found in ontologies such as GlycO could most likely not be acquired automatically • Only a small community of experts has the depth of knowledge to model such scientific ontologies

  19. Summary - GlycO • However, the automatic population shows that a highly restrictive, expert-created rule set allows for automation or involvement of larger communities. •  Frame-based population of knowledge •  The formal knowledge encoded in the ontology serves to acquire new knowledge •  The circle is completed

  20. Summary Background Knowledge • Large amounts of information and knowledge are available • Some machine readable by default • Others need specific algorithms to extract information • The more available information we can use, the better the extraction of new information will be.

  21. Circle of knowledge on the Web What is knowledge? Suggest new propositions Background knowledge How do we turn propositions into knowledge? Confirm new knowledge Part 2 How do we acquire knowledge?

  22. Knowledge Acquisition through Model Creation [2] [3] [1] [3] Christopher Thomas, PankajMehra, Roger Brooks and AmitSheth. Growing Fields of Interest -Using an Expand and Reduce Strategy for Domain Model Extraction. Web Intelligence 2008, pp. 496-502 [2] Christopher Thomas, Wenbo Wang, Delroy Cameron, Pablo Mendes, PankajMehra and AmitSheth, What Goes Around Comes Around - Improving Linked Open Data through On-Demand Model Creation, WebScience 2010 [1] Christopher Thomas, PankajMehra, Wenbo Wang, AmitSheth, Gerhard Weikum and Victor Chana Automatic Domain Model Creation Using Pattern-Based Fact Extraction, Knoesis Center technical report.

  23. First create a domain hierarchy Example: a hierarchy for the domain of Human Performance and Cognition

  24. Connect with learned facts

  25. Example: strongly connected component

  26. Excerpt: strongly connected component

  27. Expert evaluation of facts in the ontology 1-2: Information that is overall incorrect 3-4: Information that is somewhat correct 5-6: Correct general Information 7-9: Correct Information not commonly known

  28. Technical Details

  29. Step 1 • Domain hierarchy creation • Input terms e.g. related to Human Performance and Cognition • Hierarchy is automatically carved from articles and categories on Wikipedia

  30. Overview - conceptual • Expand and Reduce approach • Start with ‘high recall’ methods • Exploration - Full text search • Exploitation – Node Similarity Method • Category growth • End with “high precision” methods • Apply restrictions on the concepts found • Remove unwanted terms and categories

  31. Expand - conceptually Graph-based expansion Full text search on Article texts Delete results with low confidence score

  32. Collecting Instances

  33. Creating a Hierarchy

  34. Step 2: Pattern-Based Relationship Extraction Extracting meaningful relationships by macro-reading free text

  35. Extracting from Plain text or hypertext • Informal, human-readable presentation of information • Vast amounts of information available • Web • Scientific publications • Encyclopediae • Need sophisticated algorithms to extract information

  36. Pattern-based Fact Extraction • Learn textual patterns that express known relationship types • Search the text corpus for occurrences of known entities (e.g. from domain hierarchy) • Semi-open • Types are known and limited • Types are automatically expanded when LOD grows • Vector-Space Model • Probabilistic representation

  37. Training • Relationship data in the UMLS Metathesaurus or the Wikipedia Infobox-data provide a large set of facts in RDF Triple format • Limited set of relationships that can be arranged in a schema • Semi-open • Types are known and limited • Types are automatically expanded when LOD grows

  38. Training procedure • Iterate through all facts (S->P->O triples) • Find evidence for the fact in a corpus • Wikipedia, WWW, PubMed or any other collection • If triple subject and triple object occur in close proximity in text, add the pattern in-between to the learned patterns • Combined evidence from many different patterns increases the certainty of a relationship between the entities

  39. Overview – initial computations * Modifications CP2P  R2P CP2P CP2Pmod R2P R2Pmod * Entropy SVD/LSI Pertinence Fact Collection Text Corpus Matrix Computations

  40. Training procedure cont’d Canberra, theAustralian capital city Canberra, capital of theCommonwealth ofAustralia Canberra, theAustralian capital Canberra, theAustraliancapital city <Subject>, the<Object> capital city <Subject>, capital of the Commonwealth of <Object> <Subject>, the <Object> capital 1 1 1

  41. Relationship Patterns Extracted Synonyms Generalize

  42. Relationship Patterns

  43. Resolve Relationships x

  44. Resolve Relationships x

  45. Advanced Computations * Modifications CP2P  R2P CP2P CP2Pmod R2P R2Pmod * SVD/LSI Pertinence Fact Collection Text Corpus Entropy Matrix Computations

  46. Advanced Computations R2P R2Pmod * SVD/LSI Pertinence Entropy Matrix Computations • LSI to determine relationship similarities • Reduces sparsity in the matrix and makes relationship rows more comparable • Allows better use of pertinence computation • Entropy • Increase weights for more unique patterns • Pertinence • Smoothing of pattern occurrence frequencies

  47. Example Output (DBPedia)

  48. Pertinence for Relations • Looking at fact extraction as a classification of concept pairs into classes of relations • Class boundaries are not clear cut • E.g. has_physical_parthas_part •  don’t punish the occurrence of the same pattern with relationship types that are similar

  49. Relationship Patterns

More Related