780 likes | 878 Vues
Explore the fundamentals of knowledge acquisition and ontology while delving into semantic annotation and free-form query interpretation. Learn to distill raw web data into valuable insights. Discover techniques for scalable extraction ontology creation.
E N D
WoK: A Web of Knowledge David W. Embley Brigham Young University Provo, Utah, USA
A Web of Pages A Web of Facts • Birthdate of my great grandpa Orson • Price and mileage of red Nissans, 1990 or newer • Location and size of chromosome 17 • US states with property crime rates above 1%
Toward a Web of Knowledge • Fundamental questions • What is knowledge? • What are facts? • How does one know? • Philosophy • Ontology • Epistemology • Logic and reasoning
Ontology • Existence asks “What exists?” • Concepts, relationships, and constraints with formal foundation
Epistemology • The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?” • Populated conceptual model
Logic and Reasoning • Principles of valid inference asks: “What is known?” and “What can be inferred?” • For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Find price and mileage of red Nissans, 1990 or newer
Making this Work How? • Distill knowledge from the wealth of digital web data • Annotate web pages • Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Annotation Annotation … … Fact Fact Fact
Turning Raw Symbols into Knowledge • Symbols: $ 11,500 117K Nissan CD AC • Data: price(11,500) mileage(117K) make(Nissan) • Conceptualized data: • Car(C123) has Price($11,500) • Car(C123) has Mileage(117,000) • Car(C123) has Make(Nissan) • Car(C123) has Feature(AC) • Knowledge • “Correct” facts • Provenance
Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.
Explanation: How it Works • Extraction Ontologies • Semantic Annotation • Free-Form Query Interpretation
Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization
Extraction Ontologies Data Frame: Internal Representation: float Values External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Left Context: $ Key Word Phrase Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…
Generality & Resiliency ofExtraction Ontologies • Generality: assumptions about web pages • Data rich • Narrow domain • Document types • Single-record documents (hard, but doable) • Multiple-record documents (harder) • Records with scattered components (even harder) • Resiliency: declarative • Still works when web pages change • Works for new, unseen pages in the same domain • Scalable, but takes work to declare the extraction ontology
Free-Form Query Interpretation • Parse Free-Form Query (with respect to data extraction ontology) • Select Ontology • Formulate Query Expression • Run Query Over Semantically Annotated Data
Parse Free-Form Query “Find me the and of all s – I want a ” price mileage red Nissan 1996 or newer >=Operator
Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”
Formulate Query Expression • Conjunctive queries and aggregate queries • Projection on mentioned object sets • Selection via values and operator keywords • Color = “red” • Make = “Nissan” • Year >= 1996 >= Operator
Formulate Query Expression For Let Where Return
Great!But Problems Still Need Resolution • How do we create extraction ontologies? • Manual creation requires several dozen person hours • Semi-automatic creation • TISP (Table Interpretation by Sibling Pages) • TANGO (Table ANalysis for Generating Ontologies) • Nested Schemas with Regular Expressions • Synergistic Bootstrapping • Form-based Information Harvesting • How do we scale up? • Practicalities of technology transfer and usage • Millions of queries over zillions of facts for thousands of ontologies
Manual Creation • Library of instance recognizers • Library of lexicons
Automatic Annotation with TISP(Table Interpretation with Sibling Pages) • Recognize tables (discard non-tables) • Locate table labels • Locate table values • Find label/value associations
Recognize Tables Layout Tables (discard) Data Table Nested Data Tables
Locate Table Labels Examples: Identification.Gene model(s).Protein Identification.Gene model(s).2
Locate Table Labels Examples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2 1 2
Locate Table Values Value
Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1 2
Interpretation Technique:Sibling Page Comparison Almost Same
Interpretation Technique:Sibling Page Comparison Different Same
Technique Details • Unnest tables • Match tables in sibling pages • “Perfect” match (table for layout discard ) • “Reasonable” match (sibling table) • Determine & use table-structure pattern • Discover pattern • Pattern usage • Dynamic pattern adjustment
Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies) • Recognize and normalize table information • Construct mini-ontologies from tables • Discover inter-ontology mappings • Merge mini-ontologies into a growing ontology
Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%
Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10% Construct Mini-Ontology
Semi-Automatic Annotation viaSynergistic Bootstrapping(Based on Nested Schemas with Regular Expressions) • Build a page-layout, pattern-based annotator • Automate layout recognition based on examples • Auto-generate examples with extraction ontologies • Synergistically run pattern-based annotator & extraction-ontology annotator
Synergistic Execution Extraction Ontology Partially Annotated Document Conceptual Annotator (ontology-based annotation) Pattern Generation Document Layout Patterns Annotated Document Structural Annotator (layout-driven annotation)
Form-Based Information Harvesting • Forms • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Instance recognizers • Some pre-existing instance recognizers • Lexicons • Automated extraction ontology creation?
Form Creation • Basic form-construction facilities: • single-entry field • multiple-entry field • nested form • …