Unleashing Knowledge: Transforming Data into Wisdom

WoK: A Web of Knowledge David W. Embley Brigham Young University Provo, Utah, USA

A Web of Pages  A Web of Facts • Birthdate of my great grandpa Orson • Price and mileage of red Nissans, 1990 or newer • Location and size of chromosome 17 • US states with property crime rates above 1%

Toward a Web of Knowledge • Fundamental questions • What is knowledge? • What are facts? • How does one know? • Philosophy • Ontology • Epistemology • Logic and reasoning

Ontology • Existence  asks “What exists?” • Concepts, relationships, and constraints with formal foundation

Epistemology • The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?” • Populated conceptual model

Logic and Reasoning • Principles of valid inference  asks: “What is known?” and “What can be inferred?” • For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Find price and mileage of red Nissans, 1990 or newer

Making this Work  How? • Distill knowledge from the wealth of digital web data • Annotate web pages • Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Annotation Annotation … … Fact Fact Fact

Turning Raw Symbols into Knowledge • Symbols: $ 11,500 117K Nissan CD AC • Data: price(11,500) mileage(117K) make(Nissan) • Conceptualized data: • Car(C123) has Price($11,500) • Car(C123) has Mileage(117,000) • Car(C123) has Make(Nissan) • Car(C123) has Feature(AC) • Knowledge • “Correct” facts • Provenance

Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.

Data Extraction Demo

Semantic Annotation Demo

Free-Form Query Demo

Explanation: How it Works • Extraction Ontologies • Semantic Annotation • Free-Form Query Interpretation

Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization

Extraction Ontologies Data Frame: Internal Representation: float Values External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Left Context: $ Key Word Phrase Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…

Generality & Resiliency ofExtraction Ontologies • Generality: assumptions about web pages • Data rich • Narrow domain • Document types • Single-record documents (hard, but doable) • Multiple-record documents (harder) • Records with scattered components (even harder) • Resiliency: declarative • Still works when web pages change • Works for new, unseen pages in the same domain • Scalable, but takes work to declare the extraction ontology

Semantic Annotation

Free-Form Query Interpretation • Parse Free-Form Query (with respect to data extraction ontology) • Select Ontology • Formulate Query Expression • Run Query Over Semantically Annotated Data

Parse Free-Form Query “Find me the and of all s – I want a ” price mileage red Nissan 1996 or newer >=Operator

Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”

Formulate Query Expression • Conjunctive queries and aggregate queries • Projection on mentioned object sets • Selection via values and operator keywords • Color = “red” • Make = “Nissan” • Year >= 1996 >= Operator

Formulate Query Expression For Let Where Return

Run QueryOver Semantically Annotated Data

Great!But Problems Still Need Resolution • How do we create extraction ontologies? • Manual creation requires several dozen person hours • Semi-automatic creation • TISP (Table Interpretation by Sibling Pages) • TANGO (Table ANalysis for Generating Ontologies) • Nested Schemas with Regular Expressions • Synergistic Bootstrapping • Form-based Information Harvesting • How do we scale up? • Practicalities of technology transfer and usage • Millions of queries over zillions of facts for thousands of ontologies

Manual Creation

Manual Creation • Library of instance recognizers • Library of lexicons

Automatic Annotation with TISP(Table Interpretation with Sibling Pages) • Recognize tables (discard non-tables) • Locate table labels • Locate table values • Find label/value associations

Recognize Tables Layout Tables (discard) Data Table Nested Data Tables

Locate Table Labels Examples: Identification.Gene model(s).Protein Identification.Gene model(s).2

Locate Table Labels Examples: Identification.Gene model(s).Gene Model Identification.Gene model(s).2 1 2

Locate Table Values Value

Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE28918 1 2

Interpretation Technique:Sibling Page Comparison

Interpretation Technique:Sibling Page Comparison Same

Interpretation Technique:Sibling Page Comparison Almost Same

Interpretation Technique:Sibling Page Comparison Different Same

Technique Details • Unnest tables • Match tables in sibling pages • “Perfect” match (table for layout  discard ) • “Reasonable” match (sibling table) • Determine & use table-structure pattern • Discover pattern • Pattern usage • Dynamic pattern adjustment

Generated RDF

WoK Demo (via TISP)

Semi-Automatic Annotation with TANGO (Table Analysis for Generating Ontologies) • Recognize and normalize table information • Construct mini-ontologies from tables • Discover inter-ontology mappings • Merge mini-ontologies into a growing ontology

Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10% Construct Mini-Ontology

Discover Mappings

Merge

Semi-Automatic Annotation viaSynergistic Bootstrapping(Based on Nested Schemas with Regular Expressions) • Build a page-layout, pattern-based annotator • Automate layout recognition based on examples • Auto-generate examples with extraction ontologies • Synergistically run pattern-based annotator & extraction-ontology annotator

Synergistic Execution Extraction Ontology Partially Annotated Document Conceptual Annotator (ontology-based annotation) Pattern Generation Document Layout Patterns Annotated Document Structural Annotator (layout-driven annotation)

Form-Based Information Harvesting • Forms • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Instance recognizers • Some pre-existing instance recognizers • Lexicons • Automated extraction ontology creation?

Form Creation • Basic form-construction facilities: • single-entry field • multiple-entry field • nested form • …

Unleashing Knowledge: Transforming Data into Wisdom

Unleashing Knowledge: Transforming Data into Wisdom

Presentation Transcript

Knowledge Management for E-Business

Peer-to-Peer Knowledge Sharing

Lectures on Knowledge Management

What Is Knowledge?

Knowledge Management

Arts

Knowledge Decision Securities, LLC .

Sri Lanka as a Knowledge Hub

Outline

Types of morphemes

Chapter 3 Knowledge Acquisition 知識擷取

WEB OF KNOWLEDGE UPDATE TRAINING

KNOWLEDGE BOWL COMPETITION

Chapter 6 Knowledge Acquisition 知識擷取

CHAPTER 9

Statistical Zero-Knowledge

Chapter 10

The TAO of Topic Maps