740 likes | 755 Vues
C. S. K. An Algebraic Approach to Articulate Ontologies for Information Integration. October 2002 Gio Wiederhold Prasenjit Mitra, Jan Janninck Stanford University. Stanford Computer Forum. Prof Fouad Tobagi – Stanford CSD contact
E N D
C S K An Algebraic Approach to Articulate Ontologies for Information Integration October 2002 Gio Wiederhold Prasenjit Mitra, Jan Janninck Stanford University Gio Wiederhold SKC 1
Stanford Computer Forum • Prof Fouad Tobagi – Stanford CSD contact • Facilitated Access to students and faculty for Forum members, as FXpal • Interview students in Gates during Fall, Winter, and Spring Quarters(full time and summer internships) • Job Fair – 22 January 2003 • Annual Meeting – 19-20 March 2003 • Computer Security Workshop–21 Mar 2003 • Monthly HTML newsletter • Comprehensive Website • Student/Advisor/Mentor Program Gio Wiederhold SKC 2
Research Topics All related to exploiting data • Building Large-scale Information systems • Service Composition • Scheduling in process and data flow • Web services – Semantic web • Scalable Knowledge Composition (SKC) • Adding Prediction to Information Systems(SimQL) • Image databases • Privacy Protection • Applications • PharmGKB – with Medical Informatics • RegNet – with Civil Engineering Gio Wiederhold SKC 3
Data and Knowledge Information is created at the confluence of data -- the state & knowledge -- the ability to select and project the state into the future Knowledge Loop Data Loop Storage Education Selection Recording Integration Abstraction Experience State changes Decision-making Action Gio Wiederhold SKC 4
Information Creation Application Layer Mediation Layer Foundation Layer decision-makers at workstations value-added services data and simulation resources Gio Wiederhold SKC 5
Overview of SKC • Setting: mediation – intelligent middleware • Metainformation: Ontologies of Sources • Problem: Scalability of Integration • Solution: Interoperation of sources by articulation • articulation generator • New Problem: Composability of many sources • New Solution: Ontology Algebra • algebraic properties • Results Gio Wiederhold SKC 6
What are Ontologies? Ontologies list the terms and their relationships that allow communication among partners in enterprises (in machine-readable form) Relationships determine meaning - parent, school, company Databases use ontologies during design in their E-R diagrams (Implicitly) and represent the leaf nodes in their schemas Knowledge-bases use ontologies (often implicitly) add class definition (to hold instances), constraints, and, sometimes, operations among the terms Gio Wiederhold SKC 7
Functions of Ontologies . • Enable Precision in Understanding People = designers, implementors, users, maintainers Systems = implementors = users = maintainers • Share the Cost of Knowledge Acquistion & Maintenance reuse encoded knowledge, remain up-to-date as domains change • Enable Information Interoperation * Define the terms that link domains Gio Wiederhold SKC 8
Ancestors of Ontologies . • Lexicons: collect terms used in informtion systems • Taxonomies: categorize, abstract, classify terms • Schemas of databases: attributes, ranges, constraints • Data dictionaries: systems with multiple files, owners • Object libraries: grouped attributes, inherit., methods • Symbol tables: terms bound to implemented programs • Domain object models: (XML DTD): interchange terms • . . . More Knowledge formalized Gio Wiederhold SKC 9
Establishing Ontologies . Top-down: • Commonly acceptable UPPER layers Domain-specific • Sharing tools • Object based Bottom-up • Pragmatic, TASK-specific collections • Database schemas and models Gio Wiederhold SKC 10
Heterogeneity among Domains If interoperation involves distinct domains mismatch ensues • Autonomy conflicts with consistency, • Local Needs have Priority, • Outside uses are a Byproduct Heterogeneity must be addressed • Platform and Operating Systems • Representation and Access Conventions • Naming and Ontology Gio Wiederhold SKC 11
Semantic Mismatches Information comes from many autonomous sources • Differing viewpoints (by source) • differing terms for similar items { lorry, truck } • same terms for dissimilar items trunk(luggage, car) • differing coverage vehicles (DMV, AIA) • differing granularity trucks (shipper, manuf.) • different scope student museum fee, Stanford • Hinders use of information from disjoint sources • missed linkages loss of information, opportunities • irrelevant linkages overload on user or application program • Poor precision when merged Still ok for web browsing ,poor for business & science Gio Wiederhold SKC 12
Ontology Sharing Three Alternatives • Create a committee to define everybody’s terms • Takes many years, until people are worn out • Ignored when changes make deviation necessary • Get all terms and put them into large model [ Cyc, UMLS, Federated Schemas, . . . ] • Can be rapid • Ignores conflicts • Hard to maintain (requires committee) • Keep all Terms distinct, except where sharing • Requires initial effort • Empowers participants Gio Wiederhold SKC 13
Proposed Language Solutions Specify and define terminology usage: ontology • Domain-specific ontologies XML DTD assumption • Small, focused, cooperating groups • high quality, some examples - genomics, arthritis, Shakespeare plays • allows sharable, formal tools • ongoing, local maintenance affecting users - annual updates • poor interoperation, users still face inter-domain mismatches • Cannot achieve globally consistency • wonderful for users and their programs • too many interacting sources • long time to achieve,2 sources (UAL, BA), 3 (+ trucks), 4, … all ? • costly maintenance, since all sources evolve • no world-wide authority to dictate conformance Gio Wiederhold SKC 14
An unsolved problem Common assumption in assembling and integrating distributed information resources • The language used by the resources is the same • Sub languages used by the resources are subsets of a globally consistent language This assumption is provably false Working towards the goal of globally consistency is 1. naïve -- the goal cannot be achieved 2. inefficient -- languages are efficient in local contexts Gio Wiederhold SKC 15
General Ontologies? • Have all the Knowledge together • simple for customers of KBs • hard for owners of KBs • Large KB will cover multiple domains • created by a committee -- slow • maintained by a committee-- costly • Differences in level of abstraction -- efficiency • homeowner: nail • carpenter: sinker, brad, boxnail, . . . Gio Wiederhold SKC 16
Structural Heterogeneity Gio Wiederhold SKC 17
Mismatches in Logistics TransCom Ontology United AL Ontology Airline SubClassOf Wing Date Passenger Cargo Equ EstCost Time Equ Flight Orders Mode From Type Equ To Land Air FlightInfo Schedule Route Sortie FlightNumber Passenger Materiel Departure City Arrival City AFB Equipment Dep.Time Equ Airport Arr.Time Name GEOLOC Size Code Name Equ Gio Wiederhold SKC 18
No committee is needed to forge compromises * within a domain Domains and Consistency . • a domain will contain many objects • the object configuration is consistent • within a domain all terms are consistent & • relationships among objects are consistent • context is implicit Domain Ontology • Compromises hide valuable details Gio Wiederhold SKC 19
SKC grounded definition . • Ontology: a set of terms and their relationships • Term: a reference to real-world and abstract objects • Relationship: a named and typed set of links between objects • Reference: a label that names objects • Abstract object: a concept which refers to other objects • Real-world object: an entity instance with a physical manifestation Gio Wiederhold SKC 20
Domain-specific Expertise . Knowledge needed is huge • Partition into natural domains • Determine domain responsibility and authority • Empower domain owners • Provide tools Consider interaction Society of specialists Gio Wiederhold SKC 21
Intersection create a subset ontology • keep sharable entries • Union create a joint ontology • merge entries • Difference create a distinct ontology • remove shared entries An Ontology Algebra A knowledge-based algebra for ontologies The Articulation Ontology (AO) consists of matching rules that link domain ontologies Gio Wiederhold SKC 22
Sample Operation: INTERSECTION Terms useful for purchasing Result contains shared terms Source Domain 1: Owned and maintained by Store Source Domain 2: Owned and maintained by Factory Gio Wiederhold SKC 23
INTERSECTION support Articulation ontology Matching rules that use terms from the 2 source domains Terms useful for purchasing Store Ontology Factory Ontology Gio Wiederhold SKC 24
Shoe Factory • Material inventory {...} • Employees { . . . } • Machinery { . . . } • Processes { . . . } • Shoes { . . . } Shoe Store • Shoes { . . . } • Customers { . . . } • Employees { . . . } Sample Intersections Articulation ontology matching rules : size = size color =table(colcode) style = style Ana- tomy {. . . } Hard- ware foot = foot Employees Employees Nail (toe, foot) Nail (fastener) . . . . . . Department Store Gio Wiederhold SKC 25
Arti- culation ontology Other Basic Operations DIFFERENCE: material fully under local control UNION: merging entire ontologies typically prior intersections Gio Wiederhold SKC 26
Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused Gio Wiederhold SKC 27
What is the most recent year an OPEC member nation was on the UN security council (SC)? Related to DARPA HPKB Challenge Problem SKC resolves 3 Sources CIA Factbook ‘96 (nation) OPEC (members, dates) UN (SC members, years) SKC obtains the Correct Answer 1996 (Indonesia) Other groups obtained more, but factually wrong answers Problems resolved by SKC Factbook – a secondary source -- has out of date OPEC & UN SC lists Indonesia not listed Gabon (left OPEC 1994) different country names Gambia => The Gambia historical country names Yugoslavia UN lists future security council members Gabon 1999 needed ancillary data Sample Processing in HPKB Gio Wiederhold SKC 28
Interoperation via Articulation At application definition time • Match ontologies • Establish articulation rules. • Record the process At execution time • Query rewriting • Optimization based on an Ontology Algebra. For maintenance • Regenerate rules using the stored formulation Gio Wiederhold SKC 29
Semi-automatic approach Provide library of automatic match heuristics • Lexical Methods -- spelling • Structural Methods -- relative graph position • Reasoning-based Methods • Nexus • Hybrid Methods • Iterative/Non-iterative Methods GUI tool to • - display matches and • - verify generated matches using human expert • - expert can also supply matching rules Gio Wiederhold SKC 30
Thesaurus o o Information Flow End-User Expert GUI Tool Articulation Generator Query Engine Art123 Art12 Ontology1 Ontology3 Ontology2 Source1 Source3 Source2 Gio Wiederhold SKC 31
Articulation Generator Being built by Prasenjit Mitra Thesaurus OntA Context-based Word Relator Phrase Relator Driver Semantic Network (Nexus) Structural Matcher Ont1 Ont2 Human Expert Gio Wiederhold SKC 32
Lexical Methods • Preprocessing rules. • -Expert-generated seed rules. • e.g., (Match O1.President O2.PrimeMinister) • -Context-based preprocessing directives. • Thesaurus - synonyms, relationships • Distance of words as measure of relatedness. Gio Wiederhold SKC 33
Tools to create articulations Vehicle registration ontology Vehicle sales ontology Combine ontology graphs with expert selection based on spelling, graph matching, and a nexus derived from a dictionary (O.E.D.) Suggestions for articulations Gio Wiederhold SKC 34
Tools to create articulations Graph matcher for Articulation- creating Expert Transport ontology Vehicle ontology Suggestions for articulations Gio Wiederhold SKC 35
continue from initial point • Also suggest similar terms • for further articulation: • by spelling similarity, • by graph position • by term match repository • Expert response: • 1. Okay • 2. False • 3. Irrelevant • to this articulation • All results are recorded • Okay’s are converted into articulation rules Gio Wiederhold SKC 36
Based on processing headwords ý definitions using algebra primitives Candidate Match Repository Term linkages automatically extracted from 1912 Webster’s dictionary * * free, other sources .have been processed. Notice presence of 2 domains: chemistry, transport Gio Wiederhold SKC 37
Using the match repository Gio Wiederhold SKC 38
Navigating the match repository Gio Wiederhold SKC 39
Relative Arc Importance • PageRank (Google) limitations • node oriented • high rank to words with little semantic value • conjunctions, articles AndThe • prepositions, pronouns toit • Relative arc importance • contribution of source rank to target rank Gio Wiederhold SKC 40
ArcRank • For All source s and target t nodes in graph sort outgoing , rank by sorted order sort incoming , rank by sorted order for each arc compute • In ranking • Equal values take same rank • Ranks numbered consecutively Gio Wiederhold SKC 41
All Pairs Similarity • Compute similarity value for all node pairs product of inbound arc importance vectors product of outbound arc importance vectors similarity = • Similarity Matrix • Initial state: nodes similar only to themselves • Node substitution: terms replace similar ones • Iterative convergence: bounded substitution Gio Wiederhold SKC 42
Examples (Verb) Gio Wiederhold SKC 43
Examples (Adverb) Gio Wiederhold SKC 44
Examples (Proper Noun) Gio Wiederhold SKC 45
Country Graphs Gio Wiederhold SKC 46
To be matched to Gio Wiederhold SKC 47
FORALL X,Y,Z connection(X,Z)<- connection(X,Y) and connection(Y,Z). <connection>| <from>Washington D.C.</from> <to>al-Jaber</to></connection> Inference Engine <connection>| <from>Washington D.C.</from> <to>Frankfurt</to></connection> Using articulation rule: <Equ> <Airport>Frankfurt</Airport> <Airport>Rhein Main AFB <\Airport> <\Equ> <connection>| <from>Rhein Main AFB</from> <to>al-Jaber AB</to></connection> Declaratively Specified Rules Results via Inference Engine Gio Wiederhold SKC 48
Nexus-based methods • Consult a Nexus, a network of related words derived from a dictionary • Example: Owner : Buyer • Generate a similarity measure or relatedness measure • - words that have similar words in their definitions are similar -- Example: Rose : Tulip • Get more semantically meaningful relationships from WordNet (syn, hyper) • Example: Employee : Person Gio Wiederhold SKC 49
Corpus-based • Collect a set of text documents preferably from same domain • - search using keywords in Google • Build a context vector (1000-character neighborhood) for each word • Compute word-pair similarity based on the cosine of the vectors • Use word-pair similarity to find similarity among labels of nodes/edges Gio Wiederhold SKC 50