Semantic Mappings for Data Mediation

Semantic Mappings for Data Mediation Jayant Madhavan University of Washington Joint work with AnHai Doan, Pedro Domingos, and Alon Halevy

Find houses with 2 bedrooms priced under 300K realestate.com homeseekers.com homes.com Charlie comes to town Affiliates Meeting

Data Integration Find houses with 2 bedrooms priced under 300K mediated schema source schema 1 source schema 3 source schema 2 wrapper wrapper wrapper realestate.com homeseekers.com homes.com Affiliates Meeting

Semantic Mappings between Schemas Mediated schema address agent-name agent-city agent-state 1-1 mapping complex mapping homes.com area contact-name contact-address Denver, CO Laura Smith Boulder, CO Oakland, CA Jean Brown Davis, CA Affiliates Meeting

Why Schema Matching is Important Enterprise 1 Application has more than one schema need for schema matching! Data integration Data integration Data translation Data warehousing E-commerce World-Wide Web Ontology Matching Knowledge Base 2 Information agent Enterprise 2 Homeusers KnowledgeBase1 Affiliates Meeting

Why Schema Matching is Difficult • No access to exact semantics of concepts • Semantics not documented in sufficient details • Schemas not adequately expressive to capture semantics • Must rely on clues in schema & data • Using names, structures, types, data values, etc. • Such clues can be unreliable • Synonyms: Different names => same entity: • area & address => location • Homonyms: Same names => different entities: • area => location or square-feet • Done manually by domain experts • Expensive and time consuming Affiliates Meeting

Previous work • Mostly ad-hoc heuristics • Name matchers • Data types • Sample domain values • Graph matching • Schemas are labeled graphs • No single heuristic works across scenarios • Systems are fragile and need a lot of tuning Affiliates Meeting

How do we go about it? • Make extensive use of data instances • Incorporate multiple heuristics • Base learners that implement individual heuristics • Machine Learning • Multi-strategy learning to combine base learners • Extensible framework • Easy to add new heuristics/learners • Generic and domain specific constraints • Robust solution with high accuracy Affiliates Meeting

Multiple hypotheses Mediated schema addressprice agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location If “fantastic” & “great” occur frequently in data instances => description If “office” occurs in the name => office-phone Content matcher Name matcher Affiliates Meeting

Content Learner Name Learner (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) 729 0831”, agent-phone) (“(305) 616 1822”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address) (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) Base Learners Mediated schema addressprice agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location Affiliates Meeting

Training Phase Matching Phase Mediated schema Source schemas Training data for base learners Base-Learner1 .... Base-Learnerk Meta-Learner Base-Learner1 Base-Learnerk Predictions for data instances Hypothesis1 Hypothesisk Prediction Combiner Domain constraints Predictions for elements Weights for Base Learners Meta-Learner Constraint Handler Mappings Learning Source Descriptions (LSD) [SIGMOD’01] Affiliates Meeting

LSD’s performance Avg. Matching Accuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% Complete LSD system: + 0.8 - 6% Affiliates Meeting

Matching Ontologies of Concepts • Each ontology has an inheritance tree (taxonomy) and data instances at the leaves. • For each concept find most similar concept in the other ontology. CS Dept U.S. CS Dept Australia Undergrad Courses Grad Courses Courses People Staff Faculty Staff AcademicStaff TechnicalStaff Assistant Professor Associate Professor Senior Lecturer Professor Lecturer Professor Affiliates Meeting

The Glue System [WWW’2002] • No manually performed mappings • Automatically collect training data for base learners. • Similarity measures computed from the joint probability distribution of concepts • A random data instance can belong to both, either, neither concepts – P(A,B), P(A,B’), P(A’,B), P(A’,B’). • General framework for incorporating constraints • Extension of relaxation labeling. Affiliates Meeting

The Glue System Mappings for O1 , Mappings for O2 Relaxation Labeling Similarity Matrix Common Knowledge & Domain Constraints Similarity Estimator Joint Probability Distribution P(A,B), P(A’, B)… Similarity Function Distribution Estimator Meta Learner Base Learner Base Learner Taxonomy O1 (tree structure + data instances) Taxonomy O2 (tree structure + data instances) Affiliates Meeting

Glue’s performance Affiliates Meeting

Conclusion and Future Work • LSD and Glue perform well • Combine predictions of different base learners • Incorporate constraints • Robust solution that results in good accuracy • Future Work • Representation mapping system • Incorporates various heuristics • Can perform complex mappings • Can learn with experience. • Reasoning about mappings • Does a mapping enable answering of queries posed on other schema? • Is one mapping implied by another? Is a mapping minimal? • Can mappings be composed? Affiliates Meeting

Semantic Mappings for Data Mediation

Semantic Mappings for Data Mediation

Presentation Transcript

Semantic Mediation of Scientific Data via Logic-Based Data Federation Software

Semantic Data Integrity

Data Conflict Resolution Using Trust Mappings

Metadata Agents and Semantic Mediation

From Data Integration To Semantic Mediation: Addressing Heterogeneities in Data

Creating semantic mappings

Composing Mappings among Data Sources

Functions (Mappings)

Using context to improve data semantic mediation in web services composition

Bidirectional Mappings for Data and Update Exchange

Semantic Mediation of Scientific Data via Logic-Based Data Federation Software

Semantic Data Visualization

VALIDATION OF MAPPINGS BETWEEN DATA MODELS

Semantic Mediation in myGrid

Data Visualisation using Topographic Mappings

SEEK Semantic Mediation

Ontology-Based Data Mediation for Semantic Environments

Semantic Enrichment of Mappings

MMS Mappings

Demonstrating Semantic Mediation for Scientific Applications