1 / 17

Semantic Mappings for Data Mediation

Semantic Mappings for Data Mediation. Jayant Madhavan University of Washington Joint work with AnHai Doan, Pedro Domingos, and Alon Halevy. Find houses with 2 bedrooms priced under 300K. realestate.com. homeseekers.com. homes.com. Charlie comes to town. Data Integration.

trina
Télécharger la présentation

Semantic Mappings for Data Mediation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Mappings for Data Mediation Jayant Madhavan University of Washington Joint work with AnHai Doan, Pedro Domingos, and Alon Halevy

  2. Find houses with 2 bedrooms priced under 300K realestate.com homeseekers.com homes.com Charlie comes to town Affiliates Meeting

  3. Data Integration Find houses with 2 bedrooms priced under 300K mediated schema source schema 1 source schema 3 source schema 2 wrapper wrapper wrapper realestate.com homeseekers.com homes.com Affiliates Meeting

  4. Semantic Mappings between Schemas Mediated schema address agent-name agent-city agent-state 1-1 mapping complex mapping homes.com area contact-name contact-address Denver, CO Laura Smith Boulder, CO Oakland, CA Jean Brown Davis, CA Affiliates Meeting

  5. Why Schema Matching is Important Enterprise 1 Application has more than one schema need for schema matching! Data integration Data integration Data translation Data warehousing E-commerce World-Wide Web Ontology Matching Knowledge Base 2 Information agent Enterprise 2 Homeusers KnowledgeBase1 Affiliates Meeting

  6. Why Schema Matching is Difficult • No access to exact semantics of concepts • Semantics not documented in sufficient details • Schemas not adequately expressive to capture semantics • Must rely on clues in schema & data • Using names, structures, types, data values, etc. • Such clues can be unreliable • Synonyms: Different names => same entity: • area & address => location • Homonyms: Same names => different entities: • area => location or square-feet • Done manually by domain experts • Expensive and time consuming Affiliates Meeting

  7. Previous work • Mostly ad-hoc heuristics • Name matchers • Data types • Sample domain values • Graph matching • Schemas are labeled graphs • No single heuristic works across scenarios • Systems are fragile and need a lot of tuning Affiliates Meeting

  8. How do we go about it? • Make extensive use of data instances • Incorporate multiple heuristics • Base learners that implement individual heuristics • Machine Learning • Multi-strategy learning to combine base learners • Extensible framework • Easy to add new heuristics/learners • Generic and domain specific constraints • Robust solution with high accuracy Affiliates Meeting

  9. Multiple hypotheses Mediated schema addressprice agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location If “fantastic” & “great” occur frequently in data instances => description If “office” occurs in the name => office-phone Content matcher Name matcher Affiliates Meeting

  10. Content Learner Name Learner (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) 729 0831”, agent-phone) (“(305) 616 1822”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address) (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) Base Learners Mediated schema addressprice agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location Affiliates Meeting

  11. Training Phase Matching Phase Mediated schema Source schemas Training data for base learners Base-Learner1 .... Base-Learnerk Meta-Learner Base-Learner1 Base-Learnerk Predictions for data instances Hypothesis1 Hypothesisk Prediction Combiner Domain constraints Predictions for elements Weights for Base Learners Meta-Learner Constraint Handler Mappings Learning Source Descriptions (LSD) [SIGMOD’01] Affiliates Meeting

  12. LSD’s performance Avg. Matching Accuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% Complete LSD system: + 0.8 - 6% Affiliates Meeting

  13. Matching Ontologies of Concepts • Each ontology has an inheritance tree (taxonomy) and data instances at the leaves. • For each concept find most similar concept in the other ontology. CS Dept U.S. CS Dept Australia Undergrad Courses Grad Courses Courses People Staff Faculty Staff AcademicStaff TechnicalStaff Assistant Professor Associate Professor Senior Lecturer Professor Lecturer Professor Affiliates Meeting

  14. The Glue System [WWW’2002] • No manually performed mappings • Automatically collect training data for base learners. • Similarity measures computed from the joint probability distribution of concepts • A random data instance can belong to both, either, neither concepts – P(A,B), P(A,B’), P(A’,B), P(A’,B’). • General framework for incorporating constraints • Extension of relaxation labeling. Affiliates Meeting

  15. The Glue System Mappings for O1 , Mappings for O2 Relaxation Labeling Similarity Matrix Common Knowledge & Domain Constraints Similarity Estimator Joint Probability Distribution P(A,B), P(A’, B)… Similarity Function Distribution Estimator Meta Learner Base Learner Base Learner Taxonomy O1 (tree structure + data instances) Taxonomy O2 (tree structure + data instances) Affiliates Meeting

  16. Glue’s performance Affiliates Meeting

  17. Conclusion and Future Work • LSD and Glue perform well • Combine predictions of different base learners • Incorporate constraints • Robust solution that results in good accuracy • Future Work • Representation mapping system • Incorporates various heuristics • Can perform complex mappings • Can learn with experience. • Reasoning about mappings • Does a mapping enable answering of queries posed on other schema? • Is one mapping implied by another? Is a mapping minimal? • Can mappings be composed? Affiliates Meeting

More Related