180 likes | 292 Vues
This paper presents a novel machine-learning approach to address the challenges of reconciling schemas from disparate data sources in large-scale data integration systems. The primary bottleneck, semantic mappings, is resolved through a multi-strategy learning system that incorporates various machine learning techniques, such as XML structure learners and naïve Bayes learners. By imposing integrity constraints and leveraging user feedback, our method enhances matching accuracy and efficiency, allowing for a more extensible framework capable of adapting to diverse domains and source ambiguities.
E N D
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy
Problem & Solution • Problem • Large-scale Data Integration Systems • Bottleneck: Semantic Mappings • Solution • Multi-strategy Learning • Integrity Constraints • XML Structure Learner • 1-1 Mappings
Learning Source Descriptions (LSD) • Components • Base learners • Meta-learner • Prediction converter • Constraint handler • Operating Phases • Training phase • Matching phase
Learners • Basic Learners • Name Matcher (Whirl) • Content Matcher (Whirl) • Naïve Bayes Learner • County-Name Recognizer • XML Learner • Meta-Learner (Stacking)
Naïve Bayes Learner Input instance= bags of tokens
XML Learner Input instance= bags of tokens including text tokens and structure tokens
Domain Constraint Handler • Domain Constraints • Impose semantic regularities on schemas and source data in the domain • Can be specified at the beginning • When creating a mediated schema • Independent of any actual source schema • Constraint Handler • Domain constraints + Prediction Converter + Users’ feedback + Output mappings
Training Phase • Manually Specify Mappings for Several Sources • Extract Source Data • Create Training Data for each Base Learner • Train the Base-Learner • Train the Meta-Learner
Example1 (Cont.) Training Data Source Data
Example1 (Cont.) (“location” ,ADDRESS) (“Miami, FL”, ADDRESS) Source Data: (location: Miami, FL)
Matching Phase • Extract and Collect Data • Match each Source-DTD Tag • Apply the Constraint Handler
Experimental Evaluation • Measures • Matching accuracy of a source • Average matching accuracy of a source • Average matching accuracy of a domain • Experiment Results • Average matching accuracy for different domains • Contributions of base learners and domain constraint handler • Contributions of schema information and instance information • Performance sensitivity to the amount data instances
Limitations • Enough Training Data • Domain Dependent Learners • Ambiguities in Sources • Efficiency • Overlapping of Schemas
Conclusion and Future Work • Improve over time • Extensible framework • Multiple types of knowledge • Non 1-1 mapping ?