230 likes | 348 Vues
iTrails: Pay-as-you-go Information Integration in Dataspaces. Presented By Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich 2008-02-22 Summerized By Sungchan Park. Problem: Querying Several Sources. Solution #1: Use a Search Engine.
E N D
iTrails: Pay-as-you-go Information Integration in Dataspaces Presented By Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich 2008-02-22 Summerized By Sungchan Park
Problem: Querying Several Sources Center for E-Business Technology
Solution #1: Use a Search Engine Center for E-Business Technology
Solution #2: Use an Information Integration System Center for E-Business Technology
iTrail Core Idea • Is there an integration solution in-between these two extremes? Center for E-Business Technology
iTrail Core Idea • Is there an integration solution in-between these two extremes? • Declaratively add lightweight ‘hints’ to a search engine thus allowing gradual enrichment of loosely integrated data sources Center for E-Business Technology
Example Scenario • Query • “pdf yesterday” • Hints(Trails) • The date attribute is mapped to modified attribute • The date attribute is mapped to receivedattribute • The yesterday keyword is mapped to a query for values of the date attribute equal to the date of yesterday • The pdf keyword is mapped to a query for elements whose names end in pdf Center for E-Business Technology
Where hints come from? • Given by the user • Explicitly • Via Relevance Feedback • (Semi-)Automatically • Information extraction techniques • Automatic schema matching • Ontologies and thesauri (e.g., wordnet) • User communities (e.g., trails on gene data, bookmarks) • All these aspects are beyond the scope of this paper Center for E-Business Technology
Data and Query Model • Data Model • Assume that all data is represented by a logical graph G • Query also represented by graph Center for E-Business Technology
Query Syntax Center for E-Business Technology
Query Example • “//Home/projects//*[“Mike”]” Center for E-Business Technology
Basic Form of a Trail • An unidirectional trail • An bidirectional trail Center for E-Business Technology
Trail Example • Trails in an example scenario • Trails • Given query • “pdf yesterday” • Transformed query • “//*.pdf[modified=yesterday() OR received=yesterday() ].” Center for E-Business Technology
iTrail Query Processing • Matching • Transforming • Merging Center for E-Business Technology
iTrail Query Processing Example • Given Query Q1 = //home/projects//* [“Mike”] • Trail Ψ8:= //home/*.name -> //calendar//*.tuple.category • Resulting Query Q1{Ψ8} = //home/projects/*[“Mike”] U //calendar//*[category=“project”]//*.[“Mike”] • Utilizing G. Miklau and D. Suciu. Containment and Equivalence for an Xpath Fragment. In PODS, 2002. Center for E-Business Technology
Applying Multiple Trail • MMCA(Multiple Match Colouring Algorithm) algorithm • Trail can be applied infinitely • To prevent infinite recursion, a trail should not be rematched to nodes in a logical plan generated by itself Center for E-Business Technology
Other Issues • Trail Pruning • Problem: MMCA is exponential in number of levels • Solution: Trail Pruning • Prune by number of levels • Prune by top-K trails matched in each level • Give weight and prob. to trails • Prune by both top-K trails and number of levels • Trail Indexing • Precompute trail expressions in order to speed up query processing • Trail materialization Center for E-Business Technology
Experiments • Setting • Configured iMeMex to act in three modes • Baseline: Graph / IR search engine • iTrails: Rewrite search queries with trails • Perfect Query: Semantics-aware query • Data Center for E-Business Technology
Experiment, Quality • Compare with baseline Center for E-Business Technology
Experiment, overhead • Compare with perfect query • Overhead is not negligible • However, this can be fixed by exploiting trail materializations Center for E-Business Technology
Experiment, Scalability #1 • Rewrite Time • Query-rewrite time can be controlled with pruning Center for E-Business Technology
Experiment, Scalability #2 • Quality • Pruning improves precision Center for E-Business Technology
Conclusion • Our Contributions • iTrails: generic method to model semantic relationships (e.g. implicit meaning, bookmarks, dictionaries, thesauri,attribute matches, ...) • We propose a framework and algorithms for Pay-as-you-go Information Integration • Smooth transition between search and data integration • Future Work • Trail Creation • Use collections (ontologies, thesauri, wikipedia) • Work on automatic mining of trails from the dataspace • Other types of trails Center for E-Business Technology