iTrails: Pay-as-you-go Information Integration in Dataspaces

iTrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich VLDB 2007 Anat Heilper Jan. 2009 CS Seminar in Databases (236826) 1

Problem: Querying heterogeneous data Sources Query What is the impact of the global depression in Israel? ? ? ? ? Systems Data Sources Laptop Email Server Web Server DB Server 2

Solution 1: Use a Search Engine Query global depression Israel Graph IR Search Engine System Query semantics are not precise! TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03] text, links text, links text, links text, links Data Sources Web Server DB Server Email Server Laptop 3

Global schema Query interface countries unemployment Query Price index Result Crime rate ? countries unemployment Crime rate Source schema Data source 1 Data source 2 Data source 3 Solution 2: Use an Information Integration System Too much effort to provide schema mappings! 4 4

Querying heteregenous data sources 2 opposite approaches : • Schema first approach (SFA) • Semantically integrated view over the data sources • Mappings between source schemas and mediated schema • Queries have clearly defined semantics • Expensive to construct and maintain • Not all data sources have schemas • No schema approach (NSA) • Keyword search • Requires good result ranking methods • Performs no integration • Query semantics is not well defined 5

. . . . . . Temps Cities ... CO2 Sunspots ... Graph IR Search Engine Data Integration System Dataspace System text, links text, links text, links text, links Motivation of iTrail ? Find a integration solution in-between these two extremes? The more effort you pay, the more query power you have. 6

iTrails Core Idea: Add Integration Hints Incrementally 1) Provide search service over the data • Use general graph data model (iDM) • handles unstructured documents, XML, and relations 2) Add integration semantics via hints (trails) 3) If more semantics needed, apply trails • Smooth transition between search and data integration • Semantics added incrementally to improve precision / recall 7

home Mike papers PIM SIGMOD42.pdf SIGMOD44.pdf QP VLDB12.pdf VLDB10.pdf projects PIM SIGMOD42.pdf Example of an iDM 1 X1 = { .name= ‘home‘, .tuple= {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content= “} X2 = { .name= ‘mike‘, .tuple= {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content= “} . . . X5 = { .name= ‘SIGMOD42.pdf ‘, .tuple= {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content= ‘@PDF . . . ‘} ….. 2 5 8

General graph data model - iDM • iDM (iMeMeXData Model) represents every structural component of the input data as a node. • Supports unstructured, semi-structured and structured data, e.g., files&folders, XML, relations 9

iMeMeX – integrated MeMeX • Vannevar Bush introduced the concept “memex” in the 1945s: "device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility." • Bush predicted: "Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified." 10

Data model • Data represented by directed graph G = (RV, E) • RV: {V1, . . . Vn} termed resource view • E: Ordered pairs (Vi , Vj ) of resource views • Vi Vj : Vj is reachable from Vi by traversing the edges E 11

{.name=‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = 04.01.2007‘}, .content = ‘@PDF . . . ‘} Resource view A resource view Vi has three components: name, tuple, and content 12

Query model • Query expression: • Query Q selects nodes R := Q(G) G.RV • Example: //mike/papers • Component projection • C  {.name, .tuple.<atti>, .content} : projection of set of resource views selected by query Q, i.e. set of components R’ := {Vi.C | ViQ(G)} 13

Component projection example Example: //mike//PIM/*.tuple.lastmodified 1 X1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “} X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “} . . . X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘} ….. home 2 Mike papers PIM SIGMOD42.pdf SIGMOD44.pdf QP VLDB12.pdf VLDB10.pdf projects PIM 5 14 SIGMOD42.pdf

Syntax of query expression QUERY_EXPRESSION := (PATH | KT_PREDICATE) (union QUERY_EXPRESSION)* PATH := (LOCATION_STEP)+ LOCATION_STEP := LS_SEP NAME_PREDICATE (`[` KT_PREDICATE `]`)? LS_SEP := `//` | `/` NAME_PREDICATE := `*` | (`*`) ? VALUE (`* `)? KT_PREDICATE := (KEYWORD | TUPLE) (LOGOP KT_PREDICATE)* KEYWORD := `”` VALUE (WHITESPACE VALUE) * `”` | VALUE (WHITESPACE KEYWORD)* TUPLE := ATTRIBUTE_IDENTIFIER OPERATOR VALUE OPERATOR := `=` | `<` | `>` LOGOP := `AND` | `OR` 15

semantics All nodes in graph that have ‘a’ in its content All nodes in graph All nodes in graph that have ‘a’ and ‘b’ in its content All nodes in graph such that .name== ‘A’ nodes that .name== ‘B’ and there is an edge from W w.name == ‘A’ 16

Logical algebra for query expressions 17

Example 18

What have we seen so far? • Problem: querying heterogeneous data sources • Find a solution between SFA and NSA • Generic graph data model to describe the data • queries describes paths in the graph 19

How itrails help? • Queries are modified by hints ( trails) which adds/modifies search paths to look at. • Example: yesterday → //*[date = today() – 1] 20

Queries: keyword and path expressions Attribute projections iTrails: Defining Trails • Basic Form of a Trail QL [.CL] → QR [.CR] • Intuition: • When I query for QL [.CL], you should also query for QR [.CR]

Queries:keyword and path expressions Attribute projections iTrails: Defining Trails • Unidirectional trail QL [.CL] → QR [.CR] • Intuition: • When query for QL [.CL], also query for QR [.CR] • Bidirectional trail QL [.CL]  QR [.CR] • Example:ψi :=//*.tuple.date  //*.tuple.modified Query example: global warming zurich or //Temperatures/*[celsius>10] 22

Trail Examples: Global Warming Zurich global warming zurich Trail for Implicit meaning: query for global warming, also query Temperature data > 10 degrees” Trail for an Entity: When query for zurich, query for references of zurich as a region Temperatures • region • celsius • city • date • 20 • Bern • BE • 24-Sep • ZH • 24-Sep • 15 • Uster global warming → //Temperatures/*[celsius > 10] • 14 • Zurich • ZH • 25-Sep • 9 • Zurich • ZH • 26-Sep zurich → //*[region = “ZH”] 23

Trail Example: Deep Web Bookmarks train home Trail for a Bookmark: Query for train home, also query Train website: origin = TelAviv Uni destination = Haifa Hof Hacarmel • train home → • //trainCompany.com//*[origin=“Tel Aviv Uni” • and dest =“HAifa Hof Hacarmel”] Web Server 24

Trail Examples: Thesauri, Dictionaries, Language-agnostic Search • Trail for Thesauri: query for car, also query for auto • Trails for Dictionary: query for car, also query for carro and vice-versa auto car car → auto automobile car car →automobile automobile→ car Email Server Laptop 25

Trail Examples: Schema Equivalences Employee • Trail for schema match on names: query for Employee.empName, also query for Person.name • Trail for schema match on salaries: query for Employee.salary, also query for Person.income • salary • empName • empId //Employee//*.tuple.empName → //Person//*.tuple.name Person • name • income • age • SSN //Employee//*.tuple.salary → //Person//*.tuple.income DB Server 26

How are Trails Created? • Given by the user • Explicitly • Via Relevance Feedback • (Semi-)Automatically • Automatic schema matching • Ontologies and thesauri (e.g., wordnet) • User communities (e.g., trails on gene data, bookmarks ) 27

Uncertainty and Trails • Probabilistic Trails: • model uncertain trails • probabilities used to rank trails • QL [.CL] → QR [.CR], 0 ≤ p ≤ 1 • Example: car → auto, p = 0.9 • probability p reflects the likelihood that results obtained by trail are correct. 28

Certainty and Trails - continue • Scored Trails: • Give higher value to certain trails • Scoring Factors: boost scores of results obtained by the trail • QL [.CL] → QR [.CR], sf > 1. examples • T1: weather →sf //Temperatures/*, sf ≥ 1 • T2: yesterday → sf //*[date = today() – 1], sf ≥ 1 • Intuition: sf reflects the relevance of the trail. • Results obtained are scored sf times higher than the results obtained without the trail. • If no scoring factor is available, sf = 1 29

Rewriting Queries with Trails U • (3) Merging • Query U U weather yesterday weather yesterday //*[date = today() – 1] • T2 matches • Trail T2:yesterday → //*[date = today() – 1] • (2) Transformation • (1) Matching 30

Replacing Trails Trails that use replace instead of union semantics U • Query • (3) Merging U weather //*[date = today() – 1] yesterday weather • T2 matches • Trail T2:yesterday //*[date = today() – 1] • (2) Transformation • (1) Matching 31

Problem: Recursive Matches (1/2) U New query still matches T2, so T2 could be applied again U weather yesterday //*[date = today() – 1] • T2 matches U weather U //*[date = today() – 1] U //*[date = today() – 1] U T2: yesterday → //*[date = today() – 1] ... //*[date = today() – 1] U • T2 matches //*[date = today() – 1] ... yesterday Infinite recursion 32

Problem: Recursive Matches (2/2) U • T3 matches Trails may be mutually recursive U weather //*[date = today() – 1] yesterday U U weather • T10 matches U T3: //*.tuple.date → //*.tuple.modified yesterday //*[modified = today() – 1] //*[date = today() – 1] U We again match T3 and enter an infinite loop U weather T10: //*.tuple.modified → //*.tuple.date yesterday U //*[date = today() – 1] U //*[date = today() – 1] //*[modified = today() – 1] 33

Algorithm to solve recursion - MMCA • MultipleMatchColoringAlgorithm (MMCA): • Keep history of all trails matched or introduced • Given a set of trails Y. For every trail t in Y: • Apply t to Q iteratively and color the query tree nodes in Q according to the trails that already touched those nodes 34

MultipleMatchColoringAlgorithm U • T3,T4match U First Level //*[date = today() – 1] U yesterday U weather //Temperatures/* Second Level yesterday weather U • T2 matches • T1 matches U U U U yesterday //*[date = today() – 1] weather //Temperatures/* T1: weather → //Temperatures/* T2:yesterday → //*[date =today()-1] T3://*.tuple.date →//*.tuple.modified T4://*.tuple.date →//*.tuple.received //*[received = today() – 1] //*[modified = today() – 1] 35

MultipleMatchColoringAlgorithm cont. • MMCA is exponential in number of levels • Every leaf can be applied any of the trails, and each trail can generate additional leafs. • Solution: Trail Pruning • Number of levels – punish recursive rewrites • Top-K trails matched in each level • Ranking by probability/certainity/weight • Other - timeout, progressively compute query results 36

iTrails Evaluation in iMeMex • Main Questions in Evaluation • Quality: Top-K Precision and Recall • Performance: Use of Materialization • Scalability: Query-rewrite Time vs. Number of Trails 37

iTrails Evaluation in iMeMex • Scenario 1: Few High-quality Trails • Closer to information integration use cases • Obtained real datasets and indexed them • 18 hand-crafted trails • 14 hand-crafted queries • Scenario 2: Many Low-quality Trails • Closer to search use cases • Randomly generated up to 10,000 trails and queries with a mutual uniform match probability of 1% 38

sizes in MB Email Server Web Server DB Server Laptop iTrails Evaluation in iMeMex: Scenario 1 • Configured iMeMex to act in three modes • Baseline: Graph / IR search engine • iTrails: Rewrite search queries with trails • Perfect Query: Semantics-aware query • Data: shipped to central index 39

Trails and queries used in Scenario 1 max original tree size: 14 max final tree size after applying trails: 35 max # of trails applied: 5 40

Perfect Query always has precision and recall equal to 1 Quality: Top-K Precision and Recall (k=20) perfect query Scenario 1: few high-quality Trails (18 trails) Search Query is partially semantics-aware Search Engine misses relevant results Queries 41

Performance: Use of Materialization Scenario 1: few high-quality trails (18 trails) Trail merging adds overhead to query execution Trail Materialization improves performence for almost all queries 42

Scalability: Query-rewrite Time vs. Number of Trails – scenario 2 • No pruning approach  exponential growth in the query plan sizes • Query-rewrite time can be controlled with pruning 43

summary • Firstframework to explore pay-as-you-go information integration in dataspaces • iTrails: generic method to model semantic relationships gradually • Itrails are used to rewrite queries • Algorithm to control recursive query rewrites 44

Personal opinion - advantages • The method is incremental • Integrators can collect statistics, find most common queries and define trails for popular queries first. • Dynamic system: If popular queries changes over time, trails for less popular queries can be disabled to reduce system workload. • Trails can be defined independently by domain expects for each data domain. 45

Personal opinion - disadvantages • Trails are global: every rewritten query is evaluated over every data source. • Trail can have different meaning for different data sources. • For a good quality of query results, trails have to be defined manually problem for large systems. • Solution: use machine learning techniques to improve automatic trails creation • Overlaps and inconsistencies in trails are possible since query returns union of the results satisfying all trails • Solution: trail mining and weighting would be helpful here. 46

Questions? 47

Bibliography • iTrails: Pay-as-you-go Information Integration in Dataspaces:Marcos Antonio Vaz Salles JensPeter Dittrich Shant Kirakos Karakashian Olivier René Girard Lukas Blunschi ETH Zurich 8092 Zurich, Switzerland dbis.ethz.ch | iMeMex.org • From Databases to Dataspaces: A New Abstraction for Information Management:Michael Franklin University of California, Berkeley, Alon Halevy Google Inc. and U. Washington, David Maier Portland State University • Wikipedia, dataspace:http://en.wikipedia.org/wiki/Data_Spaces, memex:http://en.wikipedia.org/wiki/Vannevar_Bush • Imemex information: http://imemex.ethz.ch/ 48

Backup slides 49

Multiple Match Coloring Algorithm Analysis • Algorithm runtime: • L: Number of leaves in query Q • M: Max number of leaves in query introduced by a trail • N: Number of trails • d {1, . . . ,N} number of levels • Theorem: Maximum number of trail applications performed by MMCA and maximum number of leaves in the merged query tree are both bounded by O(L • M^d ) 50

iTrails: Pay-as-you-go Information Integration in Dataspaces