Provenance-Based Annotation Management System

An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew Tan, Gaurav Vijayvargiya

Annotation Management System • A system that is able to propagate meta-dataalong with the data as the data is being moved around • Main motivation • To trace the provenance and flow of data • Many other uses b2 transformation transformation b1 b2 b3 a1 a2 b1 a3 a1 a2 a1 a2 b3 transformation step: a query, an ETL rule, etc.

Cost Type Serves fine French Cuisine in elegant setting. Formal attire. Restaurant Pacifica $ Chinese $ Soho Kitchen & Bar American Cost Type Restaurant Extensive wine list! Peacock Alley $$$ French Bull & Bear $$$ Seafood Pacifica $ Chinese $ Soho Kitchen & Bar American Our Vision NYRestaurants Cost Type Restaurant Zip Peacock Alley $$$ French 10022 Bull & Bear $$$ Seafood 10022 Pacifica $ Chinese 10013 $ Soho Kitchen & Bar American 10022 Yummy chicken curry!! Cheap Restaurants All Restaurants

Other Applications • Keep information that cannot be otherwise stored in the current database design • Highlight wrong data • Erroneous data may be copied around but the comment that it is wrong goes along with it • Security and quality metric • Annotate security or quality levels of data items

Some Related Work • Idea is not new though propagation of annotations was never explicitly stated as provenance-based: • Wang & Madnick [VLDB 90], • Lee, Bressan & Madnick [WIDM 98], • Bernstein & Bergstraesser [IEEE Data Eng. 99] • Superimposed Information. Maier and Delcambre [WebDB 99] • Annotations of Web documents • Annotations on genomic sequences • Why-Provenance • Cui, Widom, & Wiener [CWW00]

Outline • pSQL queries • Semantics • CUSTOM propagation scheme • DEFAULT propagation scheme • DEFAULT-ALL propagation scheme • Implementation • System architecture • Experimental results

pSQL – an extension of SQL • A pSQL fragment: SELECT DISTINCT selectlist FROM fromlist WHERE wherelist PROPAGATE DEFAULT | DEFAULT-ALL | r1.A1 TO B1, …, rn.An TO Bn • A pSQL query is a union of pSQL fragments

The CUSTOM Scheme R • Propagate annotations according to user specification a c b h Result1 a SELECT DISTINCT A FROM R r PROPAGATE r.A TO A UNION SELECT DISTINCT A FROM R r PROPAGATE r.B TO A b Result annotation UNION a c b Result2 c h h

The DEFAULT Scheme R S • Propagate annotations according to where data is copied from a c d g b f h e SELECT DISTINCT B FROM R r PROPAGATE DEFAULT UNION SELECT DISTINCT B FROM S s PROPAGATE DEFAULT Result r.B TO B c g f e h s.B TO B natural semantics for tracing the provenance of data

Annotation Propagation under the DEFAULT Scheme R S a b Q1: SELECT DISTINCT r.A, r.B, s.C FROM R r, S s WHERE r.B = s.B PROPAGATE DEFAULT versus SELECT DISTINCT * FROM R NATURAL JOIN S PROPAGATE DEFAULT Ans1 =a a equivalent queries, but different annotated output Q2: Ans2 a b

The DEFAULT-ALL scheme • Propagate annotations according to where data is copied from according to allequivalent formulations of the given query • User QueryQ: • Compute the results of Q on a database D – idea: • E(Q) denotes the set of all queries that are equivalent to Q(more precisely, (*)). • Execute each query in E(Q) on the database D under the DEFAULT scheme, then combine the results under a. • SELECT DISTINCT r.A, s.B, s.C • FROM R r, S s • WHERE r.B = s.B • PROPAGATE DEFAULT-ALL the SQL query corresponding to Q (*)

Computing the results of a DEFAULT-ALL query • Question: Given a pSQL query Q with DEFAULT-ALL propagation scheme and a database D, can we compute the result of Q(D)? • Problem: There are infinitely many queries in E(Q). It is therefore impossible to execute every query in E(Q) in order to obtain the result of Q(D). • Solution: Compute a finite basis of E(Q) first

A Query Basis of Q • A query basis of Q, denoted as B(Q), is a finite set of pSQL queries (with default propagation scheme) such that: Uaq(D)=aUaq(D) • Given B(Q), we can execute each query in B(Q) and combine the results to obtain the result of Q(D). • Question: Given Q, does B(Q) always exist and how can we compute B(Q)? qB(Q) qE(Q)

Generating a Query Basis of Q • Given R(A,B) and S(B,C) • User queryQ: • Representative QueryQ0: The representative query propagates annotations according to where data is copied from or equivalently copied from. • SELECT DISTINCT r.A, s.B, s.C • FROM R r, S s • WHERE r.B = s.B • PROPAGATE DEFAULT-ALL Ans(x,y,z) :- R(x,y), S(y,z). • SELECT DISTINCT r.A, s.B, s.C • FROM R r, S s • WHERE r.B = s.B • PROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B Propagations under the default propagation scheme Additional propagation due to the equality r.B = s.B

Generating a Query Basis of Q • Auxiliary Queries: Q1: Q2: Ans(x,y,z) :- R(x,y), S(y,z), R(x,w). • SELECT DISTINCT r.A, s.B, s.C • FROM R r, S s, R r’ • WHERE r.B = s.B, r’.A = r.A • PROPAGATE r.A TO A, s.B TO B, s.C TO C, • r.B TO B, r’.A TO A Ans(x,y,z) :- R(x,y), S(y,z), S(w,z). • SELECT DISTINCT r.A, s.B, s.C • FROM R r, S s, S s’ • WHERE r.B = s.B, s’.C = s.C • PROPAGATE r.A TO A, s.B TO B, s.C TO C, • r.B TO B, s’.C TO C

Generating a Query Basis of Q • Auxiliary Queries: Q3: Q4: Ans(x,y,z) :- R(x,y), S(y,z), R(w,y). • SELECT DISTINCT r.A, s.B, s.C • FROM R r, S s, R r’ • WHERE r.B = s.B, r’.B = r.B • PROPAGATE r.A TO A, s.B TO B, s.C TO C, • r.B TO B, r’.B TO B Ans(x,y,z) :- R(x,y), S(y,z), S(y,w). • SELECT DISTINCT r.A, s.B, s.C • FROM R r, S s, S s’ • WHERE r.B = s.B, s’.B = s.B • PROPAGATE r.A TO A, s.B TO B, s.C TO C, • r.B TO B, s’.B TO B

Correctness of the Algorithm • For the example, a query basis of Q consists of Q0, Q1, Q2, Q3, and Q4. Theorem: Given a pSQL query Q with DEFAULT-ALL propagation scheme, the algorithm generates a query basis of Q. Proof Idea: • Every query in B(Q) is an equivalent query of Q • Every equivalent query of Q is annotation-contained in Ua q(D) qB(Q)

Outline • pSQL queries • Semantics • CUSTOM propagation scheme • DEFAULT propagation scheme • DEFAULT-ALL propagation scheme • Implementation • System architecture • Experimental results

System Architecture • Translator Module • Input: a pSQLquery Q • Output: an SQL query Q’ written against the naïve storage scheme • Q’ is sent to the RDBMS and executed • Postprocessor Module • Input: sorted tuples (returned by the RDBMS) • Output: An annotated set of tuples. • Annotations for the same output location are collected together • Duplicate tuples are removed RDBMS SQL query USER pSQL query sorted tuples final result Translator Postprocessor

A Naïve Storage Scheme • For every attribute of every relation there is an additional attribute for storing the annotations • Conceivably, there are other possible storage schemes R R’ a d c b

The Translator module Generate a Query Basis pSQL query default-all scheme set of pSQL queries with custom scheme Translate default pSQL to custom pSQL Translate custom pSQL to SQL pSQL query default scheme SQL query pSQL query custom scheme defaultpSQL query custompSQL query

Experiments • Goals • compare the performance of pSQL queries under different propagation schemes (DEFAULT, DEFAULT-ALL, or no propagation scheme) • compare the performance of pSQL queries when the number of annotations in a database is varied

Experimental setup • Implemented on top of Oracle 9i • Datasets • 100MB, 500MB, 1GB TPCH database • Unannotated database on original schema • 30%, 60%, 100% annotations on naïve schema • buffer size: 256Mb • Test queries • SPJ queries • Varied the number of joins (0 to 4 joins) • Varied the number of selected attributes (1,3 or 5 attributes)

100MB dataset – 100% annotated Qi(j) denotes a query with i joins and j output attributes.

500MB dataset – 100% annotated Qi(j) denotes a query with i joins and j output attributes.

1GB dataset – 100% annotated Qi(j) denotes a query with i joins and j output attributes.

100MB dataset annotated in various degrees Qi(j) denotes a query with i joins and j output attributes.

Contributions • an annotation management system • for carrying annotations along as data is being transformed based on provenance • pSQL query language for propagation annotations • CUSTOM – user defined • DEFAULT – where data was copied from? • DEFAULT-ALL – invariant under equivalent queries • Generate-Query-Basis algorithm • an initial implementation

Future work • Performance of our annotation management system on other storage schemes • pSQL extensions • Aggregates • Bag Queries

END

The CUSTOM Scheme - Example R a c b SELECT DISTINCT B FROM R r PROPAGATE r.A TO B, r.B TO B h Result a c b h

Terminology • A location is a triple (R, t, A) • Definition: A query Q1 is annotation contained in a query Q2 if: • Q1 Q2 • for every database D, the set of annotations attached to every output location in Q1(D) is a subset of the set of annotations associated with the same location in the output of Q2(D). R The annotation “a” is attached to the location (R,(1,2),B) a

In a More Concise Notation Ans(x,y,z) :- R(x,y), S(y’,z), y = y’. { x ! 1, y ! 2, y’ ! 2, z ! 3 } Ans(x,y,z) :- R(x,y), S(y,z). { x ! 1, y ! 2, z ! 3 } • Annotations of values that reside in different source locations but are bound to the same variable are unioned together. Ans(y) :- R(x,y). Ans(y) :- S(y,z). Ans(2 ). • Annotations that belong to the same output location are unioned together. a b a b a b

Containment vs. annotation-containment R a b c d Q1 Ans(x,v) :- R(x,y,u), R(x,z,v), R(t,w,z). Q2 Ans(x,v) :- R(p,q,v), R(x,z,v), R(t,w,z). Ans1 a b c Ans2 c d b Q1 Q2 but… Q1aQ2and Q2aQ1

Translating a CUSTOM pSQL to SQL custompSQL query: SELECT DISTINCT r.A, s.B, s.C FROM R r, S s WHERE r.B = s.B PROPAGATE s.B TO B, s.C TO C, r.B TO B SQL query: SELECT DISTINCT * FROM ( Q1 UNION Q2 ) t ORDER BY t.A, t.B, t.C • Q1: • SELECT r.A, NULL, • s.B, s.B’, • s.C, s.C’ • FROM R r, S s • WHERE r.B = s.B • Q2: • SELECT r.A, NULL, • s.B, r.B’, • s.C, NULL • FROM R r, S s • WHERE r.B = s.B

Provenance-Based Annotation Management System

Provenance-Based Annotation Management System

Presentation Transcript

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

RELATIONAL DATABASES

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases

Relational Databases:

Relational Databases

Relational Databases

Relational Databases