240 likes | 327 Vues
Explore a provenance framework in Kepler system for capturing, querying, and storing data lineage to enhance reproducibility and automation of complex scientific computations. Discover patterns and challenges in modeling provenance for scientific workflows. Analyze storage strategies for efficient data management and explore reduction algorithms for improved performance. Implement improved lineage structures to enable better querying and browsing capabilities.
E N D
A Provenance Framework to Capture, Store, Query, and Browse Data Lineage in Kepler Manish Kumar Anand maanand@ucdavis.edu Eighth Biennial Ptolemy Miniconference Berkeley, California
Scientific workflow system Scientific Workflows • Discoveries achieved via complex computations • Workflows replacing traditional scripting approaches • Enable automation, reproducibility, sharing, provenance Perl script
AXG AYG AZG RI1 AI1 alignWarp:1 reslice:1 AH1 convert:1 WP1 slicer:1 RH1 AXG AXS RI2 AI2 alignWarp:2 reslice:2 AH2 AI WP2 softmean:1 slicer:2 RH2 convert:2 RI RH RI4 AYG AYS AH alignWarp:3 reslice:3 AI4 WP4 AH4 RH4 slicer:3 convert:3 RI4 AZG reslice:4 alignWarp:4 AZS AI4 WP4 AH4 RH4 outputs inputs Provenance AlignWarp Reslice Softmean Slicer Convert • A record of processes, inputs/outputs, dependencies • Supports reproducibility, interpretation, verification
Outline Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance
input(a, s1), output(a, s2), input(b, s2), input(c, s2), … Workflow execution graph Data dependency Invocation dependency Conventional Provenance Models • Assumptions: - Data is atomic - Invocations consume all inputs and produce new outputs - Every output depends on all inputs • Records • Inputs/outputs of invocations • Infers • Data dependency • Invocation dependency
s1 s3 s1 a s3 a s2 s2 s2 s4 (a) s1 s1 a s1 s3 a s3 s2 s3 s2 s4 s4 s5 (c) (b) Challenges in Modeling Provenance Many scientific workflow systems also support: • Both data “transformers” and “pass-through” • Processes with different dependency patterns • Structured data (XML) • Models of provenance must consider these factors
1 1 2 5 a 2 4 6 3 4 1 2 5 3 4 6 Efficient Provenance Representation • Instead of storing each version • Only store a single combined version • Along with a set of updates (’s) • Updates and dependencies represented as annotations a= {ins(5,a), dep(5,2), del(3,a), ins(6,a), dep(5,3), dep(5,4), dep(6,2), dep(6,3), dep(6,4)} a= {ins(5,a), dep(5,2), del(3,a)} 1 +a 2 5 +a -a -a +a 3 4 6 Condensed Expanded
Expanding and Condensing Traces 1 1 +a +a 2 5 2 5 -a +a -a 3 4 6 3 4 6 Expanded Condensed
Condensed Trace Expanded Trace Images Trace Views S6 S3 S5 Images Images S1 S2 Images Images Images S4 1 1 1 1 1 1 … … … … … … AtlasImage AtlasImage ReslicedImage AnatomyImage AnatomyImage AtlasImage 12 15 15 2 2 15 Image Header Header 13 19 14 16 17 6 7 8 11 18 RefImage WarpParamSet Image AtlasGraphic Image Header AtlasSlice Header Image 9 10 Using a postorder (i.e, bottom-up, left-to-right) traversal Remove annotations from a node n (i) dep(n,c) if dep(n,p) and child(p,c) (ii) dep(n,d) if child(p,n) and dep(p,d) (iii) ins(n,x) if child(p,n) and ins(p,x) (iv) del(n,y) if child(p,n) and del(p,y) Remove invocation order annotations -Those implied according to rules in (3--8) alignwarp reslicewarp softmean slicer convert Uses three distinct preorder (i.e., top-down, left-to-right) traversals Pass 1: rules (1-2) and rules (3-5) -Infers insertion and deletion annotations -Infers invocation order from nodes and parent-child relationships Pass 2: rules (6-8) -Infers remaining invocation precedence relationships Pass 3: rules (9-10) -Expands dependencies sets and propagates dependencies to child nodes
Outline Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance
Storage Strategies • Store immediate and transitive dependencies • Faster query execution • Reduction techniques • Represent dependencies in reduced form Use standard relational DBMS and minimize storage size, update time and query time
Storage Strategies SE Trace Expanded Transitive Dep. NE Trace Expanded NC Trace Collapsed • 5 storage strategies • NC: Naive Collapsed • NE: Naive Expanded • SE: Simple Expanded • RE: Reduced Expanded • RC: Reduced Collapsed • Compare: • Storage size, update time, query time Reduction Algorithms RE Reduced Trace Expanded Transitive Dep. RC Reduced Trace Collapsed Transitive Dep.
Analysis of Storage Strategies Update Time Storage Size Query Time SE NE RC NE Time(s) Time(s) Cells (1000) • SE: Worst storage size and update time • RC: Very expensive query time • RE: Recommended storage strategy NC RE RC RE SE Traces Traces Traces
Outline Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance
Lineage Structures Images S6 S3 S5 Images Images S1 S2 Images Images Images S4 1 1 1 1 1 1 … … … … … … AtlasImage AtlasImage ReslicedImage AnatomyImage AnatomyImage AtlasImage 12 15 15 2 2 15 Image Header Header 13 19 14 16 17 6 7 8 11 18 RefImage WarpParamSet Image AtlasGraphic Image Header AtlasSlice Header Image 9 10 alignwarp:1 reslicewarp:1 softmean:1 slicer:1 convert:1 Querying Provenance can be Expensive • Queries are often recursive • Complex to formulate • Expensive to evaluate • Standard querying approaches • Tied to storage representation • Query language expertise • Need to query across structures, lineage, or both (Q) Select lineage path that derived all children of AtlasImage created by slicer • How to express provenance queries easily and execute them efficiently?
To Express this Query … SQL (eg, transitive dependencies) SQL (stored procedures) create procedure depc(in runId_in varchar(255), in nodeId_in Integer) begin DECLARE finished integer default 0; … declare cur_1 cursor for select depNodeId from dependency where runId=runId_in and itemNodeId=nodeId_tmp; set nodeId_tmp = nodeId_in; set depCnt = (select count(*) from dependency where runId=runId_in and itemNodeId=nodeId_tmp); if (depCnt is not null) then open cur_1; get_cur_1: loop fetch cur_1 into depNodeId_tmp; if finished then leave get_cur_1; end if; insert into depcT (nodeId) values(depNodeId_tmp); end LOOP get_cur_1; close cur_1; set cnt = 1; while (cnt <= depCnt) do set nodeId_tmp = (select nodeId from depcT where no=cnt); set row_limit = (select count(*) from dependency where itemnodeId=nodeId_tmp and runId=runId_in); set row_cnt =0; open cur_1; get_cur_1: loop fetch cur_1 into depNodeId_tmp; set flag = (select 1 from depcT where nodeId = depNodeId_tmp); if (flag is null) then insert into depcT (nodeId) values(depNodeId_tmp); end if; if (row_cnt > row_limit) then leave get_cur_1; end if; set row_cnt = row_cnt + 1; … … select t.runId, t2.nodeId, t.nodeId as depNodeId from ( select d1.runId, d1.pDep, d1.nodeId from dependency d1 where runId=runId_in union select p1.runId, p1.fromPointer as pDep, d2.nodeId from dependency d2, depSubsetPointer p1 where p1.runId=runId_in and d2.runId=runId_in and d2.pDep=p1.toPointer ) as t, depMinMaxPointer p2, ( select t.runId, r1.nodeId, t.pDep from ( select dc1.runId, dc1.pDepC, dc1.pDep from depCdepPointer dc1 where runId=runId_in union select p1.runId, p1.fromPointer as pDepC, dc2.pDep from depCdepPointer dc2, depCSubsetPointer p1 where p1.runId=runId_in and dc2.runId=runId_in and dc2.pDepC=p1.toPointer ) as t, depCMinMaxPointer p2, runCollData r1, runItemProv rp1 where p2.runId = runId_in and r1.runId=runId_in and rp1.runId=runId_in and r1.nodeId=nodeId_in and r1.pointer=rp1.pointer and rp1.pDep = p2.fromPointer and t.pDepC=p2.toPointer and t.pDep BETWEEN p2.depMin AND p2.depMax union … … • Hard for domain scientists (… and SQL experts) • Optimization depends on SQL engine [He et al. SIGMOD 08] • Need for higher-level provenance query language
QLP Constructs First Provenance Challenge Queries Formulated in QLP
Images S6 S3 S5 Images Images S1 S2 Images Images Images S4 1 1 1 1 1 1 … … … … … … AtlasImage AtlasImage ReslicedImage AnatomyImage AnatomyImage AtlasImage 12 15 15 2 2 15 Image Header Header 13 19 14 16 17 6 7 8 11 18 RefImage WarpParamSet Image AtlasGraphic Image Header AtlasSlice Header Image 9 10 alignwarp:1 reslicewarp:1 softmean:1 slicer:1 convert:1 S5 Images 1 … AtlasImage 15 18 AtlasSlice @out slicer Lineage Querying Multiple Dimensions Structures (Q) Select lineage path that derived all children of AtlasImage created by slicer 1. Obtain structures from @in and @out version operators 2. Apply XPath expressions to structure 3. Apply lineage queries to each resulting node QQLP: * derived//AtlasImage/*@out slicer //AtlasImage/* * derived 18
Outline Capturing Provenance Storing Provenance Querying Provenance Browsing Provenance
Provenance Browser • Browse different views of a trace • Data dependencies, collection structure, actor invocations • Move “forward” and “backward” through execution
Collection History • Collection and invocation view • Incrementally step through execution history
Conclusion • Capture • Supports nested data collections, explicit data dependency, update semantics • Storage • Reduce update time, storage size and query time • Query • A high-level provenance query language (QLP) • Query structures with lineage graphs • Formulate queries easily and concisely • Browse/Vizualize • Provenance Browser, a visualization tool to view and navigate across provenance views
References • M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Exploring Scientific Workflow Provenance using Hybrid Queries over Nested Data and Lineage Graphs. SSDBM 2009 • M. K. Anand, S. Bowers, T. McPhillips, B. Ludäscher. Efficient Provenance Storage over Nested Data Collections. EDBT 2009 • S. Bowers, T. McPhillips, S. Riddle, M. K. Anand, B. Ludäscher. Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life. IPAW 2008