Provenance Challenge

Provenance Challenge Simon Miles, Mike Wilde, Ian Foster and Luc Moreau

Provenance • In the study of fine art, provenance refers to the documented history of some art object. • If the provenance of data produced by computer systems could be determined like it can for some works of art, then users would be able to interpret and judge the quality of data better.

The Provenance of the Challenge • Back in May: IPAW’06 (International Provenance and Annotation Workshop) • www.ipaw.info • Proceedings to appear in LNCS 4145

Standardisation discussion at IPAW’06 • How can (workflow-based or other) systems inter-operate? • Individual systems may be able to track provenance of data • How can we that we track provenance of data across systems? • Would a standard be useful? • At the time, it was felt it was premature to standardise, we needed to understand systems’ capabilities

The Challenge Aims • The provenance challenge aims to establish an understanding of the capabilities of available provenance-related systems • The representations that systems use to document details of processes that have occurred • The capabilities of each system in answering provenance-related queries • What each system considers to be within scope of the topic of provenance (regardless of whether the system can yet achieve all problems in that scope) twiki.ipaw.info

The Challenge Process Each participant in the challenge will have their own page on this TWiki, following the ChallengeTemplate, where they can inform the rest of their efforts in meeting the challenge.. • Representations of the workflow in their system • Representations of provenance for the example workflow • Representations of the result of the core (and other) queries • Contributions to a matrix of queries vs systems, indicating for each that: (1) the query can be answered by the system, (2) the system cannot answer the query now but considers it relevant, (3) the query is not relevant to the project. Optionally, the participants may like to contribute the following. • Additional queries that illustrate the scope of their system • Extensions to the example workflow to best illustrate the unique aspects of their system • Any categorisation of queries that the project considers to have practical value twiki.ipaw.info

twiki.ipaw.info

The Queries • Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc. • Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean. • Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic. • Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday. • Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility. • Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12." • A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant. • A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago. • A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

17 Participating Teams • REDUX, Database Research Group, MSR • MINDSWAP, Semantic Web Research Group, University of Maryland, College Park • Karma, Computer Science Department, Indiana University • CESNET, GRID research group, CESNET z.s.p.o. Prague, Czech Republic • myGrid, University of Manchester • VisTrails, University of Utah • Gridprovenance, Cardiff University • ES3, University of California, Santa Barbara • UPenn, University of Pennsylvania, Database Group • RWS, UC Davis and SDSC, California • DAKS, Genome Center, UC Davis, California • PASS, Harvard • SDG, Pacific Northwest National Lab • NcsaD2k and NcsaCi, National Center for Supercomputing Applications • UChicago, University of Chicago Computation Institute • Southampton, University of Southampton, PASOA and Provenance projects • USC/ISI, University Of Southern California/Information Sciences Institute twiki.ipaw.info

Schedule • Session 1: Wednesday 10.00-11.30 team presentations • Session 2: Wednesday 13.00-15.00 team presentations • Session 3: Wednesday 16.00-17.30 • Session 4: Thursday 9.30-11.00 analysing commonalities and differences • Session 5: Thursday 11.30-13.00 what next? sessions 3-5 are open, contribute ideas on twiki http://twiki.ipaw.info/bin/view/Challenge/WorkshopAgenda

10.00-10.10: Introduction • 10.10-10.20: PNL • 10.20-10.30: UPenn, University of Pennsylvania, Database Group • 10.30-10.40: UChicago • 10.40-10.50: myGrid, University of Manchester • 10.50-11.00: Kepler (SDSC) • 11.00-11.10: Kepler (UCDavis) • 11.10-11.20: VisTrails, University of Utah • 13.00-13.10: REDUX, Database Research Group, MSR • 13.10-13.20: CESNET, GRID research group, CESNET z.s.p.o. Prague, Czech Republic • 13.20-13.30: Karma, Computer Science Department, Indiana University • 13.30-13.40: MINDSWAP, Semantic Web Research Group, University of Maryland, College Park • 13.40-13.50: PASS, Harvard slides • 13.50-14.00: Southampton, PASOA/EU Provenance • 14.00-14.10: Gridprovenance, Cardiff University • 14.10-14.20: ISI • 14.20-14.30: NCSA • 14.30-14.40: ES3, University of California, Santa Barbara

Provenance Challenge

Provenance Challenge

Presentation Transcript

Workflow Provenance

ES 3 Takes the Provenance Challenge

Provenance

Provenance Challenge: A Semantic Web Approach

Karma Provenance Framework v2 Provenance Challenge Workshop/GGF18

WINGS/Pegasus Provenance Challenge

Provenance: overview

Provenance challenge --- my Grid

A Semantic Web Approach for the Third Provenance Challenge

PROVENANCE

Harvard’s PASS Takes on The Provenance Challenge

Searching Provenance

Provenance

Quantitative Provenance

Provenance Semirings

“provenance”

Provenance

Provenance Challenge gLite Job Provenance

Provenance