1 / 51

Provenance: overview

Provenance: overview. Professor Luc Moreau L.Moreau@ecs.soton.ac.uk University of Southampton www.ecs.soton.ac.uk/~lavm. Provenance & PASOA Teams. University of Southampton Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen

nijole
Télécharger la présentation

Provenance: overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Provenance: overview Professor Luc Moreau L.Moreau@ecs.soton.ac.uk University of Southampton www.ecs.soton.ac.uk/~lavm

  2. Provenance & PASOA Teams • University of Southampton • Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen • IBM UK (EU Project Coordinator) • John Ibbotson, Neil Hardman, Alexis Biller • University of Wales, Cardiff • Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari • Universitad Politecnica de Catalunya (UPC) • Steven Willmott, Javier Vazquez • SZTAKI • Laszlo Varga, Arpad Andics, Tamas Kifor • German Aerospace • Andreas Schreiber, Guy Kloss, Frank Danneman

  3. Contents • Motivation • Provenance Concepts • Provenance Architecture • Standardisation • Conclusions

  4. Motivation

  5. Scientific Research Academic Peer Review

  6. Audit (Basel II) Audit (Sarbanes-Oxley) Business Regulations Accounting Banking

  7. Health Care Management European Recommendation R(97)5: on the protection of medical data

  8. e-Science datasets • How to undertake peer-reviewing and validation of e-Scientific results?

  9. The “next-compliance” problem Can we be certain that by ensuring compliance to a new regulation, we do not break previous compliance? Compliance to Regulations

  10. Current Solutions • Proprietary, Monolithic • Silos, Closed • Do not inter-operate with other applications • Not adaptable to new regulations

  11. Provenance • Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation • the historyor pedigree of a work of art, manuscript, rare book, etc.; • concretely, a record of the passage of an item through its various owners. • Concept vs representation

  12. Provenance in Computer Systems • Our definition of provenance in the context of applications for which process matters to end users: The provenance of a piece of data is the process that led to that piece of data • Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases

  13. Our Approach • Define core concepts pertaining to provenance • Specify functionality required to become “provenance-aware” • Define open data models and protocols that allow systems to inter-operate • Standardise data models and protocols • Provide a reference implementation • Provide reasoning capability

  14. Context (1) Aerospace engineering: maintain a historical record of design processes, up to 99 years. Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients

  15. Context (2) Bioinformatics: verification and auditing of “experiments” (e.g. for drug approval) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)

  16. Provenance Concepts

  17. Application Data Results Record Documentation of Execution Provenance “Lifecycle” Core Interfaces to Provenance Store Provenance Store Administer Store and its contents Query and Reason over Provenance of Data

  18. Nature of Documentation • We represent the provenance of some data by documenting the process that led to the data: • documentation can be complete or partial; • it can be accurate or inaccurate; • it can present conflicting or consensual views of the actors involved; • it can provide operational details of execution or it can be abstract.

  19. p-assertion • A given element of process documentation will be referred to as a p-assertion • p-assertion: is an assertion that is made by an actor and pertains to a process.

  20. Service Oriented Architecture • Broad definition of service as component that takes some inputs and produces some outputs. • Services are brought together to solve a given problem typically via a workflow definition that specifies their composition. • Interactions with services take place with messages that are constructed according to services interface specification. • The term actor denotes either a client or a service in a SOA. • A process is defined as execution of a workflow

  21. I received M1, M4 I sent M2, M3 I received M3 I sent M4 Process Documentation (1) From these p-assertions, we can derive that M3 was sent by Actor 1 and received by Actor 2 (and likewise for M4) Actor 2 Actor 1 M1 M3 M4 M2 If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages

  22. M4 is in reply to M3 M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 Process Documentation (2) These assertions help identify order of messages, but not how data was computed Actor 2 Actor 1 M1 M3 M4 M2

  23. f1 f f2 M3 = f1(M1) M2 = f2(M1,M4) M4 = f(M3) Process Documentation (3) These assertions help identify how data is computed, but provide no information about non-functional characteristics of the computation (time, resources used, etc) Actor 2 Actor 1 M1 M3 M4 M2

  24. I used sparc processor I used algorithm x version x.y.z I used 386 cluster Request sat in queue for 6min Process Documentation (4) Actor 2 Actor 1 M1 M3 M4 M2

  25. I received M1, M4 I sent M2, M3 Types of p-assertions (1) • Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message

  26. M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M3 = f1(M1) M2 = f2(M1,M4) Types of p-assertions (2) • Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained an output message sent in an interaction by applying some function to input messages from other interactions (likewise for data)

  27. Types of p-assertions (3) • Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction I used sparc processor I used algorithm x version x.y.z

  28. Data flow • Interaction p-assertions allow us to specify a flow of data between actors • Relationship p-assertions allow us to characterise the flow of data “inside” an actor • Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result

  29. Provenance Architecture

  30. Application Results Record Documentation of Execution Interfaces to Provenance Store Provenance Store Query and Reason over Provenance of Data Administer Store and its contents

  31. P-Assertion schemas

  32. The p-structure • The p-structure is a common logical structure of the provenance store shared by all asserting and querying actors • Hierarchical • Indexed by interactions (interaction= 1 message exchange)

  33. Abstract machines DS Properties Termination Liveness Safety Statelessness Documentation Properties Immutability Attribution Datatype safety Foundation for adding necessary cryptographic techniques Recording Protocol (Groth04-06)

  34. Querying Functionality (Miles06) • Process Documentation Query Interface: allows for “navigation” of the documentation of execution • Allows us to view the provenance store (i.e. the p-structure) as if containing XML data structures • Independent of technology used for running application and internal store representation • Seamless navigation of application dependent and application independent process documentation

  35. Querying Functionality (Miles06) • Provenance Query Interface: allows us to obtain the provenance of some specific data • A recognition that there is not “one” provenance for a piece of data, but there may be different, depending on the end-user’s interest • Hence, provenance is seen as the result of a query: • Identify a piece of data at a specific execution point • Scope of the process of interest: • Filter in/out p-assertions according to actors, process, types of relationships, etc

  36. Standardisation

  37. Standardisation Options

  38. Record Documentation of Execution Purpose of Standardisation Application Application Provenance Stores Allow for multiple applications to document their execution. Applications may be running in different institutions.

  39. Record Documentation of Execution Purpose of Standardisation Application Provenance Store Provenance Store Provenance Store Allow for multiple stores from multiple IT providers

  40. Purpose of Standardisation Provenance Store Provenance Store Query Provenance of Data Allow for multiple stores from multiple IT providers

  41. Purpose of Standardisation Convert in standard data format Allow for legacy, monolithic applications to expose their contents (according to standard schema)

  42. Provenance Store Purpose of Standardisation Application Allow third parties to host provenance stores, which are trusted by application owners but also auditors

  43. Compliance Oriented Architectures • Separate execution documentation from compliance verification • Allows for multiple compliance verifications • Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting). • Approach is suitable for e-scientific peer-reviewing and business compliance verification

  44. Organ Transplant Scenario Hospital Electronic Healthcare Management Service Testing Lab

  45. Hospital Actors User Interface Brain Death Manager Donor Data Collector

  46. What’s on the CD • PReServ (Paul Groth & Simon Miles) • Offer recording and querying interfaces • Available from www.pasoa.org • Soon ogsa-dai based version available from www.gridprovenance.org • Is being used in a bioinformatics application (cf. hpdc’05, iswc’05)

  47. Conclusions

  48. Standardising the documentation of Business Processes Apply Record • Provenance • Architecture • Methodology Provenance Store To Sum Up Finance Aerospace Distribution Healthcare Automobile Pharmaceutical • Compliance check • Rerun/Reproduce • Analyse Query Slide from John Ibbotson

  49. Overview of Today’s Talks • Provenance Data Structures • Recording and Querying Provenance • Break (30 minutes) • Distribution and Scalability • Security • Methodology

More Related