1 / 89

Data Workflow Management, Data Preservation and Stewardship

Data Workflow Management, Data Preservation and Stewardship. Peter Fox Data Science – ITEC/CSCI/ERTH-6961 Week 10, November 6, 2012. Contents. Scientific Data Workflows Data Stewardship Summary Next class(es). Scientific Data Workflow. What it is Why you would use it

madra
Télécharger la présentation

Data Workflow Management, Data Preservation and Stewardship

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Workflow Management, Data Preservation and Stewardship Peter Fox Data Science – ITEC/CSCI/ERTH-6961 Week 10, November 6, 2012

  2. Contents • Scientific Data Workflows • Data Stewardship • Summary • Next class(es)

  3. Scientific Data Workflow • What it is • Why you would use it • Some more detail in the context of Kepler • www.kepler-project.org • Some pointer to other workflow systems

  4. What is a workflow? • General definition: series of tasks performed to produce a final outcome • Scientific workflow – “data analysis pipeline” • Automate tedious jobs that scientists traditionally performed by hand for each dataset • Process large volumes of data faster than scientists could do by hand

  5. Background: Business Workflows • Example: planning a trip • Need to perform a series of tasks: book a flight, reserve a hotel room, arrange for a rental car, etc. • Each task may depend on outcome of previous task • Days you reserve the hotel depend on days of the flight • If hotel has shuttle service, may not need to rent a car • E.g. tripit.com?

  6. What about scientific workflows? • Perform a set of transformations/ operations on a scientific dataset • Examples • Generating images from raw data • Identifying areas of interest in a large dataset • Classifying set of objects • Querying a web service for more information on a set of objects • Many others…

  7. More on Scientific Workflows • Formal models of the flow of data among processing components • May be simple and linear or more complex • Can process many data types: • Archived data • Streaming sensor data • Images (e.g., medical or satellite) • Simulation output • Observational data

  8. Challenges • Questions: • What are some challenges for scientists implementing scientific workflows? • What are some challenges to executing these workflows? • What are limitations of writing a program?

  9. Challenges • Mastering a programming language • Visualizing workflow • Sharing/exchanging workflow • Formatting issues • Locating datasets, services, or functions

  10. Kepler Scientific Workflow Management System • Graphical interface for developing and executing scientific workflows • Scientists can create workflows by dragging and dropping • Automates low-level data processing tasks • Provides access to data repositories, compute resources, workflow libraries

  11. Benefits of Scientific Workflows • Documentation of aspects of analysis • Visual communication of analytical steps • Ease of testing/debugging • Reproducibility • Reuse of part or all of workflow in a different project

  12. Additional Benefits • Integration of multiple computing environments • Automated access to distributed resources via web services and Grid technologies • System functionality to assist with integration of heterogeneous components

  13. Why not just use a script? • Script does not specify low-level task scheduling and communication • May be platform-dependent • Can’t be easily reused • May not have sufficient documentation to be adapted for another purpose

  14. Why is a GUI useful? • No need to learn a programming language • Visual representation of what workflow does • Allows you to monitor workflow execution • Enables user interaction • Facilitates sharing of workflows

  15. The Kepler Project • Goals • Produce an open-source scientific workflow system • enable scientists to design scientific workflows and execute them • Support scientists in a variety of disciplines • e.g., biology, ecology, astronomy • Important features • access to scientific data • flexible means for executing complex analyses • enable use of Grid-based approaches to distributed computation • semantic models of scientific tasks • effective UI for workflow design

  16. Usage statistics • Projects using Kepler: • SEEK (ecology) • SciDAC (molecular bio, ...) • CPES (plasma simulation) • GEON (geosciences) • CiPRes (phylogenetics) • CalIT2 • ROADnet (real-time data) • LOOKING (oceanography) • CAMERA (metagenomics) • Resurgence (Computational chemistry) • NORIA (ocean observing CI) • NEON (ecology observing CI) • ChIP-chip (genomics) • COMET (environmental science) • Cheshire Digital Library (archival) • Digital preservation (DIGARCH) • Cell Biology (Scripps) • DART (X-Ray crystallography) • Ocean Life • Assembling theTree of Life project • Processing Phylodata (pPOD) • FermiLab (particle physics) • Source code access • 154 people accessed source code • 30 members have write permission Kepler downloads Total = 9204 Beta = 6675 red=Windows blue=Macintosh

  17. Distributed execution • Opportunities for parallel execution • Fine-grained parallelism • Coarse-grained parallelism • Few or no cycles • Limited dependencies among components • ‘Trivially parallel’ • Many science problems fit this mold • parameter sweep, iteration of stochastic models • Current ‘plumbing’ approaches to distributed execution • workflow acts as a controller • stages data resources • writes job description files • controls execution of jobs on nodes • requires expert understanding of the Grid system • Scientists need to focus on just the computations • try to avoid plumbing as much as possible

  18. Distributed Kepler • Higher-order component for executing a model on one or more remote nodes • Master and slave controllers handle setup and communication among nodes, and establish data channels • Extremely easy for scientist to utilize • requires no knowledge of grid computing systems IN OUT Controller Controller Master Slave

  19. Data Management Token Token {1,5,2} ref-276 {1,5,2} • Need for integrated management of external data • EarthGrid access is partial, need refactoring • Include other data sources, such as JDBC, OpeNDAP, etc. • Data needs to be a first class object in Kepler, not just represented as an actor • Need support for data versioning to support provenance • e.g., Need to pass data by reference • workflows contain large data tokens (100’s of megabytes) • intelligent handling of unique identifiers (e.g., LSID) A B

  20. Science Environment for Ecological Knowledge SEEK is an NSF-funded, multidisciplinary research project to facilitate … Access to distributed ecological, environmental, and biodiversity data • Enable data sharing & reuse • Enhance data discovery at global scales Scalable analysis and synthesis • Taxonomic, spatial, temporal, conceptual integration of data, addressing data heterogeneity issues • Enable communication and collaboration for analysis • Enable reuse of analytical components • Support scientific workflow design and modeling

  21. SEEK data access, analysis, mediation Data Access (EcoGrid) • Distributed data network for environmental, ecological, and systematics data • Interoperate diverse environmental data systems Workflow Tools (Kepler) • Problem-solving environment for scientific data analysis and visualization  “scientific workflows” Semantic Mediation (SMS) • Leverage ontologies for “smart”data/component discovery and integration

  22. Managing Data Heterogeneity • Data comes from heterogeneous sources • Real-world observations • Spatial-temporal contexts • Collection/measurement protocols and procedures • Many representations for thesame information (count, area, density) • Data, Syntax, Schema, Semantic heterogeneity • Discovery and “synthesis” (integration) performed manually • Discovery often based on intuitive notion of “what is out there” • Synthesis of data is very time consuming, and limits use

  23. Scientific workflow systems support data analysis KEPLER

  24. A simple Kepler workflow Composite Component (Sub-workflow) Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions, ...) (T. McPhillips)

  25. A simple Kepler workflow Lists Nexus filesto process (project) Reads text files Parses Nexus format Draws phylogenetic trees PhylipPars infers trees from discrete, multi-state characters. Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees. UniqueTrees discards redundant trees in each collection. (T. McPhillips)

  26. A simple Kepler workflow An example workflow run, executed as a Dataflow Process Network

  27. Smart Mediation Services (SMS) motivation • Scientific Workflow Life-cycle • Resource Discovery • discover relevant datasets • discover relevant actors or workflow templates • Workflow Design and Configuration • data  actor (data binding) • data  data (data integration / merging / interlinking) • actor  actor (actor / workflow composition) • Challenge: do all this in the presence of … • 100’s of workflows and templates • 1000’s of actors (e.g. actors for web services, data analytics, …) • 10,000’s of datasets • 1,000,000’s of data items • … highly complex, heterogeneous data – price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!

  28. Approach & SMS capabilities • Annotations “connect” resources to ontologies • Conceptually describe a resource and/or its “data schema” • Annotations provide the means for ontology-based discovery, integration, … Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration

  29. “Hybrid” types … Semantic + Structural Typing O : Observation  obsProperty.SpeciesOccurrence S : SpeciesData(site, day, spp, occ) O O Oout S S Sout • Structural Types: Given a structural type language S • Datasets, inputs, and outputs can be assigned structural types S  S • Semantic Types: Given an ontology language O (e.g., OWL-DL) • Datasets, inputs, and outputs can be assigned ontology types O O   Oout  Oin Semantically compatiblebut structurally incompatible A1 A2 Sout Sin Semantic & structural types can be combined using logic constraints  := (site,day,sp,occ)SpeciesData(site, day, sp, occ) (y)Observation(y), obsProp(y, occ),SpeciesOccurrence(occ)

  30. Semantic Type Annotation in Kepler • Component input and output port annotation • Each port can be annotated with multiple classes from multiple ontologies • Annotations are stored within the component metadata

  31. Component Annotation and Indexing • Component Annotations • New components can be annotated and indexed into the component library (e.g., specializing generic actors) • Existing components can also be revised, annotated, and indexed (hiding previous versions)

  32. Approach & SMS capabilities • Ontology-based “smart” search • Find components by semantic types • Find components by input/output semantic types • Ontology-based query rewriting for discovery/integration • Joint work with GEON project (see SSDBM-04, SWDB-04) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration

  33. Smart Search Browse for Components Search for Component Name Search for Category / Keyword Find a component (here: an actor) in different locations (“categories”) • … based on the semantic annotation of the component (or its ports)

  34. Searching in context • Search for components with compatible input/output semantic types • … searches over actor library • … applies subsumption checking on port annotations

  35. Approach & SMS capabilities • Workflow validation and analysis • Check that workflows are semantically & structurally well-typed • Infer semantic type annotations of derived data (ie, type inference) • An initial approach and prototype based on mapping composition (see QLQP-05) • User-oriented provenance • Collect & query data-lineage of WF runs (see IPAW-06) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration

  36. Navigate errors and warnings within the workflow Search for and insert “adapters” to fix (structural and semantic) errors … Statically perform semantic and structural type checking Workflow validation in Kepler

  37. Approach & SMS capabilities • Integrating and transforming data • Merge (“smart union”) datasets • Find mappings between data schemas for transformation • data binding, component connections (see DILS-04) Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration

  38. Smart (Data) Integration: Merge • Discover data of interest • … connect to merge actor • … “compute merge” • align attributes via annotations • open dialog for user refinement • store merge mapping in MOML • … enjoy! • … your merged dataset • almost, can be much more complicated

  39. Under the hood of “Smart Merge” … Biomass Site Site Biomass a1 a2 a3 a4 a 5 10 b 6 11 a1 a3 a4 a 5.0 10 b 6.0 11 a 0.1 c 0.2 d 0.3 Merge Result a5 a6 a7 a8 0.1 a 0.2 c 0.3 d • Exploits semantic type annotations and ontology definitions to find mappings between sources • Executing the merge actor results in an integrated data product (via “outer union”) a1 a3 a1a8 a4 a3a6 Merge a6 a4 a8

  40. Approach & SMS capabilities • Workflow design support • (Semi-) automatically combine resource discovery, integration, and validation • Abstract  Executable WF • … ongoing work! Ontologies Iterative Development SemanticAnnotation Resource Discovery Workflow Validation Resource Integration Workflow Elaboration Automated SWF Refinement

  41. Initial Work on Provenance Framework • Provenance • Track origin and derivation information about scientific workflows, their runs and derived information (datasets, metadata…) • Need for Provenance • Association of process and results • reproduce results • “explain & debug” results (via lineage tracing, parameter settings, …) • optimize: “Smart Re-Runs” • Types of Provenance Information: • Data provenance • Intermediate and end results including files and db references • Process (=workflow instance) provenance • Keep the wf definition with data and parameters used in the run • Error and execution logs • Workflow design provenance (quite different) • WF design is a (little supported) process (art, magic, …) • for free via cvs: edit history • need more “structure” (e.g. templates) for individual & collaborative workflow design

  42. Kepler Provenance Recording Utility • Parametric and customizable • Different report formats • Variable levels of detail • Verbose-all, verbose-some, medium, on error • Multiple cache destinations • Saves information on • User name, Date, Run, etc…

  43. Provenance: Possible Next Steps • Provenance Meeting: Last week at SDSC • Deciding on terms and definitions • .kar file generation, registration and search for provenance information • Possible data/metadata formats • Automatic report generation from accumulated data • A GUI to keep track of the changes • Adding provenance repositories • A relational schema for the provenance info in addition to the existing XML

  44. Some other workflow systems • SCIRun • Sciflo • Triana • Taverna • Pegasus • Some commercial tools: • Windows Workflow Foundation • Mac OS X Automator • http://www.isi.edu/~gil/AAAI08TutorialSlides/5-Survey.pdf • http://www.isi.edu/~gil/AAAI08TutorialSlides/ • See reading for this week

  45. Data Stewardship • Putting a number of data life cycle, management aspects together • Keep the ideas in mind as you complete your assignments • Why it is important • Some examples

  46. Why it is important • 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: http://www.forbes.com2002/09/12/0912data_print.html ) • 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as file formats evolve,” Houston Chronicle, Jan. 16, 2003. [Online]. Available: http://www.chron.com/cs/CDA/story.hts/tech/1739675 ) • R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long-term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec. 2004. [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_International_Polar_Year:_Making_Data_and_Information_Available_for_the_Long_Term.ppt

  47. At the heart of it • Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc. • Inability to know the inter-relations, assumptions and missing information • We’ll look at a (data) use case for this shortly • But first we will look at what, how and who in terms of the full life cycle

  48. What to collect? • Documentation • Metadata • Provenance • Ancillary Information • Knowledge

  49. Who does this? • Roles: • Data creator • Data analyst • Data manager • Data curator

  50. How it is done • Opening and examining Archive Information Packages • Reviewing data management plans and documentation • Talking (!) to the people: • Data creator • Data analyst • Data manager • Data curator • Sometimes, reading the data and code

More Related