Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler )

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler) Bertram Ludäscher San Diego Supercomputer Center ludaesch@SDSC.edu

NSF, NIH, DOE GEOsciences Network (NSF) www.geongrid.org Biomedical Informatics Research Network (NIH) www.nbirn.net Science Environment for Ecological Knowledge (NSF) seek.ecoinformatics.org Scientific Data Management Center (DOE) sdm.lbl.gov/sdmcenter/ Acknowledgements

Outline • Scientific Workflows • Business Workflows • [Problem Solving Environments (SCIRun)] • Dataflow Process Networks (Ptolemy-II) • Scientific Workflows (Kepler)

Promoter Identification Workflow (PIW) Source: Matt Coleman (LLNL)

Source: NIH BIRN (Jeffrey Grethe, UCSD)

Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Test sample (d) Species presence & absence points (native range) (a) Native range prediction map (f) Training sample (d) GARP rule set (e) Data Calculation Map Generation Map Generation EcoGrid Query EcoGrid Query Validation Validation User Sample Data +A2 +A3 Model quality parameter (g) Generate Metadata Integrated layers (native range) (c) Layer Integration Layer Integration +A1 Environmental layers (native range) (b) Invasion area prediction map (f) Selected prediction maps (h) Model quality parameter (g) Integrated layers (invasion area) (c) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) GARP Invasive Species Pipeline Source: NSF SEEK (Deana Pennington, UNM)

Scientific Workflow Aspects • Data orientation • Data volume • Data complexity • Data integration • Computational complexity • Grid-aspects • Distributed computation • Distributed data • Analysis and tool integration • User-interactions/WF steering • Data and workflow provenance

Business Workflows • Business Workflows • show their office automation ancestry • documents and “work-tasks” are passed • no data streaming, no data-intensive pipelines • lots of standards to choose from: WfMC, WSFL, BMPL, BPEL4WS,.. XPDL,… • but often no clear execution semantics for constructs as simple as this: Source: Expressiveness and Suitability of Languages for Control Flow Modelling in Workflows, PhD thesis, Bartosz Kiepuszewski, 2002

A ZOO of Workflow Standards and Systems Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/

More on Scientific WF vs Business WF • Business WF • Tasks, documents, etc. undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout • Complex control flow, task-oriented • Transactions w/o rollback (ticket: reserved  purchased) • … • Scientific WF • data-in and data-out of an analysis step are not the same object! • dataflow, data-oriented (cf. AVS/Express, Khoros, …) • re-run automatically (a la distrib. comp., e.g. Condor) or user-driven/interactively (based on failure type) • data integration & semantic typing as part of SWF framework • …

Scientific Workflows: Some Findings • More dataflow than (business) workflow • but some branching looping, merging, … • not: documents/objects undergoing modifications • instead often: dataset-out = analysis(dataset-in) • Need for “programming extension” • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …) • Need for abstraction and nested workflows • Need for data transformations (compute/transform alternations) • Need for rich user interaction & workflow steering: • pause / revise / resume • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF • Need for high-throughput transfers (“grid-enabling”, “streaming”) • Need for persistence of intermediate products  data provenance (“virtual data” concept)

Problem Solving Environments • SCIRun: a dynamic dataflow system (in the Ptolemy sense)  separate presentation

SWF vs Distributed Computing • Distributed Computing (e.g. a la Condor-(G) ) • Batch oriented • Transparent distributed computing (“remote Unix/Java”; standard/Java universes in Condor) • HPC resource allocation & scheduling • SWF • Often highly interactive for decision making/steering of the WF and visualization (data analysis) • Transparent data access (Grid) and integration (database mediation & semantic extensions) • Desktop metaphor ; often (but not always!) light-weight web service invocation

see! Dataflow Process Networks and Ptolemy-II read! try! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Dataflow Process Networks: Why Ptolemy-II? • PtII Objective: • “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-definedmodels of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” • Data & Process oriented: • Dataflow process networks • Natural Data Streaming Support • Pragmatics • mature, actively maintained, open source system • leverage “sister projects” activities (e.g. SEEK)

Ptolemy-II Type System

Scientific Workflows = Dataflow Process Networks + ? • X = … • Grid extensions: • Actors as web/grid services • 3rd party data transfer, high-throughput data streaming • Data and servicerepositories, discovery Extended type system (structural & semantic extensions) • Programmingextensions (declarative/FP) and • Rich user interactions/workflow steering • Rich data transformations (compute/transform alternations) • Data provenance • (semi-)automatic meta-data creation • … • …– (minus) upcoming Ptolemy-II extensions (PtII, SEEK, …)! • The slower we are, the less we have to do ourselves ;-) Kepler = Ptolemy-II + X

X includes: The customer is always right … • Intuitive … • component composition • data binding • execution monitoring • Reusability of … • Generic components (actors) • Derived data products • Application specific packaging and “branding” • Transparent “gridification”

Some specific tasks for Kepler$DONE(or almost ;-), %ONGOING, *NEW • User interaction, workflow steering • $ Pause/revise/resume • % BrowserUI actor (browser as a 0-learning display and selection tool) • Distributed execution • % Dynamically port-specializing WSDL actor • * Dynamically specializing Grid service actor • Port & actor type extensions (SEEK leverage) • * Structural types (XML Schema) • * Semantic types (OWL) incl. unit types w/ automatic conversion • Programming extensions • % Data transformation actors (XSLT, XQuery, Python, Perl,…) • * map, zip, zipWith, …, loop, switch “patterns” • Specialized Data Sources • $ EML (SEEK), • % MS Access (GEON), *JDBC, • *XML, *NetCDF, …

Some specific tasks for Kepler (all NEW) • Design & develop transparent, Grid-enabled PNs: • Communication protocol details • Grid-actor extensions and/or • Grid-Process Network director (G-PN) • Host/Source-location becomes actor parameter • add “active-inline” parameter display for grid-actors (@exec-loc), channels (@transport-protocol), source-actors (@{src-loc|catalog-loc}) • Activity Monitoring • Add “activity status” display (green, yellow, red) to replace PtII animation (needed for concurrently executing PN!) • Register & Deploy mechanism • Actor/Data/Workflow repository (=composite actors) • Shows up as (config’able) actor library • OGSA Service Registry approach? (SEEK leverage; UDDI complex & limited says MattJ) • http://www-unix.globus.org/toolkit/draft-ggf-ogsi-gridservice-33_2003-06-27.pdf • MOML extensions • Also separate language?

Example: Grid-enabling (again: SEEK leverage opportunity)

typed i/o ports FIFO actor actor Dataflow Process Networks • Synchronous Dataflow Network (SDF) • Statically schedulable single-threaded dataflow • Can execute multi-threaded, but the firing-sequence is known in advance • Maximally well-behaved, but also limited expressiveness • Process Network (PN) • Multi-threaded dynamically scheduled dataflow • More expressive than SDF (dynamic token rate prevents static scheduling) • Natural streaming model • Other Execution Models (“Domains”) • Implemented through different “Directors” advanced push/pull

Transparently Grid-Enabling PtII: Handles Logical token transfer (3) requires get_handle(1,2); then exec_handle(4,5,6,7) for completion. • AGA: get_handle • GAA: return &X • AB: send &X • BGB: request &X • GBGA: request &X • GA GB: send *X • GBB: send done(&X) • Example: • &X = “GA.17” • *X =<some_huge_file> PtII space 3 A B 4 7 2 1 5 Grid space GA GB 6

Transparently Grid-Enabling PtII • Different phases • Register designed WF (could include external validation service) • Find suitable grid service hosts for actors • Pre-stage execution • Execute • Archive execution log • Implementation choices: • Grid-actors (no change of director necessary) • and/or Grid-(PN)-director (also need to change actors!?) • Add grid service host id as actor parameter: A@GA • Similar for data: myDB@GA

Programming Extensions (some lessons from SciDAC/SSDBM demo)

designed to fit designed to fit hand-crafted Web-service actor Promoter Identification Workflow in Ptolemy-II (SSDBM’03) hand-crafted control solution; also: forces sequential execution! No data transformations available Complex backward control-flow

Promoter Identification Workflow in FP genBankG :: GeneId -> GeneSeqgenBankP :: PromoterId -> PromoterSeqblast :: GeneSeq -> [PromoterId]promoterRegion :: PromoterSeq -> PromoterRegiontransfac :: PromoterRegion -> [TFBS]gpr2str :: (PromoterId, PromoterRegion) -> Stringd0 = Gid "7" -- start with some gene-id d1 = genBankG d0 -- get its gene sequence from GenBankd2 = blast d1 -- BLAST to get a list of potential promotersd3 = map genBankP d2 -- get list of promoter sequences d4 = map promoterRegion d3 -- compute list of promoter regions and ...d5 = map transfac d4 -- ... get transcription factor binding sitesd6 = zip d2 d4 -- create list of pairs promoter-id/regiond7 = map gpr2str d6 -- pretty print into a list of strings d8 = concat d7 -- concat into a single "file" d9 = putStr d8 -- output that file

Back to purely functional dataflow process network (= a data streaming model!) Re-introducing map(f) to Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from piw(GeneId) to PIW :=map(piw) over [GeneId] Simplified Process Network PIW map(f)-style iterators Powerful type checking Generic, declarative “programming” constructs Generic data transformation actors Forward-only, abstractable sub-workflow piw(GeneId)

PIW as a declarative, referentially transparent functional process optimization via functional rewriting possible e.g. map(fog) = map(f) o map(g) Details: Technical report &PIW specification in Haskell Optimization by Declarative Rewriting I map(fo g) instead ofmap(f) o map(g) Combination of map and zip http://kbi.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

Optimizing II: Streams & Pipelines • Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f• mapS g mapS (f • g) Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney

Data Transformation Actors: Our Approach (proposal) • Manual • XQuery, XSLT, Perl, Python, … transformation actor (development) • (Semi-)automatic • Semantic-type guided transformation generation (research) • Also: Web Service Composition is … • … a hot topic • … a reincarnation of many “old” ideas • (e.g., AI-style planning born-again; functional composition; query composition; … ) • … a separate topic

Contrast to Existing Dataflow Systems Here: Commercial

Workflow and distributed computation grid created with Kensington Discovery Edition from InforSense.

… In "Flow-Based Programming" (FBP), applications are defined as networks of "black box" processes, which exchange data across predefined connections. These black box processes can be reconnected endlessly to form different applications without having to be changed internally. It is thus naturally component-oriented. To describe this capability, the distinguished IBM engineer, Nate Edwards, coined the term "configurable modularity", which he calls the basis of all true engineered systems. When using FBP, the application developer works with flows of data, being processed asynchronously, rather than the conventional single hierarchy of sequential, procedural code. It is thus a good fit with multiprocessor computers, and also with modern embedded software. In many ways, an FBP application resembles more closely a real-life factory, where items travel from station to station, undergoing various transformations. Think of a soft drink bottling factory, where bottles are filled at one station, capped at the next and labelled at yet another one. FBP is therefore highly visual: it is quite hard to work with an FBP application without having the picture laid out on one's desk, or up on a screen! For an example, see Sample DrawFlow Diagram. Strangely though, in spite of being at the leading edge of application development, it is also simple enough that trainee programmers can pick it up, and it is a much better match with the primitives of data processing than the conventional primitives of procedural languages. The key, of course (and perhaps the reason why it hasn't caught on more widely), is that it involves a significant paradigm shift that changes the way you look at programming, and once you have made this transition, you find you can never go back! FBP seems to dovetail neatly with a concept that I call "smart data". There is a section on this in stuff about the author. A new web page on this topic has just been uploaded - see "Smart Data" and Business Data Types - and we will be publishing more as it develops. … F I N: Words to/from the Wise FYI: Flow-based programming has been re-discovered/re-invented several times by different communities. Here is an “IBM practitioner’s view”: Flow-based Programming, http://www.jpaulmorrison.com/fbp/

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler )

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler )

Presentation Transcript

Kepler: Towards a Grid-Enabled System for Scientific Workflows

Scientific Workflows

Composing Models of Computation in Kepler/Ptolemy II

Scientific Revolution

History of Astrology

Kepler , Brahe and Ptolemy

Towards Scientific Workflows Based on Dataflow Process Networks (or from Ptolemy to Kepler)

Introduction to Scientific Workflows and the KEPLER System

Bertram Lud ä scher San Diego Supercomputer Center ludaesch@SDSC

Accelerating the Scientific Exploration Process with Kepler Scientific Workflow System

Kepler: Application of Ptolemy II to Scientific Workflows

Provenance in Scientific Workflows on SEEK

Workflows

Towards Self-Describing Workflows for Climate Models

Introduction to Scientific Workflows and the KEPLER System

An Extensible System for Design and Execution of Scientific Workflows

Ptolemy

Galileo, PTOLEMY, COPERNICUS,KEPLER

Scientific Revolution and Enlightenment

Scientific workflow in Kepler – hands on tutorial

Lecture 3

Scientific Process