Kepler: Towards a Grid-Enabled System for Scientific Workflows

Kepler: Towards a Grid-Enabled System for Scientific Workflows Ilkay Altintas, Chad Berkley, Efrat Jaeger, Matthew Jones, Bertram Ludäscher*, Steve Mock *ludaesch@SDSC.EDU San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD)

Outline • Motivation: Scientific Workflows (SEEK, SDM, GEON, ..) • Current Features of the Kepler Scientific Workflows System • Extending Kepler: • Grid-Enabling Kepler: • 3rd party transfer • WF planning & optimization • Shipping and Handling Algebra (SHA) • Web Service Composition as Declarative Query Plans • Semantic Types for Scientific Workflows • Conclusions

Ilkay Altintas SDM Chad Berkley SEEK Shawn Bowers SEEK Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Efrat Jaeger GEON Matt Jones SEEK Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludäscher BIRN, GEON, SDM, SEEK Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Yang Zhao Ptolemy II … Kepler Team, Projects, Sponsors Ptolemy II

Example: SEEK– Science Environment for Ecological Knowledge (large NSF ITR) • Analysis & Modeling System • Design and execution of ecological models and analysis • End user focus • application-/upperware • Semantic Mediation System • Data Integration of hard-to-relate sources and processes • Semantic Types and Ontologies • upper middleware • EcoGrid • Access to ecology data and tools • middle-/underware Architecture Overview (cf. Cyberinfrastructure)

Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Test sample (d) Species presence & absence points (native range) (a) Native range prediction map (f) Training sample (d) GARP rule set (e) Data Calculation Map Generation Map Generation EcoGrid Query EcoGrid Query Validation Validation User Sample Data +A2 +A3 Model quality parameter (g) Generate Metadata Integrated layers (native range) (c) Layer Integration Layer Integration +A1 Environmental layers (native range) (b) Invasion area prediction map (f) Selected prediction maps (h) Model quality parameter (g) Integrated layers (invasion area) (c) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) Ecology: GARP Analysis Pipeline for Invasive Species Prediction Source: NSF SEEK (Deana Pennington et. al, UNM)

Genomics Example: Promoter Identification Workflow (PIW) Source: Matt Coleman (LLNL)

Source: NIH BIRN (Jeffrey Grethe, UCSD)

Scientific “Workflows”: Some Findings • More dataflow than (business control-/) workflow • DiscoveryNet, Kepler, SCIRun, Scitegic, Taverna, Triana,, …, • Need for “programming extension” • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …) • Need for abstraction and nested workflows • Need for data transformations (WS1DTWS2) • Need for rich user interaction & workflow steering: • pause / revise / resume • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF • Need for high-throughput transfers (“grid-enabling”, “streaming”) • Need for persistence of intermediate products andprovenance

In a Flux: Workflow “Standards” Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/ http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html

Commercial & Open Source Scientific “Workflow” (well Dataflow) Systems Kensington Discovery Edition from InforSense Triana Taverna

SCIRun: Problem Solving Environments for Large-Scale Scientific Computing • SCIRun: PSE for interactive construction, debugging, and steering of large-scale scientific computations • New collaboration under Kepler/SDM • Component model, based on generalized dataflow programming Steve Parker (cs.utah.edu)

see! Our Starting Point: Ptolemy II & Dataflow Process Networks read! try! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Why Ptolemy II? • Ptolemy II Objective: • “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-definedmodels of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” • Data & Process oriented: Dataflow process networks • Natural Data Streaming Support • User-Orientation • “application-ware”, not middle-/under-ware) • Workflow design & exec console (Vergil GUI) • PRAGMATICS • mature, actively maintained, well-documented (500+pp) • open source system • developed across multiple projects (NSF/ITRs SEEK and GEON, DOE SciDAC SDM, …) • hoping to leverage e-sister projects (e.g. Taverna, …)

typed i/o ports FIFO actor actor Dataflow Process Networks: Putting Computation Models (“Orchestration”) first! • Synchronous Dataflow Network (SDF) • Statically schedulable single-threaded dataflow • Can execute multi-threaded, but the firing-sequence is known in advance • Maximally well-behaved, but also limited expressiveness • Process Network (PN) • Multi-threaded dynamically scheduled dataflow • More expressive than SDF (dynamic token rate prevents static scheduling) • Natural streaming model • Other Execution Models (“Domains”) • Implemented through different “Directors” advanced push/pull

Actor-/Dataflow Orientation vs Object-/ Control flow Orientation Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Marrying or Divorcing Control- & Dataflow Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Overview: Scientific Workflows in Kepler • Modeling and Workflow Design • Web services = individual components (“actors”) • “Minute-Made” Application Integration: • Plugging-in and harvesting web service components is easy, fast • Rich SWF modeling semantics (“directors”): • Different and precise dataflow models of computation • Clear and composable component interaction semantics  Web service composition and application integration tool • Coming soon: • Shrinked wrapped, pre-packaged “Kepler-to-Go” • Structural and semantic typing (better design support) • Grid-enabled web services (for big data, big computations,…) • Different deployment models (web service, web site, applet, …)

The KEPLER GUI: Vergil(Steve Neuendorffer, Ptolemy II) Drag and drop utilities, director and actor libraries.

Running a Genomics WF (Ilkay Altintas, SDM)

Support for Multiple Workflow Granularities Boulders Plumbing Powder Abstraction: Sand to Rocks Sand

Directors and Combining Different Component Interaction Semantics Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

Application Examples: Mineral Classification with Kepler … (Efrat Jaeger, GEON)

… inside the Classifier

Standard BrowserUI: Client-Side SVG

SWF Reengineering (Ashraf, Efrat, Kai, GEON)

DataMapper Sub-Workflow

Result launched via BrowserUI actor(coupling with ESRI’s ArcIMS)

Distributed Workflows in KEPLER • Web and Grid Service plug-ins • WSDL (now) and Grid services (stay tuned …) • ProxyInit, GlobusGridJob, GridFTP, DataAccessWizard • SSH, SCP, SDSC SRB, OGS?-???… coming • WS Harvester • Import query-defined WS operations as Kepler actors • XSLT and XQuery Data Transformers • to link not “designed-to-fit” web services • WS-deployment interface (planned)

Configure - select service operation Generic Web Service Actor (Ilkay Altintas) • Given a WSDL and the name of an operation of a web service, dynamically customizes itself to implement and execute that method.

Set Parameters and Commit Set parameters and commit

Specialized WS Actor (after instantiation)

Web Service Harvester (Ilkay Altintas, SDM) • Imports the web services in a repository into the actor library. • Has the capability to search for web services based on a keyword.

Output of previous web service Composing 3rd-Party WSs (NMI, Steve Mock) Input of next web service User interaction & Transformations

A Special Generic Ingestion Actor for EML Data (SEEK, Chad Berkley) • Ingests any data format described by EML metadata • Converts raw data to Ptolemy format • Data can then be operated on with other actors

Wrapping Legacy Applications

Promoter Identification Workflow (PIW) Source: Matt Coleman (LLNL)

Execution Semantics Promoter Identification Workflow in Ptolemy-II [SSDBM’03]

designed to fit designed to fit hand-crafted Web-service actor hand-crafted control solution; also: forces sequential execution! No data transformations available Complex backward control-flow

Promoter Identification Workflow in FP genBankG :: GeneId -> GeneSeqgenBankP :: PromoterId -> PromoterSeqblast :: GeneSeq -> [PromoterId]promoterRegion :: PromoterSeq -> PromoterRegiontransfac :: PromoterRegion -> [TFBS]gpr2str :: (PromoterId, PromoterRegion) -> Stringd0 = Gid "7" -- start with some gene-id d1 = genBankG d0 -- get its gene sequence from GenBankd2 = blast d1 -- BLAST to get a list of potential promotersd3 = map genBankP d2 -- get list of promoter sequences d4 = map promoterRegion d3 -- compute list of promoter regions and ...d5 = map transfac d4 -- ... get transcription factor binding sitesd6 = zip d2 d4 -- create list of pairs promoter-id/regiond7 = map gpr2str d6 -- pretty print into a list of strings d8 = concat d7 -- concat into a single "file" d9 = putStr d8 -- output that file

Back to purely functional dataflow process network (= also a data streaming model!) Re-introducing map(f) to Ptolemy-II (was there in PT Classic) no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from piw(GeneId) to PIW :=map(piw) over [GeneId] Cleaned up Process Network PIW map(f)-style iterators Powerful type checking Generic, declarative “programming” constructs Generic data transformation actors Forward-only, abstractable sub-workflow piw(GeneId)

PIW as a declarative, referentially transparent functional process optimization via functional rewriting possible e.g. map(fog) = map(f) o map(g) Technical report &PIW specification in Haskell Optimization by Declarative Rewriting I map(fo g) instead ofmap(f) o map(g) Combination of map and zip http://kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-constructs.pdf

Optimizing II: Streams & Pipelines • Clean functional semantics facilitates algebraic workflow (program) transformations (Bird-Meertens); e.g. mapS f • mapS g mapS (f • g) Source: Real-Time Signal Processing: Dataflow, Visual, and Functional Programming, Hideki John Reekie, University of Technology, Sydney

Middle/Underware Access: Querying Databases • Database connection actor: • Opening a database connection and passing it to all actors accessing this database. • Database query actor: • A generic actor that queries a database and provides its result. • DBConnection type and DBConnectionToken: • A new IOPort type and a token to distinguish a database connection from any general type.

Database Connection Actor • OpenDBConnection actor: • Input: database connection information • Output: DBConnectionToken (reference to a DB connection instance, via a DBConnection output port)

Database Query Actor • Database Query actor: • Input: SQL query string and a DB connection token • Parameters: • output type: XML, Record, or String • tuple-at-a-time vs set-at-a-time • Process: • execute query • produce results according to parameters

Querying Example

g f X Y Z An (oversimplified) Model of the Grid • Hosts: {h1, h2, h3, …} • Data@Hosts: d1@{hi}, d2@{hj}, … • Functions@Hosts: f1@{hi}, f2@{hj}, … • Given: data/workflow: • … as a functional plan: […; Y := f(X); Z := g(Y); …] • … as a logic plan: […; f(X,Y)g(Y,Z); …] • FindHost Assignment: di hi , fj hj for all di ,fj … s.t. […; d3@h3 := f@h2(d1@h1), …] is a valid plan

f@A f@A f@A f@A x@b x@b x@b x@b y@c y@c y@c y@c Shipping and Handling Algebra (SHA) Logical view (1) • plan Y@C = F@A of X@B = • [ X@B to A, Y@A := F@A(X@A), Y@A to C ] • [ F@A => B, Y@B := F@B(X@B), Y@B to C ] • [ X@B to C, F@A => C, Y@C := F@C(X@C) ] (2) (3) Physical view: SHA Plans

Grid-Enabling PTII: Handles • AGA: get_handle • GAA: return &X • AB: send &X • BGB: request &X • GBGA: request &X • GA GB: send *X • GBB: send done(&X) • Example: • &X = “GA.17” • *X =<some_huge_file> • Candidate Formalisms: • GridFTP • SSH, SCP • SDSC SRB • OGS?-??? … WSRF? Logical token transfer (3) requires get_handle(1,2); then exec_handle(4,5,6,7) for completion. Keplerspace 3 A B 4 7 2 1 5 Gridspace GA GB 6

Extensions: Semantic Type • Take concepts and relationships from an ontology to “semantically type” the data-in/out ports • Application: e.g., design support: • smart/semi-automatic wiring, generation of “massaging actors” m1 (normalize) p3 p4 Takes Abundance Count Measurements for Life Stages Returns Mortality Rate Derived Measurements for Life Stages

Kepler: Towards a Grid-Enabled System for Scientific Workflows