1 / 83

Introduction to Scientific Workflows and the KEPLER System

Introduction to Scientific Workflows and the KEPLER System. Instructors: Bertram Ludaescher Ilkay Altintas. Overview. 10:30-11:15 Introduction to Scientific Workflows 11:15-12:00 Scientific Workflows in KEPLER live demo, brains-on session … but first, one more time … (déjà déjà vu).

nelson
Télécharger la présentation

Introduction to Scientific Workflows and the KEPLER System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Scientific Workflows and the KEPLER System Instructors: Bertram Ludaescher Ilkay Altintas

  2. Overview • 10:30-11:15 Introduction to Scientific Workflows • 11:15-12:00 Scientific Workflows in KEPLER live demo, brains-on session • … but first, one more time … (déjà déjà vu) TM

  3. Information Integration Challenges: S4 Heterogeneities • Systems Integration • platforms, devices, data & service distribution, APIs, protocols, …  Grid middleware technologies + e.g. single sign-on, platform independence, transparent use of remote resources, … • Syntax & Structure • heterogeneous data formats (one for each tool ...) • heterogeneous data models (RDBs, ORDBs, OODBs, XMLDBs, flat files, …) • heterogeneous schemas(one for each DB ...)  Database mediation technologies + XML-based data exchange, integrated views, transparent query rewriting, … • Semantics • fuzzy metadata, terminology, “hidden” semantics, implicit assumptions, …  Knowledge representation & semantic mediation technologies + “smart” data discovery & integration + e.g. ask about X (‘mafic’); find data about Y (‘diorite’); be happy anyways!

  4. Information Integration Challenges: S5 Heterogeneities • Synthesis of applications, analysis tools, data & query components, … into “scientific workflows” • How to make use of these wonderful things & put them together to solve a scientist’s problem? • Scientific Problem Solving Environments (PSEs) • GEON Portal and Workbench (“scientist’s view”) + ontology-enhanced data registration, discovery, manipulation + creation and registration of new data products from existing ones, … • GEON Scientific Workflow System (“engineer’s view”) + for designing, re-engineering, deploying analysis pipelines and scientific workflows; a tool to make new tools … + e.g., creation of new datasets from existing ones, dataset registration,…

  5. What is a Scientific Workflow (SWF)? • Goals: • automate a scientist’s repetitive data management and analysis tasks • typical phases: • data access, scheduling, generation, transformation, aggregation, analysis, visualization  design, test, share, deploy, execute, reuse, … SWFs • Typical requirements/characteristics: • data-intensive and/or compute-intensive • plumbing-intensive • dataflow-oriented • distributed (data, processing) • user-interaction “in the middle”, … • … vs. (C-z; bg; fg)-ing (“detach” and reconnect) • advanced programming constructs (map(f), zip, takewhile, …) • logging, provenance, “registering back” (intermediate) products… • … easy to recognize a SWF when you see one!

  6. Promoter Identification Workflow Source: Matt Coleman (LLNL)

  7. Source: NIH BIRN (Jeffrey Grethe, UCSD)

  8. Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Test sample (d) Species presence & absence points (native range) (a) Native range prediction map (f) Training sample (d) GARP rule set (e) Data Calculation Map Generation Map Generation EcoGrid Query EcoGrid Query Validation Validation User Sample Data +A2 +A3 Model quality parameter (g) Generate Metadata Integrated layers (native range) (c) Layer Integration Layer Integration +A1 Environmental layers (native range) (b) Invasion area prediction map (f) Selected prediction maps (h) Model quality parameter (g) Integrated layers (invasion area) (c) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) Ecology: GARP Analysis Pipeline for Invasive Species Prediction Source: NSF SEEK (Deana Pennington et. al, UNM)

  9. Digression: (Business) Workflows and Systems or: what you need to know when someone wants to sell you one ;-) or: the remote relatives (2nd-3rd cousins?) of scientific workflows

  10. What is a (Business) Workflow? • Workflow management (also called Business Process Management) is the coordination of work processes through software. • A workflow management system routes pending activities to process participants according to a model of the process. • WF management systems have been around since the late 1970s (e.g. Officetalk, Xerox PARK) • marketing waves: Office Automation (70’s-80’s), Business Process Reengineering (90’s), Web Services Choreography (00’s) • roots/related: document management apps, email system apps, database apps (active DBMS’s, federated DBMS’s) • Meanwhile (69’-71’) elsewhere: Flow-based programming (J. Paul Morrison) • … not quite workflow but rather dataflow … (we’ll come to that…) Src/cf: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003

  11. Some History Commercial Workflow Systems Source: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003

  12. Some History Commercial Workflow Systems Source: http://www.workflow-research.de/index.htm, M.z. Muehlen, 2003

  13. Play Time @ Petri Nets World • Petri Nets are the underlying abstract model of many B-WfMS’s (who said I can’t do bad acronyms, too? ;-) • http://www.daimi.au.dk/PetriNets/ • http://www.daimi.au.dk/PetriNets/introductions/aalst/ • Let’s see the basic ideas first …

  14. Formal Basis: Petri Nets • Mathematical model of discrete distributed systems (named after Carl Adam Petri, 1960’s) • Provides a modeling language w/ rich theory, analysis tools, … • A Petri net consists of places (P), transitions (T) and directed arcs (PT or TP). Places can hold tokens. • A transition is enabled if each of its input places contains at least one token. • An enabled transition can fire, removing input tokens and producing output tokens P2 Enabled not enabled T1 P3 T2 P4 P1

  15. Formal Basis: Petri Nets • Mathematical model of discrete distributed systems (named after Carl Adam Petri, 1960’s) • Provides a modeling language w/ rich theory, analysis tools, … • A Petri net consists of places (P), transitions (T) and directed arcs (PT or TP). Places can hold tokens. • A transition is enabled if each of its input places contains at least one token. • An enabled transition can fire, removing input tokens and producing output tokens P2 Enabled not enabled T1 P3 T2 P4 P1

  16. Why Petri Nets • Modeling and designing concurrent systems w/ competing resources (dining philosophers), … • Lots of analysis techniques, tools, theory • boundedness (state space), • liveness (good things do happen), • safety (bad things do not happen), • reversibility, • deadlock(-freeness), • reachability (of certain states), • …

  17. In a Flux: WS-XX-“Standards” Source: W.M.P. van der Aalst et al. http://tmitwww.tm.tue.nl/research/patterns/ http://tmitwww.tm.tue.nl/staff/wvdaalst/Publications/publications.html

  18. Everything Flows? But what exactly? • Dataflow • Data flows through operations (zoom into your CPU…) • Activity diagrams: data flows through actions • Process networks: data flows between processes • Control-flow • Nodes are control-flow operations that start other operations on a state • Mixed approaches • Statecharts: events trigger state transitions • Petri nets: tokens mark control and dataflow • Workflow languages: mix control and dataflow • … many others …

  19. Scientific “Workflows” vs Business Workflows • Business Workflows (BPEL4WS* …) • Task-orientation: travel reservations;credit approval; BPM; … • Tasks, documents, etc. undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout • Complex control flow, complex process composition (danger of control flow/dataflow “spaghetti”)  Dataflow and control-flow are often divorced! • Scientific “Workflows” • Dataflow and data transformations • Data problems: volume, complexity, heterogeneity • Grid-aspects • Distributed computation • Distributed data • User-interactions/WF steering • Data, tool, and analysis integration  Dataflow and control-flow are often married! (can be a happy marriage… at times…) *Business Process Execution Language for Web Services (in case you wondered)

  20. Scientific “Workflows”: Some Findings • More dataflow than (business control-/) workflow • DiscoveryNet, Kepler, SCIRun, Scitegic, Triana, Taverna, …, • Need for “programming extensions” • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …) • Need for abstraction and nested workflows • Need for data transformations (WS1DTWS2) • Need for rich user interaction & workflow steering: • pause / revise / resume • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF • Need for high-throughput data transfers and CPU cyles: “(Data-)Grid-enabling”, “streaming” • Need for persistence of intermediate products andprovenance

  21. Perspectives on Systems / Dataflow View Source: Workflow-based Process Controlling, Michael zur Muehlen, 2003

  22. A Dataflow Component (“Actor”) parameters $1, $2, … “actor” / component input channels output channels ports

  23. Actor-Oriented Design • Object orientation: What flows through an object is sequential control (cf. CCA, MPI) class name data methods call return What flows through an object is a stream of data tokens (in SWFs/KEPLER also references!!) • Actor/Dataflow orientation: actor name data (state) parameters Input data Output data ports Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

  24. TextToSpeech initialize(): void notify(): void isReady(): boolean getSpeech(): double[] Object-Oriented vs.Actor-Oriented Interfaces Actor/Dataflow Oriented Object Oriented OO interface gives procedures that have to be invoked in an order not specified as part of the interface definition. AO interface definition says “Give me text and I’ll give you speech” Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

  25. Ptolemy II see! read! try! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

  26. Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflows KEPLER = “Ptolemy II + X” for Scientific Workflows History • Gabriel (1986-1991) • Written in Lisp • Aimed at signal processing • Synchronous dataflow (SDF) block diagrams • Parallel schedulers • Code generators for DSPs • Hardware/software co-simulators • Ptolemy Classic (1990-1997) • Written in C++ • Multiple models of computation • Hierarchical heterogeneity • Dataflow variants: BDF, DDF, PN • C/VHDL/DSP code generators • Optimizing SDF schedulers • Higher-order components • Ptolemy II (1996-2022) • Written in Java • Domain polymorphism • Multithreaded • Network integrated • Modal models • Sophisticated type system • CT, HDF, CI, GR, etc. • PtPlot (1997-??) • Java plotting package • Tycho (1996-1998) • Itcl/Tk GUI framework • Diva (1998-2000) • Java GUI framework • Copernicus (code generator) • KEPLER (2003-2028) • scientific workflow extensions Source (Ptolemy): Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

  27. An “early” example: Promoter Identification SSDBM, AD 2003 • Scientist models application as a “workflow” of connected components (“actors”) • If all components exist, the workflow can be automated/ executed • Different directors can be used to pick appropriate execution model (often “pipelined” execution: PN director)

  28. Why Ptolemy II (and thus KEPLER)? • Ptolemy II Objective: • “The focus is on assembly of concurrent components. The key underlying principle in the project is the use of well-definedmodels of computation that govern the interaction between components. A major problem area being addressed is the use of heterogeneous mixtures of models of computation.” • Dataflow Process Networks w/ natural support for abstraction, pipelining (streaming) actor-orientation, actor reuse • User-Orientation • Workflow design & exec console (Vergil GUI) • “Application/Glue-Ware” • excellent modeling and design support • run-time support, monitoring, … • not a middle-/underware (we use someone else’s, e.g. Globus, SRB, …) • but middle-/underware is conveniently accessible through actors! • PRAGMATICS • Ptolemy II is mature, continuously extended & improved, well-documented (500+pp) • open source system • Ptolemy II folks actively participate in KEPLER

  29. The KEPLER/Ptolemy II GUI (Vergil) “Directors” define the component interaction & executionsemantics Large, polymorphic component (“Actors”) and Directors libraries (drag & drop)

  30. Ptolemy II: Actor-Oriented Modeling • Component (“actor”) interaction semantics not hard-wired inside components, but “factored out” in a “director” • Different directors for different modeling and execution needs (… can even be combined!) • Better abstraction, modeling, component reuse, …

  31. Director Behavioral Polymorphism in Ptolemy These polymorphic methods implement the communication semantics of a domain in Ptolemy II. The receiver instance used in communication issupplied by the director, not by the component. (cf. CCA, WS-??, [G]BPL4??, … !) IOPort Behavioral polymorphism is the idea that components can be defined to operate with multiple models of computation and multiple middleware frameworks. consumer producer actor actor Receiver Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

  32. Domains and Directors: Semantics for Component Interaction • CI – Push/pull component interaction • CSP – concurrent threads with rendezvous • CT – continuous-time modeling • DE – discrete-event systems • DDE – distributed discrete events • FSM – finite state machines • DT – discrete time (cycle driven) • Giotto – synchronous periodic • GR – 2-D and 3-D graphics • PN – process networks • SDF – synchronous dataflow • SR – synchronous/reactive • TM – timed multitasking For (finer-grained) concurrent jobs!? For (coarse grained) Scientific Workflows! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

  33. Polymorphic Actor Components Working Across Data Types and Domains • Actor Data Polymorphism: • Add numbers (int, float, double, Complex) • Add strings (concatenation) • Add complex types (arrays, records, matrices) • Add user-defined types • Actor Behavioral Polymorphism: • In dataflow, add when all connected inputs have data • In a time-triggered model, add when the clock ticks • In discrete-event, add when any connected input has data, and add in zero time • In process networks, execute an infinite loop in a thread that blocks when reading empty inputs • In CSP, execute an infinite loop that performs rendezvous on input or output • In push/pull, ports are push or pull (declared or inferred) and behave accordingly • In real-time CORBA, priorities are associated with ports and a dispatcher determines when to add By not choosing among these when defining the component, we get a huge increment in component re-usability. But how do we ensure that the component will work in all these circumstances? Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/

  34. Directors and Combining Different Component Interaction Semantics • Possible app. in SWF: • time-series aware … • parameter-sweep aware … • MPI aware • XYZ aware … • … execution models Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/

  35. Components linked via ports Dataflow (and msg/ctl-flow) Where is the component interaction semantics defined?? each component is its own director! But still useful for special applications, e.g. parallel programs (MPI, …) DIR1 DIR2 DIR3 DIR4 ??? Component Composition & Interaction Source: GRIST/SC4DEVO workshop, July 2004, Caltech

  36. CCA!? CCA via special (“look the other way”) Director(s)? • Dataflow in CCA • a CCA “convention” can be used to accommodate actor-oriented/dataflow modeling • CCA/Message Passing in KEPLER • Kepler/Ptolemy can be extended to accommodate message passing semantics (CSP is already in Ptolemy II)

  37. Data/Control-Flow Spectrum • Data (tokens) flow • (almost) no other side effects • WYSIWYG (usually) • References flow • token reference type may be “http-get”, “ftp-get”, “hsi put”… • generic handling still possible • Application specific tokens flow • e.g. current Nimrod job management in Resurgence • “invisible contract” between components • Director is unaware of what’s going on … (sounds familiar? ;-) • Specific messages passing protocols (e.g., CSP, MPI) • for systems of tightly coupled components message passing, control flow “clean” data(=ctl)-flow special tokens flow “actor”

  38. Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SEEK, GEON, SDM, BIRN, ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK ••• KEPLER/CSP: Contributors, Sponsors, Projects(or loosely coupled Communicating Sequential Persons ;-) Ptolemy II

  39. KEPLER: An Open Collaboration • Initiated by members from NSF SEEK and DOE SDM/SPA; now several other projects • Open Source (BSD-style license) • Intensive Communications: • Web-archived mailing lists • IRC (!) • Co-development: • via shared CVS repository • joining as a new co-developer (currently): • get a CVS account (read-only) • local development + contribution via existing KEPLER member • be voted “in” as a member/co-developer • Software & social engineering • How to better accommodate new groups/communities? • How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?

  40. GEON Dataset Generation & Registration(a co-development in KEPLER) % Makefile $> ant run SQL database access (JDBC) Matt,Chad, Dan et al. (SEEK) Efrat (GEON) Ilkay (SDM) Yang (Ptolemy) Xiaowen (SDM) Edward et al.(Ptolemy)

  41. KEPLER then …

  42. … so,you see, scientific workflows need domain and data-polymorphic actors & must scale to HPC! … and KEPLER today… What’s a poly- morphic actor? What’s a scientific workflow? What is HPC? BTW: Kepler is NOT a GUI (Vergil is)

  43. KEPLER Pedigree (to be determined…) Khoros openDX SCIRun DiscoveryNet AVS Taverna Gabriel Ptolemy Ptolemy II KEPLER Triana Pegasus • Graphical dataflow environments • Problem solving environments • Grid workflows Matrix

  44. A Few Specific Kepler Features

  45. Web Services  Actors (WS Harvester) 1 2 4 3 •  “Minute-made” (MM) WS-based application integration • Similarly: MM workflow design & sharing w/o implemented components

  46. Recent Actor Additions

  47. Digression: Who are the clients? • Domain scientists • C/Perl/Python/Java/WS/DB-enabled ones • others (e.g. visually-inclined rest of us?) • Goal: make the life better for both! • Workflow automation • Plumbing support • Execution monitoring, steering, runtime revision (pause-inspect-modify-resume cycle)

  48. For the Geoscientist: GEON Mineral Classification Workflow

More Related