460 likes | 638 Vues
Introduction to Scientific Workflows and the KEPLER System. Instructor: Ilkay ALTINTAS. Outline. Introduction to scientific workflows and scientific workflow systems Kepler collaboration Kepler scientific workflow system System demonstration. What is a Scientific Workflow (SWF)?. Goals :
E N D
Introduction to Scientific Workflows and the KEPLER System Instructor: Ilkay ALTINTAS
Outline • Introduction to scientific workflows and scientific workflow systems • Kepler collaboration • Kepler scientific workflow system • System demonstration
What is a Scientific Workflow (SWF)? • Goals: • automate a scientist’s repetitive data management and analysis tasks • typical phases: • data access, scheduling, generation, transformation, aggregation, analysis, visualization • design, test, share, deploy, execute, reuse, … SWFs • Some findings: • Moredataflow than (business control-/) workflow • Need for “programming extensions” • Iterations over lists (foreach); filtering; functional composition; generic & higher-order operations (zip, map(f), …) • Need for abstractionand nested workflows • Need for data transformations (WS1DTWS2) • Need for rich user interaction & workflow steering: • pause / revise / resume • select & branch; e.g., web browser capability at specific steps as part of a coordinated SWF • Need for high-throughput data transfers and CPU cyles: “(Data-)Grid-enabling”, “streaming” • Need for persistence of intermediate products andprovenance • … easy to recognize a SWF when you see one!
Promoter Identification Workflow Source: Matt Coleman (LLNL)
Archive To Ecogrid Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Registered Ecogrid Database Test sample (d) Species presence & absence points (native range) (a) Native range prediction map (f) Training sample (d) GARP rule set (e) Data Calculation Map Generation Map Generation EcoGrid Query EcoGrid Query Validation Validation User Sample Data +A2 +A3 Model quality parameter (g) Generate Metadata Integrated layers (native range) (c) Layer Integration Layer Integration +A1 Environmental layers (native range) (b) Invasion area prediction map (f) Selected prediction maps (h) Model quality parameter (g) Integrated layers (invasion area) (c) Environmental layers (invasion area) (b) Species presence &absence points (invasion area) (a) Ecology: GARP Analysis Pipeline for Invasive Species Prediction Source: NSF SEEK (Deana Pennington et. al, UNM)
Scientific “Workflows” vs Business Workflows • Business Workflows (BPEL4WS* …) • Task-orientation: travel reservations;credit approval; BPM; … • Tasks, documents, etc. undergo modifications (e.g., flight reservation from reserved to ticketed), but modified WF objects still identifiable throughout • Complex control flow, complex process composition (danger of control flow/dataflow “spaghetti”) Dataflow and control-flow are often divorced! • Scientific “Workflows” • Dataflow and data transformations • Data problems: volume, complexity, heterogeneity • Grid-aspects • Distributed computation • Distributed data • User-interactions/WF steering • Data, tool, and analysis integration Dataflow and control-flow are often married!
SWF Systems – Requirements (1/2) • …it should work… (No kidding!) USER REQUIREMENTS: • Design tools-- especially for non-expert users • Ease of use-- fairly simple user interface having more complex features hidden in the background • Reusable generic features • Generic enough to serve to different communities but specific enough to serve one domain (e.g. geosciences) • Extensibility for the expert user-- almost a visual programming interface • Registration and publication of data products and “process products” (=workflows); provenance
SWF Systems – Requirements (2/2) TECHNICAL REQUIREMENTS: • Error detection and recovery from failure • Logging information for each workflow • Allow data-intensive and compute-intensive tasks (Maybe at the same time) • HPC + Data management/integration • Allow status checks and on the fly updates • Visualization… • Semantics and metadata… • Certification, trust, security…
Perspectives on Systems / Dataflow View Source: Workflow-based Process Controlling, Michael zur Muehlen, 2003
A Dataflow Component (“Actor”) parameters $1, $2, … “actor” / component input channels output channels ports
Actor-Oriented Design • Object orientation: What flows through an object is sequential control (cf. CCA, MPI) class name data methods call return What flows through an object is a stream of data tokens (in SWFs/KEPLER also references!!) • Actor/Dataflow orientation: actor name data (state) parameters Input data Output data ports Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
TextToSpeech initialize(): void notify(): void isReady(): boolean getSpeech(): double[] Object-Oriented vs.Actor-Oriented Interfaces Actor/Dataflow Oriented Object Oriented OO interface gives procedures that have to be invoked in an order not specified as part of the interface definition. AO interface definition says “Give me text and I’ll give you speech” Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
see! Ptolemy II read! try! Source: Edward Lee et al. http://ptolemy.eecs.berkeley.edu/ptolemyII/
Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflows KEPLER = “Ptolemy II + X” for Scientific Workflows History • Gabriel (1986-1991) • Written in Lisp • Aimed at signal processing • Synchronous dataflow (SDF) block diagrams • Parallel schedulers • Code generators for DSPs • Hardware/software co-simulators • Ptolemy Classic (1990-1997) • Written in C++ • Multiple models of computation • Hierarchical heterogeneity • Dataflow variants: BDF, DDF, PN • C/VHDL/DSP code generators • Optimizing SDF schedulers • Higher-order components • Ptolemy II (1996-2022) • Written in Java • Domain polymorphism • Multithreaded • Network integrated • Modal models • Sophisticated type system • CT, HDF, CI, GR, etc. • PtPlot (1997-??) • Java plotting package • Tycho (1996-1998) • Itcl/Tk GUI framework • Diva (1998-2000) • Java GUI framework • Copernicus (code generator) • KEPLER (2003-2028) • scientific workflow extensions Source (Ptolemy): Edward Lee et al. http://ptolemy.eecs.berkeley.edu/
What is Kepler? • … a scientific workflow system • … a cross-project collaboration New contributing partners: • Cheminformatics: Resurgence (Kim Baldridge et al.) • Life Sciences: EOL (Mark Miller et al.) • Data Mining: SKIDL (Tony Fountain et al.) • Neuroinformatics: BIRN (coming…) • … an emerging open source tool for “scientific discovery workflows” Kepler 1.0 alpha release August 15, 2004 www.geongrid.org CYBERINFRASTRUCTURE FOR THE GEOSCIENCES 17
Ilkay Altintas SDM, Resurgence Kim Baldridge Resurgence, NMI Chad Berkley SEEK Shawn Bowers SEEK Terence Critchlow SDM Tobin Fricke ROADNet Jeffrey Grethe BIRN Christopher H. Brooks Ptolemy II Zhengang Cheng SDM Dan Higgins SEEK Efrat Jaeger GEON Matt Jones SEEK Werner Krebs, EOL Edward A. Lee Ptolemy II Kai Lin GEON Bertram Ludaescher SEEK, GEON, SDM, BIRN, ROADNet Mark Miller EOL Steve Mock NMI Steve Neuendorffer Ptolemy II Jing Tao SEEK Mladen Vouk SDM Xiaowen Xin SDM Yang Zhao Ptolemy II Bing Zhu SEEK ••• KEPLER/CSP: Contributors, Sponsors, Projects(or loosely coupled Communicating Sequential Persons ;-) Ptolemy II
KEPLER Model • Open Source (BSD-style license) • Intensive Communications: • Web-archived mailing lists • IRC (!) • Co-development: • via shared CVS repository • joining as a new co-developer (currently): • get a CVS account (read-only) • local development + contribution via existing KEPLER member • be voted “in” as a member/co-developer • Software & social engineering • How to better accommodate new groups/communities? • How to better accommodate different usage/contribution models (core dev … special purpose extender … user)?
A co-development in KEPLER: GEON Dataset Generation & Registration % Makefile $> ant run SQL database access (JDBC) Matt,Chad, Dan et al. (SEEK) Efrat (GEON) Ilkay (SDM) Yang (Ptolemy) Xiaowen (SDM) Edward et al.(Ptolemy)
The KEPLER/Ptolemy II GUI (Vergil) “Directors” define the component interaction & executionsemantics Large, polymorphic component (“Actors”) and Directors libraries (drag & drop) BTW: Kepler is NOT a GUI (Vergil is)
Web Services Actors (WS Harvester) 1 2 4 3 • “Minute-made” (MM) WS-based application integration • Similarly: MM workflow design & sharing w/o implemented components
Digression: Who are the clients? • Domain scientists • C/Perl/Python/Java/WS/DB-enabled ones • others (e.g. visually-inclined rest of us?) • Goal: make the life better for both! • Workflow automation • Plumbing support • Execution monitoring, steering, runtime revision (pause-inspect-modify-resume cycle)
… inside the Classifier BrowserUI actor w/ SVG client display
in KEPLER (interactive session) Source: Dan Higgins, Kepler/SEEK
in KEPLER (w/ editable script) Source: Dan Higgins, Kepler/SEEK
A Closer Look at Dataflow … (or: Do you know what’s going on under your carpet? ) • Dataflow: what you see is what you get (almost…) • Need for a general way to handle references! control tokens flow, e.g., from “$”-actor to FileReader and ImageReader actors actual dataflow is “under the carpet” and through handles (file system, GridFTP, scp, SRB, …)
Registered Resources show up in Vergil (joint SEEK, SPA, GEON, … Registry!?)
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Traffic info for a list of highways: Uses iterate (higher-order “map”) actor to access highway info web service repeatedly, sending out one email per highway.
Re-engineered PIW w/ Iteration Constructs AD 2004 map(GenbankWS) Input: {“NM_001924”, “NM020375”} Output: {“CAGT…AATATGAC",“GGGGA…CAAAGA“}
Streaming Real-time Data Straightforward Example: Laser Strainmeter Channels in; Scientific Workflow; Earth-tide signal out Seismic Waveforms
Job Management (here: NIMROD) • Job management infrastructure in place • Results database: under development • Goal: 1000’s of GAMESS jobs (quantum mechanics) – Fall/Winter’04
KEPLER Today • Support for SWF life cycle • Design, share, prototype, run, monitor, deploy, … • Coarse-grained scientific workflows, e.g., • web service actors, grid actors, command-line actors, … • Fine grained workflows and simulations, e.g., • Database access, XSLT transformations, … • Kepler Extensions • SDM Center/SPA: support for data- and compute-intensive workflows! • real-time data streaming (ROADNet) • other special and generic extensions (e.g. GEON, SEEK) • Status • first release (alpha) was in May 2004 • nightly builds w/ version tests • “Link-Up Sister Project” w/ other SWF systems (UK Taverna, Triana, …) • Participation in various workshops and conferences (GGF10, SSDBMs, eScience WF workshop, …)
KEPLER Tomorrow • Application-drivenextensions: • access to/integration with other IDMAF components • SciRUN?, PnetCDF?, PVFS(2)?, MPI-IO?, parallel-R?, ASPECT?, FastBit, … • support for execution of new SWF domains • Astrophysics: TSI/Blondin (SPA/NCSU) • Nuclear Physics: Swesty (SPA/LLNL) • … • Generic extensions: • addtl. support for data-intensive and compute-intensive workflows (all SRB Scommands, CCA support, …) • “detach” and reconnect • workflow deployment models • Additional “domain awareness” (e.g. via new directors) • time series, parameter sweeps, job scheduling, … • hybrid type system with semantic types • Consolidation • More installers, regular releases, improved documentation, …
Desiderata for and Features of Scientific Workflow Automation • SWF design support • step-wise refinement, component/actor-oriented design, flow-oriented design, sharing (visual) design with others, … • better component reuse through actor-oriented modeling w/ (largely) independent directors • Rapid prototyping support • Web service actors and harvester • Shell/command line actor • Data transformations (e.g., via Perl, Python, XSLT, … actors) • Workflow “plumbing” support • data transformation actors e.g., in Perl, Python, XSLT, … • Runtime support • Execution monitoring • animation for SDF, planned “heartbeat” for PN, … • listening to and logging of token flow through ports and control messages of directors • Pause-inspect-modify-resume cycle
F I N Follows: A quick system demonstration Questions to: kepler-dev@ecoinformatics.org Website: http://kepler-project.org