Workflow Management in Grids

Workflow Management in Grids Ewa Deelman Center for Grid Technologies USC Information Sciences Institute

People Involved • Carl Kesselman, Saurabh Khurana, Gaurang Mehta, Sonal Patil, Gurmeet Singh, Jaskaran Singh, Karan Vahi (ISI) • Jim Blythe, Yolanda Gil (ISI) • Anirban Mandal (Rice)

Outline • The GriPhyN project and Grid Applications • Workflow Management in Grids • Pegasus, Planning for Execution in Grids • Framework Description • Generation of Executable Workflows • Based on abstract workflow descriptions • Based on application-level metadata • Application of Pegasus to the LIGO pulsar search • Future Research Directions

GriPhyN Data Grid Challenge • Provide a framework which enables Virtual Organizations around the world to perform computationally demanding analysis of large, geographically distributed datasets. • The Virtual Organizations are large and highly distributed • The datasets are large, currently on the order of Terabytes and expected to grow to the level of 100s of Petabytes in the next decade • Provide a seamless access to data: experimental raw data or processed data products • Enable a user/application to ask for any domain-specific data, whether computed or not Concept of Virtual Data

GriPhyN Project • 5-year, $12.5M NSF ITR proposal to realize the concept of virtual data • Four Applications: • ATLAS and CMS experiments at Large Hadron Collider at CERN • SDSS (Sloan Digital Sky Survey) • LIGO (Laser Interferometer Gravitational-wave Observatory) • Key research areas: • Virtual data technologies (information models, management of virtual data software, etc.) • Request planning and scheduling (including policy representation and enforcement) • Task execution (fault tolerance) • CS Participants: U.Chicago, USC/ISI, UW-Madison, UCSD, UCB, Indiana, Northwestern, Florida • Close collaboration and participation from the Physicists community

Applications • Increasing in the level of complexity • Use of individual application components • Reuse of individual intermediate data products • Description of Data Products using Metadata Attributes • Execution environment is complex and very dynamic • Resources come and go • Data is replicated • Components can be found at various locations or staged in on demand • Separation between • the application description • the actual execution description

Outline • The GriPhyN project and Grid Applications • Workflow Management in Grids • Pegasus, Planning for Execution in Grids • Framework Description • Generating Executable Workflows • Based on abstract workflow descriptions • Based on application-level metadata • Application of Pegasus to the LIGO pulsar search • Future Research Directions

Abstract Workflow Generation Concrete Workflow Generation

Generating an Abstract Workflow • Available Information • Specification of component capabilities • Ability to generate the desired data products Select and configure application components to form an abstract workflow • assign input files that exist or that can be generated by other application components. • specify the order in which the components must be executed • components and files are referred to by their logical names • Logical transformation name • Logical file name • Both transformations and data can be replicated

Generating a Concrete Workflow • Information • location of files and component Instances • State of the Grid resources • Select specific • Resources • Files • Adding jobs required to form a concrete workflow that can be executed in the Grid environment • Data movement • Data registration • Each component in the abstract workflow is turned into an executable job

Why Automate Workflow Generation? • Usability: Limit User’s necessary Grid knowledge • Monitoring and Directory Service • Replica Location Service • Transformation Catalog • Complexity: • User needs to make choices • Alternative application components • Alternative files • Alternative locations • The user may reach a dead end • Many different interdependencies may occur among components • Solution cost: • Evaluate the alternative solution costs • Performance • Reliability • Resource Usage • Global cost: • minimizing cost within a community or a virtual organization • requires reasoning about individual user’s choices in light of other user’s choices

Pegasus, Planning for Execution in Grids • performs the mapping from an abstract workflow to a concrete workflow, which can be executed on the Grid • isolates the user from many Grid details • automatically locates physical locations for both components (transformations) and data, via Globus RLS and the Transformation Catalog • finds appropriate resources to execute the components (via Globus MDS) • whenever several alternatives are possible (e.g., alternative physical files, alternative resources) it makes a random choice • publishes newly derived data products • reuses existing data products where applicable

Job c Job a Job b Job f Job d Job e Job g Job h Job i KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm • The output jobs for the Dag are all the leaf nodes i.e. f, h, i. • Each job requires 2 input files and generates 2 output files. • The user specifies the output location Abstract Workflow Reduction

KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Optimizing from the point of view of Virtual Data Job c Job a Job b Job f Job d Job e Job g Job h Job i • Jobs d, e, f have output files that have been found in the Replica Location Service. • Additional jobs are deleted. • All jobs (a, b, c, d, e, f) are removed from the DAG.

KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm • Planner picks execution • and replica locations • Plans for staging data in Job c adding transfer nodes for the input files for the root nodes Job a Job b Job f Job d Job e Job g Job h Job i

transferring the output files of the leaf job (f) to the output location Staging data out and registering new derived products in the RLS Job c Job a Job b Job f Job d Job e Job g Job h Staging and registering for each job that materializes data (g, h, i ). Job i KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm

The final, executable DAG Input DAG Job g Job h Job c Job a Job b Job i Job f KEY The original node Input transfer node Registration node Output transfer node Job d Job e Job g Job h Job i

Pegasus’ Initial Solution • Reused existing data products • Used in a variety of complex applications • Astronomy • Bioinformatics • High-energy physics • Cluster of 25 dual-processor Pentium machines. • Computation: 7 days, 678 jobs with 250 events each • Produced ~ 200GB of simulated data. • Provided a feasible solution, but not necessarily a low-cost one • To improving the quality of the solution: • Need to efficiently search large problem space and apply both local and global optimizations • Need to compose various optimization strategies • Need to reuse the resource and component models.

AI Planning technologies • Provide broad-base, generic implementation foundation • Use techniques, such as backtracking, and domain-specific and domain-independent control rules • Allow the easy addition of new system constraints and rules • Incorporate optimality and policy into the search for solutions • Can integrate the generation of workflows across users and policies within VOs. • Collaboration with Yolanda Gil and Jim Blythe (ISI)

AI-based Abstract and Concrete Planner Goal (Provided by the User) • A metadata specification of the information the user requires and the desired location for the output file Initial State (Automatically Extracted) • Information about the state of the Grid • Information about data location Operators (encoded for the application domain) • Represent the execution of a component at a particular location and the generation a particular file(s) • A file movement across the network Search control rules (Grid or application specific) • Specify options that should be exclusively considered at any choice point in the search algorithm (execute “close” to the data

Techniques Used by the Planner • Uses search algorithms to explore the solution space • Uses local heuristics to choose between alternatives and prune the solution space • Execute “close” to the data • Global evaluation of alternative plans (user, community specific) • Performance • Reliability • Resource usage

LIGO Livingston LIGO Experiment(Laser Interferometer Gravitational-Wave Observatory) • Aims to detect gravitational waves predicted by Einstein’s theory of relativity. • Can be used to detect • binary pulsars • mergers of black holes • “starquakes” in neutron stars • Two installations: in Louisiana (Livingston) and Washington State • Other projects: Virgo (Italy), GEO (Germany), Tama (Japan) • Instruments are designed to measure the effect of gravitational waves on test masses suspended in vacuum. • Data collected during experiments is a collection of time series (multi-channel) • Analysis is performed in time and Fourier domains

archive Interferometer Hz Time raw channels LIGO’s Pulsar Search(Laser Interferometer Gravitational-wave Observatory) Extract channel Short Fourier Transform transpose Long time frames 30 minutes Short time frames Single Frame Time-frequency Image Extract frequency range event DB Construct image Find Candidate Store

LIGO’s Problem Decomposition • Originally all the pulsar search conducted inside the LIGO Data Analysis System • Resources not sufficient to conduct all the desired analysis • The various components of the search, SFT, sSFT, pulsar search were identified and described in terms of metadata attributes • Subsets of the code were ported to the linux environment • Interface was created to LDAS to schedule jobs, stage data in and out and send the results of the analysis to the LIGO event database • Formed the application-level components • High level of interest and participation from the GEO experiment in Germany LIGO collaborators: Kent Blackburn, Albert Lazzarini (Caltech) Scott Koranda (UWM) and others GEO collaborators: Maria Alessadra Papa (Albert Einstein Institute, Germany), Alicia Sintes (Balearic Islands University, Spain)

LIGO’s pulsar search at SC 2002 • The pulsar search conducted at SC 2002 • Used LIGO’s data collected during the first scientific run of the instrument • Targeted a set of 1000 locations of known pulsar as well as random locations in the sky • Results of the analysis were be published via LDAS (LIGO Data Analysis System) to the LIGO Scientific Collaboration • performed using LDAS and compute and storage resources at Caltech, University of Southern California, University of Wisconsin Milwaukee. Visualization by Marcus Thiebaux

SC 2002 demo Over 58 pulsar searches Total of 330 tasks 469 data transfers 330 output files produced. The total runtime was 11:24:35. To date 185 pulsar searches Total of 975 tasks 1365 data transfers 975 output files Total runtime 96:49:47 Results

Summary so far • To date • Applied to gravitational wave physics applications • Solution based on application component composition • Reasoning at the metadata level • Simple optimizations • Challenges Tomorrow • Build and integrate models and ontologies • Application components • Grid components • Apply and extend AI planning technologies to the problem domain • More sophisticated solutions • Generic solution, but also application relevant • Investigate Reliability in Grids

Challenges • Dealing with faults in the system • Fault tolerance • Fault avoidance • Supporting a wide spectrum of applications • Optimizing application performance • Incorporating policies in the planning decisions • Providing data provenance information for derived data

Future Research Directions: Smart middleware for the Grid • Knowledge everywhere: expressive knowledge sources to support all levels of decision-making in the Grid • Intelligent reasoners: use knowledge and operate incrementally to reason about tradeoffs in task allocation • Smart workflows: augment workflows with annotations, such as constraints and alternatives, to support incremental updates by heterogeneous agents Continued Collaboration with Jim Blythe, Yolanda Gil and Hongsuda Tangmunarunkit

Workflow History Community Users Workflow history Workflow history High-level specification of desired results, constraints, requirements, user policies Policy Information Services Policy KB Other KB Application KB Resource KB Workflow Refinement Policy Management Other Grid services Smart Workflow Pool Resource Matching Workflow Repair Workflow Manager Resource Indexes Simulation codes Replica Locators Community Distributed Resources (e.g., computers, storage, network, simulation codes, data) Pervasive Knowledge Sources Intelligent Reasoners

User’s Workflow refinement Allow incremental workflow generation Request Levels of abstraction Policy reasoner Application Workflow repair -level knowledge Relevant components Logical tasks Full abstract workflow Tasks bound to Onto-based Matchmaker resources Partial and sent for execution execution Not yet time executed executed

Publications AI Forum • “The Role of Planning in Grid Computing”Jim Blythe, Ewa Deelman, Yolanda Gil, Carl Kesselman, Amit Agarwal, Gaurang Mehta, Karan Vahi, accepted to the International Conference on Automated Planning and Scheduling 2003 • “Transparent Grid Computing: a Knowledge-Based Approach”Jim Blythe, Ewa Deelman, Yolanda Gil, Carl Kesselman, accepted to Innovative Applications of Artificial Intelligence Conference 2003 Grid Forum • “Workflow Management in GriPhyN”, Chapter in “The Grid Resource Management” book, E. Deelman, J. Blythe, Y. Gil, Carl Kesselman 2003 • "Mapping Abstract Complex Workflows onto Grid Environments," E. Deelman, J. Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, Adam Arbree, Richard Cavanaugh, Kent Blackburn, Albert Lazzarini, Scott Koranda, to appear in the Journal of Grid Computing, 2003. • “Pegasus: Planning for Execution in Grids”, Ewa Deelman, Jim Blythe, Yolanda Gil, Carl Kesselman, Technical report GriPhyN 2002-20. www.griphyn.org, www.isi.edu/~deelman/pegasus.htm

Workflow Management in Grids

Workflow Management in Grids

Presentation Transcript

Bioinformatics workflow management

Workflow Management

Workflow Management

Scientific Workflow Management

Workflow Management in Condor

Bp Management Workflow

Generalized Resource Management In Computational Grids

Workflow management

Scientific Workflow Management

Workflow Management in GridMiner

Chapter 13 Replica Management in Grids

Workflow Management KReSIT

LQCD Workflow Management

Workflow Management in

Research Workflow Process on the GRIDs Surface

Workflow on the GRIDs surface

Workflow Management Software

Workflow Management Systems

Scientific Workflow Management

Chapter 13 Replica Management in Grids

workflow management software

Workflow Management Software