Agent-Based Resource Management for Grid Computing

Agent-Based ResourceManagement for Grid Computing Agent-Based ResourceManagement for Grid Computing Agent-Based ResourceManagement for Grid Computing Agent-Based ResourceManagement for Grid Computing Junwei Cao Darren J. Kerbyson Graham R. Nudd Department of Computer Science University of Warwick

Outlines • Research background • Sweep3D: performance evaluation of parallel applications using PACE • A4 (Agile Architecture and Autonomous Agents): a reference model for building large-scale distributed software systems • ARMS: an Agent-based Resource Management System for grid computing • PMA: a Performance Monitor and Advisor for ARMS • Conclusions and furture works • Research background • Sweep3D: performance evaluation of parallel applications using PACE • A4 (Agile Architecture and Autonomous Agents): a reference model for building large-scale distributed software systems • ARMS: an Agent-based Resource Management System for grid computing • PMA: a Performance Monitor and Advisor for ARMS • Conclusions and furture works

Research Background

Resource Management The overall aim of the resource management is to efficiently schedule applications that need to utilise the available resources in the metacomputing environment. uses APIs defined by the LDAP service. uses objects as the main system abstraction throughout uses the matchmaker/entity structure uses agents, each as both a database and a resource broker uses a metaserver/servers structure uses a broker/agents structure

Performance Evaluation Such goals within the high performance community will rely on accurate performance evaluation and prediction capabilities.

Multi-Agent Systems Agents are computer systems, capable of flexible, autonomous action in dynamic, unpredictable, typically multi-agent domains. Agent coordination Agent negotiation Agent communication language Knowledge representation Software agents have been accepted to be a powerful high-level abstraction for the modelling of complex software systems.

Service Discovery A service is an entity that can be used by a person, a program, or another service. Service advertisement and discovery technologies enable device cooperation and reduce configuration hassles, a necessity in today’s increasingly mobile computing environment.

Performance Evaluation Using PACE • PACE toolkit • Layered framework • Object definition • Model creation • Mapping relations ? • Sweep3D: a case study • Model decomposition • Parallel Template • Validation on SGI O2000 • Validation on Sun Ultra1 ?

PACE Toolkit User Interface User Interface Application Tools Resource Tools Cache Object Library Network CPU Object Editor Source Code Analysis HMCL Scripts Resource Model Resource Model PSL Scripts Evaluation Engine Compiler Application Model Application Model Evaluation Engine Performance Analysis Performance Analysis On-the-fly Analysis

Application Domain Application Subtask Parallel Template Hardware Description Parallel Description Sequential Description Entry level Hardware Independent Layered Framework Model parameters Time, Predictive trace

Software Object Object 1 (lower) Type Identifier Include Object 2 (lower) External Var. Def. Object 3 (higher) Link Object 1 (lower) Options Object 2 (lower) Procedures Hardware Object CPU clc flc suif ct Memory Cache L1 Cache L2 Main Network Sockets MPI PVM Uniform Object Definition

Source Code User Profiler A C T Application Layer SUIF Format SUIF Front End Parallelisation Layer Semi Automatic Model Creation • Software model creation using an ACT tool • Hardware model creation using an HMCL language

Parallel Template Hardware Object (HMCL) Abstracted Parallel Part ... Tx Subtask Serial Part Serial Part Serial Part Strict Mapping Relations Application Source Code Model Scripts

Overview of Sweep3D • Sweep3D is a part of the Accelerated Strategic Computing Initiative (ASCI) application suite. • Sweep3D solves a 1-group time-independent discrete ordinates (Sn) 3D cartesian (XYZ) geometry neutron transport problem. • Sweep3D exploits parallelism through the wavefront process.

sweep3d Application Layer source sweep fixed flux_err Parallel Template Layer async pipeline global_sum global_max Hardware Layer SunUltra1 SgiOrigin2000 Model Decomposition

Parallel Template config SgiOrigin2000 { hardware { ...... } pvm { ...... } mpi { ...... DD_COMM_A = 512, DD_COMM_B = 33.228, DD_COMM_C = 0.02260, DD_COMM_D = -5.9776, DD_COMM_E = 0.10690, DD_TRECV_A = 512, DD_TRECV_B = 22.065, DD_TRECV_C = 0.06438, DD_TRECV_D = -1.7891, DD_TRECV_E = 0.09145, DD_TSEND_A = 512, DD_TSEND_B = 14.2672, DD_TSEND_C = 0.05225, DD_TSEND_D = -12.327, DD_TSEND_E = 0.07646, ...... } clc { .... CMLL = 0.0098327, CMLG = 0.0203127, CMSL = 0.0096327, CMSG = 0.0305927, CMCL = 0.0100327, CMCG = 0.0223627, CMFL = 0.0107527, CMFG = 0.0229227, .... } } partmp pipeline { ...... procexec init { ...... step cpu { confdev Tx_sweep_init; } for( phase = 1; phase <= 8; phase = phase + 1){ step cpu { confdev Tx_octant; } step cpu { confdev Tx_get_direct; } for( i = 1; i <= mmo; i = i + 1 ) { step cpu { confdev Tx_pipeline_init; } for( j = 1; j <= kb; j = j + 1 ) { step cpu { confdev Tx_kk_loop_init; } for( x = 1; x <= npe_i; x = x + 1 ) for( y = 1; y <= npe_j; y = y + 1 ) { myid = Get_myid( x, y ); ew_rcv = Get_ew_rcv( phase, x, y ); if( ew_rcv != 0 ) step mpirecv { confdev ew_rcv, myid, nib; } else step cpu on myid { confdev Tx_else_ew_rcv; } } step cpu { confdev Tx_comp_face; } for( x = 1; x <= npe_i; x = x + 1 ) for( y = 1; y <= npe_j; y = y + 1 ) { myid = Get_myid( x, y ); ns_rcv = Get_ns_rcv( phase, x, y ); if( ns_rcv != 0 ) step mpirecv { confdev ns_rcv, myid, njb; } else step cpu on myid { confdev Tx_else_ns_rcv; } } step cpu { confdev Tx_work; } ...... } step cpu { confdev Tx_last; } } } } } void sweep() { ...... sweep_init(); for( iq = 1; iq <= 8; iq++ ) { octant(); get_direct(); for( mo = 1; mo <=mmo; mo++) { pipeline_init(); for( kk = 1; kk <= kb; kk++) { kk_loop_init(); if (ew_rcv != 0) info = MPI_Recv(Phiib, nib, MPI_DOUBLE, tids[ew_rcv], ew_tag, MPI_COMM_WORLD, &status); else else_ew_rcv(); comp_face(); if (ns_rcv != 0) info = MPI_Recv(Phijb, njb, MPI_DOUBLE, tids[ns_rcv], ns_tag, MPI_COMM_WORLD, &status); else else_ns_rcv(); work(); ...... } last(); } } }

5 25 grid size: 15x15x15 grid size: 25x25x25 4 20 Model Model 15 3 Run time (sec) Measured Measured Run time (sec) 10 2 5 1 0 0 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 Processors Processors 80 250 grid size: 35x35x35 grid size: 50x50x50 70 200 60 50 150 Model Model Run time (sec) Run time (sec) Measured Measured 40 100 30 20 50 10 0 0 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 Processors Processors Validation on SGI O2000

14 60 grid size: 25x25x25 grid size: 15x15x15 12 50 10 40 Model Model Run time (sec) Run time (sec) 8 Measured Measured 30 6 20 4 2 10 0 0 02 03 04 05 06 07 08 09 02 03 04 05 06 07 08 09 Processors Processors 160 500 grid size: 35x35x35 grid size: 50x50x50 450 140 400 120 350 100 Model 300 Model Run time (sec) Run time (sec) Measured Measured 250 80 200 60 150 40 100 20 50 0 0 02 03 04 05 06 07 08 09 02 03 04 05 06 07 08 09 Processors Processors Validation on Sun Ultra1

PACE Summary • Accurate prediction results – 15% error at most • Rapid evaluation time – typically less than 2s • Easy cross-platform comparisons V • Scalability –Multiple administrative domains –Millions of computing resources • Adaptability –Communication irregularities –Performance changing X

The Question Is … ?

A4 Methodology A gility: quick adaption of the changing environment A rchitecture: a clue of the components in a system Anything else … A utonomy: act without direct intervention A gent: a high-level abstraction of complex systems Service discovery Agent hierarchy Service advertisement Agent structure Performance metrics Agent capability tables A4 simulator

Agent Hierarchy An agent is … • A local manager • An user middleman • A broker • A coordinator • A service provider • A service requestor • A matchmaker • A router

Local Management Layer Coordination Layer Communication Layer Agent Structure • Communication Layer – Agents in the system must be able to communicate with each other using common data models and communication protocols. • Coordination Layer – The data an agent receives at the communication layer should be explained and submitted to the coordination layer, which decides how the agent should act on the data according to its own knowledge. • Local Management Layer – An agent as a local manager is responsible to maintain the local services and provide service information needed by the coordination layer to make decisions.

Local Management Layer Local Management Layer Coordination Layer Coordination Layer Communication Layer Communication Layer Service Discovery Service Advertisement NEXT!

Service Advertisement Hi, please find attached my service information. Hi, could you please give me some service information that you have? • Full service advertisement – requires no service discovery. • No service advertisement – results in complex service discovery. Make Balance!

Agent Capability Tables The process of the service advertisement and discovery corresponds to the maintenance and lookup of the ACTs. Vary by source: • T_ACT: contains service info of local resources • L_ACT: contains service info coming from lower agents • G_ACT: contains service info coming from upper agent • C_ACT: contains cached service info during discovery Strategies: • Data-push: submit service info to other agents • Data-pull: ask for service info from other agents • Periodical: Periodical ACT maintenance • Event-driven: ACT maintenance driven by system events

Discovery speed • System efficiency • Success rate • Load balancing Performance Metrics Conflicting

Simulation Engine Agent Hierarchy Step-by-step View Agent-level Modelling Model Composer r Requests a Performance Model Services Accumulative View d Strategies rf Agent Mobility Agent View System-level Modelling v Request Distribution e Service Distribution Log View b Global Strategies f GUI GUI A4 Simulator Input Kernel Output

Model Viewer Agent Viewer Simu Results Model Browser A4 Simulator Implementation • Support for all performance metrics • Support for all strategy configurations • Two level performance modelling • Multi-view simulation result display • Comparing strategies • Agent mobility modelling

A Case Study Impact of service mobility on discovery performance Mobility Impact Higher Performance Learning Process New Learning Process Stable State Agent hierarchy

Summary ? A4 is a reference model for building large-scale distributed software systems with highly dynamic behaviours. A4 + PACE  ARMS

ARMS for Grid Computing • ARMS in context • ARMS architecture • ARMS agent structure • Service information • Request information • Multi-processor scheduling At meta level, agents cooperate with each other for service discovery. • ARMS implementation • A case study • Agents & resources • Applications & requests • Experiment results I • Experiment results II At local level, PACE functions can supply accurate performance info.

ARMS in Context A4 A4 Simulator Grid Resources Grid Users ARMS PMA PACE Application Tools (AT) Evaluation Engine (EE) Resource Tools (RT)

EE ACT EE ACT Application Models Cost Models Agents ACT EE EE ACT PMA ACT EE Users EE ACT ACT EE Resource Models RT RT RT RT Processors ARMS Architecture Bottleneck? ? AT

Resource Monitoring Resource Allocation Application Management Application Execution Sched. Cost App. Info Service Info Cost Model Eval Results Res. Info Agent ID Application Model Discovery Advertisement To Another Agent ARMS Agent Structure Local Coordination ACTs Scheduler Match Maker ACT Manager PACE Evaluation Engine Comm. Communication Module

Res. Info. Proc. 1 ID Type PACE res. model Proc. 2 ID Type … PACE res. model … Proc. n ID App. Info. App. 1 ID Start time End time App. 2 ID Start time … End time … App. m ID App. Res. Mapping Service Information ACT manager controls the agent access to the ACT database, where service information of grid resources are recorded. Service Info.

Request Information A request sent to an ARMS agent should include all related information on application and execution requirements. • PACE application model includes all performance related information of the application to be executed, and can be input and evaluated using PACE evaluation engine. • Cost model includes all performance metrics and corresponding values, which need to be met by a grid service provided by a grid resource. These may include execution time, memory usage, etc.

Processor 1 Processor 2 Processor 3 Processor 4 Processor 5 Processor 6 Processor 7 Processor 8 Multi-processor Scheduling ?

Agent platform Gantt chart Auto clients Info browser ARMS Implementation • C/C++, X Windows • Simple data structure for data representation • File system for data management and agent communication • Multi-thread agent kernel

A Case Study • 8 agents, 8 grid resources, 16*8 processors • SGI Origin2000, Sun clusters • 7 applications, 149 requests • Sweep3D, fft, jacobi, memsort, etc. • 1 request / 3 sec, 7 min. • Random frequency, application, agent • 16% 1-step, 7% 2-step discovery • Application distribution 7% - 19% • 97% success rate

Agent Resource Type #Processors/Hosts gem SGI Origin 2000 16 origin SGI Origin 2000 16 gem sprite Sun Ultra 10 16 tizer Sun Ultra 10 16 coke Sun Ultra 1 16 sprite origin tizer budweiser Sun Ultra 5 16 burroughs Sun SPARCstation 2 16 rubbish Sun SPARCstation 2 16 coke budweiser burroughs rubbish Agents & Resources

Application & Requests

Experiment Results I Applications scheduled @ tizer

Experiment Results II Applications distribution Statistical results

The Answer Is ARMS

PMA Agent • PMA structure • Performance optimisation strategies • Use of ACTs • Limit service lifetime • Limit scope of service advertisement and discovery • Agent mobility and service distribution • Performance steering policies • A case study • Agents & strategies • Requests & services • Simulation results I • Simulation results II

Model Composer Monitoring Simulation Engine Reconfiguration • Relative request performance value • Request sending frequency • Relative service performance value • Service performance changing frequency PMA Structure PMA ARMS Agent Statistical data Policies Performance Model Strategies

Performance Optimisation Strategies Vary by • Dynamics • Hierarchy • Distribution • Pre-knowledge Use of ACTs Limit service lifetime Limit scope of service advertisement and discovery Agent mobility and service distribution

Performance Steering Policies Policies for balancing workload between service advertisement and discovery: • T_ACT: event-driven data-push • C_ACT: event-driven data-pull and data-push • L_ACT: avoid using redundant advertisement • G_ACT: avoid using data-push • Avoid using event-driven and periodic approaches simultaneously • Avoid using data-pull and data-push approaches simultaneously • Two level performance steering • Comparing different combinations of strategies

A Case Study • 251 agents, 3 layers • System level configuration of strategies • 4 ACT usage, 6 strategies • 13 experiments • System level definitions of services and requests • Comparing different combinations of strategies • A middle strategy is chosen as best • Agent level configuration may lead to better performance

Agent-Based Resource Management for Grid Computing

Agent-Based Resource Management for Grid Computing

Presentation Transcript

Service Computing – Grid Resource Management

Grid Computing: Resource Management Thesis Problem Statement

Use of Agent-Based Service Discovery for Resource Management in Metacomputing Environment

GridFlow: Workflow Management for Grid Computing

Efficient Resource Management for Cloud Computing Environments

Virtual Data Management for Grid Computing

Grid Computing 7700 Fall 2005 Lecture 17: Resource Management

7. Grid Computing Systems and Resource Management

Agent-Based Dialogue Management

Agent-Based Computing in Economics

Agent Teams in Grid Resource Brokering and Management (preliminary considerations)

GridFlow: Workflow Management for Grid Computing

CLOUD COMPUTING FOR AGENT-BASED URBAN TRANSPORTATION SYSTEMS

Grid Computing 7700 Fall 2005 Lecture 17: Resource Management

Grid(Lab) Resource Management System

Deadline-based Grid Resource Selection for Urgent Computing

Agent-based Systems for Ubiquitous Computing

Agent-Based Computing CSS599 Winter 2018

Component-Based Portals for Grid Computing