Computing and Brokering

Computing and Brokering Grid Middleware 5 David Groep, lecture series 2005-2006

Grid Middleware V 2 Outline • Classes of computing services • MPP SHMEM • Clusters with high-speed interconnect • Conveniently parallel jobs • Through the hourglass: basic functionalities • Representing computing services • resource availability, RunTimeEnvironment • Software installation and ESIA • Jobs as resources, or ? • Brokering • brokering models: central view, per-user broker, ‘neighbourhood’ P2P brokering • job farming and DAGs: Condor-G, gLite WMS, Nimrod-G, DAG man • resource selection: ERT, freeCPUs, …? Prediction techniques and challenges • colocating jobs and data, input & output sandboxes, LogicalFiles • Specialties • Supporting interactivity

Computing Service resource variability and the hourglass model

Grid Middleware V 4 The Famous Hourglass Model

Grid Middleware V 5 Types of systems Very different models and pricing; suitability depends on application • shared memory MPP systems • vector systems • cluster computing with high-speed interconnect • can perform like MPP, except for the single memory image • e.g. Myrinet, Infiniband • course-grained compute clusters • ‘conveniently parallel’ applications without IPC • can be built of commodity components • specialty systems • visualisation, systems with dedicated co-processors, …

Grid Middleware V 6 Quick, cheap, or both: how to run an app? Task: how to run your application • the fastest, or • the most cost-effective (this argument usually wins ) Two choices to speed up an application • Use the fastest processor available • but this gives only a small factor over modest (PC) processors • Use many processors, doing many tasks in parallel • and since quite fast processors are inexpensive we can think of usingvery many processors in parallel • but the problem must first be decomposed “fast, cheap, good – pick any two”

Grid Middleware V 7 High Performance – or – High Throughput? Key question: max. granularity of decomposition: • Have you got one big problem or a bunch of little ones? • To what extent can the “problem” be decomposed into sort-of-independent parts (‘grains’) that can all be processed in parallel? • Granularity • fine-grained parallelism – the independent bits are small, need to exchange information, synchronize often • coarse-grained – the problem can be decomposed into large chunks that can be processed independently • Practical limits on the degree of parallelism – • how many grains can be processed in parallel? • degree of parallelism v. grain size • grain size limited by the efficiency of the system at synchronising grains

Grid Middleware V 8 High Performance – v. – High Throughput? • fine-grained problems need a high performance system • that enables rapid synchronization between the bits that can be processed in parallel • and runs the bits that are difficult to parallelize as fast as possible • coarse-grained problems can use a high throughput system • that maximizes the number of parts processed per minute • High Throughput Systemsuse a large number of inexpensive processors, inexpensively interconnected • High Performance Systems use a smaller number of more expensive processors expensively interconnected

Grid Middleware V 9 High Performance – v. – High Throughput? • There is nothing fundamental here – it is just a question of financial trade-offs like: • how much more expensive is a “fast” computer than a bunch of slower ones? • how much is it worth to get the answer more quickly? • how much investment is necessary to improve the degree of parallelization of the algorithm? • But the target is moving - • Since the cost chasm first opened between fast and slower computers 12-15 years ago an enormous effort has gone into finding parallelism in “big” problems • Inexorably decreasing computer costs and de-regulation of the wide area network infrastructure have opened the door to ever larger computing facilities – clusters  fabrics  (inter)national gridsdemanding ever-greater degrees of parallelism

Grid Middleware V 10 But the fact is: ‘the food chain has been reversed’, and supercomputer vendors are struggling to make a living. Graphic: Network of Workstations, Berkeley IEEE Micro, Feb, 1995, Thomas E. Anderson, David E. Culler, David A. Patterson

Grid Middleware V 11 Using these systems • As both clusters and capability systems are both ‘expensive’ (i.e. not on your desktop), they are resources that need to be scheduled • interface for scheduled access is a batch queue • job submit, cancel, status, suspend • sometimes: checkpoint-restart in OS, e.g. on SGI IRIX • allocate #processors (and amount of memory, these may be linked!) as part of the job request • systems usually also have smaller interactive partition • not intended for running production jobs …

Grid Middleware V 12 Cluster batch system model

Grid Middleware V 13 Some batch systems • Batch systems and schedulers • Torque (OpenPBS, PBS Pro) • Sun Grid Engine (that’s not a Grid ) • Condor • LoadLeveller • Load Share Facility (LSF) • Dedicated schedulers: MAUI • can drive scheduling for Torque/PBS, SGE, LSF, … • support advanced scheduling features, like:reservation, fair-shares, accounts/banking, QoS • head node or UI system can usually be used for test jobs

Grid Middleware V 14 Torque/PBS job description • # PBS batch job script • #PBS -l walltime=36:00:00 • #PBS -l cput=30:00:00 • #PBS -l vmem=1gb • #PBS -q qlong • # Executing user job • UTCDATE=`date -u '+%Y%m%d%H%M%SZ'` • echo "Execution started on $UTCDATE" • echo "*****" • printenv • date • sleep 3 • date • id • hostname

Grid Middleware V 15 PBS queue • bosui:tmp:1010$ qstat -an1|head -10 • tbn20.nikhef.nl: • Req'd Req'd Elap • Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time • -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- • 823302.tbn20.nikhef. biome034 qlong STDIN 20253 1 -- -- 60:00 R 20:58 node15-11 • 824289.tbn20.nikhef. biome034 qlong STDIN 6775 1 -- -- 60:00 R 15:25 node15-5 • 824372.tbn20.nikhef. biome034 qlong STDIN 10495 1 -- -- 60:00 R 15:10 node16-21 • 824373.tbn20.nikhef. biome034 qlong STDIN 3422 1 -- -- 60:00 R 14:40 node16-32 • ... • 827388.tbn20.nikhef. lhcb031 qlong STDIN -- 1 -- -- 60:00 Q -- -- • 827389.tbn20.nikhef. lhcb031 qlong STDIN -- 1 -- -- 60:00 Q -- -- • 827390.tbn20.nikhef. lhcb002 qlong STDIN -- 1 -- -- 60:00 Q -- --

Grid Middleware V 16 Cluster Node Cluster Node Desktop Desktop Central Manager = Process Spawned = ClassAd Communication Pathway negotiator collector schedd schedd schedd master master master master master startd startd startd startd startd Example: Condor – clusters of idle workstations The Condor Project, Miron Livny et al. University of Wisconsin, Madison. See http://www.cs.wisc.edu/condor/

Grid Middleware V 17 Condor example • Write a submit file: • Executable = dowork • Input = dowork.in • Output = dowork.out • Arguments = 1 alpha beta • Universe = vanilla • Log = dowork.log • Queue • Give it to Condor: • condor_submit <submit-file> • Watch it run: condor_q } Files: on shared fs in a cluster at least, for other options see later From: Alan Roy, IO Access in Condor and Grid, UW Madison. See http://www.cs.wisc.edu/condor/

Grid Middleware V 18 Matching jobs to resources • For ‘homogeneous’ clusters mainly policy-based • FIFO • credential-based policy • fair-share • queue wait time • banks & accounts • QoS specific • For heterogeneous clusters (like condor pools) • matchmaking based on resource & job characteristics • see later in grid matchmaking

Grid Middleware V 19 Example: scheduling policies - MAUI • RMTYPE[0] PBS • RMHOST[0] tbn20.nikhef.nl • ... • NODEACCESSPOLICY SHARED • NODEAVAILABILITYPOLICY DEDICATED:PROCS • NODELOADPOLICY ADJUSTPROCS • FEATUREPROCSPEEDHEADER xps • BACKFILLPOLICY ON • BACKFILLTYPE FIRSTFIT • NODEALLOCATIONPOLICY FASTEST • FSPOLICY DEDICATEDPES • FSDEPTH 24 • FSINTERVAL 24:00:00 • FSDECAY 0.99 • GROUPCFG[users] FSTARGET=1 PRIORITY=10 MAXPROC=50 • GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32 • GROUPCFG[alice] FSTARGET=9 PRIORITY=100 MAXPROC=200 QDEF=lhcalice • GROUPCFG[alicesgm] FSTARGET=1 PRIORITY=100 MAXPROC=200 QDEF=lhcalice • GROUPCFG[atlas] FSTARGET=54 PRIORITY=100 MAXPROC=200 QDEF=lhcatlas • QOSCFG[lhccms] FSTARGET=1- MAXPROC=10 MAUI is an open source product from ClusterResources, Inc. http://www.supercluster.org/

Grid Interface to Computing

Grid Middleware V 21 Grid Interfaces to the compute services • Need common interface for job management • for test jobs in ‘interactive’ mode: fork • like the interactive partition in clusters and supers • batch system interface: • executable • arguments • #processors • memory • environment • stdin/out/err • Note: • batch system usually doesn’t manage local file space • assumes executable is ‘just there’, because of shared FS or JIT copying of the files to the worker node in job prologue • local file space management needs to be exposed as part of the grid service and then implemented separately

Grid Middleware V 22 Expectations? What can a user expect from a compute service? • Different user scenarios are all valid: • paratrooper mode: come in, take all your equipment (files, executable &c) with you, do your thing and go away • you’re supposed to clean up, but the system will likely do that for you if you forget. In all cases, garbage left behind is likely to be removed • two-stage ‘prepare’ and ‘run’ • extra services to pre-install environment and later request it • see later on such Community Software Area services • don’t think but just do it • blindly assume the grid is like your local system • expect all software to be there • expect your results to be retained indefinitely • … realism of this scenario is quite low for ‘production’ grids, as it does not scale to larger numbers of users

Grid Middleware V 23 Basic Operations • Direct run/submit • useless unless you have an environment already set up • Cancel • Signal • Suspend • Resume • List jobs/status • Purge (remove garbage) • retrieve output first … Other useful functions • Assess submission (eligibility, ERT) • Register & Start (needed if you have sandboxes)

Grid Middleware V 24 A job submission diagram for a single CE • Example • explicit interactions diagram from: DJRA1.1 EGEE Middleware Architecture

Grid Middleware V 25 WS-GRAM: Job management using WS-RF • same functionalitymodelled with jobs represented as resources • for input sandbox leverages an existing (GT4) data movement service • exploit re-useable components

Grid Middleware V 26 GT4 WS GRAM Architecture Service host(s) and compute element(s) Job events SEG GT4 Java Container Compute element GRAM services Local job control GRAM services Local scheduler Job functions sudo GRAM adapter Delegate Transfer request Client Delegation Delegate GridFTP User job RFT File Transfer FTP control FTP data Remote storage element(s) GridFTP diagram from: Carl Kesselman, ISI, ISOC/GFNL masterclass 2006

Grid Middleware V 27 GT2 GRAM • Informational & historical: • so don’t blame the current Globus for this … single job submission flow chart

Grid Middleware V 28 GRAM GT2 Protocol • RSL over http-g • target to a single specific resource • http-g is like https • modified protocol (one one byte) to specify delegation • no longer interoperable with standard https • delegation implicit in job submission • RSL Resource Specification Language • Used in the GRAM protocol to describe the job • required some (detailed) knowledge about target system

Grid Middleware V 29 GT2 RSL • &(executable="/bin/echo") • (arguments="12345") • (stdout=x-gass-cache://$(GLOBUS_GRAM_JOB_CONTACT)stdout anExtraTag) • (stderr=x-gass-cache://$(GLOBUS_GRAM_JOB_CONTACT)stderr anExtraTag) • (queue=qshort)

Grid Middleware V 30 GT2 Job Manager interface • One job manager per running or queued job • provide control interface: cancel, suspend, status • GASS ‘Grid Access to Secondary Storage’: • stdin, stdout, stderr • selected input/output files • listens on a specific TCP port on the Gatekeeper host • Some issues • protocol does not provide two-phase commit • know way to know if the job really made it • too many open ports • one process for each queued job, i.e. too many processes • Workaround • don’t submit a job, but instread a grid-manager process

Grid Middleware V 31 Performance ? • Time to submit a basic GRAM job • Pre-WS GRAM: < 1 second • WS GRAM (in Java): 2 seconds so GT2-style GRAM did have one significant advantage … • Concurrent jobs • Pre-WS GRAM: 300 jobs • WS GRAM: 32,000 jobs

Grid Middleware V 32 Scaling scheduling • load on the CE head node per VO cannot be controlled with a single common job manager • with many VOs • might need to resolve inter-VO resource contention • different VOs may want different policies • make the CE ‘pluggable’ • and provide a common CE interface, irrespective of the site-specific job submission mechanism • as long as the site supports a ‘fork’ JM

Grid Middleware V 33 gLite job submission model site one grid CEMON per VO or user

Grid Middleware V 34 Unicore CE Other design and concept: • eats JSDL (GGF standard) as a description • described job requirements in detail • security model cannot support dynamic VOs yet • grid-wide coordinated UID space • (or shared group accounts for all grid users) • no VO management tools (DEISA added a directory for that) • intra-site communication not secured • one big plus: job management uses only 1 port for all ommunications (including file transfer), and is thus firewall-friendly

Grid Middleware V 35 Arcon Client Toolkit User Certificate UNICOREPro Client User Certificate Preparation and Control of jobs Runtime Interface Job preparation/control Plugins optional firewall optional firewall UNICORE Site List AJO/UPL Unsafe Internet (SSL) AJO/UPL UNICORE Site List UNICORE SiteFZJ User authentication UNICORE Gateway Safe Intranet (TCP) Jobs and data transfer to other UNICORE sites User mapping,job incarnation, job scheduling Network Job Supervisor (NJS) NJS IDB IDB IDB UUDB Incarnated job Status request ... Target System Interface(TSI) TSI TSI Blade SV1 Commands files Batch Subsystem Any cluster management system Unicore CE Architecture Graphic from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003

Grid Middleware V 36 Arcon Client Toolkit User Certificate UNICOREPro Client User Certificate Runtime Interface Job preparation/control Plugins optional firewall optional firewall Preparation and Control of jobs UNICORE Site List AJO/UPL Unsafe Internet (SSL) AJO/UPL UNICORE Site List User authentication UNICORE Gateway UNICORE SiteFZJ Jobs and data transfer to other UNICORE sites User mapping,job incarnation, job scheduling Network Job Supervisor (NJS) NJS Safe Intranet (TCP) IDB IDB IDB UUDB Incarnated job Status request Target System Interface(TSI) TSI TSI Blade Commands files SV1 ... Batch Subsystem Any cluster management system Unicore programming model • Abstract Job Object • Collection of classes representing Grid functions • Encoded as Java objects (XML encoding possible) • Where to build AJOs • Pallas client GUI - The user’s view • Client plugins - Grid deployer • Arcon client tool kit - Hard core • What can’t the AJO do • Application level Meta-computing • ??? from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003

Grid Middleware V 37 Arcon Client Toolkit User Certificate UNICOREPro Client User Certificate Runtime Interface Job preparation/control Plugins optional firewall optional firewall Preparation and Control of jobs UNICORE Site List AJO/UPL Unsafe Internet (SSL) AJO/UPL UNICORE Site List User authentication UNICORE Gateway UNICORE SiteFZJ Jobs and data transfer to other UNICORE sites User mapping,job incarnation, job scheduling Network Job Supervisor (NJS) NJS Safe Intranet (TCP) IDB IDB IDB UUDB Incarnated job Status request Target System Interface(TSI) TSI TSI Blade Commands files SV1 ... Batch Subsystem Any cluster management system Interfacing to the local system • Incarnation Data Base • Maps abstract representation to concrete jobs • Includes resource description • Prototype auto-generation from MDS • Target System Interface • Perl interface to host platform • Very small system specific module for easy porting • Current: NQS (several versions), PBS, Loadleveler, UNICOS, Linux, Solaris, MacOSX, PlayStation-2 • Condor: Under development (& probably done by now) from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003

Resource Representation CE attributes obtaining metrics GLUE CE

Grid Middleware V 39 Describing a CE • Balance between completeness and timeliness • Some useful metrics almost impossible to obtain • ‘when will this job of mine be finished if I submit now?’cannot be answered! • depends on system load • need to predict runtime for already running & queued jobs • simultaneous submission in a non-FIFO scheduling model (e.g. fair share, priorities, pre-emption &c)

Grid Middleware V 40 GlueCE: a ‘resource description’ viewpoint From: the GLUE Information Model version 1.2, see document for details

Grid Middleware V 41 Through the Glue Schema: Cluster Info • Performance info: SI2k, SF2k • Max wall time, CPU time: seconds together these determine if a job completes in time • but clusters are not homogeneous • solve at the local end (scale mas{CPU,wall} time on each node to the system speed)CAVEAT: when doing cross-cluster grid-wide scheduling, this can make you choose the wrong resource entirely! • solve (i.e. multiply) at the broker endbut now you need a way to determine on which subcluster your job will run… oops.

Grid Middleware V 42 Cluster Info: total, free and max JobSlots • FreeJobSlots is the wrong metric to use for scheuling (a good cluster is always 100% full) • these metrics may be VO, user and job dependent • if a cluster have free CPUs, that does not mean that you can use them… • even if there are thousands of waiting jobs, you might get to the front of the queue because of your prio or fair-share

Grid Middleware V 43 Cluster info: ERT and WRT • Estimated/worst response time • when will my job start to run if I submit now • Impossible to pre-determine in case of simultaneous submissions • Best to do is to estimate • Possible approaches • simulation – good but very, very slow“Predicting Job Start Times on Clusters”, Hui Li et al. 2004 • historical comparisons • template approach – need to discover the proper template • look for ‘similar system states’ in the past • learning approach – adapt the estimation algorithm to the actual load and ‘learn’ the best approach see the many other papers by Hui, bundle on Blackboard!

Brokering

Grid Middleware V 45 Brokering models • All current grid broker systems use global brokering • consider all known resources when matching requests • brokering takes longer as the system grows Models • Bubble-to-the-top-information-system based • current Condor-G, gLite WMS • Ask the world for bids • Unicore Broker

Grid Middleware V 46 Some grid brokers • Condor-G • uses Condor schedd (matchmaker) to match resources • a Condor submitter has a number of backends to talk to different CEs (GT2, GT4-GRAM, Condor (flocking)) • supports DAG workflows • schedd is ‘close’ to the user • gLite WMS • separation between broker (based on Condor-G) and the UI • additional Logging and Bookkeeping (generic, but actually only used for the WMS) • does job-data co-location scheduling

Grid Middleware V 47 Grid brokers (contd.) • Nimrod-G • parameter sweep engine • cycles through static list of resources • automatically inspects the job output and uses that to drive automatic job submission • minimisation methods like simulated annealing built in • Unicore broker • based on a pricing model • asks for bids from resources • no large information system needed full of useless resources, but instead ask bids from all resources for every job • shifts, but does nothing to resolve, the info-system explosion

Grid Middleware V 48 Alternative brokering • Alternatives could be ‘P2P-style’ brokering • look in the ‘neighbourhood’ for ‘reasonable’ matches, if none found, give the task to a peer super-scheduler • scheduler only considers ‘close’ resources (has no global knowledge) • job submission pattern may or may not follow brokering pattern • if it does, it needs recursive delegation for job submission, which opens the door for worms and trojans • trust is not very transitive(this is not a problem in sharing ‘public’ files, such as in the popular P2P file sharing applications)

Grid Middleware V 49 Broker detailed example: gLite WMS • Job services in the gLite architecture • Computing Element (just discussed) • Workload Management System (brokering, submission control) • Accounting (for EGEE comes in two flavours: site or user) • Job Provenance (to be done) • Package management (to be done) • continuous matchmaking solution • persistent list of pending jobs, waiting for matching resources WMS task akin to what the resources did in Unicore

Replica Catalog Network Server UI Inform. System Workload Manager Job Adapter Network Server Information Supermarket Task Queue Match Maker Match Maker Job Submission Logging & Bookkeeping Job Contr. - CondorG Computing Element WMS Services Grid Interface LRMS gLite LCG Architecture Overview Resource Broker Node (Workload Manager, WM) Job status Storage Element Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org

Computing and Brokering