530 likes | 614 Vues
Job Delegation and Planning in Condor-G ISGC 2005 Taipei, Taiwan. The Condor Project (Established ‘85). Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students. The Condor Project (Established ‘85).
E N D
Job Delegation and Planning in Condor-GISGC 2005 Taipei, Taiwan
The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: • face software engineering challenges in a distributed UNIX/Linux/NT environment • are involved in national and international grid collaborations, • actively interact with academic and commercial users, • maintain and support large distributed production environments, • and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …
A Multifaceted Project • Harnessing the power of clusters – dedicated and/or opportunistic (Condor) • Job management services for Grid applications (Condor-G, Stork) • Fabric management services for Grid resources (Condor, GlideIns, NeST) • Distributed I/O technology (Parrot, Kangaroo, NeST) • Job-flow management (DAGMan, Condor, Hawk) • Distributed monitoring and management (HawkEye) • Technology for Distributed Systems (ClassAD, MW) • Packaging and Integration (NMI, VDT)
Some software produced by the Condor Project • MW • NeST • Stork • Parrot • VDT • And others… all as open source • Condor System • ClassAd Library • DAGMan • Fault Tolerant Shell (FTSH) • Hawkeye • GCB Data!
Who uses Condor? • Commercial • Oracle, Micron, Hartford Life Insurance, CORE, Xerox, Exxon/Mobile, Shell, Alterra, Texas Instruments, … • Research Community • Universities, Govt Labs • Bundles: NMI, VDT • Grid Communities: EGEE/LCG/gLite, Particle Physics Data Grid (PPDG), USCMS, LIGO, iVDGL, NSF Middleware Initiative GRIDS Center, …
Condor Pool MatchMaker Startd Schedd Startd Jobs Jobs Startd Schedd Jobs Jobs
Condor Pool MatchMaker Startd Schedd Startd Jobs Jobs Jobs Startd Jobs Schedd Jobs Jobs
Condor-G Schedd - Condor-C LSF PBS Globus 2 Globus 4 Unicore (Nordugrid) - Condor-G Schedd Startd Jobs Jobs
Condor-G Middleware (Globus 2, Globus 4, Unicore, …) Condor Pool User/Application/Portal Grid Fabric (processing, storage, communication)
Atomic/Durable Job Delegation • Transfer of responsibility to schedule and execute a job • Stage in executable and data files • Transfer policy “instructions” • Securely transfer (and refresh?) credentials, obtain local identities • Monitor and present job progress (tranparency!) • Return results • Multiple delegations can be combined in interesting ways
Simple Job Delegation in Condor-G Globus GRAM Batch System Front-end Execute Machine Condor-G
Expanding the Model • What can we do with new forms of job delegation? • Some ideas • Mirroring • Load-balancing • Glide-in schedd, startd • Multi-hop grid scheduling
Mirroring • What it does • Jobs mirrored on two Condor-Gs • If primary Condor-G crashes, secondary one starts running jobs • On recovery, primary Condor-G gets job status from secondary one • Removes Condor-G submit point as single point of failure
Mirroring Example Condor-G 1 Condor-G 2 X Jobs Jobs Execute Machine
Mirroring Example Condor-G 1 Condor-G 2 Jobs Execute Machine
Load-Balancing • What it does • Front-end Condor-G distributes all jobs among several back-end Condor-Gs • Front-end Condor-G keeps updated job status • Improves scalability • Maintains single submit point for users
Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2
Glide-In • Schedd and Startd are separate services that do not require any special privledges • Thus we can submit them as jobs! • Glide-In Schedd • What it does • Drop a Condor-G onto the front-end machine of a remote cluster • Delegate jobs to the cluster through the glide-in schedd • Can apply cluster-specific policies to jobs • Not fork-and-forget… • Send a manager to the site, instead of manage across the internet
Glide-In Schedd Glide-In Schedd Example Frontend Middleware Jobs Condor-G Jobs Batch System
Glide-In Startd Example Frontend Middleware Batch System Condor-G (Schedd) Startd Job
Glide-In Startd • Why? • Restores all the benefits that may have been washed away by the middleware • End-to-end management solution • Preserves job semantic guarantees • Preserves policy • Enables lazy planning
Sample Job Submit file universe = grid grid_type = gt2 globusscheduler = cluster1.cs.wisc.edu/jobmanager-lsf executable = find_particle arguments = …. output = …. log = … But we want metascheduling…
Represent grid clusters as ClassAds • ClassAds • are a set of uniquely named expressions; each expression is called an attribute and is an attribute name/value pair • combine query and data • extensible • semi-structured : no fixed schema (flexibility in an environment consisting of distributed administrative domains) • Designed for “MatchMaking”
Example of a ClassAd that could represent a compute cluster in a grid: Type = "GridSite"; Name = "FermiComputeCluster"; Arch = “Intel-Linux”; Gatekeeper_url = "globus.fnal.gov/lsf" Load = [ QueuedJobs = 42; RunningJobs = 200; ]; Requirements = ( other.Type == "Job" && Load.QueuedJobs < 100 ); GoodPeople = { "howard", "harry" }; Rank = member(other.Owner, GoodPeople) * 500
Another Sample - Job Submit universe = grid grid_type = gt2owner = howard executable = find_particle.$$(Arch) requirements = other.Arch == “Intel-Linux” || other.Arch == “Sparc-Solaris” rank = 0 – other.Load.QueuedJobs; globusscheduler = $$(gatekeeper_url) … Note: We introduced augmentation of the job ClassAd based upon information discovered in its matching resource ClassAd.
Multi-Hop Grid Scheduling • Match a job to a Virtual Organization (VO), then to a resource within that VO • Easier to schedule jobs across multiple VOs and grids
Multi-Hop Grid Scheduling Example Experiment Resource Broker VO Resource Broker Experiment Condor-G VO Condor-G HEP CMS Globus GRAM Batch Scheduler
Endless Possibilities • These new models can be combined with each other or with other new models • Resulting system can be arbitrarily sophisticated
Job Delegation Challenges • New complexity introduces new issues and exacerbates existing ones • A few… • Transparency • Representation • Scheduling Control • Active Job Control • Revocation • Error Handling and Debugging
Transparency • Full information about job should be available to user • Information from full delegation path • No manual tracing across multiple machines • Users need to know what’s happening with their jobs
Representation • Job state is a vector • How best to show this to user • Summary • Current delegation endpoint • Job state at endpoint • Full information available if desired • Series of nested ClassAds?
Scheduling Control • Avoid loops in delegation path • Give user control of scheduling • Allow limiting of delegation path length? • Allow user to specify part or all of delegation path
Active Job Control • User may request certain actions • hold, suspend, vacate, checkpoint • Actions cannot be completed synchronously for user • Must forward along delegation path • User checks completion later
Active Job Control (cont) • Endpoint systems may not support actions • If possible, execute them at furthest point that does support them • Allow user to apply action in middle of delegation path
Revocation • Leases • Lease must be renewed periodically for delegation to remain valid • Allows revocation during long-term failures • What are good values for lease lifetime and update interval?
Error Handling and Debugging • Many more places for things to go horribly wrong • Need clear, simple error semantics • Logs, logs, logs • Have them everywhere
From earlier • Transfer of responsibility to schedule and execute a job • Transfer policy “instructions” • Stage in executable and data files • Securely transfer (and refresh?) credentials, obtain local identities • Monitor and present job progress (tranparency!) • Return results
Job Failure Policy Expressions • Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file. • Can be used to describe a successful run, or what to do in the face of failure. on_exit_remove = <expression> on_exit_hold = <expression> periodic_remove = <expression> periodic_hold = <expression>
Job Failure Policy Examples • Do not remove from queue (i.e. reschedule) if exits with a signal: on_exit_remove = ExitBySignal == False • Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) • Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)
Data Placement*(DaP) must be an integral part ofthe end-to-endsolution Space management and Data transfer *
Stork • A scheduler for data placement activities in the Grid • What Condor is for computational jobs, Stork is for data placement • Stork comes with a new concept: “Make data placement a first class citizen in the Grid.”
Stage-in • Execute the Job • Stage-out Stage-in Execute the job Stage-out Release input space Release output space Allocate space for input & output data Data Placement Jobs Computational Jobs
Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C DAGMan D F Stork Job Queue A B E E DAG with DaP DAG specification C
Why Stork? • Stork understands the characteristics and semantics of data placement jobs. • Can make smart scheduling decisions, for reliable and efficient data placement.
Failure Recovery and Efficient Resource Utilization • Fault tolerance • Just submit a bunch of data placement jobs, and then go away.. • Control number of concurrent transfers from/to any storage system • Prevents overloading • Space allocation and De-allocations • Make sure space is available
Support for Heterogeneity Protocol translation using Stork memory buffer.
Support for Heterogeneity Protocol translation using Stork Disk Cache.
Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… …… Max_Retry = 10; Restart_in = “2 hours”; ]
Run-time Adaptation • Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ]