490 likes | 638 Vues
Using New Features in Condor 7.2. Outline. Startd Hooks Job Router Job Router Hooks Power Management Dynamic Slot Partitioning Concurrency Limits Variable Substitution Preemption Attributes. Startd Job Hooks.
E N D
Outline • Startd Hooks • Job Router • Job Router Hooks • Power Management • Dynamic Slot Partitioning • Concurrency Limits • Variable Substitution • Preemption Attributes
Startd Job Hooks • Users wanted to take advantage of Condor’s resource management daemon (condor_startd) to run jobs, but they had their own scheduling system. • Specialized scheduling needs • Jobs live in their own database or other storage rather than a Condor job queue
Our solution • Make a system of generic “hooks” that you can plug into: • A hook is a point during the life-cycle of a job where the Condor daemons will invoke an external program • Hook Condor to your existing job management system without modifying the Condor code
How does Condor communicate with hooks? • Passing around ASCII ClassAds via standard input and standard output • Some hooks get control data via a command-line argument (argv) • Hooks can be written in any language (scripts, binaries, whatever you want) so long as you can read/write Stdin/out
What hooks are available? • Hooks for fetching work (startd): • FETCH_JOB • REPLY_FETCH • EVICT_CLAIM • Hooks for running jobs (starter): • PREPARE_JOB • UPDATE_JOB_INFO • JOB_EXIT
HOOK_FETCH_JOB • Invoked by the startd whenever it wants to try to fetch new work • FetchWorkDelay expression • Stdin: slot ClassAd • Stdout: job ClassAd • If Stdout is empty, there’s no work
HOOK_REPLY_FETCH • Invoked by the startd once it decides what to do with the job ClassAd returned by HOOK_FETCH_WORK • Gives your external system a chance to know what happened • argv[1]: “accept” or “reject” • Stdin: slot and job ClassAds • Stdout: ignored
HOOK_EVICT_CLAIM • Invoked if the startd has to evict a claim that’s running fetched work • Informational only: you can’t stop or delay this train once it’s left the station • Stdin: both slot and job ClassAds • Stdout: ignored
HOOK_PREPARE_JOB • Invoked by the condor_starter when it first starts up (only if defined) • Opportunity to prepare the job execution environment • Transfer input files, executables, etc. • Stdin: both slot and job ClassAds • Stdout: ignored, but starter won’t continue until this hook exits • Not specific to fetched work
HOOK_UPDATE_JOB_INFO • Periodically invoked by the starter to let you know what’s happening with the job • Stdin: slot and job ClassAds • Job ClassAd is updated with additional attributes computed by the starter: • ImageSize, JobState, RemoteUserCpu, etc. • Stdout: ignored
HOOK_JOB_EXIT • Invoked by the starter whenever the job exits for any reason • Argv[1] indicates what happened: • “exit”: Died a natural death • “evict”: Booted off prematurely by the startd (PREEMPT == TRUE, condor_off, etc) • “remove”: Removed by condor_rm • “hold”: Held by condor_hold
HOOK_JOB_EXIT … • “HUH!?! condor_rm? What are you talking about?” • The starter hooks can be defined even for regular Condor jobs, local universe, etc. • Stdin: copy of the job ClassAd with extra attributes about what happened: • ExitCode, JobDuration, etc. • Stdout: ignored
Defining hooks • Each slot can have its own hook ”keyword” • Prefix for config file parameters • Can use different sets of hooks to talk to different external systems on each slot • Global keyword used when the per-slot keyword is not defined • Keyword is inserted by the startd into its copy of the job ClassAd and given to the starter
Defining hooks: example # Most slots fetch work from the database system STARTD_JOB_HOOK_KEYWORD = DATABASE # Slot4 fetches and runs work from a web service SLOT4_JOB_HOOK_KEYWORD = WEB # The database system needs to both provide work and # know the reply for each attempted claim DB_DIR = /usr/local/condor/fetch/db DATABASE_HOOK_FETCH_WORK = $(DB_DIR)/fetch_work.php DATABASE_HOOK_REPLY_FETCH = $(DB_DIR)/reply_fetch.php # The web system only needs to fetch work WEB_DIR = /usr/local/condor/fetch/web WEB_HOOK_FETCH_WORK = $(WEB_DIR)/fetch_work.php
Semantics of fetched jobs • Condor_startd treats them just like any other kind of job: • All the standard resource policy expressions apply (START, SUSPEND, PREEMPT, RANK, etc). • Fetched jobs can coexist in the same pool with jobs pushed by Condor, COD, etc. • Fetched work != Backfill
Semantics continued • If the startd is unclaimed and fetches a job, a claim is created • If that job completes, the claim is reused and the startd fetches again • Keep fetching until either: • The claim is evicted by Condor • The fetch hook returns no more work
Limitations of the hooks • If the starter can’t run your fetched job because your ClassAd is bogus, no hook is invoked to tell you about it • We need a HOOK_STARTER_FAILURE • No hook when the starter is about to evict you (so you can checkpoint) • Can implement this yourself with a wrapper script and the SoftKillSig attribute
Job Router • Automated way to let jobs run on a wider array of resources • Transform jobs into different forms • Reroute jobs to different destinations
What is “job routing”? original (vanilla) job routed (grid) job Universe = “vanilla” Executable = “sim” Arguments = “seed=345” Output = “stdout.345” Error = “stderr.345” ShouldTransferFiles = True WhenToTransferOutput = “ON_EXIT” Universe = “grid” GridType = “gt2” GridResource = \“cmsgrid01.hep.wisc.edu/jobmanager-condor” Executable = “sim” Arguments = “seed=345” Output = “stdout” Error = “stderr” ShouldTransferFiles = True WhenToTransferOutput = “ON_EXIT” JobRouter Routing Table: Site 1 … Site 2 … final status
Routing is just site-level matchmaking • With feedback from job queue • number of jobs currently routed to site X • number of idle jobs routed to site X • rate of recent success/failure at site X • And with power to modify job ad • change attribute values (e.g. Universe) • insert new attributes (e.g. GridResource) • add a “portal” grid proxy if desired
Configuring the Routing Table • JOB_ROUTER_ENTRIES • list site ClassAds in configuration file • JOB_ROUTER_ENTRIES_FILE • read site ClassAds periodically from a file • JOB_ROUTER_ENTRIES_CMD • read periodically from a script • example: query a collector such as Open Science Grid Resource Selection Service
Syntax • List of sites in new ClassAd format [ Name = “Grid Site 1”;… ] [ Name = “Grid Site 2”; … ] [ Name = “Grid site 3”; … ] …
Syntax [ Name = “Site 1”; GridResource = “gt2 gk.foo.edu”; MaxIdleJobs = 10; MaxJobs = 200; FailureRateThreshold = 0.01; JobFailureTest = other.RemoteWallClockTime < 1800 Requirements = target.WantJobRouter is True; delete_WantJobRouter = true; set_PeriodicRemove = JobStatus == 5; ]
What Types of Input Jobs? • Vanilla Universe • Self Contained(everything needed is in file transfer list) • High Throughput(many more jobs than cpus)
Grid Gotchas • Globus gt2 • no exit status from job (reported as 0) • Most grid universe types • must explicitly list desired output files
JobRouter vs. Glidein • Glidein - Condor overlays the grid • job never waits in remote queue • job runs in its normal universe • private networks doable, but add to complexity • need something to submit glideins on demand • JobRouter • some jobs wait in remote queue (MaxIdleJobs) • job must be compatible with target grid semantics • simple to set up, fully automatic to run
Job Router Hooks • Truly transform jobs, not just reroute them • E.g. stuff a job into a virtual machine (either VM universe or Amazon EC2) • Hooks invoked like startd ones
HOOK_TRANSLATE • Invoked when a job is matched to a route • Stdin: route name and job ad • Stdout: transformed job ad • Transformed job is submitted to Condor
HOOK_UPDATE_JOB_INFO • Invoked periodically to obtain extra information about routed job • Stdin: routed job ad • Stdout: attributes to update in routed job ad
HOOK_JOB_FINALIZE • Invoked when routed job has completed • Stdin: ads of original and routed jobs • Stdout: modified original job ad or nothing (no updates)
HOOK_JOB_CLEANUP • Invoked when original job returned to schedd (both success and failure) • Stdin: Original job ad • Use for cleanup of external resources
Power Management • Hibernate execute machines when not needed • Condor doesn’t handle waking machines up yet • Information to wake machines available in machine ads
Configuring Power Management • HIBERNATE • Expression evaluated periodically by all slots to decide when to hibernate • All slots must agree to hibernate • HIBERNATE_CHECK_INTERVAL • Number of seconds between hibernation checks
Setting HIBERNATE • HIBERNATE must evaluate to one of these strings: • “NONE”, “0” • “S1”, “1”, “STANDBY”, “SLEEP” • “S2”, “2” • “S3”, “3”, “RAM”, “MEM” • “S4”, “4”, “DISK”, “HIBERNATE” • “S5”, “5”, “SHUTDOWN” • These numbers are ACPI power states
Power Management on Linux • On linux, theses methods are tried in order for setting power level: • pm-UTIL tools • /sys/power • /proc/ACPI • LINUX_HIBERNATION_METHOD can be set to pick a favored method
Sample Configuration ShouldHibernate = \ ((KeyboardIdle > $(StartIdleTime)) \ && $(CPUIdle) \ && ($(StateTimer) > (2 * $(HOUR))) HIBERNATE = ifThenElse( \ $(ShouldHibernate), “RAM”, “NONE” ) HIBERNATE_CHECK_INTERVAL = 300 LINUX_HIBERNATION_METHOD = “/proc”
Dynamic Slot Partitioning • Divide slots into chunks sized for matched jobs • Readvertise remaining resources • Partitionable resources are cpus, memory, and disk
How It Works • When match is made… • New sub-slot is created for job and advertised • Slot is readvertised with remaining resources • Slot can be partitioned multiple times • Original slot ad never enters Claimed state • But may eventually have too few resources to be matched • When claim on sub-slot is released, resources are added back to original slot
Configuration • Resources still statically partitioned between slots • SLOT_TYPE_<N>_PARTITIONABLE • Set to True to enable dynamic partition within indicated slot
New Machine Attributes • In original slot machine ad • PartitionableSlot = True • In ad for dynamically-created slots • DynamicSlot = True • Can reference these in startd policy expressions
Job Submit File • Jobs can request how much of partitionable resources they need • request_cpus = 3 • request_memory = 1024 • request_disk = 10240
Dynamic Partitioning Caveats • Cannot preempt original slot or group of sub-slots • Potential starvation of jobs with large resource requirements • Partitioning happens once per slot each negotiation cycle • Scheduling of large slots may be slow
Concurrency Limits • Limit job execution based on admin-defined consumable resources • E.g. licenses • Can have many different limits • Jobs say what resources they need • Negotiator enforces limits pool-wide
Concurrency Example • Negotiator config file • MATLAB_LIMIT = 5 • NFS_LIMIT = 20 • Job submit file • concurrency_limits = matlab,nfs:3 • This requests 1 Matlab token and 3 NFS tokens
New Variable Substitution • $$(Foo) in submit file • Existing feature • Attribute Foo from machine ad substituted • $$([Memory * 0.9]) in submit file • New feature • Expression is evaluated and then substituted
More Info For Preemption • New attributes for these preemption expressions in the negotiator… • PREEMPTION_REQUIREMENTS • PREEMPTION_RANK • Used for controlling preemption due to user priorities
Preemption Attributes • Submitter/RemoteUserPrio • User priority of candidate and running jobs • Submitter/RemoteUserResourcesInUse • Number of slots in use by user of each job • Submitter/RemoteGroupResourcesInUse • Number of slots in use by each user’s group • Submitter/RemoteGroupQuota • Slot quota for each user’s group
Thank You! • Any questions?