Introduction to CERN's HTCondor Batch Service

CERN Batch Service: HTCondor Document reference

Agenda Document reference Batch Service What is HTCondor? Job Submission Multiple Jobs & Requirements File Transfer

Batch Service • IT-CM-IS Mandate:“Provide high-level compute services to the CERN Tier-0 and WLCG “ • HTCondor: our production batch service. • Service used for both “grid” and “local” submission • Local means open to all CERN users, kerberos, shared filesystem, managed submission nodes • ~218k cores in HTCondor • Over a million jobs a day • Service Element: Batch Service 4

What is ? Part of the content adapted from: “An introduction to using HTCondor” by Christina Koch, HTCondor Week 2016 & HTCondor Week 2018 5

What is ? • Open Source batch system developed at the CHTC at the University of Wisconsin • “High Throughput Computing” • Long history in HEP and elsewhere (including previously at CERN) • Used extensively in OSG, and things like the CMS global pool (200K++ cores) • System of symmetric matching of job requests to resources using ClassAds of job requirements and machine resources 6

HTCondor elements Pull list of jobs Schedd Schedd Schedd Startd Collector Schedd Startd Schedd Startd Startd Startd Startd Startd Startd Startd Negotiator Startd Startd Send machineproperties (ClassAds) Startd Startd Match jobs & machines Startd Send jobs toreserved slot Broker Submit Side Execute Side 7

Execute Side Document reference • Slot: 1 CPU / 2GB RAM / 20GB Disk • CPU / Memory will be scaled in requests to reflect slot • Ask for 2 CPUs, get 4 GB RAM • Mostly CentOS7 at this point • CentOS8 in the works, but likely CentOS7 platform for next run • Docker & Singularity are available for containers

Jobs • A single computing task is called a “job” • Three main pieces of a job are the input, executable and output 9

Job Example wi.dat compare_ states wi.dat.out us.dat $ compare_states wi.dat us.dat wi.dat.out • The executable must be runnable from the command line without any interactive input 10

Job Translation • Submit file: communicates everything about your job(s) to HTCondor • The main goal of this training is to show you how to properly represent your job in a submit file executable = compare_states arguments = wi.dat us.dat wi.dat.out should_transfer_files = YES transfer_input_files = us.dat, wi.dat when_to_transfer_output = ON_EXIT log = job.log output = job.out error = job.err request_cpus = 1 request_disk = 20MB request_memory = 20MB queue 1 11

CERN HTCondor Service HTCondor Pool – “CERN Condor Share” CE Scheddce5XY.cern.ch Central Manager tweetybirdXY.cern.ch GRID Central Manager tweetybirdXY.cern.ch CE Scheddce5XY.cern.ch Authentication CE Scheddce5XY.cern.ch Grid Cert Local ScheddbigbirdXY.cern.ch Local Users Authentication Local ScheddbigbirdXY.cern.ch Local ScheddbigbirdXY.cern.ch KerberosGrid Cert Typically from lxplus.cern.ch Workers SLC6 / Short … SLC6 / Mix CC7 … Worker Worker Worker Worker Worker Worker … Worker Worker Worker Worker Worker Worker Different flavours, same config:afs, cvmfs, eos, root,… “It’s like lxplus” 12

Ex. 1: Job Submission: sub file lxplus ~$ vi ex1.sub universe: an HTCondor execution environment. Vanilla is the default and it should cover 90% cases. executable arguments: arguments are any options passed to the executable from the command line. output/error: captures stdout & stderr • universe = vanilla • executable = ex1.sh • arguments = "training 2018" • output = output/ex1.out • error = error/ex1.err • log = log/ex1.log • queue log: file created by HTCondor to track job progress queue: keyword indicating “create a job” 13

Ex. 1: Job Submission: script lxplus ~$ vi ex1.sh • #!/bin/sh • echo 'Date: ' $(date) • echo 'Host: ' $(hostname) • echo 'System: ' $(uname -spo) • echo 'Home: ' $HOME • echo 'Workdir: ' $PWD • echo 'Path: ' $PATH • echo "Program: $0" • echo "Args: $*" The shebang (#!) is mandatory when submitting script files in HTCondor: “#!/bin/sh” “#!/bin/bash” “#!/bin/env python” Malformed or invalid shebang silently ignored and no error reported (yet) lxplus ~$ chmod +x ex1.sh 14

Ex. 1: Job Submission • universe = vanilla • executable = ex1.sh • arguments = "training 2018" • output = output/ex1.out • error = error/ex1.err • log = log/ex1.log • queue To submit a job/jobs condor_submit <submit_file> To monitor submitted jobs: condor_q • lxplus ~$ condor_submit ex1.sub • Submitting job(s). • 1 job(s) submitted to cluster 162. • lxplus ~$ condor_q • -- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:42 • OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS • fernandl CMD: ex1.sh 11/19 20:49 _ _ 1 1 162.0 • 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 15

More about condor_q • By default condor_qshows: • User’s job only • Jobs summarized in batches: same cluster or same executable or same batch name • lxplus ~$ condor_q • -- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:42 • OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS • fernandl CMD: /bin/hostname 11/19 20:49 _ _ 1 1 162.0 • 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended JobId = ClusterId .ProcId 16

More about condor_q • To see individual job information, use:condor_q –nobatch • We will use –nobatchoption in the following slides to see extra detail about what is happening with a job • lxplus ~$ condor_q -nobatch • -- Schedd: bigbird99.cern.ch : <137.138.120.138:9618?... @ 11/19/18 20:50:32 • ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD • 162.0 fernandl 11/19 20:49 0+00:00:00 I 0 0.0 hostname • 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 17

Job States transfer executable and input to execute node transfer output back to submit node Idle (I) Running (R) Completed (C) condor_ submit in the queue leaving the queue 18

Log File • 000 (168.000.000) 11/20 11:34:25 Job submitted from host: <137.138.120.138:9618?addrs=137.138.120.138-9618&noUDP&sock=1069_d2d4_3> • ... • 001 (168.000.000) 11/20 11:37:26 Job executing on host: <188.185.217.222:9618?addrs=188.185.217.222-9618+[--1]-9618&noUDP&sock=3285_211b_3> • ... • 006 (168.000.000) 11/20 11:37:30 Image size of job updated: 15 • 0 - MemoryUsage of job (MB) • 0 - ResidentSetSize of job (KB) • ... • 005 (168.000.000) 11/20 11:37:30 Job terminated. • (1) Normal termination (return value 0) • Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage • Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage • Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage • Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage • 19 - Run Bytes Sent By Job • 15768 - Run Bytes Received By Job • 19 - Total Bytes Sent By Job • 15768 - Total Bytes Received By Job • Partitionable Resources : Usage Request Allocated • Cpus : 1 1 • Disk (KB) : 31 15 1841176 • Memory (MB) : 0 2000 2000 • ... 19

The Central Manager Class Ad & Matchmaking 20

The Central Manager • HTCondor matches jobs with computers via a “central manager” central manager submit execute execute execute 21

Class Ads • HTCondor stores a list of information about each job and each computer. • This information is stored as a “Class Ad” • Class Ads have the format:AttributeName = value • value can be Boolean, number or string 22

Job Class Ads RequestCpus = 1 Err = “job.err" WhenToTransferOutput = "ON_EXIT" TargetType = "Machine" Cmd = “/afs/cern.ch/user/f/fernandl/condor/exe“ Arguments = “x y z” JobUniverse = 5 • Iwd = “/afs/cern.ch/user/f/fernandl/condor" RequestDisk = 20480 NumJobStarts = 0 WantRemoteIO = true OnExitRemove = true MyType = "Job" Out = "job.out" • UserLog = “/afs/cern.ch/user/f/fernandl/condor/job.log" RequestMemory = 20 ... ... executable = exe Arguments = “x y z” log = job.log output = job.out error = job.err queue 1 = + HTCondor configuration* 23

Machine Class Ads HasFileTransfer = true DynamicSlot = true TotalSlotDisk = 4300218.0 TargetType = "Job" TotalSlotMemory = 2048 Mips = 17902 Memory = 2048 UtsnameSysname = "Linux" MAX_PREEMPT = ( 3600 * 72 ) Requirements = ( START ) && ( IsValidCheckpointPlatform ) && ( WithinResourceLimits ) OpSysMajorVer = 6 TotalMemory = 9889 HasGluster = true OpSysName = "SL" HasDocker = true ... = + HTCondor configuration 24

Job Matching • On a regular basis, the central manager reviews Job and Machine Class Ads and matches jobs to computers submit central manager execute execute execute 25

Job Execution • After the central manager makes the match, the submit and execute points communicate directly central manager submit execute execute execute 26

Class Ads for People • Class Ads also provides lots of useful information about jobs and computers to HTCondor users and administrators 27

Finding Job Attributes • Use the “long” option for condor_qcondor_q –l <JobId> $ condor_q -l 128.0 • Arguments = "" • Cmd = "/bin/hostname" • Err = "error/hostname.err" • Iwd = "/afs/cern.ch/user/f/fernandl/temp/htcondor-training/module_single_jobs" • JobUniverse = 5 • OnExitRemove = true • Out = "output/hostname.out" • RequestMemory = 2000 • Requirements = ( TARGET.Hostgroup =?= "bi/condor/gridworker/share/mixed" || TARGET.Hostgroup =?= "bi/condor/gridworker/shareshort" || TARGET.Hostgroup =?= "bi/condor/gridworker/share/singularity" || TARGET.Hostgroup =?= "bi/condor/gridworker/sharelong" ) && VanillaRequirements • TargetType = "Machine" • UserLog = "/afs/cern.ch/user/f/fernandl/temp/htcondor-training/module_single_jobs/log/hostname.log" • WantRemoteIO = true • WhenToTransferOutput = "ON_EXIT_OR_EVICT" ... 28

Resource Request • Jobs use a part of the computer, not the whole thing. • Important to size job requirements appropriately: memory, cpus and disk. • CERN HTCondor defaults: • 1 CPU • 2 Gb ram • 20 Gb disk whole computer your request 29

Resource Request (II) • Even if the system sets a default CPU, memory and disk requests, they may be too small. • Important to run the job and get the information from the log to request the right amount of resources: • Requesting too little: causes problems for your and other jobs, jobs might be held by HTCondor or killed by the system. • Requesting too much: jobs will match to fewer slots and will waste resources. 30

Time to start running • As we have seen, jobs don’t start to run immediately after the submission. • Many factors involved: • Negotiation Cycle: the central managers don’t perform matchmaking continuously. It is an expensive operation (~ 5 min). • User priority: users priority is dynamic and recalculated according to usage. • Availability of resources: many worker flavours. Machines matching your job requirements might be busy. • More info: BatchDocs (Fairshare) & Manual (User Priorities) 31

Ex. 4: Multiple Jobs (queue) lxplus ~$ vi ex4.sub • universe = vanilla • executable = ex4.sh • arguments = $(ClusterId) $(JobId) • output = output/$(ClusterId).$(ProcId).out • error = error/$(ClusterId).$(ProcId).err • log = log/$(ClusterId).log • queue 5 • Pre-defined macros: we can use the $ClusterIdand $ProcIdvars in order to provide unique values to the jobs files. queue: it controls how many instances of the job are submitted (default 1). It supports dynamic input. 32

Ex. 5: Multiple Jobs (queue) lxplus ~$ vi ex5.sub • universe = vanilla • executable = $(filename) • output = output/$(ClusterId).$(ProcId).out • error = error/$(ClusterId).$(ProcId).err • log = log/$(ClusterId).log • queue filename matching files ex5/*.sh • The usage of regular expressions in queue allows us to submit more than one different jobs. • The resulting jobs point to different executables, but they will belong to the same ClusterIdwith different ProcIds. 33

Queue Statement Comparison

CERNism: JobFlavour • Set of pre-defined run times to bucket jobs easily (Default: espresso) universe = vanilla executable = training.sh output = output/$(ClusterId).$(ProcId).out error = error/$(ClusterId).$(ProcId).err Log = log/$(ClusterId).log +JobFlavour = "microcentury" queue espresso = 20 minmicrocentury = 1 hourlonglunch = 2 hoursworkday = 8 hourstomorrow = 1 daytestmatch = 3 daysnextweek = 1 week 35

ExceedingMaxRuntime • What will happen if we set MaxRuntime less than the job needs in order to be executed • The job will be removed by the system. • universe = vanilla • executable = training.sh • output = output/$(ClusterId).$(ProcId).out • error = error/$(ClusterId).$(ProcId).err • log = log/$(ClusterId).log • +MaxRuntime = 120 • queue [fprotops@lxplus088 training]$ condor_q -afMaxRuntime 120 [fprotops@lxplus088 training]$ condor_history -l <job id> |grep -i remove RemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to wall time exceeded allowed max." 36

Debug (I): condor_status • It displays the status of machines in the pool:$ condor_status –avail$ condor_status –schedd$ condor_status <hostname>$ condor_status –l <hostname> • It supports filtering based on ClassAd: [fprotops@lxplus071 ssh]$ condor_status -const ‘OpSysAndVer=?="CentOS7"’slot1_3@b692bc24fe.cern.ch LINUX X86_64 Claimed Busy 3.350 slot1_4@b692bc24fe.cern.ch LINUX X86_64 Claimed Busy 3.130 slot1_5@b692bc24fe.cern.ch LINUX X86_64 Claimed Busy 3.250 slot1_6@b692bc24fe.cern.ch LINUX X86_64 Claimed Busy 3.610 slot1_7@b692bc24fe.cern.ch LINUX X86_64 Claimed Busy 11.510 37

Debug (II): condor_ssh_to_job • Creates an ssh session to a running job:$ condor_ssh_to_job <job id>$ condor_ssh_to_job -auto-retry <job id> • This will get us access to the contents of our sandbox in the worker node: output, temp files, credentials… [fprotops@lxplus071 ssh]$ condor_ssh_to_job -auto-retry <job id>slot1_4@b626c4b230.cern.ch: Rejecting request, because the job execution environment is not yet ready.Waiting for job to start...Welcome to slot1_4@b626c4b230.cern.ch!Your condor job is running with pid(s) 18694.[fprotops@b626c4b230 dir_18443]$ lscondor_exec.exe _condor_stderr _condor_stdout fprotops.cc test.txt tmpvar 38

Debug (III): condor_tail • It displays the tail of the job files:$ condor_tail -follow <job id> • The output can be controlled via flags:$ condor_tail -follow -no-stdout -stderr <job id> • [fprotops@lxplus052 training]$ condor_tail -follow <job id> • Welcome to the HTCondor training! 39

Debug (IV): Hold & Removed • Apart from Idle, Running and Completed, HTCondor defines two more states: • Hold and Removed • Jobs can get into Hold or Removed status either by the user or by the system. • Related commands:$ condor_q –hold$ condor_q –afHoldReason <job_id>$ condor_hold$ condor_release$ condor_rm$ condor_history –limit 1 <job_id> -afRemoveReason • [fprotops@lxplus052 training]$ condor_q -afHoldReason 155.0 • via condor_hold (by user fprotops) 40

File Transfer condor data transfer A job will need input and output data. There are several ways to get data in or out of the batch system, so we need to know a little about the trade offs. Do you want to use a shared filesystem? Do you want to have condor transfer data for you? Should you input or output in the job payload itself?

Infrastructure condor data transfer

Adding Input files • In order to add input files, we just need to add “transfer_input_files” to our submit file • It’s a list of files to take from the working directory to send to the job sandbox • This example produce one output file “merge.out” executable = merge.sh arguments = a.txtb.txtmerge.out transfer_input_files= a.txt, b.txt log = job.log output = job.out error = job.err +JobFlavour = “longlunch” queue 1 condor data transfer

Transferring output back executable = merge.sh arguments = a.txtb.txtmerge.out transfer_input_files = a.txt, b.txt transfer_output_files= merge.out log = job.log output = job.out error = job.err +JobFlavour = “longlunch” queue 1 condor data transfer By default condorwill transfer everything in your sandbox To only transfer back the file you need, use transfer_output_files Adding to transfer_output_files adds file to list that “condor_tail” can see

Important considerations condor data transfer • Even when using a shared filesystem, files are transferred to a scratch space on the workers, the “sandbox”. • Remember the impact on the filesystem! The most efficient use of network filesystems is typically to write once, at the end of a job • You have 20GB per CPU of sandbox • There are limits to the amount of data that we allow to be transferred using condor file transfer • The limit is 1GB currently per job • The job itself can do file transfer, both input and output

condor_submit -spool condor data transfer • You may not want condor to create files in your shared filesystem • Particularly if you are submitting 10s of 1000s of jobs • condor_submit –spool transfers files to the Schedd • Important notes: • This makes the system async – to get any files back you need to run condor_transfer_data • The spool on the Schedd is limited! • Best practice for this mode: spool, but write data out to end location within job, use spool only for stdout/err

Note on AFS & EOS condor data transfer • Shared filesystem is used a lot for batch jobs • Current best practices: • AFS, EOS FUSE, EOS via xrdcp are all available on the worker node • Between the submit node and the Schedd, only AFS is currently supported • No exe, log, stdout, err in EOS in your submit file • With all network filesystem, best to write at end of job, not constant I/O whilst the job is running • AFS supported for as long as it’s available • EOS FUSE will be supported when it is performant

Introduction to CERN's HTCondor Batch Service