Further elements of Condor and CamGrid

Further elements of Condor and CamGrid Mark Calleja

What Condor Daemons are running on my machine, and what do they do?

Central Manager = Process Spawned collector schedd startd negotiator master Condor Daemon Layout Note: there can also be other, more specialist daemons

master condor_master • Starts up all other Condor daemons • If there are any problems and a daemon exits, it restarts the daemon and sends email to the administrator • Checks the time stamps on the binaries of the other Condor daemons, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version

startd master condor_startd • Represents a machine to the Condor system • Responsible for starting, suspending, and stopping jobs • Enforces the wishes of the machine owner (the owner’s “policy”… more on this soon)

schedd startd master condor_schedd • Represents users to the Condor system • Maintains the persistent queue of jobs • Responsible for contacting available machines and sending them jobs • Services user commands which manipulate the job queue: • condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio, …

collector schedd startd master condor_collector • Collects information from all other Condor daemons in the pool • “Directory Service” / Database for a Condor pool • Each daemon sends a periodic update called a “ClassAd” to the collector • Services queries for information: • Queries from other Condor daemons • Queries from users (condor_status)

collector schedd startd negotiator master condor_negotiator • Performs “matchmaking” in Condor • Gets information from the collector about all available machines and all idle jobs • Tries to match jobs with machines that will serve them • Both the job and the machine must satisfy each other’s requirements

Execute-Only Execute-Only Submit-Only Regular Node Regular Node Central Manager = Process Spawned negotiator collector schedd schedd schedd schedd master master master master master master startd startd startd startd startd Typical Condor Pool = ClassAd Communication Pathway Each daemon maintains its own log; makes for interesting distributed debugging!

Job Job Startup Central Manager Negotiator Collector Submit Machine Execute Machine Schedd Startd Starter Shadow Submit Condor Syscall Lib

The Parallel Universe First, some caveats: • Using the PU requires extra set-up on the execute nodes by a sysadmin. • Execute hosts willing to run under the PU will only accept jobs from one dedicated scheduler. • No flocking! At its simplest, the PU just launches N identical jobs on N machines. The job will only run when all machines have been allocated. For example, the following submit script will run the command “uname –a” on four nodes.

###################################### ## Example PU submit description file ###################################### universe = parallel executable = /bin/uname arguments = -a should_transfer_files = yes when_to_transfer_output = on_exit requirements = OpSys == "LINUX" && Arch == "X86_64" log = logfile output = outfile.$(NODE) error = errfile.$(NODE) machine_count = 4 queue

The Parallel Universe and MPI • Condor leverages the PU to run MPI jobs. • Does this by launching a wrapper on N nodes. Then: • N-1 nodes exit (but don’t relinquish claim on node). • Wrapper on first node launches MPI job, grabbing all N nodes and calling relevant MPI command (e.g. MPICH2, OpenMPI, etc.) • Condor bundles wrappers for various MPI flavours, which you can then modify suit your needs. • But what does MPI on CamGrid mean? • Job spans many pools: not practical! Consider packet latency over routers and firewalls. Also, it’s a heterogeneous environment. • Job spans many machines in one pool: possible, and some people do this. Inter-node connectivity is usually 1Gb/s (at best). • Job sits on one machine, but spans all cores: we’re in the money! With multi-core machines this is becoming increasingly attractive, and comms take place over shared memory (avoiding n/w stack) = very fast.

MPI Example # This is a wrapper for an OpenMPI job using 4 cores on the same # physical host: executable = openmpi.sh transfer_input_files = castep, Al_00PBE.usp, O_00PBE.usp, \ corundum.cell, corundum.param WhenToTransferOutput = ON_EXIT output = myoutput error = myerror log = mylog machine_count = 4 # We want four processes arguments = "castep corundum" +WantParallelSchedulingGroups = True requirements = OpSys == "LINUX" && Arch == "X86_64" queue

MPI Example (cont.) • openmpi.sh does necessary groundwork, e.g. set paths to MPI binaries and libraries, create machine file lists, etc., before invoking the MPI starter command for that flavour. • In parallel environments, machines can be divided into groups by suitable configuration on the execute hosts. For example, the following configurational entry would mean that all processes on the host would reside in the same group, all on that same machine: ParallelSchedulingGroup = “$(HOSTNAME)” • This feature is then requested in the user’s submit script by having: +WantParallelSchedulingGroups = TRUE

Case study: Ag3[Co(CN)6] energy surface from DFT

Accessing data across CamGrid • CamGrid spans many administrative domains, with each pool generally run by different sysadmins. • This makes the running of a file system that need privileged installation and administration, e.g. NFS, impractical. • So far we’ve got round this by sending input files from the submit node with every job submission. • However, there are times when it would be really nice to be able to mount a remote file store, e.g.: maybe I don’t know exactly which files I need at submit time (identified at run time).

Parrot • Fortunately, a tool to do just this in Linux has come out of the Condor project, called Parrot. • Parrot gives a transparent way of accessing these resources without the need of superuser intervention (unlike trying to export a directory via NFS, or setting up sshfs). • It supports many protocols (http, httpfs, ftp, anonftp, gsiftp, chirp, …) and authentication models (GSI, kerberos, IP-address,…). • Parrot can be used on its own (outside of Condor). • It also allows server clustering for load balancing.

Digression: Chirp • Chirp: a remote I/O protocol used Condor. • I can start my own chirp_server and export a directory: chirp_server -r /home/mcal00/data -I 172.24.116.7 -p 9096 • I set permissions per directory with a .__acl file in the exported directory: hostname:*.grid.private.cam.ac.uk rl hostname:*.escience.cam.ac.uk rl

Using interactive shells with Parrot • Not that useful in a grid job, but a nice feature: parrot vi /chirp/woolly--escience.grid.private.cam.ac.uk:9096/readme • Default port is 9094, so can export different mount-points from same resource using different ports. • I can also mount a remote file system in a new shell: parrot -M /dbase=/http/woolly--escience.grid.private.cam.ac.uk:80 bash • /dbase appears as a local directory in the new shell

Parrot and non-interactive grid jobs • Consider an executable called a.out, that needs to access the directories /Dir1 and /Dir2. • I start by constructing a mountpoint file (call it Mountfile): • Next I wrap it in an executable that will provide all the parrot functionality, and is what I actually submit to the relevant grid scheduler, e.g. via condor_submit. Call this wrapper.sh: /Dir1 /chirp/woolly--escience.grid.private.cam.ac.uk:9094 /Dir2 /chirp/woolly--escience.grid.private.cam.ac.uk:9096

wrapper.sh #!/bin/bash export PATH=.:/bin:/usr/bin export LD_LIBRARY_PATH=. # Nominate file with the mount points mountfile=Mountfile # What's the "real" executable called? my_executable=a.out chmod +x $my_executable parrot # Run the executable under parrot parrot -k -Q -m $mountfile $my_executable

New in 7.4: File transfer by URL • It is now possible for vanilla jobs to specify a URL for the input files so that the execute host pulls the files over. • First, a sysadmin must have added an appropriate plugin and configured the execute host as such, e.g: FILETRANSFER_PLUGINS = $(RELEASE_DIR)/plugins/curl-plugin • You can then submit a job to that machine and not send it any files, but instead direct it to pull files from an appropriate server, e.g. have in a submit script: URL = https://www.escience.cam.ac.uk/ transfer_input_files = $(URL)/file1.txt, $(URL)/file2.txt

Checkpointing - DIY • Recap: Condor’s process checkpointing via the Standard Universe saves all the state of a process into a checkpoint file • Memory, CPU, I/O, etc. • Checkpoints are saved on submit host unless a dedicated checkpoint server is nominated. • The process can then be restarted from where it left off • Typically no changes to the job’s source code needed – however, the job must be relinked with Condor’s Standard Universe support library • Limitations: no forking, kernel threads, or some forms of IPC • Not all combinations of OS/compilers are supported (none for Windows), and support is getting harder. • VM universe is meant to be the successor, but users don’t seem too keen.

Strategy 1 – Recursive shell scripts

Recursive shell scripts (cont.) • We can run a recursive shell script. • This script does a condor_submit on our required executable, and we ensure that the input files are such that the job only runs for a “short” duration. • The script then runs condor_wait on the job’s log file and waits for it to finish. • Once this happens, the script checks the output files to see if the completion criteria have been met, otherwise we move the output files to input files and resubmit the job. • Hence, there is a proviso that output files can generate next set of input files (not all applications can).

Recursive shell scripts (cont.) • There are some drawbacks with this approach: • We need to write the logic for checking for job completion. This will probably vary between applications. • We need to take into account of how our recursive script will behave if the job exits abnormally, e.g. execute host disappears, etc. • We can mitigate some of these concerns by running a recursive DAG (so Condor worries about abnormalities), and an example is given in CamGrid’s online documentation. However, we still need to write some application-specific logic.

Checkpointing (linux) vanilla universe jobs • Many applications can’t link with Condor’s checkpointing libraries. And what about interpreted languages? • To perform this for arbitrary code we need: 1) An API that checkpoints running jobs. 2) A user-space FS to save the images • For 1) we use the BLCR kernel modules – unlike Condor’s user-space libraries these run with root privilege, so less limitations as to the codes one can use. • For 2) we use Parrot, which came out of the Condor project. Used on CamGrid in its own right, but with BLCR allows for any code to be checkpointed. • I’ve provided a bash implementation, blcr_wrapper.sh, to accomplish this (uses chirp protocol with Parrot).

Checkpointing linux jobs using BLCR kernel modules and Parrot • Start chirp server to receive checkpoint images 2. Condor jobs starts: blcr_wrapper.sh uses 3 processes Job Parent Parrot I/O 3. Start by checking for image from previous run 4. Start job 5. Parent sleeps; wakes periodically to checkpoint and save images. 6. Job ends: tell parent to clean up

Example of submit script • Application is “my_application”, which takes arguments “A” and “B”, and needs files “X” and “Y”. • There’s a chirp server at: woolly--escience.grid.private.cam.ac.uk:9096 Universe = vanilla Executable = blcr_wrapper.sh arguments = woolly--escience.grid.private.cam.ac.uk 9096 60 $$([GlobalJobId]) \ my_application A B transfer_input_files = parrot, my_application, X, Y transfer_files = ALWAYS Requirements = OpSys == "LINUX" && Arch == "X86_64" && HAS_BLCR == TRUE Output = test.out Log = test.log Error = test.error Queue

Further elements of Condor and CamGrid