230 likes | 332 Vues
Learn about Condor checkpointing process, including installation, storage, and restoration of process states. Explore checkpointing details and limitations on homogeneous systems. Discover different Condor universes and commands available for job submission and management.
E N D
Basic Grid Projects – Condor Part II Sathish Vadhiyar Sources/Credits: Condor Project web pages
Checkpointing • Checkpointing is used to vacate job from one idle workstation to another • A Condor checkpoint library linked with the program’s code • Checkpoint library installs signal handler for handling SIGSTP signal. • Checkpoints either stored on local disk of submitting machine or on checkpoint servers • Stores unix process’ states including text, stack, data segments, files, pointers etc. • Condor also provides periodic checkpointing
Checkpointing Overview • When startd daemon detects policy violations, sends a signal to the process • The signal handler in the process is invoked, process state is checkpointed • Checkpoints sent to shadow process which stores it • When a new machine is chosen, the executable and checkpoint is sent to remote machine • When the job is started on the remote machine, it detects that it is a restart; reads the checkpoint; some manipulations done such that process state at the time of checkpoint is restored. • It appears to the user code that the process has just returned from the signal handler
Checkpointing Details (Refer to postscript file) • Preserving and restoring text area (same executable), data area (using sbrk(0)) and stack • Preserving stack state consists of storing and restoring 2 parts – stack context and stack space • Stack context stored by setjmp and restored by longjmp • Stack space replacement is tricky – performed by using a secure data region for stack • Open files • state saved by augmenting open calls • lseek performed during checkpointing to obtain offset information • Signals – sigaction, sigispending
Checkpoint summary • Checkpoint library installs signal handler called checkpoint() • Then calls main() • At the time of checkpoint, SIGSTP signal sent, checkpoint() invoked • checkpoint() • Write open files, signals, stack context to data area • Stores data and stack segments
Restart Summary • restore() • Overwrites data segment with that in checkpoint • Restores file and signal information • Switches to a temporary location in data segment, replaces its stack space • Performs longjmp() pointing to checkpoint() signal handler • Checkpoint routine returns and restores CPU registers
Limitations • Cannot checkpoint fork()/exec() or multi-process • Can checkpoint only on homogeneous systems • Cannot checkpoint communicating multi-processes
Condor Universes • Universe specified during job submission • Types: • Standard • System calls transferred to submit machines • Provides for checkpointing and migration • Relink program with condor_compile • Vanilla • For programs that cannot be relinked • Does not provide for checkpointing and migration – WHY? • For accessing to files, use Condor File Transfer mechanism • Scheduler • For job that should act as metascheduler • Mpi, pvm, java,globus
Condor Commands • condor_compile • Relinks source or object files with condor libraries • Condor library provides checkpointing, migration, remote system calls • condor_submit - Takes as input submit description file and produces a job classAd for further processing by central manager • condor_status – to view about various machines in the Condor pool • condor_q – for viewing job status
DAGMan • Meta scheduler for Condor • Manages dependencies between jobs at a higher level • Sits on top of Condor • Input of one program depends on the other • condor_ submit_dagDAGInputFileName • DAG within a DAG is supported
Example input file for DAGMan # Filename: diamond.dag # Job A A.condor Job B B.condor Job C C.condor Job D D.condor PARENT A CHILD B C PARENT B C CHILD D Retry C 3
Condor File System and File Transfer Mechanism • Applicable for only vanilla jobs • By default a shared file system is assumed between submitting machine and executing machine • Machine classAd attributes – FileSystemDomain and UidDomain • To bypass default: say something like: Requirements = UidDomain == ``cs.wisc.edu'' && \ FileSystemDomain == ``cs.wisc.edu''
Condor File System and File Transfer Mechanism • If machines do not share file systems or the file systems not explicitly specified, enable Condor File Transfer Mechanism: should_transfer_files = YES when_to_transfer_output = ON_EXIT • Any files that are generated or modified in the remote working directory are transferred back to the submit machine
References / Sources / Credits • Condor manual • Condor web pages • Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Sciences Technical Report #1346, April 1997. • James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi-Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10) San Francisco, California, August 7-9, 2001. • Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking: Distributed Resource Management for High Throughput Computing", Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998, Chicago, IL. • Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104-111, June, 1988.
Submit description files • Directs queuing of jobs • Contains • Executable location • Command line arguments to job • stdin, stderr, stdout • Initial working directory • should_transfer_files = <YES | NO | IF_NEEDED >. NO disables condor file transfer mechanism • when_to_transfer_output = < ON_EXIT | ON_EXIT_OR_EVICT >
Submit description file • requirements = <ClassAd Boolean Expression> • By default, Arch, OpSys, Disk, virtualMemory, FileSystemDomain for vanilla are set • requirements = <ClassAd Boolean Expression> • +<attribute> = <value>
Machine ClassAd Attributes • Activity • Arch • CondorLoadAvg, ConsoleIdle, Disk, Cpus, KeyboardIdle, LoadAvg, KFlops, Mips, Memory, OpSys, • FileSystemDomain, Requirements, StartdIpAddr • ClientMachine, CurrentRank, RemoteOwner, LastPeriodicCheckpoint
Job ClassAd Attributes • CompletionDate, RemoteIwd
Heterogeneous job submission • Works well with the vanilla universe since checkpoint is not taken. • For standard universe, # Added by Condor CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \ ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED)) Requirements = (<user specified policy>) && $(CkptRequirements)
Submission steps • Job preparation • Choosing a universe • Submit description file • condor_submit
Job Migration • SIGSTP and signal handler in standard universe • SIGTERM in vanilla
Condor Security • Schedd starts shadow with the effective UID of job owner • Different methods like Kherberos and GSI for authentication, different encryption mechanisms, authorization are supported between client and daemons • Sockets and ports – condor collector and negotiator start on well known ports. Other daemons start on ephermeral ports.
Checkpointing • CkptArch, CkptOpSys, LastCkptServer, LastCkptTime, NumCkpts classAds generated automatically for job