Checkpoint and Migration in the Condor Distributed Processing System

Checkpoint and Migration in the Condor Distributed Processing System Presentation by James Nugent CS739 University of Wisconsin Madison

Primary Sources • J. Basney and M. Livny, "Deploying a High Throughput Computing Cluster", in High Performance Cluster Computing, Rajkumar • Condor Web page http://www.cs.wisc.edu/condor/ • M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", Computer Sciences Technical Report #1346, University of Wisconsin-Madison, April 1997. • Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104-111, June, 1988.

Outline • Goals of Condor • Overview of Condor • Scheduling • Remote Execution • Checkpointing • Flocking

Goals • Opportunistically uses idle cycles for work • Remote machines have intermittent availability • Lower reliability, as cluster processing is nottheprimary function of the machines used. • Portability • Only requires re-linking(non-relinked apps can’t be checkpointed) • Remote machines not a part of system.

Voluntary • People should see benefit of using it, • At at least no penalty for letting others use their machine. • No security problem from using Condor, or Condor using your machine. • Low impact on remote machines perf.

Implementation Challenges • Runs serial programs in remote locations, without any source code changes to program. • Program must be re-linked • Programs may have to move to a different machine • Checkpointing allows process migration and provides reliability. • Provide transparent access to remote resources

Basic Functioning-Condor Pool Central Manager Select A+B Job +C A B C D Unavailable Job Job Condor Pool

Basic Execution Model Remote Machine Source Machine 1.Create Shadow process. 2.Send job to remote machine. 3)Remote runs the job 4)Job sends remote system calls to shadow process on source machine Job Shadow 4 1 3 Condor Daemon 2 Condor Daemon

Scheduling CM selects A machine Can next job find a remote to run on? Start N Y N Y Machine has jobs to run? Next job runs

Central Manager—Select a machine to have a job run. • Priority(Machine)=Base Priority-(Users)+(# Running Jobs) • (Users) is the number of users who have jobs local to this machine, i.e. their jobs started on this machine • Lower values are a higher priority • Ties are broken randomly • CM periodically receives info from each machine. • Number of remote jobs machine is running • Number of remote jobs machine wants to run • This is the only state about machines the CM stores, which reduces stored state and what machines must communicate CM.

Fair Scheduling • If user A submits 4 jobs and user B submits 1 then: • User A’s jobs wait 1/5 of the time. • User B’s job must wait 4/5 of the time. • Assumes there is insufficient capacity to schedule all jobs immediatly • User B contributes as much computing power to the system as User A.

Up-Down scheduling—Workstation priorities • Each machine has a scheduling index.(SI) • Lower indexes have higher priority • Index goes up when a machine is granted capacity.(Priority Decreases) • Index goes down when a machine is denied access to capacity (Priority Increases) • Index also decreases slowly over time.(Priority Increases over time)

Host Machines • Priority(Job)=(UserP*10)+Runing(1|0)+(Wait Time/1000000000). • Runing jobs consume disk space • Jobs are ordered first by User Priority, then by if they are running or not, and finally by wait time.

Classads—Finding Servers for Jobs • Classadds are tuples of attribute names and expressions. (Requirements= Other.Memory > 32) • Two classads match if their requirements attributes evaluate to true in the context of each other. • Example • Job:(Requirements Other.Memory > 32 && Other.Arch=“Alpha”) • Machine: ( Memory=64) ,(Arch=“Alpha”) • These two classads would match.

Remote Execution-Goals • Execute a program on a remote machine like it is executing on the local machine. • All the machines in a pool should not need to be identically configured in order to run remote jobs. • CM should only be involved in starting and stopping jobs, not while they are running.

Resource Access • Remote Resource access • Have a shadow process on local machine • Condor library sends system calls to local machine • Thus there is a penalty to submitting from one’s machine with this mechanism. Also extra traffic. • Users does not need account on remote machine • Jobs execute as ‘nobody’ • Compensates for system differences • AFS,NFS • Shadow checks for possible use.(I.e. is file reachable? Looks up local mappings) • If file can be reach via the file system, this is used instead of remote system calls for performance reasons.

Interposition-Sending Syscalls Shadow process on local machine Remote Job …. … fd=open(args); …. • receive(open, args); • result=syscall(open, args); • send(result); 2 1 Condor Library 3 int open(args){ send(open,args); receive(result); store(args,result); return result;}

Limitations • File operations must be idempotent • Example: • checkpoint • Open a file for append • Write a file • Rollback to previous checkpoint. • Some functionality limited • No socket communications • Only single process jobs. • Some signals reserved for condor use(SIGUSR2, SIGTSTP) • No interprocess communication(pipes, semaphores or shared memory) • Alarms,timers and sleeping not allowed. • Multiple kernel threads not allowed(User level threads ok) • Memory mapped files not allowed • File locks not retained between checkpoints. • Network communication must be brief • Communication defers checkpoints

When Users Return Vacate Job ( Start Chkpting) Start Suspend Job Y Wait 5 minutes Wait 5 minutes Is user still using machine? Is chkpt done yet? N N Y Job resumes execution Kill Job Checkpoint Migrates

Goals of Checkpointing/Migration • Packages up a process, including dynamic state • Does not require any special considerations on programmers part; done by a linked in library. • Provides fault tolerance & reliability.

Checkpoint Considerations • Must schedule checkpoints to prevent excess checkpoint server/network traffic. • Example: Many jobs submitted in a cluster, checkpoint interval is fixed, so they would all checkpoint at the same time. • Solution: Have checkpoint server schedule checkpoints to avoid excessive network use. • Checkpoints use a transactional model • A checkpoint will either: • Succeed • Fail and roll back to the previous checkpoint • Only the most recent successful checkpoint is stored.

Checkpointing—The Job Stack • Identify Segments • Use /proc interface to get segment addresses • Compare segment addresses to know data to determine which segment is which. • A global variable marks the data segment • Static function marks the text segment • Stack pointer and stack ending address mark stack segments • All others are dynamic library text, or data • Most segments can be stored with no special actions needed • Static text segment is the executable and does not need to be stored • Stack • Info saved by Setjmp(includes registers & CPU state) • Signal state must also be saved • For instance if a signal is pending. Heap Shared Libraries Data Text Condor Lib

Heap Shared Libraries Data Text Condor Lib Restart—I • Shadow process created on machine by Condor daemon • Sets up environment • Check if files must use remote syscalls or a networked FS • Shadow creates segments • Shadow restores segments

Stack Heap Shared Libraries Data Temp Stack Text Condor Lib Restart—II • File state must be restored • Lseek sets file pointers to correct spot • Blocked signals must be re-sent • Stack restored • Return to job • Appears like a return from a signal handler Kernel state Signals

Flocking-Goals • Ties groups of condor pools into, effectively, one large pool. • No change to CM • User see no difference if a job is in the Flock or their local pool. • Fault Tolerance: Fail in one pool should not effect rest of flock • Allows each group to maintain authority over their own machines • Having one large pool, instead of a flock, could overload the central manager

Flocking Pool Pool M M M M M M M M M M Net M Possible Job assignment Net CM CM WM M WM Net CM M WM WM CM M Net Net M Pool M Pool M M M M M M

Flocking--Limitations • World Machine uses normal priority system • Machines can run on flock, when local space available • World machine cannot be used to actually run jobs • Word machine can only present itself as having one set of hardware • The World machine can only receive one job per scheduling interval, no matter how many machines are available.

Bibliography • 7.M.L. Powell and B.P. Miller, "Process Migration in DEMOS/MP", 9th Symposium on Operating Systems Principles, Bretton Woods, NH, October 1983, pp. 110-119. • 9.E. Zayas, "Attacking the Process Migration Bottleneck", 11th Symposium on Operating Systems Principles, Austin, TX, November 1987, pp. 13-24. • Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104-111, June, 1988. • Matt Mutka and Miron Livny, "Scheduling Remote Processing Capacity In A Workstation-Processing Bank Computing System", Proceedings of the 7th International Conference of Distributed Computing Systems, pp. 2-9, September, 1987. • Jim Basney and Miron Livny, "Deploying a High Throughput Computing Cluster", High Performance Cluster Computing, Rajkumar Buyya, Editor, Vol. 1, Chapter 5, Prentice Hall PTR, May 1999. • Miron Livny, Jim Basney, Rajesh Raman, and Todd Tannenbaum, "Mechanisms for High Throughput Computing", SPEEDUP Journal, Vol. 11, No. 1, June 1997.

Bibliography II • Jim Basney, Miron Livny, and Todd Tannenbaum, "High Throughput Computing with Condor", HPCU news, Volume 1(2), June 1997. • D. H. J Epema, Miron Livny, R. van Dantzig, X. Evers, and Jim Pruyne, "A Worldwide Flock of Condors : Load Sharing among Workstation Clusters" Journal on Future Generations of Computer Systems, Volume 12, 1996 • Scott Fields, "Hunting for Wasted Computing Power", 1993 Research Sampler, University of Wisconsin-Madison. • Jim Basney and Miron Livny, "Managing Network Resources in Condor", Proceedings of the Ninth IEEE Symposium on High Performance Distributed Computing (HPDC9), Pittsburgh, Pennsylvania, August 2000, pp 298-299. • Jim Basney and Miron Livny, "Improving Goodput by Co-scheduling CPU and Network Capacity", International Journal of High Performance Computing Applications, Volume 13(3), Fall 1999. • Rajesh Raman and Miron Livny, "High Throughput Resource Management", chapter 13 in The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, San Francisco, California, 1999. • Matchmaking: Distributed Resource Management for High Throughput Computing Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998, Chicago, IL.

Communication & Files • File state saved by overloaded library calls • Syscall • Signal state must also be saved • Flush communications channels and store them • If only one endpoint is checkpointable • Communication with only one checkpointed endpoint is done with a ‘switchboard process’. • Example: license server • Switchboard buffers communication till app Restarted. May be a problem if licensee server expects prompt replies(need to be specific)

Basic Problems • How are jobs scheduled on remote machines? • How are jobs moved around? • How are issues such as portability and user impact dealt with?

Basic Functioning-Condor Pool Central Manager Select:A & B Job+Condor Lib Network Host Machine: A Remote Machines

Checkpointing--Process • Identify Segments • /proc interface • Compare to know data to determine which is which • A global variable marks the data segment • Static function marks the text segment • Stack pointer and stack ending address mark stack segments • All others are dynamic library text, or data • Most segments can just be written • Exception:Static text==Executable • Stack • Info saved by Setjmp(includes registers & CPU state)

Additional Scheduling • Scheduling affected by parallel PVM apps. • (Master on submit machine, never preemt) • Opportunistic workers. • Since workers contact master, avoids problems Of having two mobile endpoints.(workers know where master is) • Jobs can be ordered in a directed acyclic graph.

Checkpoint and Migration in the Condor Distributed Processing System