250 likes | 264 Vues
DMTCP is a robust Linux checkpointing tool developed at Northeastern U. and MIT. It supports sequential and multi-threaded computations, transparently working in user space without kernel modules. This checkpoint/restart system eliminates common restrictions like no pthreads and no mmap() support. DMTCP is widely available under LGPL license, with no recompiling required. Checkpoint compression on-the-fly, a stateless synchronization server, and additional wrappers for process and thread virtualization are among its key features. It can seamlessly work with various applications like MPICH-2, OpenMPI, Python, and more. Planned support includes applications like Bash and Matlab. The integration with Condor is experimental, aiming to explore scalability and stability.
E N D
DMTCP: A New Linux Checkpointing Mechanism For Vanilla Universe Jobs
Why DMTCP? • Why checkpoint at all? • Problems with Condor’s Standard Universe • Single process. • No pthreads. • No mmap() support. • Forced re-link to form a static executable. • DMTCP removes these restrictions!
What is DMTCP? • Distributed Multi-Threaded CheckPointing. • Works with Linux Kernel 2.6.9 and later. • Supports sequential and multi-threaded computations across single/multiple hosts. • Entirely in user space (no kernel modules or root privilege). • Transparent (no recompiling, no re-linking). • Written at Northeastern U. and MIT and under active development for 4+ years. • LGPL’d and freely available. • No Remote I/O.
Process Structure Coordinator Signal (USR2) DMTCP CT CT Process 1 Process N T1 T1 T2 Network Socket CT = DMTCP checkpoint thread T = User Thread
How Does It Work? • ./dmtcp_checkpoint a.out # starts coordinator too • ./dmtcp_command –c # talks to coordinator • ./dmtcp_restart ckpt_a.out-*.dmtcp • Coordinator is a stateless synchronization server for the distributed checkpointing algorithm. • Checkpoint/Restart performance related to size of memory, disk write speed, and synchronization.
How Does It Work? • LD_PRELOAD: Transparently preloads checkpoint libraries which installs libc wrappers and checkpointing code. • SIGUSR2: Used internally from checkpoint thread to user threads. • Wrappers: Only on less heavily used calls to libc • fork, exec, system, pipe, bind, listen, setsockopt, connect, accept, clone, close, ptsname, openlog, closelog, signal, sigaction, sigvec, sigblock, sigsetmask, sigprocmask, rt_sigprocmask, pthread_sigmask • Overhead is negligible.
How Does It Work? • Additional wrappers when process id & thread id virtualization is enabled • getpid, getppid, gettid, tcgetpgrp, tcsetprgrp, getgrp, setpgrp, getsid, setsid, kill, tkill, tgkill, wait, waitpid, waitid, wait3, wait4
How Does It Work? • Checkpoint image compression on-the-fly (default). • Currently only supports dynamically linking to libc.so. Support for static libc.a is feasible, but not implemented. • Stays close to POSIX API standards.
A Checkpoint Under DMTCP • dmtcphijack.so & mtcp.so present in executable’s memory. • Ask coordinator process for checkpoint via dmtcp_command. • Now what happens?
A Checkpoint Under DMTCP • Suspend user threads with SIGUSR2. • Elect shared file descriptor leaders. • Drain kernel buffers and do network handshake with peers. • Write checkpoint to disk. • Refill kernel buffers. • Resume user threads.
Where Is the Checkpoint? • In the cwd of the application. • A set of ckpt_<exec>_<id>.dmtcp files. • In the cwd of the coordinator. • A dmtcp_restart_script.sh file. • The dmtcp_restart_script.sh may need tweaking depending upon circumstance.
A Restart Under DMTCP • Restart Process loads in memory. • Reopen files and recreate ptys. • Recreate and reconnect sockets. • Fork into user processes. • Rearrange file descriptors to initial layout. • Restore memory and threads. • Refill kernel buffers. • Resume user threads.
Supported OS Features • Threads, mutexes/semaphores, fork, exec • Shared memory (via mmap), TCP/IP sockets, UNIX domain sockets, pipes, ptys, terminal modes, ownership of controlling terminals, signal handlers, open and/or shared fds, I/O (including the readline library), parent-child process relationships, process id & thread id virtualization, session and process group ids, and more… • Trying to keep the implementation small!
Supported Applications • MPICH-2, OpenMPI, SciPy/iPython, Python • cmsRun, Perl, Ruby, PHP, GHCi (Glasgow Haskell Compiler), Ocaml, Octave, Macaulay2, GNUPlot, slsh (S-Lang scripts), MZScheme, GST (Gnu Smalltalk virtual machine), tcsh, dash, csh, tclsh (tcl-based interpreter), SQLite. • And many others!
Planned Application Support • Bash, gcl (GNU Common Lisp), maxima (based on gcl), and the Sun JVM. • These programs use sbrk() for their own memory management and induce a bug in DMTCP. • A fix is planned and will go in soon.
Planned Application Support • Matlab • Directly calling the binary without graphics works, but matlab uses bash which needs the sbrk() fix.
Condor/DMTCP Integration • Experimental at this time. • Determining scalability, stability, and extent of “weird edge cases” of DMTCP mixed with Condor. • Completely outside of Condor source code. • A vanilla job called “shim_dmtcp” that wraps the user’s job and stdfiles with DMTCP. • A submit description file which transfers needed dmtcp files over to the remote side and saves intermediate checkpoints. • No remote I/O!
Shim Script Execution condor_starter shim_dmtcp Job Coordinator
Submit File Example universe = vanilla executable = shim_dmtcp arguments = logfile stdinf stdoutf stderrf a.out arg0 arg1… should_transfer_files = YES when_to_transfer_output = ON_EVICT_OR_EXIT transfer_input_files = <dmtcp libraries and programs>,\ a.out, stdinf, stdoutf, stderrf environment = DMTCP_TMPDIR=./;JALIB_STDERR_PATH=/dev/null kill_sig = 2 output = shim.$(Cluster).$(Process).out error = shim.$(Cluster).$(Process).err log = shim.log queue
Condor/DMTCP Integration • Early Results • It works with our test case and thousands of jobs. • Problems • Checkpointing between Physical Address Kernels and normal kernels is a challenge. • DMTCP’s API needs some improvement. • Coordinator failure means job failure. • Shim script is clunky, e.g. no streaming I/O. • Next: Integration into our stduniv test suite for full regression testing.
Future Condor Integration • Add WantCheckpoint = True and CheckpointMethod = DMTCP for a vanilla universe job. • Condor takes care of the wrapping of the job with DMTCP and transferal of needed DMTCP files--no shim script voodoo. • Condor should honor CheckpointPlatform for Vanilla universe jobs in case of pool segmentation. • Parallel universe support with single coordinator. • Doug Thain’s Parrot for remote I/O.
Challenges • C/C++ runtime library compatibility issues. • Recompile DMTCP on slot before job execution? • Dynamic library incompatibilities. • No Checkpoint Server. • Condor file transfer protocol enhancement? • Debugging methods and practices?
Further Reading • “DMTCP: Transparent Checkpointing for Cluster Computation and the Desktop” • http://arxiv.org/abs/cs/0701037 • Source Code • http://dmtcp.sourceforge.net
Questions? • DMTCP • http://dmtcp.sourceforge.net • Gene Cooperman: gene@ccs.neu.edu • Condor/DMTCP Integration • Pete Keller: psilord@cs.wisc.edu • Ask me if you want to try the Alpha Version out!