150 likes | 247 Vues
Rusty Lusk lusk@mcs.anl.gov 1/15/04. Process Manager Specification. Outline. Process Manager Functionality Expected Consumers Commands Semantics Examples Schema. Process Manager Functionality. Process Execution Start process groups Provide status information during execution
E N D
Rusty Lusk lusk@mcs.anl.gov 1/15/04 Process Manager Specification
Outline • Process Manager Functionality • Expected Consumers • Commands • Semantics • Examples • Schema
Process Manager Functionality • Process Execution • Start process groups • Provide status information during execution • Provide command output and error messages • Return exit status information • Process Group Control • Kill process groups • Signal Process groups
Expected Consumers • Components which execute programs • Components which need to locate running processes • Components which need to control running processes
Schematic of Process Management Component in Scalable Systems Software Context NSM SD Sched EM MPD’s SSS Components QM PM PM SSS XML application processes mpdrun simple scripts or hairy GUIs using SSS XML QM’s job submission language XML file mpiexec (MPI Standard args) interactive Prototype MPD-based implementation side SSS side
Commands • <create-process-group> - creates a new process group. • <get-process-group-info> - get status information; includes current process ids, exit status information and stdout/err information. • <signal-process-group> - send a unix signal to all processes in a process group • <kill-process-group> - kill all processes in a process group • <del-process-group-info> - allow process manager to discard process group information after process group has exited. • All commands use the restriction syntax
Examples <create-process-group submitter=‘desai’ totalprocs=‘4’ pgid=‘*’ output=‘merge’> <process-spec exec=‘/bin/cpi’ cwd=‘/’ path=‘/bin:/usr/bin’> <host-spec>node1 node2 </host-spec> </process-spec> </create-process-group>
Examples (continued) <get-process-group-info> <process-group user=‘desai’ pgid=‘*’/> <process-group user=‘lusk’ pgid=‘*’/> <process-group pgid=‘*’> <process host=‘node4’ pid=‘*’/> </process-group> </get-process-group-info>
Examples (continued) Response: <process-group-info> <process-groups> <process-group user=‘desai’ pgid=’12’/> <process-group user=‘desai’ pgid=’16’/> <process-group pgid=’24’> <process host=‘node4’ pid=‘15423’/> <process host=‘node4’ pid=‘2523’/> </process-group> <process-group> <process-group-info>
Chiba City • Medium-sized cluster at Argonne National Laboratory • 256 dual-processor 500MHz PIII’s • Myrinet • Linux (and sometimes others) • No shared file system, for scalability (but now a test platform for PVFS2) • Dedicated to Computer Science scalability research, not applications • Many groups use it as a research platform • Both academic and commercial • Also used by friendly, hungry applications • New requirement: support research requiring specialized kernels and alternate operating systems, for OS scalability research
New Challenges • Want to schedule jobs that require node rebuilds (for new OS’s, kernel module tests, etc.) as part of “normal” job scheduling • Want to build larger virtual clusters (using VMware or User Mode Linux) temporarily, as part of “normal” job scheduling • Requires major upgrade of Chiba City systems software
Chiba Commits to SSS • Fork in the road (occurred August, 2003): • Major overhaul of old Chiba systems software (open PBS + Maui scheduler + homegrown stuff), OR • Take great leap forward and bet on all-new software architecture of SSS • Problems with leaping approach: • SSS interfaces not finalized • Some components don’t yet use library (implement own protocols in open code, not encapsulated in library) • Some components not fully functional yet • Solutions to problems: • Collect components that are adequately functional and integrated (PM, SD, EM, BCM) • Write “stubs” for other critical components (Sched, QM) • Do without some components (CKPT, monitors, accounting) for the time being
Features of Adopted Solution • Stubs adequate, at least for time being • Scheduler does FIFO + reservations + backfill, improving • QM implements “PBS compatibility mode” (accepts user PBS scripts) as well as asking Process Manager to start parallel jobs directly • Process Manager wraps MPD-2 • Single ring of MPD’s runs as root, managing all jobs for all users • MPD’s started by Build-and-Config manager at boot time • An MPI program called MPISH (MPI Shell) wraps user jobs for handling file staging and multiple job steps • Python implementation of most components • Demonstrated feasibility of using SSS component approach to systems software • Running normal Chiba job mix for over five months now • Moving forward on meeting new requirements for research support
Next Steps • Integrate other components into this structure • Integrate other instantiations of components into this structure • Replace stubs as possible • Easiest if they use same XML API’s • Put “unusual” capabilities into production • Rebuilding nodes on the fly