240 likes | 405 Vues
Toward Global HPC Platforms Gabriel Mateescu Research Computing Support Group National Research Council Canada Gabriel.Mateescu@nrc.ca www.sao.nrc.ca/~gabriel/hpcs. Agenda. HPC applications for computational grids Building Legion parallel programs File staging and job running
E N D
Toward Global HPC Platforms • Gabriel Mateescu Research Computing Support Group National Research Council Canada Gabriel.Mateescu@nrc.ca www.sao.nrc.ca/~gabriel/hpcs
Agenda • HPC applications for computational grids • Building Legion parallel programs • File staging and job running • MPI Performance • Job scheduling • PBS interoperability • Conclusions
Wide Area Parallel Programs • Some HPC applications benefit from running under Legion • Tolerate high latencies • Consist of many loosely coupled subtasks • Either use or generate very large amounts of data, e.g., Terabytes • Examples: • parameter space studies • Monte Carlo simulations • particle physics, astronomy • Avaki HPC is the new brand name of the Legion grid middleware
Build Legion MPI Programs • Sample makefile for building and registering the MPI program called myprog • CC = cc • CFLAGS = -mips4 • myprog: myprog.c • $(CC) $(CFLAGS) –I$(LEGION)/include/MPI –c $?; • legion_link –o $@ $@.o -mpi; • legion_mpi_register $@ ./$@ $(LEGION_ARCH); • One must use Legion’s mpi.h • Issue the command make myprog
Remote Build • Register the makefile • legion_register_program legion_makefile \ • ./makefile sgi • legion_register_program legion_makefile \ • ./makefile linux • Remote make, on a selected architecture • legion_run –a sgi \ • –IN myprog.c \ • –OUT myprog legion_makefile
Native MPI Programs • Need to configure the host for running native MPI • Create the class /class/legion_native_mpi_backend • legion_native_mpi_init $LEGION_ARCH • Configure the host to invoke the native mpirun • legion_native_mpi_config_host hosts/nickel \ $LEGION/bin/${LEGION_ARCH}/legion_native_mpisgi_wrapper
Build Native MPI Programs • Sample makefile • CC = cc • CFLAGS = -mips4 • myprog_native: myprog.c • $(CC) $(CFLAGS) –o $@ –c $? -lmpi; • legion_native_mpi_register $@. /$@ $(LEGION_ARCH); • Run the program under Legion • legion_native_mpi_run –n 4 mpi/programs/myprog
File Staging • Collect input files with the –IN option and output files with –OUT option to legion_run, legion_mpi_run • Need to know the names of the files created by the program • Option file useful for multiple legion_run files • -IN par.in -OUT res.out • Wild cards are not allowed in the options file • Specification file useful with legion_run_multi • # keyword(IN,OUT) file_name pattern • IN par.in /mydir/par*.in • OUT res.out /mydir/res*.out
…File Staging • For MPI programs, use –a to get output from all processes • Examples • legion_mpi_run –n 4 –a -OUT file.out mpi/programs/myprog • legion_run -OUT file.out home/gabriel/myprog • legion_run -f opt_file home/gabriel/myprog • legion_run_multi -f spec_file –n 2 \ home/gabriel/myprog
Capturing Standard Output • Method 1: Use a tty object • legion_tty tty1 • legion_mpi_run –n 2 home/gabriel/myprog_run • legion_tty_off • Method 2: Redirect standard output to a file • Option -A -STDOUT to legion_mpi_run • Option -stdout (to redirect) and –OUT (to copy back) to legion_run • legion_mpi_run –n 4 –A STDOUT out mpi/programs/myprog • legion_run –stdout std.out -OUT std.out myprog_run
Debugging • View MPI program instances • legion_context_list mpi/instances/myprog • Prove MPI jobs with legion_mpi_probe • legion_mpi_probe mpi/instances/myprog • Trace Legion MPI jobs with legion_mpi_debug • legion_mpi_debug –q –c mpi/instances/myprog • Trace execution of commands with the –v (verbose) option • legion_mpi_run –n 8 –v –v –v mpi/programs/myprog
Debugging • Find where an object is located with legion_whereis • ps –ef|grep $LEGION_OPR/Cached-myprog-Binary–version • Probe jobs • legion_run –nonblock –p probe_file myprog • legion_probe_run –p probe_file –statjob • legion_probe_run –p probe_file –kill • Some commands accept the –debug option • Error messages not always helpful • require knowledge of Legion internals
MPI Performance • Platform: SGI Origin 2000 • 4 x R12K 400 MHz CPUs • Instruction- and Data- cache: 32 KB • L2 cache 8 MB • Main memory 2 GB • Average latency from processor 0 to the other three processors • Native MPI ~ 8 microseconds • Legion MPI ~1900 microseconds
Host Types • Interactive Host – Legion starts a job on an interactive host as a time-shared job • Batch queue job – Legion submits the job to a batch queuing and scheduling system, such as PBS • Determine the type of the host with the command • legion_list_attributes –c hosts/nickel \ • host_property host_queue_type • Attribute host_property has the value‘interactive’ or ‘queue’
Interactive Host Scheduling Scheduler Enactor hostA hostB hostC Collection
Interactive Host Scheduling • Legion can pick a set of hosts for running a job, but it does not seem to include really good scheduling algorithms • Legion may split a parallel job among two hosts, even though there are enough resources to run the job on an SMP • User can create a host file specifying candidate hosts • cat hf_monet • /hosts/monet.CERCA.UMontreal.CA 4 • /hosts/nickel 1 • The option –HF to legion_mpi_run specifies the host file • legion_mpi_run –n 2 –HF hf_monet mpi/programs/hello
Interactive Host Scheduling • A performance description of the architectures is converted to a host file for Legion scheduling • % cat perf • sgi 2.0 • % legion_make_schedule –f perf mpi/programs/myprog \ • > hf_file • The file hf_file can be used along with the option –HF to legion_mpi_run • % legion_mpi_run –n 2 –HF hf_file mpi/programs/myprog
Batch Scheduling • Instead of relying on Legion scheduling, or specifying the set of hosts “by hand”, use a batch scheduling system, e.g. PBS • Why use PBS? • Smart Scheduling, job monitoring, and restarting • Combine ubiquitous access provided by Legion with efficient execution and communication obtained from locality and resource allocation • Legion MPI does not have good performance • Think globally, and act locally • ${LEGION} must be visible or copied to all PBS nodes
Running on a Batch Host • A batch host has the attributes • legion_list_attributes –c hosts/nickel_pbs \ • host_property host_queue_type • host_property(‘queue’) • host_queue_type(‘PBS’) • Make sure that the job to be run on the batch does not have the attribute desired_host_property with the value interactive • legion_update_attributes \ –c home/gabriel/mpi/programs/myprog \ –d “desired_host_property(‘interactive’)”
Running on a Batch Host • Run an ordinary job • % legion_run –stdout stdout.o –OUT stdout.o \ –h /hosts/wolf-pbs uname_stdout • For legion_mpi_run, one needs to create a batch host context – associate a context with a batch host , done once • % legion_mkdir /home/gabriel/context_wolf-pbs • % legion_ln /hosts/wolf-pbs \ /home/gabriel/context_wolf-pbs/wolf-pbs • Run the MPI job with a batch host context • % legion_mpi_run –n 16 -A –STDOUT std.out \ –h /home/gabriel/context_wolf-pbs \ • mpi/programs/myprog
Fault Tolerance • Checking the consistency of the Legion collection is tricky • Legion tools tend to core dump when the bootstrap host does not respond • Watch for core dumps in the current directory, or under /tmp • Aborting some commands, e.g., legion_login, may leave around stale objects which confuse Legion • After a while, it is good to log out of Legion and log in again to refresh the working context • Apparently, a crashed MPI program is not restarted
Conclusions • Legion provides good capacity and not so good performance • It is not trivial to run parallel jobs under Legion • User must specify staging of input and output files • Limited job information and debugging tools • A legion_ps command is needed • How to peek at the output files ? • Too much information hiding ? • Legion scheduling of parallel job seems non optimal • Integrating Legion with a batch system that provides scheduling and fault tolerance, improves performance and reliability
Acknowledgments • Chris Cutter and Mike Herrick, Avaki • John Karpovich, University of Virginia • Janusz Pipin and Marcin Kolbuszewski, C3.ca and NRC • Roger Impey, NRC