Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab

Scaling Up MPI and MPI-I/O onseaborg.nersc.govDavid Skinner, NERSC Division, Berkeley Lab

Scaling: Motivation • NERSC’s focus is on capability computation • Capability == jobs that use ¼ or more of the machines resources • Parallelism can deliver scientific results unattainable on workstations. • “Big Science” problems are more interesting!

Scaling: Challenges • CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated. • Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes • MPI implementation • Filesystem metadata systems • Batch queue system • NERSC consultants can help Users need information on how to mitigate the impact of these issues for large concurrency applications.

Seaborg.nersc.gov

Switch Adapter Bandwidth: csss

Switch Adapter Comparison Tune message size to optimize throughput csss css0 

Switch Adapter Considerations • For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio) • Use MP_SHAREDMEMORY to minimize switch traffic • csss is most often the best route to the switch

Job Start Up times

Synchronization • On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks • A fully synchronizing MPI call requires everyone’s attention • By analogy, imagine trying to go to lunch with 1024 people • Probability that everyone is ready at any given time scales poorly

Scaling of MPI_Barrier()

Load Balance • If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall • Seek out and eliminate sources of variation • Distribute problem uniformly among nodes/cpus

Synchronization: MPI_Bcast 2048 tasks

Synchronization: MPI_Alltoall 2048 tasks

Synchronization (continued) • MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above • Use MPI_Bcast if possible which is not fully synchronizing • Remove un-needed MPI_Barrier calls • Use Immediate Sends and Asynchronous I/O when possible

Improving MPI Scaling on Seaborg

The SP switch • Use MP_SHAREDMEMORY=yes (default) • Use MP_EUIDEVICE=csss (default) • Tune message sizes • Reduce synchronizing MPI calls

64 bit MPI • 32 bit MPI has inconvenient memory limits • 256MB per task default and 2GB maximum • 1.7GB can be used in practice, but depends on MPI usage • The scaling of this internal usage is complicated, but larger concurrency jobs have more of their memory “stolen” by MPI’s internal buffers and pipes • 64 bit MPI removes these barriers • 64 bit MPI is fully supported • Just remember to use “_r” compilers and “-q64” • Seaborg has 16,32, and 64 GB per node available

How to measure MPI memory usage? 2048 tasks

MP_PIPE_SIZE : 2*PIPE_SIZE*(ntasks-1)

OpenMP • Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation, e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads • Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages

Beware Hidden Multithreading • ESSL and IBM Fortran have autotasking like “features” which function via creation of unspecified numbers of threads. • Fortran RANDOM_NUMBER intrinsic has some well known scaling problems. http://www.nersc.gov/projects/scaling/random_number.html • XLF, use threads to auto parallelize my code “-qsmp=auto”. ESSL, libesslsmp.a has an autotasking feature • Synchronization problems are unpredictable using these features. Performance impacted when too many threads.

MP_LABELIO, phost • Labeled I/O will let you know which task generated the message “segmentation fault” , gave wrong answer, etc. export MP_LABELIO=yes • Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks • MPI and LAPI versions available • Hostslists are useful in general

Core files • Core dumps don’t scale (no parallel work) • MP_COREDIR=none  No corefile I/O • MP_COREFILE_FORMAT=light_core  Less I/O • LL script to save just one full fledged core file, throw away others … if MP_CHILD !=0 export MP_COREDIR=/dev/null endif …

Debugging • In general debugging 512 and above is error prone and cumbersome. • Debug at a smaller scale when possible. • Use shared memory device MPICH on a workstation with lots of memory as a mock up high concurrency environment. • For crashed jobs examine LL logs for memory usage history. (ask a NERSC consultant for help with this)

Parallel I/O • Can be a significant source of variation in task completion prior to synchronization • Limit the number of readers or writers when appropriate. Pay attention to file creation rates. • Output reduced quantities when possible

Summary • Resources are present to face the challenges posed by scaling up MPI applications on seaborg. • Hopefully, scientists will expand their problem scopes to tackle increasingly challenging computational problems. • NERSC consultants can provide help in achieving scaling goals.

Scaling of Parallel I/O on GPFS

Motivation • NERSC uses GPFS for $HOME and $SCRATCH • Local disk filesystems on seaborg (/tmp) are tiny • Growing data sizes and concurrencies often outpace I/O methodologies

GPFS@Seaborg.nersc.gov 16 nodes are dedicated to serving GPFS filesystems Each compute node relies on the GPFS nodes as gateways to storage

Common Problems when Implementing Parallel IO • CPU utilization suffers as time is lost to I/O • Variation in write times can be severe, leading to batch job failure

Finding solutions • Checkpoint (saving state) IO pattern • Survey strategies to determine the rate and variation in rate

Parallel I/O Strategies

Multiple File I/O if(private_dir) rank_dir(1,rank); fp=fopen(fname_r,"w"); fwrite(data,nbyte,1,fp); fclose(fp); if(private_dir) rank_dir(0,rank); MPI_Barrier(MPI_COMM_WORLD);

Single File I/O fd=open(fname,O_CREAT|O_RDWR, S_IRUSR); lseek(fd,(off_t)(rank*nbyte)-1,SEEK_SET); write(fd,data,1); close(fd);

MPI-I/O MPI_Info_set(mpiio_file_hints, MPIIO_FILE_HINT0); MPI_File_open(MPI_COMM_WORLD, fname, MPI_MODE_CREATE | MPI_MODE_RDWR, mpiio_file_hints, &fh); MPI_File_set_view(fh, (off_t)rank*(off_t)nbyte, MPI_DOUBLE, MPI_DOUBLE, "native", mpiio_file_hints); MPI_File_write_all(fh, data, ndata, MPI_DOUBLE, &status); MPI_File_close(&fh);

Results

Scaling of single file I/O

Scaling of multiple file and MPI I/O

Large block I/O • MPI I/O on the SP includes the file hint IBM_largeblock_io • IBM_largeblock_io=true used throughout, default values show large variation • IBM_largeblock_io=true also turns off data shipping

Large block I/O = false • MPI on the SP includes the file hint IBM_largeblock_io • Except above IBM_largeblock_io=true used throughout • IBM_largeblock_io=true also turns off data shipping

Bottlenecks to scaling • Single file I/O has a tendency to serialize • Scaling up with multiple files create filesystem problems • Akin to data shipping consider the intermediate case

Parallel IO with SMP aggregation (32 tasks)

Parallel IO with SMP aggregation (512 tasks)

Summary

Scaling Up MPI and MPI-I/O on seaborg.nersc David Skinner, NERSC Division, Berkeley Lab