S3D: Comparing Performance of XT3+XT4 with XT4

S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

Acknowledgements • Alan Morris [UO] • Kevin Huck [UO] • Allen D. Malony [UO] • Kenneth Roche [ORNL] • Bronis R. de Supinski [LLNL] • John Mellor-Crummey [Rice] • Nick Wright [SDSC] • Jeff Larkin [Cray, Inc.] The performance data presented here is available at: http://www.cs.uoregon.edu/research/tau/s3d

TAU Parallel Performance System • http://www.cs.uoregon.edu/research/tau/ • Multi-level performance instrumentation • Multi-language automatic source instrumentation • Flexible and configurable performance measurement • Widely-ported parallel performance profiling system • Computer system architectures and operating systems • Different programming languages and compilers • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid

The Story So Far... • Scalability study of S3D using TAU • 3D Scatter plots and mapping of ranks to physical processors points to partitioning in XT3/XT4 • Memory and network on XT3 partition cause the rest of the application to slow down • Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly • Ran a 6400 core simulation on an XT4 partition to compare with XT3+XT4 (used #PBS -lfeature=xt4)...

3D Scatter Plots • Plot four routines along X, Y, Z, and Color axes • Each routine has a range (max, min) • Each process (rank) has a unique position along the three axes and a unique color • Allows us to examine the distribution of nodes (clusters)

Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters! Previous work proved: Blue nodes are XT3, Red are XT4

3D Triangle Mesh Display • Plot MPI rank, routine name, and exclusive time along X, Y and Z axes • Color can be shown by a fourth metric • Scalable view • Suitable for very large number of processors

XT3+XT4: MPI_Wait • Gap represents XT3 nodes

3D View: Large MPI_Wait times on most CPUs • To improve performance, we must reduce MPI_Wait time on other cpus

3D View: XT3 Partition, Imbalance On XT3: MPI_Wait takes less time, other routines take more time!

Getting Back to MPI_Wait() • MPI_Wait takes less time on XT3 nodes • Other routines take longer

XT3+XT4: MPI_Wait - Sorted by Exclusive Time • MPI_Wait takes 435.84 seconds on rank 3101 • It takes 15.49 seconds on rank 0! • Rank 3101 is on XT4, rank 0 is on XT3

Comparing XT4 and XT3 ranks (Best vs worst)

Improving S3D Performance • Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly and reduce the time spent idling in MPI_Wait

XT4 Profile: Main Window

XT4: Mean Profile Sorted by Exclusive Time • MPI_Wait • has moved • down!

XT4: Mean Profile Sorted by Inclusive Time

Comparing XT4 with XT3+XT4 • MPI_Wait takes 26% of time compared to combined XT3+XT4!

Comparing Mean Inclusive Time

XT4: 3D View • The “exp” loop [~1GFlop] takes most time now!

XT3+XT4: Scatter Plot (Before)

XT4 Scatter Plot (After) • MPI_Wait takes from 78 to 121 s now!

Comparing Performance • Hypothesis confirmed: XT4 is faster than XT3+XT4 • Inclusive time down from 1935 to 1702 s • 12% improvement • Saved 24853.3 minutes (414 hours) of wallclock time! • Reduction in MPI_Wait time is most significant • 390s (mean) down to 104s (mean) • Lessons learned: • Slower XT3 nodes can have a significant impact on a large scale S3D run • S3D harness testcase does not perform well on non-homogeneous nodes • We recommend running S3D on XT4 partition only! • #PBS -lfeature=xt4

Discussion • Did we get optimal performance on XT4 nodes? • Are the nodes performing at similar rates uniformly now? • Let us see the std. deviation plot of all routines...

XT4: Standard Deviation • IO routines!

Scatter Plot: One CPU... WRITE_SAVEFILE

WRITE_SAVEFILE • Rank 0 is quicker!

MPI_Barrier

I/O is not performed uniformly

I/O Becomes a Bottleneck: XT3, XT3+XT4... WRITE_SAVEFILE MPI_Wait

Conclusions • Using pure XT4 improved performance by 12% • Need to investigate I/O in XT4/Lustre further to achieve better performance... • Discuss I/O issues with S3D developers

S3D - Building with TAU • Change name of compiler in build/make.XT3 • ftn=> tau_f90.sh • cc => tau_cc.sh • Set compile time environment variables • setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi • Disabled tracking message communication statistics in TAU • MPI_Comm_compare() is not called inside TAU’s MPI wrapper • Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation • setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ • Selective instrumentation file eliminates instrumentation in lightweight routines • Pre-process Fortran source code using cpp before compiling • Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script: • export TAU_THROTTLE=1 • export COUNTER1 GET_TIME_OF_DAY • export COUNTER2 PAPI_FP_INS • export COUNTER3 PAPI_L1_DCM • export COUNTER4 PAPI_TOT_INS • export COUNTER5 PAPI_L2_DCM

Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION

Getting Access to TAU on Jaguar • set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) • Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* • Makefile.tau-mpi-pdt-pgi (flat profile) • Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) • Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile) • Binaries of S3D can be found in: • ~sameer/scratch/S3D-BINARIES • withtau • papi, multiplecounters, mpi, pdt, pgi options • without_tau

Concluding Discussion • Performance tools must be used effectively • More intelligent performance systems for productive use • Evolve to application-specific performance technology • Deal with scale by “full range” performance exploration • Autonomic and integrated tools • Knowledge-based and knowledge-driven process • Performance observation methods do not necessarily need to change in a fundamental sense • More automatically controlled and efficiently use • Develop next-generation tools and deliver to community • Open source with support by ParaTools, Inc. • http://www.cs.uoregon.edu/research/tau

Support Acknowledgements • Department of Energy (DOE) • Office of Science • LLNL, LANL, ORNL, ASC • PERI

S3D: Comparing Performance of XT3+XT4 with XT4