1 / 36

S3D: Comparing Performance of XT3+XT4 with XT4

S3D: Comparing Performance of XT3+XT4 with XT4. Sameer Shende tau-team@cs.uoregon.edu. Acknowledgements. Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]

npearman
Télécharger la présentation

S3D: Comparing Performance of XT3+XT4 with XT4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

  2. Acknowledgements • Alan Morris [UO] • Kevin Huck [UO] • Allen D. Malony [UO] • Kenneth Roche [ORNL] • Bronis R. de Supinski [LLNL] • John Mellor-Crummey [Rice] • Nick Wright [SDSC] • Jeff Larkin [Cray, Inc.] The performance data presented here is available at: http://www.cs.uoregon.edu/research/tau/s3d

  3. TAU Parallel Performance System • http://www.cs.uoregon.edu/research/tau/ • Multi-level performance instrumentation • Multi-language automatic source instrumentation • Flexible and configurable performance measurement • Widely-ported parallel performance profiling system • Computer system architectures and operating systems • Different programming languages and compilers • Support for multiple parallel programming paradigms • Multi-threading, message passing, mixed-mode, hybrid

  4. The Story So Far... • Scalability study of S3D using TAU • 3D Scatter plots and mapping of ranks to physical processors points to partitioning in XT3/XT4 • Memory and network on XT3 partition cause the rest of the application to slow down • Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly • Ran a 6400 core simulation on an XT4 partition to compare with XT3+XT4 (used #PBS -lfeature=xt4)...

  5. 3D Scatter Plots • Plot four routines along X, Y, Z, and Color axes • Each routine has a range (max, min) • Each process (rank) has a unique position along the three axes and a unique color • Allows us to examine the distribution of nodes (clusters)

  6. Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters! Previous work proved: Blue nodes are XT3, Red are XT4

  7. 3D Triangle Mesh Display • Plot MPI rank, routine name, and exclusive time along X, Y and Z axes • Color can be shown by a fourth metric • Scalable view • Suitable for very large number of processors

  8. XT3+XT4: MPI_Wait • Gap represents XT3 nodes

  9. 3D View: Large MPI_Wait times on most CPUs • To improve performance, we must reduce MPI_Wait time on other cpus

  10. 3D View: XT3 Partition, Imbalance On XT3: MPI_Wait takes less time, other routines take more time!

  11. Getting Back to MPI_Wait() • MPI_Wait takes less time on XT3 nodes • Other routines take longer

  12. XT3+XT4: MPI_Wait - Sorted by Exclusive Time • MPI_Wait takes 435.84 seconds on rank 3101 • It takes 15.49 seconds on rank 0! • Rank 3101 is on XT4, rank 0 is on XT3

  13. Comparing XT4 and XT3 ranks (Best vs worst)

  14. Improving S3D Performance • Hypothesis: Running S3D on a ‘pure’ XT4 system will help improve the performance significantly and reduce the time spent idling in MPI_Wait

  15. XT4 Profile: Main Window

  16. XT4: Mean Profile Sorted by Exclusive Time • MPI_Wait • has moved • down!

  17. XT4: Mean Profile Sorted by Inclusive Time

  18. Comparing XT4 with XT3+XT4 • MPI_Wait takes 26% of time compared to combined XT3+XT4!

  19. Comparing Mean Inclusive Time

  20. XT4: 3D View • The “exp” loop [~1GFlop] takes most time now!

  21. XT3+XT4: Scatter Plot (Before)

  22. XT4 Scatter Plot (After) • MPI_Wait takes from 78 to 121 s now!

  23. Comparing Performance • Hypothesis confirmed: XT4 is faster than XT3+XT4 • Inclusive time down from 1935 to 1702 s • 12% improvement • Saved 24853.3 minutes (414 hours) of wallclock time! • Reduction in MPI_Wait time is most significant • 390s (mean) down to 104s (mean) • Lessons learned: • Slower XT3 nodes can have a significant impact on a large scale S3D run • S3D harness testcase does not perform well on non-homogeneous nodes • We recommend running S3D on XT4 partition only! • #PBS -lfeature=xt4

  24. Discussion • Did we get optimal performance on XT4 nodes? • Are the nodes performing at similar rates uniformly now? • Let us see the std. deviation plot of all routines...

  25. XT4: Standard Deviation • IO routines!

  26. Scatter Plot: One CPU... WRITE_SAVEFILE

  27. WRITE_SAVEFILE • Rank 0 is quicker!

  28. MPI_Barrier

  29. I/O is not performed uniformly

  30. I/O Becomes a Bottleneck: XT3, XT3+XT4... WRITE_SAVEFILE MPI_Wait

  31. Conclusions • Using pure XT4 improved performance by 12% • Need to investigate I/O in XT4/Lustre further to achieve better performance... • Discuss I/O issues with S3D developers

  32. S3D - Building with TAU • Change name of compiler in build/make.XT3 • ftn=> tau_f90.sh • cc => tau_cc.sh • Set compile time environment variables • setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi • Disabled tracking message communication statistics in TAU • MPI_Comm_compare() is not called inside TAU’s MPI wrapper • Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation • setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ • Selective instrumentation file eliminates instrumentation in lightweight routines • Pre-process Fortran source code using cpp before compiling • Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script: • export TAU_THROTTLE=1 • export COUNTER1 GET_TIME_OF_DAY • export COUNTER2 PAPI_FP_INS • export COUNTER3 PAPI_L1_DCM • export COUNTER4 PAPI_TOT_INS • export COUNTER5 PAPI_L2_DCM

  33. Selective Instrumentation in TAU % cat select.tau BEGIN_EXCLUDE_LIST MCADIF GETRATES TRANSPORT_M::MCAVIS_NEW MCEDIF MCACON CKYTCP THERMCHEM_M::MIXCP THERMCHEM_M::MIXENTH THERMCHEM_M::GIBBSENRG_ALL_DIMT CKRHOY MCEVAL4 THERMCHEM_M::HIS THERMCHEM_M::CPS THERMCHEM_M::ENTROPY END_EXCLUDE_LIST BEGIN_INSTRUMENT_SECTION loops routine="#" END_INSTRUMENT_SECTION

  34. Getting Access to TAU on Jaguar • set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) • Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* • Makefile.tau-mpi-pdt-pgi (flat profile) • Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) • Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile) • Binaries of S3D can be found in: • ~sameer/scratch/S3D-BINARIES • withtau • papi, multiplecounters, mpi, pdt, pgi options • without_tau

  35. Concluding Discussion • Performance tools must be used effectively • More intelligent performance systems for productive use • Evolve to application-specific performance technology • Deal with scale by “full range” performance exploration • Autonomic and integrated tools • Knowledge-based and knowledge-driven process • Performance observation methods do not necessarily need to change in a fundamental sense • More automatically controlled and efficiently use • Develop next-generation tools and deliver to community • Open source with support by ParaTools, Inc. • http://www.cs.uoregon.edu/research/tau

  36. Support Acknowledgements • Department of Energy (DOE) • Office of Science • LLNL, LANL, ORNL, ASC • PERI

More Related