A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20

A Recipe for Performance Analysis and Tuning in Parallel Computing Environments20 18 October 2001 (Rev. 21 October 2001)19 Jack K. Horner SAIC and LANL/CCN-8 jkh@lanl.gov

Overview • Objectives • Some trades in approaches to performance analysis • The recipe • Some open issues in performance analysis • Notes and references

Objectives To provide • a tutorial-level workflow for parallel performance analysis and tuning, based on hands-on use on large parallel codes • in terms of this workflow, a high-level survey of representative production-class (mainly, COTS) parallel performance analysis tools10 • a list of open issues in tool evolution

Some trades in approaches to performance analysis (1 of 2) • Modeling-oriented • comprehensive in intent • but often requires long development lead time • Measurement-oriented • problem-specific by default; may lead to suboptimal results globally • but can often produce some useful results with little investment

Some trades in approaches to performance analysis (2 of 2) • The two approaches are interdependent • calibration of an explicit performance model requires some performance measurement, taken as truth data • any interpretation of a performance measurement is (at least implicitly) based on some performance model12

The recipe (performed in order)17 • First, in single-thread-of-control mode • Get the right answers • Profile the program’s execution • Use existing, tuned library code for performance-critical single-thread functionality • Then, in multiple-thread-of-control mode • Get the right answers • Let the compiler optimize what it can • Profile the program's execution • Optimize memory utilization

IN SINGLE THREAD OF CONTROL MODE

Get the right answers (1 of 2) • Seems embarrassingly obvious, but • often underfunded • well calibrated software project effort and schedule estimation models13 predict this activity can easily consume half of the project effort • outside the existing calibration envelope, may require a V&V effort as large as the rest of the project7

Get the right answers (2 of 2) • Comparing results from multiple compilers (with associated math libraries) can help expose compiler/math-library-based numeric bugs6 • Use modern quality assurance tools and practices16

Profile the program’s execution (1 of 5) • Most vendor-supplied profilers are, on the whole, at least as good as “home-grown” timers/profilers; none are perfect • In any case, the relation between source and machine instructions may be many-to-many, making at least instruction-level accounting problematic • Comparison of multiple profilers is best3

Profile the program’s execution (2 of 5) • Low-level (machine/OS-event-oriented) • Performance Data API (PAPI) implementations (multiplatform; Ref. [4]) • perfex (SGI/IRIX; Ref. [1]) • DCPI1, ProfileMe (Compaq/Tru64; Refs. [2], [16]) • jm_status (IBM/AIX; Ref. [19]) • Xkstat (Sun/Solaris; Ref. [20])15

Profile the program’s execution(3 of 5) SpeedShop + prof (SGI O2K/IRIX 6.5.x) example (sage); t01:~/sage20010624 [5] > mpirun -np 4 ssrun -pcsamp sage.x timing_c.input t01:~/sage20010624 [12] > prof sage.x.pcsamp.f603300 ------------------------------------------------------------------------- [prof header material deleted here] ------------------------------------------------------------------------- Function list, in descending order by time ------------------------------------------------------------------------- [index] secs % cum.% samples function (dso: file, line) [1] 186.140 29.8% 29.8% 18614 TOKEN_GS_R8 (sage.x:module_token.f90, 6270) [2] 136.220 21.8% 51.7% 13622 CSR_CG_SOLVER (sage.x:module_matrix.f90, 304) [material deleted here]

Profile the program’s execution (4 of 5) • Source-traceback-profiling • TAU (multiplatform; Ref. [8]) • HPCview (multiplatform; Ref. [6])4 • SpeedShop + prof (SGI/IRIX; Ref. [1]) • atom (Compaq/Tru64; Ref. [17]) • prof/gprof(Sun Enterprise; Ref. [3])

Profile the program’s execution (5 of 5) • General dynamic instrumentation of executable • Paradyn/DynInst/DPCL (multiplatform; Refs. [12], [13], [14])

Use existing, tuned library code for single-thread performance-critical regions • Unless library tuning IS your project, you will be hard-pressed to improve on existing, well-calibrated, tuned service-level code • Examples of good math libraries • NAG • ACM

IN MULTIPLE THREAD OF CONTROL MODE

Get the right answers • See previous rant on this topic18

Profile the program’s execution (1 of 3) • Hypotheses about parallel performance bottlenecks that are not based on actual measurement are • like ants at a picnic (many and distracting) • often enough, spectacularly wrong • As a first step, use profilers mentioned in the “in single-thread mode” section of this presentation: most are parallel-tolerant

Profile the program’s execution (2 of 3) perfex (SGI O2K/IRIX 6.5.x) example (sage, 4 MPI processes): t01:~/sage20010624 [16] > mpirun -np 4 perfex -mp -a -x -y -o myperf.txt sage.x timing_c.input t01:~/sage20010624 [18] > vi myperf.txt [material deleted here] Statistics =========================================================================== Graduated instructions/cycle............................................ 0.583431 [material deleted here] L1 Cache Line Reuse ........................................................ 16.610574 L2 Cache Line Reuse......................................................... 4.691712 L1 Data Cache Hit Rate...................................................... 0.943216 L2 Data Cache Hit Rate...................................................... 0.824306 Time not making progress (probably waiting on memory) / Total time.......... 0.792328 [material deleted here] MFLOPS (average per process)................................................ 5.650960

Profile the program’s execution (3 of 3) • Apply communication-oriented profilers • MPI • vampir/Guide (multiplatform; Refs. [8], [9]) • upshot/nupshot (multiplatform; Ref. [10]) • Ref. [11] is a good review, if a little dated • OpenMP • KAPro Toolset (multiplatform; Ref. [9])

Let the compiler optimize what it can • Give your compiler a fighting chance by providing it reasonable optimization opportunities (see Refs. [1] and [3]) • Modern optimizing compilers can produce binaries that typically have performance superior to that of most “hand-tuned” code • Custom-tuning at the source level almost invariably creates non-portable performance

Optimize memory utilization(1 of 3) • Proximity of process and memory could affect performance in distributed-shared-memory systems • In most of LANL’s large simulators, provided a process is not memory-starved, placement does NOT account for more than a few percent of wallclock5 • Applications that perform frequent large memory moves9 tend to be sensitive to placement

Optimize memory utilization (2 of 3) • Profile memory utilization • dmalloc (various platforms; Ref. [7]) • perfex, dlook, dprof, nodememusg (SGI/IRIX; Refs. [1], [18]) • jm_status (IBM/AIX; Ref. [19]) • Xkstat (Sun/Solaris; Ref. [20])15 • High cache-miss, TLB-miss, or swap rates are the hallmark of suboptimal memory utilization

Optimize memory utilization (3 of 3) • In distributed memory architectures, size the process-specific memory requirements to ~50% of the size of local main memory block associated with the process • Use placement tools2 where available • Beware of (sometimes, unadvertised) interactions between load-sharing runtimes and placement utilities8

Some open issues in performance analysis (1 of 4) • Accounting attribution is not always what it seems to be (requires user to have too much knowledge of tool internals) • Concurrent presentation of metrics is not well supported in COTS • There is no standard model for cross-platform performance comparisons

Some open issues in performance analysis (2 of 4) • C++ support, especially for templates and mangled names, is limited in COTS • Hardware support for performance metrics varies greatly among platforms • COTS performance analysis tools often optimized on vendor’s hardware-diagnostic interests

Some open issues in performance analysis (3 of 4) • COTS products provide little support for dynamically identifying “performance groups” -- collections of blocks that act in concert and dominate wallclock • Mapping between the science-application-domain expert’s, and the computer-scientist’s, view of the code not supported in COTS (TAU does provide support)

Some open issues in performance analysis (4 of 4) • Tool literature is not centralized -- a well-maintained database with hyperlinks would be welcome (candidate for a Parallel Tools Consortium (Ref. [23]) project?)14 • Commodity-market incentives to address on speculation any of the above are very low, especially for large systems11

Notes (1 of 8) 1. As of 18 October 2001, it is unclear whether DCPI will be supported on the LANL Compaq Q (30 TeraOps) system. 2. Such as SGI’s dplace (Ref. [1]). 3. HPCView (Ref. [6]), for example, provides a convenient way to juxtapose multiple metrics. 4. HPCView presumes the existence of prof-like files produced by software other than HPCView.

Notes (2 of 8) 5. J. K. Horner and P. Gustafson, Los Alamos National Laboratory, unpublished results using Ref. [15]. 6. W. R. Boland, D. L. Harris, and J. K. Horner, “A comparison of Fortran 90 compiler features on the LANL SGI Origin2000”, presentation at SGI Quarterly Review, Fall 1998. (NAG, Compaq, HP, and Lahey/Fujitsu Fortran compilers are good crosschecks.) 7. For example, the LANL Advanced Hydrodynamic Facility (AHF), with a nominal cost of ~$1010.

Notes (3 of 8) 8. For example, between LSF and dplace under SGI/IRIX. 9. For example, streaming visualization systems. 10. If I failed to mention your favorite tool, let me know, and I’ll try to include it in an update.

Notes (4 of 8) 11. Most of today’s supercomputers are effectively networked commodity-class SMP nodes. Nominally, an ASCI-class supercomputer configured this way costs ~$108. The manufacturers of such systems are also typically in the PC/workstation market. 10% of the workstation/PC market, a nominal vendor market share, is ~$1010. On average, a successful ASCI-class computer vendor could hope to sell one supercomputer system every three years. Thus the revenue from the sale of an ASCI-class supercomputer represents only ~0.3% of the annual revenues of a nominal COTS workstation/PC vendor.

Notes (5 of 8) 12. In the sense of Ref. [21], pp. 18-20. 13. Such as Revised Intermediate COCOMO (REVIC), Ref. [22]. 14. Federico Bassetti of LANL has an outstanding LANL-internal web page dedicated to this objective. 15. Xkstat, like many system monitoring utilities, provides system-wide, not process-specific metrics.

Notes (6 of 8) [16] Getting the right answers requires a system of procedures and practices that optimize the probability of getting the right answers. These include, but are not limited to, tool-assisted configuration management, requirements definition, logical and physical design documentation, testing and user documentation (Ref. [24]). Thanks to Bill Spangenberg, Richard Barrett , and Michael Ham of LANL for this reminding of the importance of this point. [17] See Ref. [3] for an outstanding tutorial on most of these topics.

Notes (7 of 8) [18] Thanks to Larry Cox for chiding me on this point. In the original version of this presentation, I omitted explicit mention of this topic under the “multiple-thread-of-control” heading. [19] Based on comments from participants in the Los Alamos Computer Science Institute (LACSI) Symposium 2001, Workshop on Parallel Tools, 18 October 2001, Santa Fe, NM.

Notes (8 of 8) [20] Los Alamos National Laboratory identifier LA-UR-01-5827. This work was supported in part by University of California/Los Alamos National Laboratory Subcontract 22237-001-01 4x with Science Applications International Corporation. This document is not claimed to represent the views of the U.S. Government or its contractors.

References (1 of 7) [1] SGI, Origin2000 and Onyx2 Performance Tuning and Optimization Guide, Document Number 007-3430-003, URL http://techpubs.sgi.com/library/. [2] Compaq, Compaq (formerly Digital) Continuous Profiling Interface (DCPI), URL http://www.tru64unix.compaq.com/dcpi/. [3] P. Mucci and K. London, Application Optimization and Tuning of Cache Based Uniprocessors, SMPs, and MPPs, Including OpenMP and MPI, URL http://www.cs.utk.edu/~mucci/MPPopt.html.

References (2 of 7) [4] S. Browne (Moore) et al., Ptools Project: Performance Data API (PAPI), URL http://www.cs.utk.edu/~browne/ptools98/perfAPI. [5] A. Malony, S. Shende et al., TAU, URL http://www.cs.oregon.edu/research/paracomp/proj/tau. [6] J. Mellor-Crummey, R. Fowler, and G. Marin, HPCView; contact rjf@cs.rice.edu. [7] G. Watson, dmalloc, URL http://dmalloc.com.

References (3 of 7) [8] Pallas, vampir, URL http://www.pallas.com/pages/vampir.htm. [9] KAI Software, KAPro Toolset, URL http://www.kai.com. [10] Argonne National Laboratory, upshot, URL http://www-fp.mcs.anl.gov/~lusk/upshot. [11] S. Browne, K. London, J. Dongarra, “Review of Performance Analysis Tools for MPI Parallel Programs”, URL http://www.cs.utk.edu/~browne/perftools-review.

References (4 of 7) [12] B. Miller et al., Paradyn, URL http://www.cs.wisc.edu/paradyn. [13] (IBM-sponsored), Dynamic Probe Class Library (DPCL) Users Manual, available at URL http://www.cs.wisc.edu/paradyn/DPCL. [14] J. Hollingsworth et al., Dyninst, URL http://www.dyninst.org.

References (5 of 7) [15] P. Gustafson and J. Horner, LANL, Memory Utilization Tracking Tool (MUTT) Version 1.1, 1999 (no longer supported). [16] J. Dean et al., “ProfileMe: hardware support for instruction-level profiling on out-of-order processors”, Proceedings of the 30th Symposium on Microarchitecture (Micro-30), December 1997, URL ftp://ftp.digital.com/pub/DEC/SRC/publications/ weihl/micro30.ps.

References (6 of 7) [17] Compaq, atom, URL http://www.tru64unix/compaq.com/developerstoolkit/#atom. [18] J. Horner, nodememusg, LANL, contact jkh@lanl.gov. [19] A. Baker, “CU Boulder’s SP2”, URL http://www-ugrad.cs.colorado.edu/~csci4576/SP2/introsp.html#Useful. [20] Xkstat, URL http://www.sunperf.com/perfmontools.html.

References (7 of 7) [21] C. C. Chang and H. J. Kiesler, Model Theory, North-Holland, 1990. [22] U.S. Air Force Cost Analysis Agency, Revised Intermediate COCOMO, Version 9.2, URL http://www.hq.af.mil/afcaa/models/REVIC92.EXE. [23] URL http://www.ptools.org. [24] URL http://www.softstarsystems.com/faq.htm.

A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20

A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20

Presentation Transcript

High performance computing for a family of smooth trajectories using parallel environments

Parallel and High Performance Computing

Parallel Applications And Tools For Cloud Computing Environments

Parallel Applications And Tools For Cloud Computing Environments

Parallel Applications And Tools For Cloud Computing Environments

Highest performance parallel storage for HPC environments

Performance Analysis and Tuning

A Survey on Parallel Computing in Heterogeneous Grid Environments

Analysis of Virtualization Technologies for High Performance Computing Environments

Parallel Computing Explained Scalar Tuning

Pablo and Autopilot: Performance Tuning in Distributed Computing Environments

Performance analysis and tuning of parallel/distributed applications

Parallel Computing Demand for High Performance

Automatic Performance Analysis and Tuning

Parallel Computing Explained Parallel Code Tuning

Parallel Applications And Tools For Cloud Computing Environments

TAU: A Framework for Parallel Performance Analysis

Analysis of Virtualization Technologies for High Performance Computing Environments

Parallel Computing 2007: Performance Analysis for Laplace’s Equation

A Recipe for Performance Analysis and Tuning in Parallel Computing Environments 20

Parallel Computing Demand for High Performance

Parallel Computing Demand for High Performance