Performance Profiling with hpmcount, poe+, and libhpm at NERSC

Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services ragerber@nersc.gov 510-486-6820

Introduction • How to obtain performance numbers • Tools based on IBM’s PMAPI • Relevant for FY2003 ERCAP

Agenda • Low Level PAPI Interface • HPM Toolkit • hpmcount • poe+ • libhpm : hardware performance library

Overview • These tools are used for performance measurement • All can be used to tune applications and measure performance • Needed for FY 2003 ERCAP applications

Vocabulary • PMAPI – IBM’s low-level interface • PAPI – Performance API (portable) • hpmcount, poe+ report overall code performance • libhpm can be used to instrument portions of code

PAPI • Standard application programming interface • Portable, don’t confuse with IBM low-level PMAPI interface • Can access hardware counter info • V2.1 at NERSC • See • http://hpcf.nersc.gov/software/papi.html • http://icl.cs.utk.edu/projects/papi/

Using PAPI • PAPI is available through a module • module load papi • You place calls in source code • xlf –O3 source.F $PAPI • #include "fpapi.h“ • … • integer*8 values(2) • integer counters(2), ncounters, irc • … • irc = PAPI_VER_CURRENT • CALL papif_library_init(irc) • counters(1)=PAPI_FMA_INS • counters(2)=PAPI_FP_INS • ncounters=2 • CALL papif_start_counters(counters,ncounters,irc) • … • call papif_stop_counters(values,ncounters,irc) • write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2) • …

hpmcount • Easy to use • Does not affect code performance • Profiles entire code • Uses hardware counters • Reports flip (floating point instruction) rate and many other quantities

hpmcount usage • Serial • %hpmcount executable • Parallel • %poe hpmcount executable –nodes n -procs np • Gives performance numbers for each task • Prints output to STDOUT (or use –o filename) • Beware! These profile the poe command • hpmcount poe executable • hpmcount executable (if compiled with mp* compilers)

hpmcount example ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount ./ex1 hpmcount (V 2.3.1) summary Total execution time (wall clock time): 17.258385 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 17.220000 seconds Total amount of time in system mode : 0.040000 seconds Maximum resident set size : 3116 Kbytes Average shared memory use in text segment : 6900 Kbytes*sec Average unshared memory use in data segment : 5344036 Kbytes*sec Number of page faults without I/O activity : 785 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 1 Number of involuntary context switches : 1727 ####### End of Resource Statistics ########

hpmcount output ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount ./ex1 PM_CYC (Cycles) : 6428126205 PM_INST_CMPL (Instructions completed) : 693651174 PM_TLB_MISS (TLB misses) : 122468941 PM_ST_CMPL (Stores completed) : 125758955 PM_LD_CMPL (Loads completed) : 250513627 PM_FPU0_CMPL (FPU 0 instructions) : 249691884 PM_FPU1_CMPL (FPU 1 instructions) : 3134223 PM_EXEC_FMA (FMAs executed) : 126535192 Utilization rate : 99.308 % Avg number of loads per TLB miss : 2.046 Load and store operations : 376.273 M Instructions per load/store : 1.843 MIPS : 40.192 Instructions per cycle : 0.108 HW Float points instructions per Cycle : 0.039 Floating point instructions + FMAs : 379.361 M Float point instructions + FMA rate : 21.981 Mflip/s FMA percentage : 66.710 % Computation intensity : 1.008

Floating point measures • PM_FPU0_CMPL (FPU 0 instructions) • PM_FPU1_CMPL (FPU 1 instructions) • The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. • Each FPU can start a new instruction at every cycle. • This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by each FPU. • PM_EXEC_FMA (FMAs executed) • The POWER3 can execute a computation of the form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).

Total flop rate • Float point instructions + FMA rate • Float point instructions + FMAs gives the floating point operations. The two are added together since an FMA instruction yields 2 floating point operations. • The rate gives the code’s Mflops. • The POWER3 has a peak rate of 1500 Mflops. (375 MHz clock x 2 FPUs x 2Flops/FMA instruction) • Our example: 22 Mflops.

Memory access • Average number of loads per TLB miss • Memory addresses that are in the Translation Lookaside Buffer can be accessed quickly. • Each time a TLB miss occurs, a new page (4KB, 512 8-byte elements) is brought into the buffer. • A value of ~500 means each element is accessed ~1 time while the page is in the buffer. • A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly. • Our example: 2.0

Cache hits • The –sN option to hpmcount specifies a different statistics set • -s2 will include L1 data cache hit rate • 33.4% for our example • See http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html for more options and descriptions.

Optimizing the code • Original code fragment DO I=1,N DO K=1,N DO J=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO END DO END DO

Optimizing the code • “Optimized” code: move I to inner loop DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO END DO END DO

Optimized results • Float point instructions + FMA rate • 461 vs. 22 Mflips (ESSL 933) • Avg number of loads per TLB miss • 20,877 vs. 2.0 (ESSL: 162) • L1 cache hit rate • 98.9% vs. 33.4%

Using libhpm • libhpm can instrument code sections • Embed calls into source code • Fortran, C, C++ • Contained in hpmtoolkit module • module load hpmtoolkit • compile with $HPMTOOLKIT • xlf –O3 source.F $HPMTOOLKIT • Execute program normally

hpmlib example … #include f_hpm.h … CALL f_hpminit(0,”someid") CALL f_hpmstart(1,"matrix-matrix multiply") DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO END DO END DO CALL f_hpmstop(1) CALL f_hpmterminate(0) …

Parallel programs • poe hpmcount executable –nodes n –procs np • Will print output to STDOUT separately for each task • poe+ executable –nodes n –procs np • Will print aggregate number to STDOUT • libhpm • Writes output to a separate file for each task • Do not do these! • hpmcount poe executable … • hpmcount executable (if compiled with mp* compiler)

Summary • Utilities to measure performance • PAPI • hpmcount • poe+ • hpmlib • You need to quote performance data in ERCAP application

Where to Get More Information • NERSC Website: hpcf.nersc.gov • PAPI • http://hpcf.nersc.gov/software/tools/papi.html • hpmcount, poe+ • http://hpcf.nersc.gov/software/ibm/hpmcount/ • http://hpcf.nersc.gov/software/ibm/hpmcount/counter.html • hpmlib • http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html

Performance Profiling with hpmcount, poe+, and libhpm at NERSC