1 / 23

Performance Profiling Using hpmcount, poe+ & libhpm

Performance Profiling Using hpmcount, poe+ & libhpm. Richard Gerber NERSC User Services ragerber@nersc.gov 510-486-6820. Introduction. How to obtain performance numbers Tools based on IBM’s PMAPI Relevant for FY2003 ERCAP. Agenda. Low Level PAPI Interface HPM Toolkit hpmcount poe+

rane
Télécharger la présentation

Performance Profiling Using hpmcount, poe+ & libhpm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services ragerber@nersc.gov 510-486-6820

  2. Introduction • How to obtain performance numbers • Tools based on IBM’s PMAPI • Relevant for FY2003 ERCAP

  3. Agenda • Low Level PAPI Interface • HPM Toolkit • hpmcount • poe+ • libhpm : hardware performance library

  4. Overview • These tools are used for performance measurement • All can be used to tune applications and measure performance • Needed for FY 2003 ERCAP applications

  5. Vocabulary • PMAPI – IBM’s low-level interface • PAPI – Performance API (portable) • hpmcount, poe+ report overall code performance • libhpm can be used to instrument portions of code

  6. PAPI • Standard application programming interface • Portable, don’t confuse with IBM low-level PMAPI interface • Can access hardware counter info • V2.1 at NERSC • See • http://hpcf.nersc.gov/software/papi.html • http://icl.cs.utk.edu/projects/papi/

  7. Using PAPI • PAPI is available through a module • module load papi • You place calls in source code • xlf –O3 source.F $PAPI • #include "fpapi.h“ • … • integer*8 values(2) • integer counters(2), ncounters, irc • … • irc = PAPI_VER_CURRENT • CALL papif_library_init(irc) • counters(1)=PAPI_FMA_INS • counters(2)=PAPI_FP_INS • ncounters=2 • CALL papif_start_counters(counters,ncounters,irc) • … • call papif_stop_counters(values,ncounters,irc) • write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2) • …

  8. hpmcount • Easy to use • Does not affect code performance • Profiles entire code • Uses hardware counters • Reports flip (floating point instruction) rate and many other quantities

  9. hpmcount usage • Serial • %hpmcount executable • Parallel • %poe hpmcount executable –nodes n -procs np • Gives performance numbers for each task • Prints output to STDOUT (or use –o filename) • Beware! These profile the poe command • hpmcount poe executable • hpmcount executable (if compiled with mp* compilers)

  10. hpmcount example ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount ./ex1 hpmcount (V 2.3.1) summary Total execution time (wall clock time): 17.258385 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 17.220000 seconds Total amount of time in system mode : 0.040000 seconds Maximum resident set size : 3116 Kbytes Average shared memory use in text segment : 6900 Kbytes*sec Average unshared memory use in data segment : 5344036 Kbytes*sec Number of page faults without I/O activity : 785 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 1 Number of involuntary context switches : 1727 ####### End of Resource Statistics ########

  11. hpmcount output ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount ./ex1 PM_CYC (Cycles) : 6428126205 PM_INST_CMPL (Instructions completed) : 693651174 PM_TLB_MISS (TLB misses) : 122468941 PM_ST_CMPL (Stores completed) : 125758955 PM_LD_CMPL (Loads completed) : 250513627 PM_FPU0_CMPL (FPU 0 instructions) : 249691884 PM_FPU1_CMPL (FPU 1 instructions) : 3134223 PM_EXEC_FMA (FMAs executed) : 126535192 Utilization rate : 99.308 % Avg number of loads per TLB miss : 2.046 Load and store operations : 376.273 M Instructions per load/store : 1.843 MIPS : 40.192 Instructions per cycle : 0.108 HW Float points instructions per Cycle : 0.039 Floating point instructions + FMAs : 379.361 M Float point instructions + FMA rate : 21.981 Mflip/s FMA percentage : 66.710 % Computation intensity : 1.008

  12. Floating point measures • PM_FPU0_CMPL (FPU 0 instructions) • PM_FPU1_CMPL (FPU 1 instructions) • The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. • Each FPU can start a new instruction at every cycle. • This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by each FPU. • PM_EXEC_FMA (FMAs executed) • The POWER3 can execute a computation of the form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).

  13. Total flop rate • Float point instructions + FMA rate • Float point instructions + FMAs gives the floating point operations. The two are added together since an FMA instruction yields 2 floating point operations. • The rate gives the code’s Mflops. • The POWER3 has a peak rate of 1500 Mflops. (375 MHz clock x 2 FPUs x 2Flops/FMA instruction) • Our example: 22 Mflops.

  14. Memory access • Average number of loads per TLB miss • Memory addresses that are in the Translation Lookaside Buffer can be accessed quickly. • Each time a TLB miss occurs, a new page (4KB, 512 8-byte elements) is brought into the buffer. • A value of ~500 means each element is accessed ~1 time while the page is in the buffer. • A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly. • Our example: 2.0

  15. Cache hits • The –sN option to hpmcount specifies a different statistics set • -s2 will include L1 data cache hit rate • 33.4% for our example • See http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html for more options and descriptions.

  16. Optimizing the code • Original code fragment DO I=1,N DO K=1,N DO J=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO END DO END DO

  17. Optimizing the code • “Optimized” code: move I to inner loop DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO END DO END DO

  18. Optimized results • Float point instructions + FMA rate • 461 vs. 22 Mflips (ESSL 933) • Avg number of loads per TLB miss • 20,877 vs. 2.0 (ESSL: 162) • L1 cache hit rate • 98.9% vs. 33.4%

  19. Using libhpm • libhpm can instrument code sections • Embed calls into source code • Fortran, C, C++ • Contained in hpmtoolkit module • module load hpmtoolkit • compile with $HPMTOOLKIT • xlf –O3 source.F $HPMTOOLKIT • Execute program normally

  20. hpmlib example … #include f_hpm.h … CALL f_hpminit(0,”someid") CALL f_hpmstart(1,"matrix-matrix multiply") DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO END DO END DO CALL f_hpmstop(1) CALL f_hpmterminate(0) …

  21. Parallel programs • poe hpmcount executable –nodes n –procs np • Will print output to STDOUT separately for each task • poe+ executable –nodes n –procs np • Will print aggregate number to STDOUT • libhpm • Writes output to a separate file for each task • Do not do these! • hpmcount poe executable … • hpmcount executable (if compiled with mp* compiler)

  22. Summary • Utilities to measure performance • PAPI • hpmcount • poe+ • hpmlib • You need to quote performance data in ERCAP application

  23. Where to Get More Information • NERSC Website: hpcf.nersc.gov • PAPI • http://hpcf.nersc.gov/software/tools/papi.html • hpmcount, poe+ • http://hpcf.nersc.gov/software/ibm/hpmcount/ • http://hpcf.nersc.gov/software/ibm/hpmcount/counter.html • hpmlib • http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html

More Related