HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

Prof. Thomas Sterling Department of Computer Science Louisiana State University March 1, 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANSPERFORMANCE MEASUREMENT & ANALYSIS

Contact Info • Steven R. Brandt • sbrandt@cct.lsu.edu • AIM: RegexGuy

Links • http://cct.lsu.edu/~sbrandt/csc7600l15demos.zip • X-Ming: • http://www.straightrunning.com/XmingNotes/ • Scroll down, click on Xming public release and install • Putty: • http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html • Click on putty.exe and save to the desktop

Topics • Introduction • Measuring System Operation • Gprof • Perfsuite • PAPI • Tau & PAPI • Benchmarks b_eff • MPI Tracing with PMPI • Tau & MPI • Summary – Material for the Test 4

Opening Remarks • Up until now, 2 strategies for measuring performance: • 1) wall-clock time for user applications • 2) benchmarks for comparing • Machines of different type • Machines of different scale • But, we have identified factors that contribute to system operational performance, e.g.: • Effective use of parallelism • Cache behavior • To make better use of HPC systems, need to measure operational behavior • How the system is performing during application execution • What are the application demands and bottlenecks • Focus on SMP class system operation during this Segment • Next Segment: measuring MPP & cluster behavior 6

What you’ll Need to Know • This is a skills-oriented lecture • Understand the kinds and levels of metrics of system and processor operation that you can measure • Know the kinds of tools that can expose valuable parameters of system & application operation • Hardware counters • Software instrumentation, data acquisition, and presentation • Learn the basics of how to use specific tools when running your application code • Gprof • Perfsuite • PAPI • TAU 7

Final initial comments(yes, I know that’s an oxymoron) • We are only going to scratch the surface today • Try to get the basic ideas • This will expose you to a range of concepts, strategies, and tools • Lots of details will be left to future discussions • Over the next weeks, we will extend our abilities in using these tools • But don’t hesitate to read through the documentation • Hey, try some things out for yourself • You’ve got a sandbox to play in (Arete)‏ 8

MP MP MP MP L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 Hardware Counters • Each processor has the ability to monitor events of various kinds • Small set of registers used to count events. Very processor specific. M1 M2 Mn S PCI-e Controller JTAG Ethernet S Peripherals USB NIC NIC 10

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL 11

Hardware Events • Floating point operations, Multiplies, Adds, Multiply-Adds, etc. • L1/L2 cache hits/misses (see http://en.wikipedia.org/wiki/CPU_cache)‏ • Translation Lookaside Buffer hits/misses (virtual to physical address translation table)‏ • Branch prediction counters (pipelined systems must guess the next instruction to fetch)‏ 13

A Goal: Optimization • Compile Time: • Various levels enabled by compiler options • Examine Compiler Output • Run Time (Performance Analysis): • Instrument code or execution to produce a trace • Tools to analyze trace: • Standard/basic tool is gprof, but there are many others • Note: Java Hot-Spot environment collects data about execution and uses it to optimize a program as it runs 14

Performance Analysis Tools • Widely Ported Low-Level Interface to hardware counters: PAPI (Performance API)‏: Supports AIX, Linux, Solaris, and even Windows! http://icl.cs.utk.edu/papi/custom/index.html?lid=62&slid=96 • Many tools built on PAPI • Perfsuite (NCSA), psrun command • TAU (University of Oregon)‏ • etc. etc. • Useful for: • Finding performance bottlenecks • Identifying cache problems (badly sized arrays)‏ 15

time • A simple Unix command to give resource usage. • Runs a specified program • time [options] command [arguments …] • Gives timing statistics about program run • The elapsed real time between invocation and termination • User CPU time • System CPU time • See: man time 16

top • Gives an overview of system process status and resource usage • Provides a dynamic realtime view of a running system • System summary information • Currently managed tasks • Updates every few (e.g. 5) seconds • top –hv | -bcisS –d delay –n iterations –p pid [, pid …] • See: man top 17

Basic Tools • Time $ time du -s /usr > /dev/null 2>&1real 0m34.274suser 0m0.082ssys 0m0.957s • top/ps top - 11:29:40 up 49 min, 2 users, load average: 0.32, 0.26, 0.25 Tasks: 125 total, 3 running, 121 sleeping, 0 stopped, 1 zombie Cpu(s): 4.5%us, 0.3%sy, 0.0%ni, 94.7%id, 0.2%wa, 0.3%hi, 0.0%si, 0.0%st Mem: 1030940k total, 1013376k used, 17564k free, 124616k buffers Swap: 2104472k total, 32k used, 2104440k free, 411968k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4136 sbrandt 15 0 35208 15m 10m S 6 1.5 0:03.35 gnome-terminal 3761 root 16 0 82676 50m 12m R 3 5.0 1:02.82 X 5195 sbrandt 16 0 2176 1172 852 R 1 0.1 0:00.03 top 3487 root 17 0 1820 572 496 S 0 0.1 0:00.25 hald-addon-stor 3930 sbrandt 16 0 99.8m 40m 14m S 0 4.0 0:36.27 beagled 18

gprof : quick overview • gprof • a utility which profiles procedures in programs, available in most Unix systems. • gprof provides information about : • An index for each procedure • Parent of each procedures • The percentage of CPU time utilized by a procedure and its calls. • Breakdown of time used by the procedure and its descendents • Number of times a procedure was called. • direct descendents of each procedure • To use gprof: • compile the source code with a –pg option • running the executable created generates an output file gmon.out for serial programs. • For serial programs: gprof exe gmon.out • For parallel programs, set env variable GMON_OUT_PREFIX:gprof exe gmon.out.*‏ 20

GPROF: one minute tutorial • Steps to use gprof: • gcc -pg -g -o prog prog.c • ./prog • gprof prog gmon.out • More reading: http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html • Finds subroutines where the most time is spent • Cannot tell you why some routines are more costly than others. Need more information... 21

Demo of gprof 22

Using psrun • psrun cmd (e.g. psrun du -s /usr)‏ • This test will measure performance counters used by the du command. No special compilation of ls is required for this to work. • psprocess cmd.* (e.g. psprocess du.*.xml)‏ • At the bottom of this file, you will see summary events about numerous counters. 25

Demo of psrun 26

By hand: Verifying the PAPI Version // When hand-instrumenting you need to check #include <papi.h> ... /* Verifying PAPI Version */ int v = PAPI_library_init(PAPI_VER_CURRENT); if(v != PAPI_VER_CURRENT) { fprintf(stderr,"Bad PAPI version\n"); exit(2); } 42

By Hand: Measuring PAPI Counters • Use "papi_avail -a" to identify counters • Link with -lpapi 43

Demo: Hand instrumentation with PAPI 44

Statistical profiling • profil() - Unix command to examine program to periodically examine program counter. Identify subroutines where code spends most time. • Used by Gprof • PAPI_profil() - Emulates profil(), but looks at a specific hardware counter. Identifies file/line where code spends most time. 45

Using psrun to find hot spots • gcc -g -o cmd cmd.c • psrun -C -c papi_profile_cycles.xml cmd • "-C" Instructs papi to use xml configurations that are in the install path rather than current directory. • "-c papi_profile_cycles.xml" Use the named config file rather than the default. • "papi_profile_cycles.xml" directs papi to collect file/line data. • psprocess cmd.*.xml • display results 46

Demo : 2nd Demo of psrun 47

HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS