Parallel Computing Explained Timing and Profiling

Parallel Computing ExplainedTiming and Profiling Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 (Additional Slides by Javier Delgado)

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing 6.1.1 Timing a Section of Code 6.1.1.1 CPU Time 6.1.1.2 Wall clock Time 6.1.2 Timing an Executable 6.1.3 Timing a Batch Job 6.2 Profiling 6.2.1 Profiling Tools 6.2.2 Profile Listings 6.2.3 Profiling Analysis 6.3 Further Information

Timing and Profiling • Now that your program has been ported to the new computer, you will want to know how fast it runs. • This chapter describes how to measure the speed of a program using various timing routines. • The chapter also covers how to determine which parts of the program account for the bulk of the computational load so that you can concentrate your tuning efforts on those computationally intensive parts of the program. • This chapter also gives a summary of some available profiling tools.

Timing • In the following sections, we’ll discuss timers and review the profiling tools ssrun and prof on the Origin and vprof and gprof on the Linux Clusters. The specific timing functions described are: • Timing a section of codeFORTRAN • etime, dtime, cpu_time for CPU time • time and f_time for wallclock time • clock for CPU time • gettimeofday for wallclock time • Timing an executable • time a.out • Timing a batch run • busage • qstat • qhist

CPU Time • etime • A section of code can be timed using etime. • It returns the elapsed CPU time in seconds since the program started. real*4 tarray(2),time1,time2,timeres … beginning of program time1=etime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=etime(tarray) timeres=time2-time1

CPU Time • dtime • A section of code can also be timed using dtime. • It returns the elapsed CPU time in seconds since the last call to dtime. real*4 tarray(2),timeres … beginning of program timeres=dtime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed timeres=dtime(tarray) … rest of program

CPU Time The etime and dtime Functions • User time. • This is returned as the first element of tarray. • It’s the CPU time spent executing user code. • System time. • This is returned as the second element of tarray. • It’s the time spent executing system calls on behalf of your program. • Sum of user and system time. • This is the function value that is returned. • It’s the time that is usually reported. • Metric. • Timings are reported in seconds. • Timings are accurate to 1/100th of a second.

CPU Time Timing Comparison Warnings • For the SGI computers: • The etime and dtime functions return the MAX time over all threads for a parallel program. • This is the time of the longest thread, which is usually the master thread. • For the Linux Clusters: • The etime and dtime functions are contained in the VAX compatibility library of the Intel FORTRAN Compiler. • To use this library include the compiler flag -Vaxlib. • Another warning: Do not put calls to etime and dtime inside a do loop. The overhead is too large.

CPU Time cpu_time • The cpu_time routine is available only on the Linux clusters as it is a component of the Intel FORTRAN compiler library. • It provides substantially higher resolution and has substantially lower overhead than the older etime and dtime routines. • It can be used as an elapsed timer. real*8 time1, time2, timeres … beginning of program call cpu_time (time1) … start of section of code to be timed … lots of computation … end of section of code to be timed call cpu_time(time2) timeres=time2-time1 … rest of program

CPU Time clock • For C programmers, one can call the cpu_time routine using a FORTRAN wrapper or call the intrinsic function clock that can be used to determine elapsed CPU time. include <time.h> static const double iCPS = 1.0/(double)CLOCKS_PER_SEC; double time1, time2, timres; … time1=(clock()*iCPS); … /* do some work */ … time2=(clock()*iCPS); timers=time2-time1;

Wall clock Time time • For the Origin, the function time returns the time since 00:00:00 GMT, Jan. 1, 1970. • It is a means of getting the elapsed wall clock time. • The wall clock time is reported in integer seconds. external time integer*4 time1,time2,timeres … beginning of program time1=time( ) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=time( ) timeres=time2 - time1

Wall clock Time f_time • For the Linux clusters, the appropriate FORTRAN function for elapsed time is f_time. integer*8 f_time external f_time integer*8 time1,time2,timeres … beginning of program time1=f_time() … start of section of code to be timed … lots of computation … end of section of code to be timed time2=f_time() timeres=time2 - time1 • As above for etime and dtime, the f_time function is in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib.

Wall clock Time gettimeofday • For C programmers, wallclock time can be obtained by using the very portable routine gettimeofday. #include <stddef.h> /* definition of NULL */ #include <sys/time.h> /* definition of timevalstruct and protyping of gettimeofday */ double t1,t2,elapsed; structtimevaltp; intrtn; .... .... rtn=gettimeofday(&tp, NULL); t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec; .... /* do some work */ .... rtn=gettimeofday(&tp, NULL); t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec; elapsed=t2-t1;

Timing an Executable • To time an executable (if using a csh or tcsh shell, explicitly call /usr/bin/time) time …options… a.out • where options can be ‘-p’ for a simple output or ‘-f format’ which allows the user to display more than just time related information. • Consult the man pages on the time command for format options.

Timing a Batch Job • Time of a batch job running or completed. • Origin busage jobid • Linux clusters qstat jobid # for a running job qhist jobid # for a completed job

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing 6.1.1 Timing a Section of Code 6.1.1.1 CPU Time 6.1.1.2 Wall clock Time 6.1.2 Timing an Executable 6.1.3 Timing a Batch Job 6.2 Profiling 6.2.1 Profiling Tools 6.2.2 Profile Listings 6.2.3 Profiling Analysis 6.3 Further Information

Profiling • Profiling determines where a program spends its time. • It detects the computationally intensive parts of the code. • Use profiling when you want to focus attention and optimization efforts on those loops that are responsible for the bulk of the computational load. • Most codes follow the 90-10 Rule. • That is, 90% of the computation is done in 10% of the code.

Profiling Tools Profiling Tools on the Origin • On the SGI Origin2000 computer there are profiling tools named ssrun and prof. • Used together they do profiling, or what is called hot spot analysis. • They are useful for generating timing profiles. • ssrun • The ssrun utility collects performance data for an executable that you specify. • The performance data is written to a file named "executablename.exptype.id". • prof • The prof utility analyzes the data file created by ssrun and produces a report. • Example ssrun -fpcsampa.out prof -h a.out.fpcsamp.m12345 > prof.list

Profiling Tools Profiling Tools on the Linux Clusters • On the Linux clusters the profiling tools are still maturing. There are currently several efforts to produce tools comparable to the ssrun and perfextools. • gprof • Basic profiling information can be generated using the OS utility gprof. • First, compile the code with the compiler flags -p -g for the Intel compiler (-g on the Intel compiler does not change the optimization level) or -pg for the GNU compiler. • Second, run the program. • Finally analyze the resulting gmon.out file using the gprof utility: gprof executable gmon.out. efc -O -p -g -o foofoo.f ./foo gproffoogmon.out

The Performance API (PAPI) • Provides an interface to hardware performance counters integrated in CPU • Provides more in-depth details about resource utilization • E.g. cache misses, instructions per second • Used by perfex, mpitrace, perfsuite, and other profiling tools • Requires kernel patch to deploy on Linux

Profiling Tools Profiling Tools on the Linux Clusters • vprof • On the IA32 platform there is a utility called vprof that provides performance information using the PAPI instrumentation library. • To instrument the whole application requires recompiling and linking to vprof and PAPI libraries. setenv VMON PAPI_TOT_CYC ifc -g -O -o md md.f /usr/apps/tools/vprof/lib/vmonauto_gcc.o -L/usr/apps/tools/lib -lvmon -lpapi ./md /usr/apps/tools/vprof/bin/cprof -e md vmon.out

Profile Listings Profile Listings on the Origin • Prof Output First Listing • The first listing gives the number of cycles executed in each procedure (or subroutine). The procedures are listed in descending order of cycle count. Cycles % Cum% Secs Proc -------- ----- ----- ---- ---- 42630984 58.47 58.47 0.57 VSUB 6498294 8.91 67.38 0.09 PFSOR 6141611 8.42 75.81 0.08 PBSOR 3654120 5.01 80.82 0.05 PFSOR1 2615860 3.59 84.41 0.03 VADD 1580424 2.17 86.57 0.02 ITSRCG 1144036 1.57 88.14 0.02 ITSRSI 886044 1.22 89.36 0.01 ITJSI 861136 1.18 90.54 0.01 ITJCG

Profile Listings Profile Listings on the Origin • Prof Output Second Listing • The second listing gives the number of cycles per source code line. • The lines are listed in descending order of cycle count. Cycles % Cum% Line Proc -------- ----- ----- ---- ---- 36556944 50.14 50.14 8106 VSUB 5313198 7.29 57.43 6974 PFSOR 4968804 6.81 64.24 6671 PBSOR 2989882 4.10 68.34 8107 VSUB 2564544 3.52 71.86 7097 PFSOR1 1988420 2.73 74.59 8103 VSUB 1629776 2.24 76.82 8045 VADD 994210 1.36 78.19 8108 VSUB 969056 1.33 79.52 8049 VADD 483018 0.66 80.18 6972 PFSOR

Profile Listings Profile Listings on the Linux Clusters • gprof Output First Listing • The listing gives a 'flat' profile of functions and routines encountered, sorted by 'self seconds' which is the number of seconds accounted for by this function alone. Flat profile: Each sample counts as 0.000976562 seconds. % cumulative self self total time seconds seconds calls us/call us/call name ----- ---------- ------- ----- ------- ------- ----------- 38.07 5.67 5.67 101 56157.18 107450.88 compute_ 34.72 10.84 5.17 25199500 0.21 0.21 dist_ 25.48 14.64 3.80 SIND_SINCOS 1.25 14.83 0.19 sin 0.37 14.88 0.06 cos 0.05 14.89 0.01 50500 0.15 0.15 dotr8_ 0.05 14.90 0.01 100 68.36 68.36 update_ 0.01 14.90 0.00 f_fioinit 0.01 14.90 0.00 f_intorange 0.01 14.90 0.00 mov 0.00 14.90 0.00 1 0.00 0.00 initialize_

Profile Listings Profile Listings on the Linux Clusters • gprof Output Second Listing • The second listing gives a 'call-graph' profile of functions and routines encountered. The definitions of the columns are specific to the line in question. Detailed information is contained in the full output from gprof. Call graph: index % time self children called name ----- ------ ---- -------- ---------------- ---------------- [1] 72.9 0.00 10.86 main [1] 5.67 5.18 101/101 compute_ [2] 0.01 0.00 100/100 update_ [8] 0.00 0.00 1/1 initialize_ [12] --------------------------------------------------------------------- 5.67 5.18 101/101 main [1] [2] 72.8 5.67 5.18 101 compute_ [2] 5.17 0.00 25199500/25199500 dist_ [3] 0.01 0.00 50500/50500 dotr8_ [7] --------------------------------------------------------------------- 5.17 0.00 25199500/25199500 compute_ [2] [3] 34.7 5.17 0.00 25199500 dist_ [3] --------------------------------------------------------------------- <spontaneous> [4] 25.5 3.80 0.00 SIND_SINCOS [4] … …

Profile Listings Profile Listings on the Linux Clusters • vprof Listing • The above listing from (using the -e option to cprof), displays not only cycles consumed by functions (a flat profile) but also the lines in the code that contribute to those functions. Columns correspond to the following events: PAPI_TOT_CYC - Total cycles (1956 events) File Summary: 100.0% /u/ncsa/gbauer/temp/md.f Function Summary: 84.4% compute 15.6% dist Line Summary: 67.3% /u/ncsa/gbauer/temp/md.f:106 13.6% /u/ncsa/gbauer/temp/md.f:104 9.3% /u/ncsa/gbauer/temp/md.f:166 2.5% /u/ncsa/gbauer/temp/md.f:165 1.5% /u/ncsa/gbauer/temp/md.f:102 1.2% /u/ncsa/gbauer/temp/md.f:164 0.9% /u/ncsa/gbauer/temp/md.f:107 0.8% /u/ncsa/gbauer/temp/md.f:169 0.8% /u/ncsa/gbauer/temp/md.f:162 0.8% /u/ncsa/gbauer/temp/md.f:105

Profile Listings Profile Listings on the Linux Clusters • vprof Listing (cont.) 0.7% /u/ncsa/gbauer/temp/md.f:149 0.5% /u/ncsa/gbauer/temp/md.f:163 0.2% /u/ncsa/gbauer/temp/md.f:109 0.1% /u/ncsa/gbauer/temp/md.f:100 … … 100 0.1% do j=1,np 101 if (i .ne. j) then 102 1.5% call dist(nd,box,pos(1,i),pos(1,j),rij,d) 103 ! attribute half of the potential energy to particle 'j' 104 13.6% pot = pot + 0.5*v(d) 105 0.8% do k=1,nd 106 67.3% f(k,i) = f(k,i) - rij(k)*dv(d)/d 107 0.9% enddo 108 endif 109 0.2% enddo

Profiling Analysis • The program being analyzed in the previous Origin example has approximately 10000 source code lines, and consists of many subroutines. • The first profile listing shows that over 50% of the computation is done inside the VSUB subroutine. • The second profile listing shows that line 8106 in subroutine VSUB accounted for 50% of the total computation. • Going back to the source code, line 8106 is a line inside a do loop. • Putting an OpenMP compiler directive in front of that do loop you can get 50% of the program to run in parallel with almost no work on your part. • Since the compiler has rearranged the source lines the line numbers given by ssrun/prof give you an area of the code to inspect. • To view the rearranged source use the option f90 … -FLIST:=ON cc … -CLIST:=ON • For the Intel compilers, the appropriate options are ifort … –E … icc … -E …

MPE and Jumpshot • MPE is a tracing library that comes with MPI • Jumpshot is a graphical application for analyzing the MPE output • MPE requires inserting code at specific locations to be analyzed • Display options are specified in the code (e.g. “ShowMPI_Broadcastevents in dotted blue lines”

Jumpshot

Perfsuite • Collection of tools, utilities, and libraries for software performance analysis • Intel architectures only • Provides many in-depth statistics • Operations per cycle, Cache miss/hit data, etc. • Not difficult to use (but may be difficult to compile)mpiexec –np $NN psrun wrf.exepsprocess wrf.exe.NN_n.xml • Requires PAPI kernel patch for showing most information

Perfsuite + Graphical App http://perfsuite.ncsa.uiuc.edu/examples/GenIDLEST/

CEPBA Tools • Developed at the European Center for Parallelism at Barcelona • Currently not free • Provide text-based and graphical applications for: • Execution analysis and optimization • Execution prediction • 3 Main tools: • Mpitrace, Dimemas, Paraver

CEPBA Tools • Powerful, but complex • Requires PAPI kernel patch for showing most information • May require application to be recompiled • Very large trace files for long executions and/or high number of processors (e.g. over 10GB)

CEPBA Tools Source: Barcelona SuperComputing Center – http://www.bsc.es/plantillaA.php?cat_id=479

Visualizing with Paraver • Process: • (Compile application with mpitrace libraries linked) • Execute application (and preload mpitrace libraries if not linked to the application) • Convert individual trace files to a Paraver file • “Chop” paraver trace file, if it is too big

Paraver Screenshots

Dimemas • Estimate impact of code changes without changing the code • Estimate execution time on slightly different architectures

Further Information • SGI Irix • man etime • man 3 time • man 1 time • man busage • man timers • man ssrun • man prof • Origin2000 Performance Tuning and Optimization Guide • Linux Clusters • man 3 clock • man 2 gettimeofday • man 1 time • man 1 gprof • man 1B qstat • Intel Compilers Vprof on NCSA Linux Cluster

Parallel Computing Explained Timing and Profiling