Serial optimisation and profiling

Serial optimisation and profiling Computer User Training Course 2004 Carsten Maaß User Support This unit is derived from presentations by John Hague and Daniel Boulet

Topics • Tuning methodology • Timing • Profiling • Serial optimisation

Why optimisation? Resources are limited/shared. With optimised code you could: • Make more cycles available for all users • Run more experiments • Run larger experiments • Get results earlier

Performance measure • M1: ‘Science’ / time-unit using computer • M2: Computations / time-unit • M1/M2 = ?

Tuning options In order of ease and preference: • Use existing package tuned for pSeries (IFS etc., etc., etc.) • Use ESSL (and/or MASS) • Use another tuned library such as NAG • Hand tune

Tuning techniques • minimize I/O including paging • compile overall with "-O4 -qarch=pwr4" and with "-O3 -qarch=pwr4“ (and –qstrict) • measure performance & go with the better one • do a Hot-Spot Analysis (profiling) • for key routines: • consider replacing with ESSL equivalent • check which of -O3 and -O4 is best again • hand tune

Tuning methodology un-optimised correct code measure / profile fast enough Yes Define a performance target! No tune bottlenecks correct results No Yes check code Don't abandon an otherwise successful tuning approach just because the program starts to generate wrong answers as it may be possible to fix the answers while still getting faster results. optimised code

Some reminders . . . • Neither hand tuning or the compiler's optimizer are likely to triple or even double the performance of most programs. The only realistic exceptions are • very careful tuning of certain matrix intensive applications (check out the SMP aware ESSL library) • parallelization techniques • Never underestimate the potential improvements to be gained by switching to a better algorithm • after which, hand tuning and the compiler's optimizer can make things even faster

Profiling or "Hot Spot Analysis" (1/2) • Remember why bank robbers rob banks: because that's where the money is! • Apply the same logic to performance tuning: • identify which parts of the program are the most expensive • concentrate tuning efforts on the expensive parts • REMEMBER: different input data may result in very different activity patterns

Profiling or "Hot Spot Analysis" (2/2) • Rule of thumb: • Trying to double or triple the performance of a program by optimizing parts that each use less than 20% of the time is like trying to get rich by robbing lots of small grocery stores. • i.e. you have to be incredibly lucky to succeed!

eoj — end-of-job information … Node actual : 16 Adapter Req. : (csss,MPI,not_shared,US) Resources : ConsumableCpus(2) ConsumableMemory(1.758 gb) #*#* Next 3 times NOT up-to-date (TOTAL CPU TIME given later IS accurate) Step User Time : 5+19:10:52.450000 Step System Time : 00:25:27.960000 Step Total Time : 5+19:36:20.410000 (502580.41 secs) #*#* Last 3 times NOT up-to-date (TOTAL CPU TIME given later IS accurate) Context switches : involuntary = 180609151, voluntary = 2017625 per second = 36383 406 Page faults : with I/O = 41887, without I/O = 12774072 per second = 8 2573 <--------- CPU --------> <------------- MEM ------------> Node ? #T #t secs/CPU (Eff%) (Now%) max/TSK mb (Eff%) (Now% - mb ) Task list -------- - -- -- ---------- ------ ------ ---------- ------ -------------- --------- hpcb0302 M 4 2 4076.71 ( 82%) ( 83%) 644.91 ( 35%) ( 36% - 7680) 0:1:2:3: … … … … … … hpcb1903 . 4 2 4105.18 ( 82%) ( 95%) 627.06 ( 34%) ( 35% - 7680) 60:61:62:63: -------- - -- -- ---------- ------ ------ ---------- ------ -------------- --------- Min = 4033.09 602.13 = Min Max = 4118.67 644.91 = Max -------- - -- -- ---------- ------ ------ ---------- ------ -------------- --------- Elapsed = 4964 secs 1800 mb = ConsumableMemory CPU Tot = 523004.47 ( 6+01:16:44) Average: 32688 s/node, 8172 s/task System Billing Units used by this jobstep = 563.363 Can be used at any time: eoj job-ID

Profiling tools • Xprofiler (a really useful and user-friendly tool) • compile with -g -pg (and usual optimization options) and then execute program against chosen test case(s) • provides graphical indication of call tree • visual indication of most active routines • click on routine to get FORTRAN statement level profiling • part of IBM Parallel Environment • also available on ecgate • prof, gprof (standard Unix tools) • Use them if Xprofiler is unavailable

Using Xprofiler • compile and link the program with the -g and -pg options: • $ xlf -c -g -pg -O4 main.f • $ xlf -c -g -pg -O4 qq.f • $ xlf -g -pg main.o qq.o -o prog • run the program (creates a gmon.out file) • $ ./prog data1 • invoke Xprofiler on the binary and the gmon.out file • $ xprofiler prog gmon.out

Xprofiler - overall view start mcount main recurs par2 par1 log transfm tan trans3 trans2 sin cos

Xprofiler - zoomed view Hints: To obtain a clear overview of the call tree for your executable only, use the option Filter -> Hide All Library Calls followed by Filter -> Uncluster View -> Zoom In to see labels right-click function box for Function menu

Interpreting Xprofiler results The width of the box indicates the relative amount of time spent by the routine and the routine's descendents The height of the box indicates the relative amount of time spent in the routine The program spent 2.631 seconds of CPU time in the routine and the routine's descendents the routine itself consumed 1.230 seconds of CPU time the name of the routine is recurs [4] is the index of the routine in the "Function index" report recurs called log1000000 times

Xprofiler source code view A "right click" on the recurs function's box brings up the routine's source code view: The "ticks" column is the number of times the line was "active" when the profiling clock "ticked"

gmon.out file issues • each run of the program will create a new gmon.out file (overwriting any existing gmon.out file) • rename the old gmon.out file first if you want to keep it • recompiling the program invalidates all older gmon.out files • saving the old binary before recompiling can be used to keep older gmon.out files valid • modifying the source file invalidates the information shown in xprofiler's "Source Code" window

Timing a program • In program: • mclock() returns CPU time • INTEGER FUNCTION • Returns 1/100ths of seconds • rtc() returns elapsed (wall clock) time • REAL*8 FUNCTION • Returns seconds with microsecond resolution • AIX time(x) command gives: • 'Real' time (elapsed) • 'User' time (CPU) • 'System' time (CPU) • Total CPU time = User + System implicit (none) real*8 r0,rtc,cpu_secs,real_secs integer m0,mclock . r0=rtc() m0=mclock() . >code you want to time< . cpu_secs=(mclock()-m0)*0.01 real_secs=rtc()-r0

MCLOCK granularity • timed code must take significantly longer than 1/100th sec • an AIX restriction - not FORTRAN • wrap the loop in "time multiplier" loop to improve timing results: T0=MCLOCK() C===MMM LOOP IS TO INCREASE TOTAL CPU TIME DO MMM=1,10000 DO I=1,2000 A(I)=A(I)+S*B(I) ENDDO ENDDO TLOOP=(MCLOCK()-T0)/100./MMM • BE CAREFUL: this may wrongly hide cache miss effects • The solution is to use rtc() and run CPU-bound on an otherwise quiet system.

Beware of the optimizer … call dummy(m0) call dummy(ml) • the optimizer might move calls to rtc() and mclock() • insert print statements to force serialization m0=rtc() . . . ml = rtc() delta = ml – m0 • use –qstrict • flush

Hardware performance monitor (1/2) • counters provided by processor • can be used via libraries in /usr/local/lib/trace • libmpihpm.a (and –lpmapi) • see /usr/local/lib/trace/README Performance monitoring was developed for hardware engineers! No. of floating point operations: FPU_FMA + FPU0_FIN + FPU1_FIN - FPU_STF

Hardware performance monitor (2/2) • Other tools: • hpmcount • ~trx/hpm/hpmcount executable • libhpm • -L/home/ectrain/trx/hpm –lhpm –lpmapi • export HPM_GROUP=[0-60], see /usr/local/lib/trace/power4.ref • call hpm_begt(n) start counting block n call hpm_endt(n) stop counting block n call hpm_prnt() print counter values and labels • see ~trx/hpm

The serial tuning top 10 list • instrument the code (i.e. mclock() and rtc()) • try different optimization flags • use optimised libraries (ESSL, MASS, (NAG)) • use stride 1 • use cache effectively • keep pipelines and FPUs busy • maximize: Floating Point ops / (Load+Store ops) • replace DIVIDEs • remove IF statements • help the compiler • replace ** with EXP of LOG • recode getting fractional part of a number

FORTRAN compiler flags • -O2 • optimises, but retains order of computation • small amount of unrolling • better than -O3 -qhot for some routines • -O3 • optimises with reordering of computation • more aggressive unrolling • use -qstrict to retain order of computation • -qhot • blocks and transforms simple loops • good for F90 array notation instructions • use selectively

FORTRAN compiler flags • -qarch=auto [pwr4, pwr3] • Controls which instructions the compiler can generate. Changing the default can improve performance but might produce code that can only be run on specific machines -O4 • shorthand for: -O3 -qhot -qipa –qarch=auto -qtune=auto -qcache=auto • -qipa: inter procedural analysis - increases compilation time • -qcache=auto, -qtune=auto: tune for processor doing compilation

FORTRAN compiler flags • -qessl • will substitute Fortran intrinsic functions from ESSL library when it is safe to do so (-lessl must be specified at link time). Controls which instructions the compiler can generate. Changing the default can improve performance but might produce code that can only be run on specific machines • Try various combinations as many optimisations interfere with each other

Performance libraries • MASS library • Mathematical Acceleration SubSystem • ESSL • Engineering and Scientific Subroutine Library • NAG • Numerical Algorithms Groups (not particularly optimised for POWER4!)link with $NAGLIB

MASS library • automatically provides high-performance alternative to maths intrinsics • re-link only • vector versions require source code change • some are very slightly less accurate (normally only one ULP, i.e. one bit) • at high optimization levels (-O4), xlf may automatically use routines from MASS • -L/usr/local/lib/mass –lmass -lmassv

MASS library • scalar library • no change to code (i.e. compiler uses them) • exp, log, **, sin, cos, tan, dnint speed up by a factor of about 2 • vector library • code change may be required • exp, log, sin, cos, tan, dnint, dint speed up by a factor of about 6 • if exp with IF statement, create reduced vector (which may not be long enough) http://www.rs6000.ibm.com/resource/technology/MASS

MASS library • Examples for performance gains on POWER3 architecture • log: 1.57 vlog: 10.4 • sin: 2.42 vsin: 10.0 • (reciprocal) vrec: 2.6 • Compiler flags: -qarch=pwr4 enables hardware SQRT (very important) -qnounroll to be avoided for small loops -qhot compiler uses vector MASS SQRT and 1/x -qstrict disables vector MASS functions • Hand coded use of MASS generally best

ESSL functionality • BLAS (Basic Linear Algebra Subprograms) • linear algebra • eigensystem analysis • Fourier transform • etc.

BLAS • three "levels" • Level 1: vector-vector: e.g. dot product • Level 2: vector-matrix: e.g. DAXPY • Level 3: matrix-matrix: e.g. matrix multiply, DGEMM • standardised • portable across systems • hardware vendors are encouraged to supply high-performance BLAS • IBM's high performance BLAS is ESSL

(P)ESSL • both ESSL and PESSL have an SMP-parallel capability -lessl -lesslsmp • the "Parallel" in PESSL refers to the use of MPI message passing, usually over the SP switch

ESSL library • particularly good for: • FFT's • matrix manipulation • linear equation solvers • sort • FORTRAN often better for level 2 BLAS • get benefits of inlining etc: CALL DAXPY(N,A,P,1,R,1) S=DDOT(N,R,1,P,1) DO I=1,N R(I)=R(I)+A*P(I) S=S+P(I)*R(I) ENDDO

Use stride 1 DO I=1,N DO J=1,N C(I,J)=C(I,J)+A(I,J) ENDDO ENDDO (leftmost Fortan index in array) DO J=1,N DO I=1,N C(I,J)=C(I,J)+A(I,J) ENDDO ENDDO Note: the compiler may do this with -qhot (or -O4)

Avoid stores • stores may flush caches • stores may have to load caches first • don't zero array as a precaution • mix store zeros with storing data or • use CACHE_ZERO directive • zeroes whole cache line without loading from memory • need to put in subroutine to handle partial line zeroing • stores only go at half speed on POWER4 (unless cache lines interleaved)

Avoid stores do i=1,n y(i)=c*x(i) enddo . . do=1,n z(i)=1.0+y(i) enddo do i=1,n z(i)=1.0+c*x(i) enddo

Remove DIVIDEs do i=1,N a(i)=x(i)/z(i) b(i)=y(i)/z(i) enddo do i=1,N t=1.d0/z(i) a(i)=x(i)*t b(i)=y(i)*t enddo Note: compiler usually uses reciprocal with -O3 (without -qstrict)

Remove DIVIDEs do i=1,N z(i)=a/x(i)+b/y(i) enddo do i=1,N z(i)=(a*y(i)+b*x(i))/(x(i)*y(i)) enddo Remember: DIVIDEs take about 14 cycles but extra multiplies only take from 1 cycle (pipelined) to 6 cycles (totally unpipelined)

Remove DIVIDEs • create reciprocal array if DIVIDE uses same denominator more than once • try to use MASS library VDIV (vector divide) • use call vdiv(out,nom,div,len) • may need to split loop • compiler will try to do this if -qhot, and not conditional

Remove SQRTs • much harder to remove as there is usually no high speed alternative • but sometimes there is: if ( sqrt(x) < y )if ( x < y * y )

Remove IFs do j=1,N if(j.eq.1) then a(j)=1.0 else a(j)=b(j) endif enddo a(1)=1.0 do j=2,N a(j)=b(j) enddo

Remove IFs do j=1,N if(k(i).eq.0)x(j)=0.0 a(j)=x(j)+c*b(j) enddo if(k(i).eq.0)then do j=1,N x(j)=0.0 a(j)=c*b(j) enddo else do j=1,N a(j)=x(j)+c*b(j) enddo endif

Remove IFs Use MAX or MIN instead of IF: do j=1,N if(a(j).lt.0) then b(j)=0.0 else b(j)=a(j) endif enddo do j=1,N b(j)= max(0.0,a(j)) enddo

Help the compiler If –qstrict is used, put parentheses around common expressions: DO I=1,N A(I)=B(I)*C(J)*D(J) X(I)=Y(I)*C(J)*D(J) ENDDO DO I=1,N A(I)=B(I)*(C(J)*D(J)) X(I)=Y(I)*(C(J)*D(J)) ENDDO

Help the compiler Enable overlapping (pipelining): DO J=1,M S=S+C*X(J) ENDDO DO J=1,M,6 S1=S1+C*X(J) S2=S2+C*X(J+1) S3=S3+C*X(J+2) S4=S4+C*X(J+3) S5=S5+C*X(J+4) S6=S6+C*X(J+5) ENDDO S=S1+S2+S3+S4+S5+S6 Notes: need at least 6 independent FMAs for max MFLOPS compiler may unroll with -O3 (without -qstrict) don't forget to handle the case where M isn't a multiple of 6!

Help the compiler Unroll outer loop: do j=1,N do i=2,M y(i,j)=y(i,j)-c*y(i-1,j) enddo enddo do j=1,N,4 do i=2,M y(i,j )=y(i,j )-c*y(i-1,j ) y(i,j+1)=y(i,j+1)-c*y(i-1,j+1) y(i,j+2)=y(i,j+2)-c*y(i-1,j+2) y(i,j+3)=y(i,j+3)-c*y(i-1,j+3) enddo enddo

Use the cache effectively • poor use of cache can reduce performance by a factor of 10 or more • thorough understanding of cache enables efficient program design • ....but remember Feynman: • if you think you understand how the cache works …then you don't understand how the cache works

Use the cache effectively 4 Kwords • Block inner strided loop • e.g. matrix transpose: do ii=1,N,NB do j=1,N do i=ii,ii+NB-1 y(i,j)=x(j,i) enddo enddo enddo

Serial optimisation and profiling