Single-Processor Optimization Stuart Johnson, SDSC (sjohnson@sdsc)

Single-Processor OptimizationStuart Johnson, SDSC(sjohnson@sdsc.edu)

It is impossible to get good performance on a parallel computer unless the underlying serial code performs well(?) Optimization of the single processor code is probably required for good performance(?)

Questions to ask before undertaking a project in code optimization • How much effort should I put into optimization? • Where should I focus my efforts? • What is the best way to measure performance? • Should I optimize for a particular machine?

1. How much effort should I put into optimization? • Code optimization, like many other endeavors, quickly runs into the law of diminishing returns. Since programmer time is very valuable, you need to decide how much time is worth spending on optimization. • How much use will the code get? Frequently reused computational kernels generally deserve more effort than codes that will be used once. • How crucial is it that the code achieves good performance? Does it require an excessive amount of CPU time (especially considering the size of my allocation)?

2. Where should I focus my efforts? • Once you decide to optimize a code, you should focus your efforts on the most computationally intensive portions. • Use performance analysis tools (Xprofiler, TotalView, Hpmcount on BH - ATOM, Pixie on TCSini) to identify hotspots in test runs. Talk on Friday by Dr. Carrington on TotalView & Xprofiler.) • Make sure that test runs are indicative of the production runs that you expect to do (don’t profile a toy example and base your application tuning on that!) • Often, only a few subroutines account for a majority of the CPU time. Sometimes just one loop contains most of the work.

3. What is the best way to measure performance? Below are some of the common measures of code performance: • MIPS – Millions of Instructions Per Second • Drawback – number of instructions issued is not necessarily indicative of the amount of useful work done. • MFLOPS – Millions of Floating-Point Ops. Per Second • A better metric for numerically intensive codes, BUT • Different platforms measure Flops differently • Flops is not completely indicative of useful work done • Run time/CPU time • The only true measure of code performance! Accounts for algorithmic improvements to code. Can be converted to cycles.

3a. Counting cycles like the pros • Counting cycles means: • Estimate how many cycles your loop(s) should take • Compare to measured times(converted to cycles) and tune the code to narrow the difference • Advantages: only non-black magic technique for code optimization • Compares performance to expected (absolute) performance • Process of predicting performance familiarizes you with the important features of an architecture • Disadvantages: • Requires knowledge of architecture which may be hard to acquire • Probably requires reading assembly code (not that bad really!) • Requires thought and time

3b. Portable timing routines in Fortran 90 • integer start, stop, rate • call system_clock(count_rate=rate) • call system_clock(count=start) • * code segment to be timed * • call system_clock(count=stop) • time = real(stop-start)/real(rate) • This is the preferred way to time code segments since it uses Fortran 90 intrinsic routines and is therefore completely portable. On the MPP nodes of the T3E and SP, “all” of the time is spent on your code.

3c. T3E/SP timing routines in C • #include <sys/times.h> • #include <time.h> • struct tms,before,after; • clock_t utime,stime,starttime,endtime; • starttime=times(&before); • . • < code to be timed > • . • endtime=times(&after); • utime=after.tms_utime-before.tms_utime; • stime=after.tms_stime-before.tms_stime; • printf(“CPU time = %f sec or %ld ticks\n”, • float(utime)/(float)CLK_TCK,utime);

4. Should I optimize for a particular machine? • To get good performance, often need to perform • machine-specific optimizations based on: • Cache size and number of levels/structure of cache • Instruction set • Availability of vendor specific libraries • Special hardware features (registers, streams) • How square roots and divisions are performed • The more a code is optimized for a specific machine, the less • portable it is to other platforms. Between various platforms • there can be fundamental architectural differences which make • portable efficiency impossible (e.g., vector<->RISC).

Optimization philosophy • Understand your hardware; strive to predict performance For RISC processors this yields 2 general concerns: • program for the memory hierarchy: • modify your algorithms to maximize data reuse • avoid cache thrashing • program for the functional units: • maximize independent operations to keep the pipelines busy • understand and avoid or mitigate the effects of expensive ops

Power3 Architecture

Power3 Chip Layout

Node Architecture

System Architecture

Blue Horizon • Processor clock rate - 375 MHZ • L1 64 KB (8192 W), 128 way associative, 16 word cache line • L2 8MB, 4 way associative, 16 word cache line • 4GB main memory per 8 way SMP node • 1154 processors (144 nodes)

Pipelining and parallelism 1 cycle per stage 5-stage(carpenter) chair pipeline: Performance: dependent chairs: 5 cycles per chair independent chairs: 1 cycle/chair, 4 cycle latency By analogy, pipelined functional units require independent operations for high performance. Power3 Nighthawk II has 7/8 stage pipeline

Optimization and ... • Counting cycles • Cache • Loop transformations • Using compiler options • Using intrinsic and library routines • Hardware-specific features • Miscellaneous tricks • Inspecting assembly code to check what the compiler is doing

Counting cycles - a simple start For the BH (Power3 NightHawk2): (in-cache: >90% of pred. speed) • can do 2 independent FMAs per cycle, or two independent FLOPS (a multiply or add) per cycle • can load or store 2 words/cycle cache<-> register • loads/stores and arithmetic can overlap • clock rate is 375 Mhz • Rate in L2 is cut by 2/3rds (2K MB/s vs 6K MB/s) For the TCSini (Alpha 21264): (in-cache: > 90% of pred. speed) • can do 2 independent FLOPS • can do an independent add every cycle • can load and store a cycle cache<-> register • clock rate 667 Mhz

Counting cycles - simple example Code fragment (NOT optimized) do i=1,n do j=1,n do k=1,n c(i,j)=c(i,j)+a(i,k)*b(k,j) enddo enddo enddo Per iteration cycle cost predictions (BH): In-cache, double precision: • Loads: 2 : 1/2 cycle • FMAs: 1 : 1/2 cycle (independent???) Out of (L1) cache, double precision: • Loads: 2 : 10s of cycles(worst case for a(i,k) loads) • FMAs:1: 1/2 cycle (independent???) Observed: 35 cycles/per iteration (1/70 of peak!!!)

Cache The cache is a small, fast memory that buffers data passing between registers and main memory. The cache was designed with several concepts in mind: • $$$$$/space Main memory (DRAM) is much cheaper and physically denser than cache (SRAM) • Spatial data locality (assumption about your code) When a location in memory is referenced, nearby locations will probably also be referenced • Temporal data locality (another assumption) When a location in memory is referenced, it will probably be referenced again soon.

Cache • To exploit spatial locality, data is loaded one cache line at a time, rather than one word at a time. • Example: when element A(1) is referenced, elements A(1)-A(n), where n is the size of the cache line, are loaded into cache. • To exploit temporal locality, once data is loaded into cache, it remains in cache until the cache line needs to be flushed (or written back and flushed) to hold other data.

Cache Both the BH and TCSini processors have associative caches. This means that each address in memory maps to several possible locations in cache. (cache location = memory address modulo cache size) * associativity thes are checked in parallel Main memory is much larger than cache resulting in a many-to-one mapping. cache Main memory Memory locations of same color compete for same locations in cache

Cache An n-way set associative cache can be thought of as n independent caches. A location in main memory maps to the same location in each of the caches. Three-way set-associative cache main memory

Power 3 Cache 21264 Small and Fast CPU CPU 8 K words 128 way set-associative 16 words/line 8 MB 4 way assoc 16 words/line 8K words 2 way associative 8 words/line L1 CACHE L1 CACHE 8 MB direct mapped 8 words/line L2 CACHE L2 CACHE MAIN MEMORY MAIN MEMORY 4 GB 4 GB Big and Slow

Cache Which bank of the set-associative cache do I go to? • 21264 random replacement • Power3 least recently used (LRU) Why do I care about the details of cache structure? • I need to know where my data ends up in cache to prevent cache thrashing and also to predict how much data I can expect to remain in cache during program execution What is cache thrashing? • We'll cover that in a few more slides

? ? X(1) ? ? X(2) X(1) ? ? X(2) X(3) X(1) X(2) X(3) X(4) X(1) Cache Both systems should ensure that an array or common block will be aligned on a cache line. This is neat, but is really only important for code which accesses randomly located cache-line sized chunks from main memory. Possible mappings of first elements of array x to locations in CACHE lines “?” refers to whatever is stored in memory before array x The following flags guarantee the first case above on BH xlf -qalign=4k (page boundary) xlc -qalign = <alignopt> !

Cache • Array elements and elements of a common block are always stored sequentially in memory. Although we can’t predict where a particular array element will end up in cache, we can predict the relative locations of two elements of an array or common block • comon /xy/ x(8192), y(8192) If CACHE contains 8K words, y(i) will be mapped to same associativity set as x(i)

Cache Arrays competing for same array locations: cache thrashing common /xy/ x(8Mword), y(8M word) do i=1,8M y(i) = a*x(i) + y(i) enddo • Load y(1)-y(8) into L2 CACHE line • Load y(1)-y(8) from L2 CACHE into L1 CACHE line • Load x(1)-x(8) into L2 CACHE line • x(1) maps to same L2 CACHE location as y(1), flush L2 CACHE line • Load x(1)-x(8) from L2 CACHE into L1 CACHE line • y(2) maps to same DCACHE location as x(2), but associativity saves us • but now x is not cached in L2 even though it was brought in…… • how about ? common /xy/ x(8Mword), y(8M word), z(8 M word) do i=1,8M y(i) = a*x(i) + y(i) - z(i) enddo

Cache Solution to cache thrashing (compiler flags should fix it). By hand we should pad arrays common /xy/x(8M W),pad(8),y(8 MW)! do i=1,1024 y(i) = a*x(i) + b enddo • Load y(1)-y(8) into L2 CACHE • Load y(1)-y(8) from L2 CACHE to L1 CACHE • Load x(1)-x(8) into L2 CACHE – NO CONFLICT! • Load x(1)-x(4) from SCACHE to DCACHE – NO CONFLICT • y(2) is already in DCACHE • x(2) is already in DCACHE F90: -qhot=arraypad

Loop Transformations • Pushing loops inside subroutine calls • Loop fusion • Loop interchange • Loop unrolling: inner • Blocking • Loop unrolling: outer These transformations can lead to better data reuse, increased opportunity for instruction level parallelism, reduced loop overhead, and elimination of redundant memory references.

Avoiding function call overhead • A lot happens when you call a function • call foo(X) • save the PC and register state at the point of invocation by pushing them on the stack (accessing memory) • Pop them off when you return

Wall clock time of function call overhead on several ASCI White configurations

Pushing loop inside function/subroutine call Replace a loop over calls to a subroutine/function to a subroutine that contains the loop Before After subroutine add(x,y,z) real*8 x,y,z z = x + y end ... do i=1,n call add (x(i),y(i),z(i)) enddo subroutine add(x,y,z,n) integer i, n real*8 x(n),y(n),z(n) do i=1,n z(i) = x(i) + y(i) enddo end ... Call add(x,y,z,n) Eliminates overhead associated with call to subroutine (at least 50 cycles on Power3) and allows pipelining of arithmetic

Function Inlining Replace a call to a function with the actual body of the function Before function sum3(x,y,z) real*8 x,y,z,sum3 sum3 = x+y+z return end ... do i=1,n w(i)=sum3(x(i),y(i),z(i)) enddo After do i=1,n w(i)=x(i)+y(i)+z(i) enddo Provides same benefits as pushing loop inside subroutine - reduced function call overhead, better pipelining. Compiler can perform some, but not all function inlining. BH -qipa=inline

Loop fusion Combine the bodies of two or more loops into a single loop Cycles per iteration: In L1 cache(BH): L/S: 1 cycle FP: 1/2 cycle L/S: 1 cycle FP: 1/2 cycle L/S: 1 cycle FP: 1cycle Before do i=1,n x(i) = a(i) + b(i) enddo do i=1,n y(i) = a(i) * b(i) enddo After do i=1,n x(i) = a(i) + b(i) y(i) = a(i) * b(i) enddo Get better data reuse and utilization of functional units. Most compilers can recognize opportunities for loop fusion and perform this optimization automatically, even if there are lines of code intervening between the two loops.

do i=1,n x(i)=a(i)+b(i) enddo do i=2,n-1 y(i)=a(i)*c(i) enddo x(1)=a(1)+b(1) x(n)=a(n)+b(n) do i=2,n-1 x(i)=a(i)+b(i) y(i)=a(i)*c(i) enddo Loop fusion • There are intervening operations that will prevent the compiler from performing loop fusion: • I/O operations • Calls to timing routines • Subroutine calls that could modify variable used in 2nd loop • Conditionals • different loop limits:

Loop interchange Interchange the nesting of loops to achieve better data stride Cycles per iteration: In memory: L/S: (at least) 30 cycles per iteration on BH In memory: L/S: (30 +7)/8 ~= 5 cycles per iteration! In cache: ? (probably doesn’t matter much ~= 2 cycles per random load on BH) Before Do i=1,m do j=1,n x(i,j)=x(i,j)+1.0 enddo enddo After do j=1,n do i=1,m x(i,j)=x(i,j)+1.0 enddo enddo This optimization can usually be done by the compiler, but it is suggested that the programmer do this manually since it is both easy to do and could have serious performance implications if missed by compiler.

Inner Loop Unrolling Modification of loop such that multiple iterations occur in loop body Before do i=1,n y(i)=a*x(i)+b enddo After do i=1,n,4 y(i)=a*x(i)+b y(i+1)=a*x(i+1)+b y(i+2)=a*x(i+2)+b y(i+3)=a*x(i+3)+b enddo Reduces loop overhead, generates a package of instructions for pipelining. The compiler can usually make the best choice for the number of times that an inner loop should be unrolled, but on occasion manually unroll for better performance.

Inner Loop Unrolling 2 Modification of loop to eliminate data dependency. Before do i=1,n a=a+x(i)*y(i) enddo After a1=a2=a3=0 do i=1,n,4 a=a+x(i)*y(i) a1=a1+x(i+1)*y(i+1) a2=a2+x(i+2)*y(i+2) a3=a3+x(i+3)*y(i+3) enddo a=a+a1+a2+a3 Cycles: (in cache BH) L/S: n/2 cycles FP: 2n cycles (dependent) L/S: n/2 cycles FP: n/2 cycles (note we should unroll to pipeline depth * # of functional units independent operations = 16) This involves a change in the order of summing and may be done by advanced compilers or preprocessors. The big payoff comes for x and y in cache.

do ib=1,n,nb do jb=1,n,nb do kb=1,n,nb do i=ib,min(n,ib+nb+1) do j=jb,min(n,jb+nb+1) do k=kb,min(n,kb+nb+1) c(i,j)=c(i,j)+a(i,k)*b(k,j) enddo enddo enddo enddo enddo enddo do i=1,n do j=1,n do k=1,n c(i,j)=c(i,j)+a(i,k)*b(k,j) enddo enddo enddo Tiling Loop transformation whereby blocks of data that fit into cache are reused. Optimal block sizes depend on both the number of arrays being accessed and the cache sizes.

Iteration Space Traversal with Tiling i j

Tiling for I <- 1 to n do b[I] <- a[I]/b[I] a[I+1] <-[I] + 1.0 endfor for I <- 1 by 2 to n do for j <- I to min(I+1,n) do b[I] <- a[I]/b[I] a[I+1] <-[I] + 1.0 endfor endfor

Outer loop unrolling The following change reduces the number of loads and exposes independent operations to the floating point pipelines. do i=ib,min(n,ib+nb+1),2 do j=jb,min(n,jb+nb+1),2 c00=c(i,j) c01=c(i,j+1) c10=c(i+1,j) c00=c(i+1,j+1) do k=kb,min(n,kb+nb+1) c00=c00+a(i,k)*b(k,j) c01=c01+a(i,k)*b(k,j+1) c10=c10+a(i+1,k)*b(k,j) c11=c11+a(i+1,k)*b(k,j+1) enddo c(i,j)=c00 c(i,j+1)=c01 c(i+1,j)=c10 c(i+1,j+1)=c11 enddo enddo do i=ib,min(n,ib+nb+1) do j=jb,min(n,jb+nb+1) do k=kb,min(n,kb+nb+1) c(i,j)=c(i,j) +a(i,k)*b(k,j) enddo enddo enddo

Fortran 90 compiler options on the BH BH benchmark study showed an average 2X speedup just by using the following compiler options See f90 man page for additional compiler flags. Also see www.npaci.edu/BlueHorizon and www.psc.edu (use -arch ev67, experiment with 4 levels of opt. Check man pages.

Vendor libraries on TCSini • NAG - parallel sparse and dense linear algebra and other numerical routines • PETSC - partial differential equations systems • SCALAPACK • More information: http://www.psc.edu/machines/tcs/ These libraries have been specially optimized for the Alpha. Functionality includes: • Linear algebra (Ax=b, BLAS, matrix factorizations, …) • Eigenvalue solvers • FFT’s and signal processing routines • Sparse solvers, tridiagonal solvers, linear recurrence, …

Vendor libraries on BH • MASS (Mathematical Acceleration Subsystem) - A set of functions that replace several of the mathematical functions in libma (sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y). The MASS scalar routines are not thread-safe. Do not link this library if there are any references to MASS scalar intrinsics in any shared memory parallel region. MASS does have thread-safe vector intrinsics. • ESSL - The Engineering and Scientific Subroutine Library family of products is a collection of mathematical subroutines that provides a wide range of over 400 high-performance mathematical functions for many different scientific and engineering applications. When using ESSL routines in a parallel region, you must specify the thread-safe version by using -lessl_r at link time to insure the correct behavior. ESSL includes the following: • Basic Linear Algebra Subroutines (BLAS) • Linear Algebraic Equations • Eigensystem Analysis • Fourier Transforms • PESSL - A parallel version of ESSL. • More information:http://www.npaci.edu/BlueHorizon/guide/progdev.html#LIBS

Miscellaneous tricks • Replace repeated divides by multiplications by the reciprocal • Avoid exponentiation (real-to-real) by using integer powers and square roots • Replace integer division/multiplication by two by left/right shifts • For the BH, try to schedule two (or more) independent divides in the loop body

Summary • Use timers and performance analysis tools to identify the most computationally intensive parts of the code • Keep track of performance limits and know when to stop • Always keep in mind the big picture • Program for the RISC architecture: • Program for the memory hierarchy: • search your soul for data reuse • don’t thrash the cache • be aware of your load-store/flop ratio and act accordingly • Program for pipelines: • expose independent operations to the compiler

References • POWER3: www.redbooks.ibm.com, search for “optimization”

Single-Processor Optimization Stuart Johnson, SDSC (sjohnson@sdsc)

Single-Processor Optimization Stuart Johnson, SDSC (sjohnson@sdsc)

Presentation Transcript

SDSC Blue Gene: Optimization and Debugging Mahidhar Tatineni SDSC, April 6, 2007

Dataflows in SRB using SDSC Matrix

Profiling techniques on SDSC systems Dmitry Pekurovsky SDSC Summer Institute July 16, 2007

SDSC/UCSD Campus Update

SDSC TG RP Report September 07

Green Datacenter Initiatives at SDSC

Running jobs on SDSC Resources

SDSC/Calit2 Synthesis Center CSIG ‘06

SDSC Imaging Portal

SDSC, skitter (July 1998)

SDSC Resources: Applications and Libraries

Overview of HPC SDSC Machines Science Enabled at SDSC

SDSC S R B survey

SDSC Blue Gene: Overview

Visualization at SDSC

Gridflows and SDSC Matrix

SDSC Matrix Project: A Passionate Workflow towards Scientific Perfection

Using SDSC TeraGrid IA-64 Cluster

Running jobs on SDSC Resources

SDSC RP Update

DataTurbine at SDSC

Introduction to SDSC/NPACI Architectures