High Performance Computing

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics

Today’s Lecture Vector and SIMD Computing • Part 1: Vector Computing & architectures • Part 2: Vectorization of code • Part 3: SIMD operations on scalar CPUs (MMX,SSE), generally programmable graphics processors

Part 1: Vector Computing & Architectures • Motivation for using vectors, what is a vector? • Hardware implementation, vector registers • Modern vector machines

Note to history • Vector ideas seem to have been first conceived around 1964 • CDC Star-100 was the first machine to have vector facilities (delivered 1974) • Didn’t have vector registers though, so takes a long time to get words from memory • Cray-1 – first effective vector machine, probably because of the excellent scalar performance and it had vector registers

Input data is divided into series of independent parts Processing of the parts is carried out independently Similar in concept to SIMD, except many instructions could be executed Pipelined vector processors are not truly data parallel but they share strong conceptual similarities Vector lanes do extend vector processors to a data parallel execution `Streams’ (used in graphics) are data parallel do i=11,20 a(i)=a(i)*2. end do do i=21,30 a(i)=a(i)*2. end do do i=1,10 a(i)=a(i)*2. end do Task 2 Task 3 Task 1 Data parallel execution ARRAY A(i)

Reinventing the wheel? • Data parallel computing seems to undergo reinvigoration on a 20 year cycle – 1960’s vector computing 1980’s massively parallel SIMD (Connection Machine, MasPar) 2000’s Streams in GPUs

Motivations for vector operations • Recall from the last lecture that loop counter overheads can contribute significantly: • On way around this is to unroll do i=1,n a(i)=b(i)+c(i) end do do i=1,n a(i)=b(i)+c(i) a(i+1)=b(i+1)+c(i+1) a(i+2)=b(i+2)+c(i+2) a(i+3)=b(i+3)+c(i+3) end do

Load/store overhead • Loop unrolling helps reduce loop overheads • However, there are still 3 load/store operations per arithmetic operation • Lots of instructions streaming through the processor and to memory do i=1,64 a(i)=b(i)+c(i) end do VecLoad b,vreg1 VecLoad c,vreg2 Vadd vreg3,vreg1,vreg2 VecStore vreg3,a

Machine and Human Factors • Semantic advantage • Reduction in the number of instructions required in program • Reduction in the number of operations in the CPU • Note: machines are also limited in terms of the number of instructions they can retrieve and perform per clock cycle ``Flynn bottleneck’’ • (Dense) Vectors have prescribed access pattern – can use this to amortize cost of memory latency (i.e. super `cache lines’)

Memory Interleaving • A single bank of memory has a long access time • After retrieving/writing values the bank’s charge must be refreshed – cannot access for another cycle • To retrieve all the elements of a 64 element vector would require multiple accesses  multiple waits for refresh • Interleaved memory banks: store vector elements across multiple banks which can be accessed at the same time • avoid the waits • Some machines had up to 512 way interleaving(!) • Expensive to build, but even PCs often have 2-way interleaving

Vector pipeline Vector register x1…xn Vector add unit x6+y6 x5+y5 x4+y4 x3+y3 x2+y2 x1+y1 y1…yn In this case the pipeline depth is 6, results are returned after 6 clock cycles. Vector register

(Recap) Why FP ops take more than 1 cycle • When multiplying one must treat the exponent and mantissa separately • Similarly for an add, x+y=s:

Vector execution time • For the previous example the time until the first result was 6 clock cycles (6t) • After that a result was produced every clock cycle (t) • Total of n-1 results to produce • Total time to produce all results: Tv=6t+(n-1)t • Tv=(n+5)t

Serial execution time • Each floating point operation takes 6 cycles • Assume no start up time (but there will be in reality) • n total floating point operations • Total time: Ts = 6nt

Vector Speed up • Ts/Tv=6n/(n+5) • For n=1 – same speed • For n=∞ (asymptotic limit) Ts/Tv=6 • Average performance will be somewhere in between – and is typically around 4

Fastest vector processor: NEC SX-8, achieves 16 Gflops on LINPACK Average throughput ~ 8 Gflops (50% efficiency) Scalar unit runs at 2 Gflops on LINPACK Still around a factor of 4 Opteron – 3 Gflops on optimized LINPACK! Vector speed-ups today

Vector Definitions • Rn = Mflops achievable on a vector of length n

R∞,n1/2,nv • R∞: Asymptotic number of Mflops. For very long vectors the number of Mflops should be 1/t (assuming a new result is produced every cycle) • n1/2: The value of n such that Rn=R∞/2. Vector length necessary for average time to be 2t • nv: Length of a vector required to be faster than the scalar execution (shows balance between scalar start-up time and vector)

nv – true measure of utility • If nv is large then a given machine will only be effective for very long vectors • CDC Star-100 nv=100 (!) • Cray-1 nv=2 • `Long vector’ machine => implies you need long vectors to get good speed up (bad!) • Not surprising the Star-100 was unpopular

Memory Architectures: Memory-Memory • The first vector machines were “memory-memory” (e.g. CDC Star-100) • Each vector was loaded from memory to the CPU and written back • Long start-up latencies: vectors had to be long to overcome this problem (why nv=100)

Vector-register processors • Memory words loaded into registers before being used • Results in turn written to registers • Benefits: result can be reused before being written to memory • (See the overhead on chaining)

“Vectors hide latency” • Oft quoted: idea is that for a long enough vector the start up time can be hidden • Total time to produce all results: Tv=kt+(n-1)t • Provided that n-1 is significantly greater than k this is true • Also, predictable access pattern does imply possibility of prefetch • Vectors are not the latency hiding ‘magic bullet’

Vectorization: Amdahl’s Law • Fraction of code that will only run in serial = Fs • Fraction of code that will vectorize = Fv • Relative vector speed up = Sv • Maximum vector speed up = 1/(Fs+(Fv/Sv)) • Average vectorization factor Fv = 70%, take Sv=4 • What performance improvement results?

Gather/Scatter operations • Gather: very important operation in computing: • Scatter: • Especially relevant in sparse matrix calculations do i=1,n a(i)=b(ix(i)) end do a(i) b(i) do i=1,n a(ix(i))=b(i) end do ix() Machine dependent though…

Gather/scatter hardware • Later Cray machines provided hardware speed-up for these indexed operations • An address register was used to load in the memory references into a single vector register • Can also combine gather/scatter operations • Performance advantage is potentially very large for some problems

x8 slower! 20% slower Sparse Matrix Example

Improving throughput • Consider following loop: • Normally would have to wait for first vector multiply to complete before starting second add • Is it possible to pipeline these operations? Do i=1,n a(i)=b(i)*c(i) d(i)=a(i)+e(i) End do

Chaining : Pipelining vector units together Unchained: Vector add begins when vector multiplication is complete 6 cycles 6 cycles 63 cycles 63 cycles vadd vmult =138 cycles Chained: output from vector add is directed into the multiplication unit before all operations have completed 6 vmult =75 cycles 6 vadd Performance gain ~ factor of 2 Cray 1 could pipeline 3 units together

Vector ideas still have mileage: Alpha EV8 Tarantula • The vector processor that never was… • 16(!) parallel pipelined vector lanes, each with 2 execution units • Extension to the SMT based EV8 architecture (which by itself had a peak throughput of 20 Gflops) • 32 vector registers • Registers 128 64 bit words • Full vector ISA including masks See 29th International Conference on Comp. Arch, 2002, Alaska.

Application benefits

What happened? • Projected performance levels were very high, 80 Gflops throughput.. • Alpha product development canned • Memory subsystem too expensive (surprise?) • Unclear whether the chip could be fabbed • Perhaps these ideas will resurface elsewhere…

Summary Part 1 • Vector speed comes (yet again!) from pipelining • Typical speed-ups for a single lane are around 4 • Vector registers: reduce start-up times, improve vector performance • Chaining: pipelining for vectors • Memory cost is the main barrier to implementing vector machines cheaply

Interlude: Cray 3 ``Goldfish Bowl’’ Only 1 built (1993), 88kw power dissipation!

Part 2: Vectorization • Vector (array) notation in f90 • Vectorization strategies • Barriers to vectorization

Caveat • Given the small number of vector machines available in Canada why teach vector programming? 1. Concepts still valid 2. Possible that vector concepts will make a return (we’re running out of options for speed up and ideas apply to streams) 3. Useful to know!

Vector programming philosophy • Traditionally, vector instructions are inserted by the compiler • Therefore need to write code that easily enables the compiler to analyze for vectorization • Similar idea to scalar optimization where it is necessary to make ILP as easily visible to the compiler as possible

Vector order a(1)=sin(b(1)) a(2)=sin(b(2)) … a(n)=sin(b(n)) c(1)=b(1)+d(1) c(2)=b(2)+d(2) … c(n)=b(n)+d(n) Scalar order a(1)=sin(b(1)) c(1)=b(1)+d(1) a(2)=sin(b(2)) c(2)=b(2)+d(2) … a(n)=sin(b(n)) c(n)=b(1)+d(n) Vector ordering completes all operations on a() before proceeding to c() Execution organization do i=1,n a(i)=sin(b(i)) c(i)=b(i)+d(i) end do

Inner loop emphasis • For multiply nested loops, vectorization occurs on the innermost loop • Strategy: inner loop needs to be long for vectorization to be efficient do i=1,5 do j=1,n do k=1,n a(k,j,i)=… end do end do end do do i=1,n do j=1,n do k=1,5 a(k,j,i)=… end do end do end do

Vector Notation • Simple, but powerful, f90 syntax • For array a(i) • Similarly adding arrays: do i=1,n a(i)=a(i)*2.0 end do a(1:n)=a(1:n)*2.0 do i=1,n a(i)=b(i)+c(i) end do a(1:n)=b(1:n)+c(1:n)

Dealing with increments • This can be dealt with • Multidimensional arrays: do i=1,n,2 a(i)=a(i)*2.0 end do a(1:n:2)=a(1:n:2)*2.0 Do i=1,n do j=1,n,2 a(j,i)=a(j,i)*2. end do End do a(1:n:2,:)=a(1:n:2,:)*2. Colon = all values

Hazards to vectorization • Functional hazards: • Loops should not contain I/O • Loops should not contain branching out/in of loop • Loops should not contain user defined functions • Loops should not contain subroutines • Loops should not be do while • Data hazards: • Recursion • Dependence • Unresolvable logic

a(i)=a(i-1) depends explicitly on previous iteration – cannot be vectorized Second loop can be vectorized because the recursion has been eliminated Third loop is non-recursive (but vectorizeabe) Recursion: Vectorization Inhibitor do i=1,n a(i)=a(i-1) end do do i=2,n,2 a(i)=a(i-1) end do do i=1,n a(i)=a(i+1) end do

Scalar Order Vector Order d(1)=a(1)+1.0 a(2)=b(1)+2.0 d(2)=a(2)+1.0 a(3)=b(2)+2.0 … d(1)=a(1)+1.0 d(2)=a(2)+1.0 … a(2)=b(1)+2.0 a(3)=b(2)+2.0 … Data Dependence: Another Inhibitor • Similar concept to recursion except dependence is now across different arrays: do i=1,n-1 d(i)=a(i)+1.0 a(i+1)=b(i)+2.0 end do

Some apparent dependencies can be vectorized by rearrangement do i=1,n-1 a(i)=b(i) c(i)=a(i+1) end do do i=1,n-1 c(i)=a(i+1) a(i)=b(i) end do Rearrange Some compilers analyze for this situation.

Indexing hazards • Indirect addressing is another difficulty: do i=1,n a(index1(i))=s*b(index2(i)) end do do i=1,n u(i)=b(index2(i)) end do do i=1,n u(i)=s*u(i) end do do i=1,n a(index1(i))=u(i) end do Good programming habit to break this operation up Hardware assitance for these loops This section of code will vectorize

Conditional execution • Some IF statements are vectorizable (but limits and penalties) • Consider: • The operation carried out on all elements is still the same • However, vector pipelines cannot stop mid-execution to decide what needs to be done: need code rearrangement do i=1,n if (b(i).gt.0.0) a(i)=a(i)+b(i) end do

Conditional Execution • Strategy 1: Replace logic with mathematical intrinsic do i=1,n if (b(i).gt.0.0) a(i)=a(i)+b(i) end do do i=1,n a(i)=a(i)+max(b(i),0.) end do

7.2 1.2 7.7 Stored results Masked Operations • Alternative strategy is to operate all iterations of the loop as if the logic were true • Use the conditional IF statement to form a vector mask register that then controls which elements need to be updated • Leads to some overhead, most efficient when the conditional true-ratio is high 7.2 1 0.2 0 3.1 0 8.9 0 1.2 1 0 3.7 7.7 1 Results Mask

Inlining function calls • Consider the following (trivial) example: do i=1,n a(i)=b(i)+c(i) call mult(e(i),d(i)) end do subroutine mult(a1,a2) real a1,a2 a1=a1*a2 return end As written it doesn’t vectorize

High Performance Computing – CISC 811