Examples of One-Dimensional Systolic Arrays

Examples of One-Dimensional Systolic Arrays

Motivation & Introduction • We need a high-performance , special-purpose computer • system to meet specific application. • I/O and computation imbalance is a notable problem. • The concept of Systolic architecture can map high-level • computation into hardware structures. • Systolic system works like an automobile assembly line. • Systolic system is easy to implement because of its • regularity and easy to reconfigure. • Systolic architecture can result in cost-effective , high- • performance special-purpose systems for a wide range • of problems.

Pipelined Computations P1 P2 P3 P4 P5 f, e, d, c, b, a • Pipelined program divided into a series of tasks that have to be completed one after the other. • Each task executed by a separate pipeline stage • Data streamedfrom stage to stageto form computation

Pipelined Computations P5 P4 P3 P2 P1 a b c d e f a b c d e f a b c d e f P1 P2 P3 P4 P5 f, e, d, c, b, a a b c d e f a b c d e f time • Computation consists of data streaming through pipeline stages • Execution Time = Time to fill pipeline (P-1) + Time to run in steady state (N-P+1) + Time to empty pipeline (P-1) P = # of processors N = # of data items (assume P < N) This slide must be explained in all detail. It is very important

Pipelined Example: Sieve of Eratosthenes • Goal is to take a list of integers greater than 1 and produce a list of primes • E.g. For input 2 3 4 5 6 7 8 9 10, output is 2 3 5 7 • A pipelined approach: • Processor P_i divides each input by the i-th prime • If the input is divisible (and not equal to the divisor), it is marked (with a negative sign) and forwarded • If the input is not divisible, it is forwarded • Last processor only forwards unmarked (positive) data [primes]

Sieve of Eratosthenes Pseudo-Code Code for processor Pi (and prime p_i): x=recv(data,P_(i-1)) If (x>0) then If (p_i divides x and p_i = x ) then send(-x,P_(i+1) If (p_i does not divide x or p_i = x) then send(x, P_(i+1)) Else Send(x,P_(i+1)) Code for last processor x=recv(data,P_(i-1)) If x>0 then send(x,OUTPUT) P2 P3 P5 P7 out / Processor P_i divides each input by the i-th prime

Programming Issues P13 P17 P2 P3 P5 P7 P11 • Algorithm will take N+P-1 to run where N is the number of data items and P is the number of processors. • Can also consider just the odd bnys or do some initial part separately • In given implementation, number of processors must store all primes which will appear in sequence • Not a scalable approach • Can fix this by having each processor do the job of multiple primes, i.e. mapping logical “processors” in the pipeline to each physical processor • What is the impact of this on performance? processor does the job of three primes

Processors for such operation • In pipelined algorithm, flow of data moves through processors in lockstep. • The design attempts to balance the work so that there is no bottleneck at any processor • In mid-80’s, processors were developed to support in hardware this kind of parallel pipelined computation • Two commercial products from Intel: • Warp (1D array) • iWarp (components for 2D array) • Warp and iWarp were meant to operate synchronously Wavefront Array Processor (S.Y. Kung) was meant to operate asynchronously, • i.e. arrival of data would signal that it was time to execute

Systolic Arrays from Intel • Warp and iWarp were examples of systolic arrays • Systolic means regular and rhythmic, • data was supposed to move through pipelined computational units in a regular and rhythmic fashion • Systolic arrays meant to be special-purpose processors or co-processors. • They were very fine-grained • Processors implement a limited and very simple computation, usually called cells • Communication is very fast, granularity meant to be around one operation/communication!

Systolic Algorithms • Systolic arrays were built to support systolic algorithms, a hot area of research in the early 80’s • Systolic algorithms used pipelining through various kinds of arrays to accomplish computational goals: • Some of the data streaming and applications were very creative and quite complex • CMU a hotbed of systolic algorithm and array research (especially H.T. Kung and his group)

Example 1: “pipelined” polynomial evaluation • Polynomial Evaluation is done by using a Linear array with 2D. • Expression: Y = ((((anx+an-1)*x+an-2)*x+an-3)*x……a1)*x + a0 • Function of PEs in pairs • 1. Multiply input by x • 2. Pass result to right. • 3. Add aj to result from left. • 4. Pass result to right. First processor in pair Second processor in pair

Example 1: polynomial evaluation Y = ((((anx+an-1)*x+an-2)*x+an-3)*x……a1)*x + a0 Multiplying processor X is broadcasted • Using systolic array for polynomial evaluation. • This pipelined array can produce a polynomial on new X value on every cycle - after 2n stages. • Another variant: you can also calculate various polynomials on the same X. • This is an example of a deeply pipelined computation- • The pipeline has 2n stages. Adding processor x an-1 an-2 an x x a0 x ………. 0 X + X + X + X +

Example 2:Matrix Vector Multiplication • There are many ways to solve a matrix problems using systolic arrays, some of the methods are: • Triangular Array performing gaussian elimination with neighbor pivoting. • Triangular Array performing orthogonal triangularization. • Simple matrix multiplication methods are shown in next slides.

Example 2:Matrix Vector Multiplication • Matrix Vector Multiplication: • Each cell’s function is: • 1. To multiply the top and bottom inputs. • 2. Add the left input to the product just obtained. • 3. Output the final result to the right. • Each cell consists of an adder and a few registers.

Matrix Multiplication Example 2:Matrix Vector Multiplication - -i - h f g ec d b - a - - n m l PE1 PE2 PE3 z y x q r p • At time t0 the array receives 1, a, p, q, and r ( The other inputs are all zero). • At time t1, the array receive m, d, b, p, q, and r ….e.t.c • The results emerge after 5 steps. Analyze how row [a b c] is multiplied by column [p q r]T to return first element of the column vector [X Y Z]T

- -i - h f g ec d b - a - - n m l z y x PE1 PE2 PE3 q r p • Each cell (P1, P2, P3) does just one instruction • Multiply the top and bottom inputs, add the left input to the product just obtained, output the final result to the right • The cells are simple • Just an adder and a few registers • The cleverness comes in the order in which you feed input into the systolic array • At time t0, the array receives l, a, p, q, and r • (the other inputs are all zero) • At time t1, the array receives m, d, b, p, q, and r • And so on. • Results emerge after 5 steps To visualize how it works it is good to do a snapshot animation

Systolic Processors, versus Cellular Automata versus Regular Networks of Automata Data Path Block Data Path Block Data Path Block Data Path Block Systolic processor Control Block Control Block Control Block Control Block These slides are for one-dimensional only Cellular Automaton

Systolic Processors, versus Cellular Automata versus Regular Networks of Automata Control Block Control Block Control Block Control Block General and Soldiers, Symmetric Function Evaluator Cellular Automaton Control Block Control Block Control Block Control Block Data Path Block Data Path Block Data Path Block Data Path Block Regular Network of Automata

Introduction to Polynomial multiplication, filtering and Convolution circuits synthesis Perkowski

Example 3:FIR Filter or Convolution

Convolution as polynomial multiplication (a3 x3 + a2 x2 + a1 x + a0) (b3 x3 + b2 x2 + b1 x + b0) b3 a3 x6 + b3 a2 x5 + b3 a1 x4 + b3 a0 x3 b2 a3 x5 + b2 a2 x4 + b2 a1 x3 + b2 a0 x2 b1 a3 x4 + b1 a2 x3 + b1 a1 x2 + b1 a0 x b0 a3 x3 + b0 a2 x2 + b0 a1 x + b0 a0 *

FIR-filter like structure a4 0 0 0 b2 b1 b4 b3 + + + a4*b4 Vector of bi stands in place, vector of ai moves from highest coefficient of a towards highest coefficient of b First we will explain how it works

a3 a4 0 0 b2 b1 b4 b3 + + + a4*b4 a3*b4+a4b3

a2 a3 a4 0 b2 b1 b4 b3 + + + a4*b4 a3*b4+a4b3 a4*b2+a3*b3+a2*b4

a1 a2 a3 a4 b2 b1 b4 b3 + + + a4*b4 a3*b4+a4b3 a4*b2+a3*b3+a2*b4 a1*b4+a2*b3+a3*b2+a4*b1

0 a1 a2 a3 b2 b1 b4 b3 + + + a4*b4 a3*b4+a4b3 a4*b2+a3*b3+a2*b4 a1*b4+a2*b3+a3*b2+a4*b1 a1*b3+a2*b2+a3*b1

We redesign this architecture. We insert Dffs to avoid many levels of logic a2 a3 a4 b2 b1 b4 b3 + + + a4*b4 a4*b3 a4*b2 a4*b1 We simulate it again shifting vector a. Vector a is broadcasted and it moves, highest coefficient to highest coefficient

a1 a2 a3 b2 b1 b4 b3 + + + a4*b4 a4*b3+a3b4 a4*b2+a3b3 a3b1 a4*b1+a3b2

0 a1 a2 b2 b1 b4 b3 + + + a4*b4 a4*b3+a3b4 a4*b2+a3b3+a2b4 a4*b1+a3b2+a2b3 a2b1 a3b1+a2b2 The disadvantage of this circuit is broadcasting

Another way to draw exactly the same architecture with broadcast input

A family of systolic designs for convolution computation • Given the sequence of weights • {w1 , w2 , . . . , wk} • And the input sequence • {x1 , x2 , . . . , xk} , • Compute the result sequence • {y1 , y2 , . . . , yn+1-k} • Defined by • yi = w1 xi + w2 xi+1 + . . . + wk xi+k-1

Design B1 • Previously proposed for circuits to implement a pattern matching processor and for circuit to implement polynomial multiplication. - • Broadcast input , • move results systolically, • weights stay • - (Semi-systolic convolution arrays with global data communication

Types of systolic structure: design B1 x3 x2 x1 y3 y2 y1 W1 W2 W3 xin yout = yin + W×xin yin yout W • wider systolic path (partial result yi move) Please analyze this circuit drawing snapshots like in an animated movie of data in subsequent moments of time broadcast Results move out Discuss disadvantages of broadcast

We go back to our unified way of drawing processors

We insert more Dffs to avoid broadcasting a2 a3 a4 0 0 0 b2 b1 b4 b3 + + + a4*b4 0 0 0 We simulate it again shifting vector a. Vector a moves, highest coefficient to highest coefficient

a1 a2 a3 a4 0 0 b2 b1 b4 b3 + + + a4*b4 a3b4 a4b3 0 0 With this modification the circuit does not work correctly like this. Try something new….

a1 a2 a3 a4 0 0 Let us check what happens when we shift a through b with highest bit towards highest bit approach When we add the results the timing is correct. b2 b1 b4 b3 0 0 a1b2 a2b1 But the trouble is big adder to add these results from columns 0 a1b3 a2b2 a3b1 a1b4 a2b3 a3b2 a4b1 a2b4 a3b3 a4b2 0 a3b4 a4b3 0 0 Second sum a4*b4 0 0 0 First sum

Another way of drawing this type of architecture

Types of systolic structure: design F xin xout W Zout Input move Weights stay Partial results fan-in • needs adder • applications : signal processing, pattern matching x3 x2 x1 W3 W2 W1 ADDER y1’s Zout = W×xin xout = xin

Design F • When number of cell is large , the adder can be implemented as a pipelined adder tree to avoid large delay. • Design of this type using unbounded fan-in. - Fan-in results, move inputs, weights stay - Semi-systolic convolution arrays with global data communication

FIR-filter like structure, assume two delays • So we invent a new trick. • We create two delays not one in order to shift it everywhere b2 b1 b4 b3 + + +

b2 b1 b4 b3 + + +

Examples of One-Dimensional Systolic Arrays