Optimizing Loop Performance with Software Pipelining Technique

CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680 CMPUT 680 - Compiler Design and Optimization

Reading List • Tiger book: chapter 20 • Other papers such as: GovindAltmanGao97, RutenbergAtAl97 CMPUT 680 - Compiler Design and Optimization

stf fadds ldf cmp bg sub Software Pipeline Software Pipeline is a technique that reduces the execution time of important loops byinterweaving operations from many iterations to optimize the use of resources. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 time CMPUT 680 - Compiler Design and Optimization

Initiation interval stf fadds ldf cmp bg sub 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 time Software Pipeline • What limits the speed of a loop? • Data dependencies: recurrence initiation interval (rec_mii) • Processor resources: resource initiation interval (res_mii) • Memory accesses: memory initiation interval (mem_mii) CMPUT 680 - Compiler Design and Optimization

Problem Formulation (I) Given a weighted dependence graph, derive a schedule which is “time-optimal” under a machine model M. Def: A schedule S of a loop L is time-optimal if among all “legal” schedules of L, no schedule is faster than S. Note: There may be more than one time-optimal schedule. CMPUT 680 - Compiler Design and Optimization

Example: The Inner Product Q = 0.0 DO k = 1, N Q = Q+Z(k)*X(k) ENDDO z0  &Z(1) x0  &X(1) q0 0.0 DO k=1,N uk load zk-1 vk  load xk-1 wk uk * vk qk  qk-1 + wk zk  zk-1 + 4 xk  xk-1 + 4 END DO Dynamic Single Assignment (DSA): Uses an expanded virtual register (EVR) that is an infinite, linearly ordered, set of virtual registers. A program in DSA has no anti-dependencies and no output dependencies. (Dehnert, J. and Towle, R. A., “Compiling for Cidra 5”) CMPUT 680 - Compiler Design and Optimization

Machine Model and Resource Constraints Machine Model What unit each operation in the loop uses? z0  &Z(1) x0  &X(1) q0 0.0 DO k=1,N uk load zk-1 MEM vk  load xk-1 MEM wk uk * vk FMULT qk  qk-1 + wk FADD zk  zk-1 + 4 ADDR xk  xk-1 + 4 ADDR END DO Unit Latency MEM1 6 MEM2 6 ADDR1 1 ADDR2 1 FMULT 2 FADD 2 Without instruction level parallelism. How long does the loop take to execute? (6+6+2+2+1+1)*N=18*N CMPUT 680 - Compiler Design and Optimization

Resource Minimum Initiation Interval (resMII) Each processor resource defines a minimum initiation interval for the execution of the loop. For instance in the machine model in the previous example, a loop that requires the computation of 6 addresses has a ResMII(ADDR) = 6*1/2 = 3. The Resource Minimum Initiation Interval of a loop is given by: CMPUT 680 - Compiler Design and Optimization

ResMII z0  &Z(1) x0  &X(1) q0 0.0 DO k=1,N uk load zk-1 MEM vk  load xk-1 MEM wk uk * vk FMULT qk  qk-1 + wk FADD zk  zk-1 + 4 ADDR xk  xk-1 + 4 ADDR END DO Machine Model Unit Latency MEM1 6 MEM2 6 ADDR1 1 ADDR2 1 FMULT 2 FADD 2 There are enough units to schedule all the instructions of the loop in the same cycle. Therefore ResMII = 1. Can we execute the loop in N+C cycles (C = a small constant)? CMPUT 680 - Compiler Design and Optimization

k=2 k=3 a a a a b b b b c c c c (1) (1) d d d d (1) e e e e f f f f (1) (1) Recurrence Minimum Initiation Interval (RecMII) k=1 z0  &Z(1) x0  &X(1) q0 0.0 DO k=1,N (a) uk load zk-1 (b) vk  load xk-1 (c) wk uk * vk (d) qk  qk-1 + wk (e) zk  zk-1 + 4 (f) xk  xk-1 + 4 END DO CMPUT 680 - Compiler Design and Optimization

a b c d e f Recurrence Minimum Initiation Interval (RecMII) z0  &Z(1) x0  &X(1) q0 0.0 DO k=1,N Unit Lat. (a) uk load zk-1 MEM (6) (b) vk  load xk-1 MEM (6) (c) wk uk * vk FMULT (2) (d) qk  qk-1 + wk FADD (2) (e) zk  zk-1 + 4 ADDR (1) (f) xk  xk-1 + 4 ADDR (1) END DO (1,2) (1,1) (1,1) (1,1) (1,1) CMPUT 680 - Compiler Design and Optimization (dist,lat)

The recursive minimum initiation interval (rec_mii) is given by: a b c d e f Recurrence Minimum Initiation Interval (RecMII) (1,2) (1,1) (1,1) Quiz: What is the rec_mii for the example? (1,1) (1,1) CMPUT 680 - Compiler Design and Optimization (dist,lat)

The Minimum Initiation Interval (MII) for a loop is constrained both by resources and recurrences, therefore, it is given by: Minimum Initiation Interval In our example we have MII = max(1,2) = 2. Therefore the best that we can do without transforming the loop is to execute it in 2*N+C. CMPUT 680 - Compiler Design and Optimization

Module Schedule In module scheduling, we: (1) start with the first instruction (2) schedule as many instructions as we can in every cycle, limited only by the resources available and by the dependences. When a pattern emerges, we adopt the pattern as our module schedule. Instructions before this pattern form the loop prologue. Instructions after this pattern form the loop epilogue. CMPUT 680 - Compiler Design and Optimization

Recurrence Minimum Initiation Interval (RecMII) z0  &Z(1) x0  &X(1) q0 0.0 DO k=1,N Lat. (a) uk load zk-1 (6) (b) vk  load xk-1 (6) (c) wk uk * vk (2) (d) qk  qk-1 + wk (2) (e) zk  zk-1 + 4 (1) (f) xk  xk-1 + 4 (1) END DO

Why an eager scheduler fails in our example Iterations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 b1 1 b2 2 b3 3 b4 4 b5 5 b6 6 b7 7 c1 b8 8 c2 b9 9 d1 c3 b10 Cycles 10 c4 b11 11 d2 c5 b12 Cycles 12 c6 b13 13 d3 c7 b14 14 c8 b15 15 d4 c9 b16 16 c10 b17 17 d5 c11 b18 18 c12 19 d6 c13 20 c14 21 d7 c15 CMPUT 680 - Compiler Design and Optimization 22 c16 23 d8 c17

Why an eager scheduler fails in our example Iterations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 b1 1 2 b2 3 4 b3 5 6 b4 Therefore we can do it in 2*N+9 cycles. 7 c1 8 b5 9 d1 c2 Cycles 10 b6 11 d2 c3 Cycles 12 b7 13 d3 c4 14 b8 15 d4 c5 16 b9 17 d5 c6 18 b10 19 d6 c7 20 b11 21 d7 c8 CMPUT 680 - Compiler Design and Optimization 22 b12 23 d8 c9

Collision vectors Given the reservation tables for two operations A and B, the set of forbidden intervals, i.e., intervals at which distance the operations A and B cannot be issued is called the collision vector for the reservation tables. CMPUT 680 - Compiler Design and Optimization

A Simplistic Module Scheduling Algorithm 1. Compute MII as discussed 2. Use a modified list scheduling algorithm to generate a module schedule. The scheduling algorithm must obey the following restriction: If an operation P is scheduled at time t, it cannot be scheduled at any time t  k*II for any k 0. The Module Reservation Table has II rows, representing the cycles of the initiation interval, and as many columns as the resources that it needs to keep track of. CMPUT 680 - Compiler Design and Optimization

Heuristic Method for Modulo Scheduling Why a simple variant of list scheduling may not work? Problem: Generate a module schedule of a loop by scheduling instructions until a pattern emerge. CMPUT 680 - Compiler Design and Optimization

A C There is only one cycle in the dependence graph, therefore RecMII is given by: Counter Example I:List Scheduling May Fail (0,4) (0,2) (0,2) (1,2) B D Therefore, in a machine with infinite resources, we must be able to schedule the loop in 4 cycles. CMPUT 680 - Compiler Design and Optimization

A C (0,4) (0,2) (0,2) (1,2) B D ??? B Counter Example I:List Scheduling May Fail List Scheduling: a greedy algorithm that schedules each operation at its earliest possible time A C B D B must be scheduled after the A of the current iteration and before the C of the next iteration. We are deadlocked!!! 0 1 2 3 A C D CMPUT 680 - Compiler Design and Optimization

A(0) C(1) C(0) A(1) D(0) B(0) B(N) … … … … … D(N) Counter Example I:List Scheduling May Fail 0 1 2 3 A A C C prologue (0,4) (0,2) (0,2) 4 5 6 7 B B D D kernel (1,2) The solution is to create a kernel with operations from different iterations, and use a prologue and an epilogue. epilogue CMPUT 680 - Compiler Design and Optimization

ResMII(Adder) = 6; ResMII(Multiplier) = 6 ResMII(Bus) = 1 ResMII = 6 Counter Example II:List Scheduling May Fail A1, A3, and A4 are non-pipelined adds that take two cycles at the adder M5 and M6 are non-pipelined multiply operations that take three cycles each on the multiplier C2 is a copy operation that uses the bus for one cycle What is the ResMII for these operations in a machine that has one adder, one multiplier and one bus? A1 M6 (0,2) (0,3) C2 (0,1) A3 (0,2) A4 (0,2) M5 (0,3) CMPUT 680 - Compiler Design and Optimization

Adder Mult Bus A1 M6 0 1 2 3 4 5 (0,2) (0,3) C2 (0,1) A3 (0,2) A4 ??? (0,2) We cannot schedule A4 and achieve an MII = ResMII = 6!!! M5 A4 (0,3) Counter Example II:List Scheduling May Fail A1 M6 A1 M6 C2 C2 A3 A3 A4 CMPUT 680 - Compiler Design and Optimization

Adder Mult Bus A1 M6 0 1 2 3 4 5 (0,2) (0,3) C2 (0,1) A3 (0,2) A4 M6 (0,2) M5 A3 A1 (0,3) Counter Example II:List Scheduling May Fail A1 M6 C2 A4 C2 M5 A3 A4 Although it seems counter-intuitive we obtain a module schedule with MII = 6 if we initially schedule both M6 and A3 one cycle later than the earliest possible time for these operations. M5 CMPUT 680 - Compiler Design and Optimization

A1 M2 MA3 (0,2) (0,3) (0,4) Add Mult Bus Add Mult Bus Add Mult Bus ResMII(Add) = 1 + 0 + 1 = 2 Res MII(Mult) = 0 + 1 + 1 = 2 ResMII(Bus) = 1 + 1 + 0 = 2 ResMII = 2 Complex Reservation Tables Consider three independent operations with the reservation tables shown below What is the MII for a loop formed by this three operations? CMPUT 680 - Compiler Design and Optimization

A1 M2 MA3 (0,2) (0,3) (0,4) Adder Mult Bus Add Mult Bus Add Mult Bus Add Mult Bus 0 1 Is the MII = 2 Feasible?? A1 M2 Deadlocked. Cannot allocate MA3. Even though MII = max(ResMII, RecMII) = 2, MII = 2 is not feasible!!!! A1 M2 M2 A1 CMPUT 680 - Compiler Design and Optimization

A1 M2 MA3 (0,2) (0,3) (0,4) Add Mult Bus Add Mult Bus Add Mult Bus Increasing MII to 3 helps? A1 M2 MA3 Adder Mult Bus 0 1 2 A1 M2 We find a module schedule with MII = 3!! MA3 MA3 A1 M2 CMPUT 680 - Compiler Design and Optimization

A (0,2) Add Mult Bus ResMII(Add) = 1+1+1+1 = 4 ResMII(Mult) = 0+0+0+0 = 0 ResMII(Bus) = 1+1+1+1 = 4 ResMII = 4 Iteration Between Recurrence Constraints and Resource Constraints What is the RecMII for this loop? A1 (0,2) RecMII = (2+2+2+2)/2 = 4 A2 (2,2) (0,2) What is the ResMII for the loop? A3 (0,2) A4 Therefore MII = max(ResMII,RecMII) = 4 CMPUT 680 - Compiler Design and Optimization

Adder Mult Bus A A1 0 1 2 3 (0,2) (0,2) A2 (2,2) (0,2) A3 Add Mult Bus (0,2) A4 Is the MII = 4 feasible? A1 A1 A1 A2 A2 A2 In order to finish A4 in time to produce the result for two iterations later, A3 must be scheduled at time 4. But 4 module 4 = 0, which conflicts with A1. Therefore there is no feasible schedule with MII = 4. CMPUT 680 - Compiler Design and Optimization

Scheduling Strategy An exhaustive search will eventually reveal that the MII calculated is not feasible, but it might take too long. In practice, we compute the MII and spend a pre-allocated budget of time trying to find a schedule with the MII. If we don’t find one, we increase the MII. In some commercial compilers, the search for the smallest feasible II is a binary search, where the II is doubled at each step until a feasible one is found, at which point a linear search between the last unfeasible II and the feasible one is conducted. CMPUT 680 - Compiler Design and Optimization

Previous Approaches • Approach I (Operational): • “Emulate” the loop execution under the machine model and a “pattern” will eventually occur [AikenNic88, EbciogluNic89, GaoEtAl91] • Approach II (Periodic scheduling): • Specify the scheduling problem into a periodical scheduling problem and find optimal solution [Lam88, RauEtAl81,GovindAltmanGao94] CMPUT 680 - Compiler Design and Optimization

Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc) Operational Approach Formal Model (GaoWonNin91) Non-Exact Method (Heuristic) (RauGla81, Lam88, RauEtA192, Huff93, DehnertTow93, Rau94, WanEis93) Software Pipelining Basic Formulation (DongenGao92) Periodic Scheduling (Modulo Scheduling) Register Optimal (NingGao91, NingGao93, Ning93) Exact Method Resource Constrained (GovindAltGao94) ILP based Resource & Register (GovindAltGao95, Altman95, EichenbergerDav95) “Showdown” (RuttenbergGao StouchininWoody96) Exhausitive Search (Altman95, AltmanGao96)

Optimizing Loop Performance with Software Pipelining Technique

Optimizing Loop Performance with Software Pipelining Technique

Presentation Transcript

CMPUT680

Winter Olympics Medals :Torino 2006

Jordanian-German Winter Academy 2006

Winter 2006-2007

Integrated Accounting Issues Winter 2006

Winter 2006 ACAD - student work -

CSE 451 Section Winter 2006

CMPUT680 - Winter 2006

CMPUT680 - Winter 2006

CMPUT680 - Fall 2003

CMPUT680 - Fall 2003

CMPUT680 - Winter 2006

CMPUT680 - Fall 2003

CMPUT680 - Fall 2003

CMPUT680 - Winter 2006

CMPUT680 - Winter 2006

CMPUT680 - Winter 2001

Winter 2006 IGC3 Meeting

CSE 451 Section Winter 2006

CMPUT680 - Fall 2006

CMPUT680 - Winter 2006