Advanced Compiler Techniques

Advanced Compiler Techniques Parallelism & Locality LIU Xianhua School of EECS, Peking University

Outline • Data dependences • Loop transformation • Software pipelining • Software prefetching & data layout “Advanced Compiler Techniques”

Data Dependence of Variables • Output dependence • Input dependence • True dependence • Anti-dependence a = = a a = a = = a a = = a = a “Advanced Compiler Techniques”

Domain of Data Dependence Analysis • Only use loop bounds and array indexes that are affine functions of loop variables for i = 1 to n for j = 2i to 100 a[i + 2j + 3][4i + 2j][i * i] = … … = a[1][2i + 1][j] • Assume a data dependence between the read & write operation if there exists: ∃integers ir,jr,iw,jw 1 ≤ iw, ir ≤ n 2iw≤ jw ≤ 100 2ir≤ jr ≤ 10 iw+ 2jw + 3 = 1 4iw+ 2jw = 2ir + 1 • Equate each dimension of array access; ignore non-affine ones • No solution No data dependence • Solution  There may be a dependence “Advanced Compiler Techniques”

Iteration Space • An abstraction for loops • Iteration is represented as coordinates in iteration space. for i= 0, 5 for j= 0, 3 a[i, j] = 3 j i “Advanced Compiler Techniques”

Iteration Space • An abstraction for loops for i = 0, 5 for j= i, 3 a[i, j] = 0 j i “Advanced Compiler Techniques”

Iteration Space • An abstraction for loops for i = 0, 5 for j= i, 7 a[i, j] = 0 j i “Advanced Compiler Techniques”

Affine Access “Advanced Compiler Techniques”

Affine Transform j v i u “Advanced Compiler Techniques”

Loop Transformation for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_for end_for for u = 1, 200 for v = 1, 100 A[v,u] = A[v,u]+ 3 end_for end_for “Advanced Compiler Techniques”

Interchange Loops? for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_for end_for j foru = 1, 1000 for v= 2, 1000 A[v, u] = A[v-1, u+1]+3 end_for end_for i • e.g. dependence vector dold = (1,-1) “Advanced Compiler Techniques”

GCD Test • Is there any dependence? • Solve a linear Diophantine equation • 2*iw = 2*ir + 1 for i = 1, 100 a[2*i] = … … = a[2*i+1] + 3 “Advanced Compiler Techniques”

GCD • The greatest common divisor (GCD) of integers a1, a2, …, an, denoted gcd(a1, a2, …, an), is the largest integer that evenly divides all these integers. • Theorem: The linear Diophantine equation has an integer solution x1, x2, …, xniffgcd(a1, a2, …, an) divides c “Advanced Compiler Techniques”

Loop Transformation • Loop Fusion • Loop Distribution/Fission • Loop Re-Indexing • Loop Scaling • Loop Reversal • Loop Permutation/Interchange • Loop Skew • Uni-modular Transformation “Advanced Compiler Techniques”

Loop Transformation “Advanced Compiler Techniques”

Cache Locality • Suppose array A has column-major layout • Loop nest has poor spatial cache locality. for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_for end_for “Advanced Compiler Techniques”

Summary of Compiler Optimizationsto Reduce Cache Misses (by hand) “Advanced Compiler Techniques”

Resource Constraints • Each instruction type has a resource reservation table • Pipelined functional units: occupy only one slot • Non-pipelined functional units: multiple time slots • Instructions may use more than one resource • Multiple units of same resource • Limited instruction issue slots • may also be managed like a resource Functional units ld st alu fmpy fadd br … 0 Time 1 2 “Advanced Compiler Techniques”

List Scheduling • Scope: DAGs • Schedules operations in topological order • Never backtracks • Variations: • Priority functionfor node n • critical path: max clocks from n to any node • resource requirements • source order “Advanced Compiler Techniques”

Global Scheduling Assume each clock can execute 2 operations of any kind. B1 if (a==0) goto L LD R6 <- 0(R1) nop BEQZ R6, L c = b L: e = d + d B2 LD R7 <- 0(R2) nop ST 0(R3) <- R7 B3 L: LD R8 <- 0(R4) nop ADD R8 <- R8,R8 ST 0(R5) <- R8 “Advanced Compiler Techniques”

Result of Code Scheduling B1 LD R6 <- 0(R1) ; LD R8 <- 0(R4) LD R7 <- 0(R2) ADD R8 <- R8,R8 ; BEQZ R6, L B3 B3’ L: ST 0(R5) <- R8 ST 0(R5) <- R8 ; ST 0(R3) <- R7 “Advanced Compiler Techniques”

Code Motions Goal: Shorten execution time probabilistically Moving instructions up: • Move instruction to a cut set (from entry) • Speculation: even when not anticipated. Moving instructions down: • Move instruction to a cut set (from exit) • May execute extra instruction • Can duplicate code src src “Advanced Compiler Techniques”

General-Purpose Applications • Lots of data dependences • Key performance factor: memory latencies • Move memory fetches up • Speculative memory fetches can be expensive • Control-intensive: get execution profile • Staticestimation • Innermost loops are frequently executed • back edges are likely to be taken • Edges that branch to exit and exception routines are not likely to be taken • Dynamicprofiling • Instrument code and measure using representative data “Advanced Compiler Techniques”

A Basic Global Scheduling Algorithm • Schedule innermost loops first • Only upward code motion • No creation of copies • Only one level of speculation • Extra: • Prepass before scheduling: loop unrolling “Advanced Compiler Techniques”

Software Pipelining “Advanced Compiler Techniques”

Software Pipelining • Obtain parallelism by executing iterations of a loop in an overlapping way. • Try to maximize parallelism across loop iterations • We’ll focus on simplest case: the do-all loop, where iterations are independent. • Goal: Initiate iterations as frequently as possible. • Limitation: Use same schedule and delay for each iteration. “Advanced Compiler Techniques”

Example of DoAll Loops • Machine: • Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with hardware loop op and auto-incrementing addressing mode. • Source code: For i = 1 to n D[i] = A[i] * B[i]+ c • Code for one iteration: 1. LD R5,0(R1++) 2. LD R6,0(R2++) 3. MUL R7,R5,R6 4. 5. ADD R8,R7,R4 6. 7. ST 0(R3++),R8 • No parallelism in basic block “Advanced Compiler Techniques”

Unrolling 1.L: LD 2. LD 3. LD 4. MUL LD 5. MUL LD 6. ADD LD 7. ADD LD 8. ST MUL LD 9. MUL 10. ST ADD • ADD • ST 13. ST BL (L) • Let u be the degree of unrolling: • Length of u iterations = 7+2(u-1) • Execution time per source iteration = (7+2(u-1)) / u = 2 + 5/u “Advanced Compiler Techniques”

Software Pipelined Code 1. LD 2. LD 3. MUL LD 4. LD 5. MUL LD 6. ADD LD 7. MUL LD 8. ST ADD LD 9. MUL LD 10. ST ADD LD • MUL • ST ADD 13. • ST ADD 15. 16. ST • Unlike unrolling, software pipelining can give optimal result. • Locally compacted code may not be globally optimal • DOALL: Can fill arbitrarily long pipelines with infinitely many iterations “Advanced Compiler Techniques”

Example of DoAcross Loop Loop: Sum = Sum + A[i]; B[i] = A[i] * c; Software Pipelined Code 1. LD 2. MUL3. ADD LD4. ST MUL 5. ADD 6. ST Doacross loops • Recurrences can be parallelized • Harder to fully utilize hardware with large degrees of parallelism LD MUL ADD ST “Advanced Compiler Techniques”

Problem Formulation Goals: • maximize throughput • small code size Find: • an identicalrelative schedule S(n) for every iteration • a constantinitiation interval (T) such that • the initiation interval is minimized Complexity: • NP-complete in general S 0 LD 1 MUL2 ADD LD3 ST MUL ADD ST T=2 “Advanced Compiler Techniques”

Scheduling Constraints: Resource • RT: resource reservation table for single iteration • RTs: modulo resource reservation table RTs[i] = t|(t mod T = i) RT[t] Iteration 1 LD Alu ST Iteration 2 T=2 LD Alu ST Iteration 3 LD Alu ST Iteration 4 Steady State Time LD Alu ST LD Alu ST T=2 “Advanced Compiler Techniques”

Is It Possible to Have an IterationStart at Every Clock? • Hint: No. • Why? • An iteration injects 2 MEM and 2 ALU resource requirements. • If injected every clock, the machine cannot possibly satisfy all requests. • Minimum delay = 2. “Advanced Compiler Techniques”

Assigning Registers • We don’t need an infinite number of registers. • We can reuse registers for iterations that do not overlap in time. • But we can’t just use the same old registers for every iteration. “Advanced Compiler Techniques”

Assigning Registers --- (2) • The inner loop may have to involve more than one copy of the smallest repeating pattern. • Enough so that registers may be reused at each iteration of the expanded inner loop. • Our example: 3 iterations coexist, so we need 3 sets of registers and 3 copies of the pattern. “Advanced Compiler Techniques”

Example: Assigning Registers • Our original loop used registers: • r9 to hold a constant 4N. • r8 to count iterations and index the arrays. • r1 to copy a[i] into b[i]. • The expanded loop needs: • r9 holds 12N. • r6, r7, r8 to count iterations and index. • r1, r2, r3 to copy certain array elements. “Advanced Compiler Techniques”

The Loop Body Each register handles every third element of the arrays. To break the loop early Iteration i Iteration i + 1 Iteration i + 2 L: ADD r8,r8,#12 nop LD r3,a(r6) BGEr8,r9,L’ ST b(r7),r2nop LD r1,a(r8) ADD r7,r7,#12 nop nopBGEr7,r9,L’’ ST b(r6),r3 nop LD r2,a(r7) ADD r6,r6,#12 ST b(r8),r1nop BLT r6,r9,L Iteration i + 3 Iteration i + 4 L’ and L’’ are places for appropriate codas. “Advanced Compiler Techniques”

Cyclic Data Dependence Graph • We assumed that data at an iteration depends only on data computed at the same iteration. • Not even true for our example. • r8 computed from its previous iteration. • But it doesn’t matter in this example. • Fixup: edge labels have two components: (iteration change, delay). “Advanced Compiler Techniques”

Cyclic Data Dependence Graph (C) must wait at least one clock after the (B) from the same iteration. (A) must wait at least one clock after the (C) from the previous iteration. • Label edges with < , d > •  = iteration difference, d = delay  x T + S(n2) – S(n1)  d (A) LD r1,a(r8) <1,1> <0,2> (B) ST b(r8),r1 <0,1> (C) ADD r8,r8,#4 <0,1> (D) BLT r8,r9,L “Advanced Compiler Techniques”

Matrix of Delays • Let T be the delay between the start times of one iteration and the next. • Replace edge label <i,j> by delay j-iT. • Compute, for each pair of nodes n and m the total delay along the longest acyclic path from n to m. • Gives upper and lower bounds relating the times to schedule n and m. “Advanced Compiler Techniques”

Example:Delay Matrix A A B B C C D D A A 2 2 B B 1 1 C C 1-T 1-T 1 1 D D S(A) ≥S(B)+2-T S(B) ≥S(A)+2 3 4 2-T 2 3-T Edges Acyclic Transitive Closure Note: Implies T ≥ 4 (because only one register used for loop-counting). If T=4, then A (LD) must be 2 clocks before B (ST). If T=5, A can be 2-3 clocks before B. S(B)-2 ≥ S(A) ≥S(B)+2-T “Advanced Compiler Techniques”

Iterative Modulo Scheduling • Compute the lower bounds (MII) on the delay between the start times of one iteration and the next (initiation interval, aka II) • due to resources • due to recurrences • Try to find a schedule for II = MII • If no schedule can be found, try a larger II. “Advanced Compiler Techniques”

The Algorithm • Choose an initiation interval, ii • Compute lower bounds on ii • Shorter ii means faster overall execution • Generate a loop body that takes ii cycles • Try to schedule into ii cycles, using modulo scheduler • If it fails, bump ii by one and try again • Generate the needed prologue & epilogue code • For prologue, work backward from upward exposed uses in the schedulued loop body • For epilogue, work forward from downward exposed definitions in the scheduled loop body M. Lam, "Software pipelining: An effective scheduling technique for VLIW machines", In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988 pages 318-328. Also published as ACM SIGPLAN Notices 23(7). “Advanced Compiler Techniques”

A simple sum reduction loop Source code LLIR code rc 0 r@a @a r1  n x 4 rub  r1 + r@a if r@a > rub goto Exit Loop: ra  MEM(r@a) rc  rc + ra r@a  r@a + 4 if r@a ≤ rub goto Loop Exit: c  rc c = 0 for i = 1 to n c = c + a[i] “Advanced Compiler Techniques”

OOO Execution The loop’s steady state behavior “Advanced Compiler Techniques”

OOO Execution The loop’s prologue The loop’s epilogue “Advanced Compiler Techniques”

Implementing the Concept 1 Prologue 3 1 1 1 Kernel 3 1 1 Epilogue The actual schedule must respect both the data dependences and the operation latencies. General schema for the loop “Advanced Compiler Techniques”

The Example Focus on the loop body 1 2 3 rc 0 r@a @a r1  n x 4 rub  r1 + r@a if r@a > rub goto Exit Loop: ra  MEM(r@a) rc  rc + ra r@a  r@a + 4 if r@a ≤ rub goto Loop Exit: c  rc 4 5 6 8 7 9 The Code Its Dependence Graph Op 6 is not involved in a cycle “Advanced Compiler Techniques”

The Example 6 8 7 9 Focus on the loop body ii = 2 Template for the Modulo Schedule “Advanced Compiler Techniques”

The Example 6 8 7 9 Simulated clock • Focus on the loop body • Schedule 6 on the fetch unit 3 1 “Advanced Compiler Techniques”

Advanced Compiler Techniques