Lecture 11: Advanced Static ILP

Lecture 11: Advanced Static ILP • Topics: loop unrolling, software pipelining (Section 4.4)

Loop Dependences • If a loop only has dependences within an iteration, the loop • is considered parallel  multiple iterations can be executed • together so long as order within an iteration is preserved • If a loop has dependeces across iterations, it is not parallel • and these dependeces are referred to as “loop-carried” • Not all loop-carried dependences imply lack of parallelism • Parallel loops are especially desireable in a multiprocessor • system

Examples For (i=1000; i>0; i=i-1) x[i] = x[i] + s; No dependences For (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; S1 B[i+1] = B[i] + A[i+1]; S2 } S2 depends on S1 in the same iteration S1 depends on S1 from prev iteration S2 depends on S2 from prev iteration For (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; S1 B[i+1] = C[i] + D[i]; S2 } S1 depends on S2 from prev iteration S1 depends on S1 from 3 prev iterations Referred to as a recursion Dependence distance 3; limited parallelism For (i=1000; i>0; i=i-1) x[i] = x[i-3] + s; S1

Finding Dependences – the GCD Test • Do A[ai + b] and A[ci + d] refer to the same element? • Restrict ourselves to affine array indices (expressible as • ai + b, where i is the loop index, a and b are constants) – • example of non-affine index: x[y[i]] • For a dependence to exist, must have two indices j and k • that are within the loop bounds, such that • aj + b = ck + d; • aj – ck = d – b; • G = GCD(a,c); • (aj/G - ck/G) = (d-b)/G; • If (d-b)/G is not an integer, the initial equality can not be true

Static vs. Dynamic ILP Loop: L.D F0, 0(R1) ; F0 = array element ADD.D F4, F0, F2 ; add scalar S.D F4, 0(R1) ; store result DADDUI R1, R1,# -8 ; decrement address pointer BNE R1, R2, Loop ; branch if R1 != R2 Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, # -32 S.D F12, 16(R1) BNE R1,R2, Loop S.D F16, 8(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) ….. Statically unrolled loop Large window dynamic ooo proc

Dynamic ILP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R3, R1,# -8 BNE R3, R2, Loop L.D F6, 0(R3) ADD.D F8, F6, F2 S.D F8, 0(R3) DADDUI R4, R3,# -8 BNE R4, R2, Loop L.D F10, 0(R4) ADD.D F12, F10, F2 S.D F12, 0(R4) DADDUI R5, R4,# -8 BNE R5, R2, Loop L.D F14, 0(R5) ADD.D F16, F14, F2 S.D F16, 0(R5) DADDUI R6, R5,# -8 BNE R6, R2, Loop Renamed

Dynamic ILP L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R3, R1,# -8 BNE R3, R2, Loop L.D F6, 0(R3) ADD.D F8, F6, F2 S.D F8, 0(R3) DADDUI R4, R3,# -8 BNE R4, R2, Loop L.D F10, 0(R4) ADD.D F12, F10, F2 S.D F12, 0(R4) DADDUI R5, R4,# -8 BNE R5, R2, Loop L.D F14, 0(R5) ADD.D F16, F14, F2 S.D F16, 0(R5) DADDUI R6, R5,# -8 BNE R6, R2, Loop 1 3 6 1 3 2 4 7 2 4 3 5 8 3 5 4 6 9 4 6 Cycle of Issue Renamed

Loop Pipeline L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE … L.D ADD.D DADDUI BNE … L.D ADD.D DADDUI BNE

Statically Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) L.D F18, -32(R1) ADD.D F4, F0, F2 L.D F22, -40(R1) ADD.D F8, F6, F2 L.D F26, -48(R1) ADD.D F12, F10, F2 L.D F30, -56(R1) ADD.D F16, F14, F2 L.D F34, -64(R1) ADD.D F20, F18, F2 S.D F4, 0(R1) L.D F38, -72(R1) ADD.D F24, F22, F2 S.D F8, -8(R1) … … S.D F12, 16(R1) S.D F16, 8(R1) DADDUI R1, R1, # -32 S.D BNE R1,R2, Loop S.D

Static Vs. Dynamic New iterations completed 1 Dynamic ILP Cycles New iterations completed 1 Static ILP Cycles • What if I doubled the number of resources in each processor? • What if I unrolled the loop and executed it on a dynamic ILP processor?

Static vs. Dynamic • Dynamic: because of the loop index, at most one iteration • can start every cycle – even fewer if there are resource • constraints – in other words, we have a pipeline that has • a throughput of one iteration per cycle! • Static: by eliminating loop index, each iteration is • independent  as many loops can start in a cycle as there • are resources – however, after a while, we don’t start any • more iterations – thus, loop unrolling provides a brief steady • state, where an iteration starts/finishes every cycle and the • rest is start-up/wind-down for each unrolled loop

Software Pipeline?! L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE L.D ADD.D S.D DADDUI BNE … L.D ADD.D DADDUI BNE … L.D ADD.D DADDUI BNE

Software Pipelining Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop • Advantages: achieves nearly the same effect as loop unrolling, but • without the code expansion – an unrolled loop may have inefficiencies • at the start and end of each iteration, while a sw-pipelined loop is • almost always in steady state – a sw-pipelined loop can also be unrolled • to reduce loop overhead • Disadvantages: does not reduce loop overhead, may require more • registers

Midterm Exam • Show up early! Time will be a constraint • Attempt all questions – grading will be lenient • Open books and notes

Title • Bullet

Lecture 11: Advanced Static ILP

Lecture 11: Advanced Static ILP

Presentation Transcript

Advanced Graphics Lecture Three

Introduction to C# 2.0

Lecture #37: Memory

Lecture 13: Advanced Examples

Static Memory

David Notkin  Autumn 2009  CSE303 Lecture 20

Advanced Vitreous State – The Physical Properties of Glass

Static and Global

PHYS16 – Lecture 26

Lecture 25: Advanced Command patterns

Review

Lecture 8 Advanced Pipeline

AGEC/FNR 406 LECTURE 11

Lecture 5: Static ILP Basics

Lecture 10: Processes II-B

Static ILP Static (Compiler Based) Scheduling

Lecture 11 Memory Design

Lecture 4: Advanced Pipelines

Lecture 14

Lecture 11:

CSE 403 Lecture 11

Advanced Corporate Finance FINA 7330 Capital Structure Issues and Financing Lecture 07 and 08