Loop Scheduling and Software Pipelining
This presentation explores essential techniques in loop scheduling and software pipelining crucial for effective code generation. It covers the problem formulation of modulo scheduling, solution methods, and the significance of data dependence graphs in understanding loop behaviors. Students will gain the ability to identify, formulate, and solve challenges relating to scheduling problems while employing modern compiler tools. Additionally, insights on optimization strategies, including inner-loop scheduling and instruction scheduling, are presented to enhance students’ problem-solving capabilities in contemporary coding practices.
Loop Scheduling and Software Pipelining
E N D
Presentation Transcript
Loop Scheduling and Software Pipelining \course\cpeg421-08s\Topic-7.ppt
Reading List • Slides: Topic 7 and 7a • Other papers as assigned in class or homework: \course\cpeg421-08s\Topic-7.ppt
ABET Outcome • Ability to apply knowledge of basic code generation techniques, e.g. Loop scheduling e.g. software pipelining techniques to solve code generation problems. • An ability to identify, formulate and solve loops scheduling problems using software pipelining techniques • Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness. • Ability to use a modern compiler development platform and tools for the practice of above. • A Knowledge on contemporary issues on this topic. \course\cpeg421-08s\Topic-7.ppt
Outline • Brief overview • Problem formulation of the modulo scheduling problem • Solution methods • Summary \course\cpeg421-08s\Topic-7.ppt
General Compiler Framework Source Inter-Procedural Optimization (IPA) • Good IPO • Good LNO • Good global optimization • Good integration of IPO/LNO/OPT • Smooth information passing between FE and CG • Complete and flexible support of inner-loop scheduling (SWP), instruction scheduling and register allocation Loop Nest Optimization (LNO) Global Optimization (OPT) ME Global inst scheduling Innermost Loop scheduling Arch Models Reg alloc Local inst scheduling BE/CG Executable \course\cpeg421-08s\Topic-7.ppt
Questions ? • How to formulate the loop scheduling problem? • How to model it? • How to solve it? \course\cpeg421-08s\Topic-7.ppt
Questions (cont’d) • Instruction scheduling for code without loops – a review • Dependence graphs may become cyclic! So, “critical path length” is less obvious! • Is it becoming harder!(?) • What new insights are required to formulate and solve it? \course\cpeg421-08s\Topic-7.ppt
Challenges of Loop Scheduling Right strategy? A DDG With Cycles \course\cpeg421-08s\Topic-7.ppt
Observations • Execution of “good loops” tend to be “regular” and “repetitive” - a “pattern” may appear • This gives “cyclic scheduling” problem a new twist! • How to efficiently derive a “pattern”? \course\cpeg421-08s\Topic-7.ppt
Problem Formulation (I) Given a weighted dependence graph, derive a schedule which is “time-optimal” under a machine model M. Def: A schedule S of a loop L is time-optimal if among all “legal” schedules of L, no other schedule that is faster than S. Note: There may be more than one time-optimal schedule. \course\cpeg421-08s\Topic-7.ppt
A Short Tour on Data Dependence Graphs for Loops \course\cpeg421-08s\Topic-7.ppt
Basic Concept and Motivation • Data dependence between 2 accesses • The same memory location • Exist an execution path between them • One of them is a write • Three types of data dependence • Dependence graphs • Things are not simple when dealing with loops \course\cpeg421-08s\Topic-7.ppt
-1 0 Types of Data Dependence • Flow dependence • Anti-dependence • Output dependence X = = X ... = X X = -- ... X = X = ... \course\cpeg421-08s\Topic-7.ppt
Example 1: S1: A = 0 S2: B = A S3: C = A + D S4: D = 2 Data Dependence S1 S2 S3 S4 Sx Sy Sy depends on Sx \course\cpeg421-08s\Topic-7.ppt
Data Dependence Con’d Example 2: S1: A = 0 S2: B = A S3: A = B + 1 S4: C = A S1 S2 S3 S4 S10 S3 : Output-dep S2-1 S3 : anti-dep \course\cpeg421-08s\Topic-7.ppt
Should we consider input dependence? = X = X Is the reading of the same Ximportant? Well, it may be! (if we intend to group the 2 reads together for cache optimization!) \course\cpeg421-08s\Topic-7.ppt
Subscript Variables • Extension of def-use chains to employ a more precise treatment of arrays, especially in iterative loops. DO I = 1, N A(I + 1) = X(I + 1) + B(I) X(I) = A(I) * 5 ENDDO \course\cpeg421-08s\Topic-7.ppt
Dependence Graph Con’d Applications - register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization \course\cpeg421-08s\Topic-7.ppt
I = 2 I = 3 I = 4 Data Dependence in Loops An Example Find the dependence relations due to the array X in the program below: (1) for I = 2 to 9 do (2) X[I] = Y[I] + Z[I] (3) A[I] = X[I-1] + 1 (4) end for Solution To find the data dependence relations in a simple loop, we can unroll the loop and see which statement instances depend on which others: • X[2]=Y[2]+Z[2] • A[2]=X[1]+1 X[3] =Y[3]+Z[3] A[3] =X[2]+1 X[4]=Y[4]+Z[4] A[4]=X[3]+1 \course\cpeg421-08s\Topic-7.ppt
Data Dependence in Loops Con’d In our example, there is a loop-carried, lexically forward flow dependence relation. S2 (1) Dependence distance = 1 S3 Data dependence graph for statements in a loop. - Loop-carried vs loop-independent - Lexical-forward vs lexical backward \course\cpeg421-08s\Topic-7.ppt
An Example for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; end; iterations i = 3 . . . II a Note: We use a token here to represent a flow dependence of distance = 1 b time So, iteration interval II = 2 Assume each operation takes 1 cycle and there is only one addition unit! c \course\cpeg421-08s\Topic-7.ppt
stf fadds ldf cmp bg sub 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 time Software Pipeline: Concept Software Pipeline is a technique that reduces the execution time of important loops by interweaving operations from many iterations to optimize the use of resources. \course\cpeg421-08s\Topic-7.ppt
The Structrure of the SWP code % prologue a[0] = a[-1] + R[0]; % pattern for i = 0 to N-2 do b[i] = a[i] + c[i -1]; a[i + 1] = a[i] + R[i + 1]; c[i] = b[i] + 1; end; % epilogue b[N - 1] = a[N - 1] + c[N - 2]; c[N - 1] = b[N - 1] + 1 prologue bi , ci , a i+1 epilogue \course\cpeg421-08s\Topic-7.ppt
Initiation interval stf fadds ldf cmp bg sub 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 time Software Pipeline (Cont’d) • What limits the speed of a loop? • Data dependencies: recurrence initiation interval (rec_mii) • Processor resources: resource initiation interval (res_mii) \course\cpeg421-08s\Topic-7.ppt
Previous Approaches • Approach I (Operational): • “Emulate” the loop execution under the machine model and a “pattern” will eventually occur • [AikenNic88, EbciogluNic89, GaoEtAl91] • Approach II (Periodic scheduling): • Specify the scheduling problem into a periodical scheduling problem and find optimal solution • [Lam88, RauEtAl81,GovindAltmanGao94] \course\cpeg421-08s\Topic-7.ppt
Periodic Schedule(Modulo Scheduling) The time (cycle) when the I-th instance of the operation v is scheduled : t(i, v) = T * i+ A[v] where T = II so t(i + 1, v) -t(i, v) = T*(i + 1) – T*(i) = T For our example: t(i, v) = 2i + A[v] where A(a) = 1 A(b) = 0 A(c) = 1 Question: Is this an optimal schedule? \course\cpeg421-08s\Topic-7.ppt
Periodic Schedule (cont’d) Yes, the schedule t(i, v) = 2i + A[v] is time-optimal! With II = 2 \course\cpeg421-08s\Topic-7.ppt
Given a DDG of a loop L, how to determine the fastest computation rate of L -- also called minimum initiation interval (MII) ? Restate the problem Hint: Consider the “Critical Cycles” as well as critical resource usage in L \course\cpeg421-08s\Topic-7.ppt
Recurrence MII -- RecMII • An Example \course\cpeg421-08s\Topic-7.ppt
An Example (Revisit) : RecMII for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; end; iterations i = 3 . . . II a b time So, RecMII = 2 Assume each operation takes 1 cycle and there is only one addition unit! c \course\cpeg421-08s\Topic-7.ppt
Hint: now one must think about deadlines. Maximum Computation Rate Theorem: The maximum computation rate of a loop is bounded by the following ratio ropt = min{ Dc/Wc} where C is a dependence cycle, Dc is the total dependence distance along C and Wc is total execution time of C. i.e. (C) (di is the dependence distance along the edge i in C) (wi is the edge weight along the edge i in C) \course\cpeg421-08s\Topic-7.ppt
RecMII (Contn’d) And the optimal period MII = Topt = 1/ropt Def: A cycle is critical if the period of the cycle equal to MII (We should write it as RecMII) Note: a loop may have multiple critical cycles! \course\cpeg421-08s\Topic-7.ppt
An Example (Revisit) : RecMII for i = 0 to N - 1 do a: a[i] = a[i - 1] + R[i]; b: b[i] = a[i] + c[i - 1]; c: c[i] = b[i] + 1; end; iterations i = 3 . . . II a b So, RecMII = 2 time Assume each operation takes 1 cycle and there is only one addition unit! c \course\cpeg421-08s\Topic-7.ppt
How About Machine Resource and Register Constraints? • Numbers and types of FUs, etc, (ResMII) • Number of registers \course\cpeg421-08s\Topic-7.ppt
Software Pipelining • Review of previous work [RauFisher93] [Rau94] • Minimum initiation interval [MII] MII = max {RecMII, ResMII} where RecMII: determined by “critical’ recurrence cycles in DDG ResMII: determined by resource constraints (#s and types of FUs) \course\cpeg421-08s\Topic-7.ppt
Consider a simple example of n nodes and 1 FU -- what is ResMII ? The Resource Constraint MII (ResMII) • Can be calculated by totaling, for each resource, the usage requirement imposed by one iteration of the loop • can be done by bin-packing a resource reservation table (expensive) • usually derive a lower-bound is enough as a starting point to begin the search process • More price with complex hardware pipelines ? \course\cpeg421-08s\Topic-7.ppt
Example 1: Reservation Table - An Example 1 3 2 (a) a Pipeline M (b) The Reservation Table of M \course\cpeg421-08s\Topic-7.ppt
A Quiz ? • What is the RT of a fully pipelined adder with 3 stages ? • How about an unpipelined adder ? • How to computer ResMII for each case ? \course\cpeg421-08s\Topic-7.ppt
Modulo Reservation Table • The modulo reservation table only has length = II • Each entry in the table records a sequence of reservations corresponds a sequence of slots every II cycles • Hints: think about your weekly calendar \course\cpeg421-08s\Topic-7.ppt
Choose Next Operation X Remove from L is L =<>? Modulo Scheduling MII II = II + 1 try to place x in a time slot i in II failed succeed No A legal schedule is found \course\cpeg421-08s\Topic-7.ppt
Heuristic Method for Modulo Scheduling Why a simple variant list scheduling may not work? Hint: consider the deadline constraints of operations in a cycle. \course\cpeg421-08s\Topic-7.ppt
Counter Example I: List Scheduling May Fail ! A C 4 2 (a) [1] 2 4 B D (b) MII = RecMII = 4 Note: if simple list scheduling is used B cannot be scheduled due to the deadline set by scheduling C: deadlock! A C D B A B D (c) MII = RecMII = 4 C Note: we cannot fire C as early as possible! \course\cpeg421-08s\Topic-7.ppt
A Taxonomy of Software Pipelining Heuristic (Aiken88, AikenNic88, Ebcioglu89, etc.) Operational Approach Formal Model (GaoWonNin91) PLDI'91 In-Exact Method (Heuristic) (RauGla81, Lam88, RauEtAI92, Huff93, DehnertTow93, Rau94, WanEis93, StouchininEtAI00) Periodic Scheduling (Modulo Scheduling) Exact Method Software Pipelining Basic Formulation Register Optimal (DongenGao92) (Ninggao91, NingGao93 CONPAR'92 , Ning93) POPL'93 Resource Constrained ILP based ( ) GovindAITGao94 Micro27 Resource & Register "Showdown" Exhaustive ( GovindAITGao95 , Altman95, Search (RuttenbergGao PLDI'95 EuroPar'96 StouchininWoody96 ) EichenbergerDav95) PLDI'96 FSA Co-Scheduling (Altman95) formulation (GovindAltmanGao'96) FSA Based FSA Construction Method Method (GovindAltmanGao98) (Model hardware pipeline with FSA Heuristic/optimization sharing and hazards) (ZhangGovindRyanGao99) \course\cpeg421-08s\Topic-7.ppt Theory of Co-Scheduling (GovindAltmanGao00)
Advanced Topics • Consider register constraints • Realistic pipeline architecture constraints (with structural hazards) • Loop body with conditionals • Multi-Dimensional Loops \course\cpeg421-08s\Topic-7.ppt