Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark

Coordinated Coarse-Grain and Fine-Grain Optimizations for High-Level Synthesis Sumit Gupta Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark Supported by Semiconductor Research Corporation

M e m o r y Control ALU Data path High Level Synthesis Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture x = a + b c = a < b x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; If Node c T F d = e - f g = h + i j = d x g l = e + x

M e m o r y Control ALU Poor quality of HLS results beyond straight-line behavioral descriptions Problem # 1 : Data path Problem # 2 : Poor/No controllability of the HLS results High Level Synthesis Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture x = a + b c = a < b x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; If Node c T F d = e - f g = h + i j = d x g l = e + x

Outline • Motivation and Background • Our Approach to Parallelizing High-Level Synthesis • Code Transformations Techniques for PHLS • Parallelizing Transformations • Dynamic Transformations • The PHLS Framework and Experimental Results • Multimedia and Image Processing Applications • Case Study: Intel Instruction Length Decoder • Contributions • Future Work

High-level Synthesis • Well-researched area: from early 1980’s • Recently, the level of design entry has moved up from schematic entry to coding in HDLs (VHDL, Verliog, C, C++) • A large number of optimizations have been proposed over the years • Many HLS optimizations are either at the operation level (e.g., algebraic transformations on DSP codes) or at the logic level (e.g., Don’t Care based control optimizations) • In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) • Few/No Compiler optimizations have been applied to HLS • Quality of synthesis results severely effected by complex control flow, characterized by Nested ifs and loops

Recent HLS Optimizations • Mostly related to code scheduling in the presence of conditionals • Condition Vector List Scheduling [Wakabayashi 89] • Symbolic Scheduling [Radivojevic 96] • WaveSched Scheduler [Lakshminarayana 98] • Basic Block Control Graph Scheduling [Santos 99] • Limitations • Arbitrary nesting of conditionals and loops not handled or handled poorly • ad hoc optimizations: optimizations applied in isolation • Limited/no analysis of logic and control costs • Not clear if an optimization has positive impact beyond scheduling

Focus of this Work • Target Applications: • Descriptions with complex and nested conditionals and loops: • Multimedia & Image processing applications with a mix of data and control operations • Computationally expensive Microprocessor blocks: resource rich, tightly packed into a few cycles • Objectives: • Improve quality of overall HLS results (circuit delay and area) • Find balance of parallelism and hardware costs under resource constraints • Improve controllability of the HLS solutions • Analyze control and area costs of transformations • Approach • Explore and analyze compiler and parallelizing compiler transformations that are useful for high-level synthesis • Develop scheduling and control generation algorithms that use these/new transformations to solve Problems 1 and 2 • Build an experimental PHLS framework to enable high level design exploration for HLS.

Poor quality of HLS results beyond straight-line behavioral descriptions Problem # 1 : Problem # 2 : Poor/No controllability of the HLS results Focus of this Work • Target Applications: • Descriptions with complex and nested conditionals and loops: • Multimedia & Image processing applications with a mix of data and control operations • Computationally expensive Microprocessor blocks: resource rich, tightly packed into a few cycles • Objectives: • Improve quality of overall HLS results (delay and area) • Find balance of parallelism and hardware costs under resource constraints • Improve controllability of the HLS solutions • Analyze control and area costs of transformations • Approach • Explore and analyze compiler and parallelizing compiler transformations that are useful for high-level synthesis • Develop scheduling and control generation algorithms that use these/new transformations to solve Problems 1 and 2 • Build an experimental PHLS framework to enable high level design exploration for HLS.

Important Parallelizing Compiler Techniques • Transformations to exploit instruction-level parallelism • Speculative Code Motions • Attempt to move operations out of and sometimes even duplicate into conditional blocks • Percolation Scheduling and Trailblazing • Can produce optimal schedule given enough resources • Loop Transformations • Loop Invariant Code Motion: reduce num of operations executed • Loop Unrolling and Pipelining: expose inter-iteration parallelism • Induction Variable Analysis: operation strength reduction • Loop Fusion, Interchange: increase scope of transformations • Partial evaluation and removing redundant and useless operations • CSE, Copy Propagation, Constant Folding, Dead Code Elimination

Useful, but important differences with HLS • Different cost models between programmable processors and synthesized hardware • For instance, routing resources, control logic cost can be significant compared to functional unit costs. • Transformations have implications on hardware • Non-trivial control and area costs • Operation duplication leads to flexible scheduling ; however, can lead to higher control costs • Integration with synthesis transformations • Operation Chaining • Notion of mutual exclusivity of operations • Resource Sharing

Our Approach to Parallelizing HLS (PHLS) • Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling • Source-level code refinement using Pre-synthesis transformations • Code Restructuring by Speculative Code Motions • Operation replication to improve concurrency • Dynamic transformations: exploit new opportunities during scheduling C Input VHDL Output Original CDFG Scheduling & Binding Optimized CDFG Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations

Our Approach to PHLS • Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling • Source-level code refinement using Pre-synthesis transformations • Code Restructuring by Speculative Code Motions • Operation replication to improve concurrency • Dynamic transformations: exploit new opportunities during scheduling • Develop heuristics that balanceparallelism extracted by code transformations with the hardware costs: improve overall QOR • Increase Code Compaction (improve resource utilization) • Reduce impact of programming style/control constructs on HLS results • Useful in descriptions with nested conditionals and loops • Choice of Transformations we explore is based on: Improving Performance Increasing Code Compaction Invariance to Programming Style Extracting Parallelism

PHLS Transformations Organized into Four Groups • Pre-synthesis: Loop-invariant code motions, Loop unrolling, CSE • Scheduling: Speculative Code Motions, Multi-cycling, Operation Chaining, Loop Pipelining • Dynamic: Transformations applied dynamically during scheduling: Dynamic CSE, Dynamic Copy Propagation, Dynamic Branch Balancing • Basic Compiler Transformations: Copy Propagation, Dead Code Elimination

1: a = b + c 2: d = a + c Loop Node BB 1 i < n T F BB 2 3: e = e + d … i = i + 1 Loop Exit BB 3 BB 4 … 1. Pre-synthesis: Loop Invariant CM … Loop Node BB 1 i < n T F BB 2 1: a = b + c 2: d = a + c 3: e = e + d … i = i + 1 Loop Exit BB 3 BB 4 …

1: a = b + c 2: d = a + c Loop Node BB 1 i < n T F BB 2 3: e = e + d … i = i + 1 Loop Exit BB 3 BB 4 … 1. Pre-synthesis: Loop Invariant CM … Loop Node BB 1 i < n T F BB 2 1: a = b + c 2: d = a + c 3: e = e + d … i = i + 1 Loop Exit • Reduce number of operations that execute in loops • Putting code inside loops is a programming convenience • Common situation in media applications BB 3 BB 4 …

BB 0 a = b + c If Node BB 1 T F d = a e = g + h BB 2 BB 3 BB 4 After CSE BB 0 a = b + c If Node BB 1 T F d = b + c e = g + h BB 2 BB 3 BB 4 HTG Representation 1. Common Sub-Expression Elimination a = b + c; c = b < c; if (c) d = b + c; else e = g + h; C Description

BB 0 a = b + c If Node BB 1 T F d = a e = g + h BB 2 BB 3 BB 4 After CSE BB 0 a = b + c If Node BB 1 T F d = b + c e = g + h BB 2 BB 3 BB 4 HTG Representation 1. Common Sub-Expression Elimination a = b + c; c = b < c; if (c) d = b + c; else e = g + h; • We use notion of Dominance of Basic Blocks • A basic block BBi dominates another basic block BBj if all control paths from the initial basic block of the design graph leading to BBj goes through BBi • We can eliminate an operation opj in BBj using common expression in opi if BBi dominates BBj C Description

_ Across Hierarchical Blocks + Reverse Speculation Speculation _ _ + + Conditional Speculation _ + + Schedule under Resource Constraints 2. Scheduling Transformations:Speculative Code Motions Hierarchical Task Graph Representation Resource Utilization BB0 a + If Node T F T F b + BB1 BB2 _ BB3 c

_ Across Hierarchical Blocks + Reverse Speculation Speculation _ _ + + Conditional Speculation _ + + Schedule under Resource Constraints 2. Scheduling Transformations:Speculative Code Motions Hierarchical Task Graph Representation Resource Utilization BB0 a + If Node T F T F b + BB1 BB2 • Lead to Shorter Schedule Lengths by utilizing resources that are “idle” in earlier cycles • Reduce the impact of programming style (operation placement) on quality of HLS results _ BB3 c

Speculation ALU Hardware Costs of Speculative Code Motions: Speculation e h BB0 f i S0 c = a < b S1.c If Node S1.!c c T F S1 g = h + i d = e + f BB1 BB2 S1.c S1.!c BB3 g d

ALU Hardware Costs of Speculative Code Motions: Speculation e h BB0 f i S0 d’ = e + f; c = a < b S0 If Node S1.!c c T F S1 g = h + i d = d’ BB1 BB2 S0 S1.!c d’ BB3 g • Might be able to eliminate extra register by careful resource binding (interconnect aware) S1.c d

ALU Conditional Speculation Hardware Costs of Speculative Code Motions: Conditional Speculation h e BB0 i f S0 g = h + i; c = a < b S0 If Node S3 c T F BB1 BB2 S1 S2 e = g S3 S0 g d BB3 S3 d = e + f

ALU Hardware Costs of Speculative Code Motions: Conditional Speculation h g e BB0 i f S0 g = h + i; c = a < b S0 S1.c If Node c S1.!c T F BB1 BB2 S1 d = g + f S2 d = e + f e = g S1.c + S2.!c S0 g d • Fewer Cycles but more complex Control and Multiplexing BB3 S3

ALU Critical Path Hardware Costs of Speculative Code Motions Current State & Conds • Speculative code motions lead to two opposite effects on the control, multiplexing and area costs 1. Shorter Schedule lengths • Leads to Smaller Controllers (fewer states in State Machine) • This leads to smaller area 2. More multiplexing and control costs to steer the data • Particularly when operations are conditionally speculated • Leads to longer critical paths h e i f Control Logic d

3. Dynamic Transformations • Called “dynamic” since they are applied during scheduling (versus a pass before/after scheduling) • Dynamic Branch Balancing • Increase the scope of code motions • Reduce impact of programming style on HLS results • Dynamic CSE and Dynamic Copy Propagation • Exploit the Operation movement and duplication due to speculative code motions • Create new opportunities to apply these transformations • Reduce the number of operations

Resource Allocation Longest Path + If Node If Node BB 0 BB 0 T F T F BB 2 BB 2 BB 1 BB 1 S0 _ _ d d c a a + + S1 b b + + _ _ c BB 3 S2 BB 3 BB 4 e _ _ BB 4 S3 e Unbalanced Conditional Scheduled Design 3. Dynamic Branch Balancing Original Design

Resource Allocation + If Node BB 0 T F BB 2 BB 1 S0 _ _ d a + S1 b + _ _ c S2 BB 3 _ _ BB 4 S3 e Insert New Scheduling Step in Shorter Branch If Node BB 0 T F BB 2 BB 1 d c a + b + BB 3 BB 4 e Original Design Scheduled Design

Resource Allocation + If Node BB 0 T F BB 2 BB 1 _ _ d a + b + _ _ c BB 3 _ _ _ BB 4 e Insert New Scheduling Step in Shorter Branch If Node BB 0 T F BB 2 BB 1 S0 d c a + S1 b e e + BB 3 S2 BB 4 Dynamic Branch Balancing is done • While Traversing the design • And if it enables Conditional Speculation S3 Original Design Scheduled Design

BB 0 a = b + c If Node BB 1 T F d = a e = g + h BB 2 BB 3 BB 4 After CSE BB 0 a = b + c If Node BB 1 T F d = b + c e = g + h BB 2 BB 3 BB 4 HTG Representation 3. Dynamic CSE: Going beyond Traditional CSE a = b + c; c = b < c; if (c) d = b + c; else e = g + h; C Description

Scheduler decides to Speculate BB 0 dcse = b + c BB 1 a = dcse BB 2 BB 3 BB 4 BB 5 d = b + c BB 6 BB 7 BB 8 New Opportunities for “Dynamic” CSEDue to Code Motions BB 0 BB 1 a = b + c BB 2 BB 3 BB 4 BB 5 d = b + c BB 6 BB 7 BB 8 CSE not possible since BB2 does not dominate BB6 CSE possible now since BB0 does not dominate BB6

Scheduler decides to Speculate BB 0 dcse = b + c BB 1 a = dcse BB 2 BB 3 BB 4 BB 5 d = dcse BB 6 BB 7 BB 8 New Opportunities for “Dynamic” CSEDue to Code Motions BB 0 BB 1 a = b + c BB 2 BB 3 BB 4 BB 5 d = b + c BB 6 BB 7 BB 8 If scheduler moves or duplicates an operation op, apply CSE on remaining operations using op CSE not possible since BB2 does not dominate BB6 CSE possible now since BB0 does not dominate BB6

BB 0 a' = b + c a' = b + c BB 1 BB 2 Scheduler decides to Conditionally Speculate BB 3 BB 4 a = a' BB 5 BB 6 BB 7 d = b + c BB 8 Condition Speculation & Dynamic CSE BB 0 BB 1 BB 2 BB 3 BB 4 a = b + c BB 5 BB 6 BB 7 d = b + c BB 8

BB 0 a' = b + c a' = b + c BB 1 BB 2 BB 3 BB 4 a = a' BB 5 BB 6 BB 7 d = a' Condition Speculation & Dynamic CSE BB 0 BB 1 BB 2 BB 3 Scheduler decides to Conditionally Speculate BB 4 • Use the notion of dominance by groups of basic blocks • All Control Paths leading up to BB8 come from either BB1 or BB2:=>BB1 and BB2 together dominate BB8 a = b + c BB 5 BB 6 BB 7 d = b + c BB 8 BB 8

Integrating the Parallelizing Transformations into a HLS Scheduler • Employ Speculative Code Motions during Scheduling • Perform Branch Balancing • While traversing the design • If it enables code motion (during scheduling) • Perform Dynamic CSE after scheduling an operation • After the scheduled operation has been moved and possibly duplicated

Architecture of the PHLS Scheduler Scheduler Traverses Design to find next basic block to schedule IR Walker Traverses Design to find Candidate Operations to schedule Candidate Fetcher Candidate Chooser Calculates Cost of Operations and chooses Operation with lowest cost for scheduling Moves, duplicates and schedules chosen Operation Candidate Mover Dynamically apply transformations such as CSE on remaining Candidate Operations using scheduled operation Dynamic Transforms

Branch Balancing During Traversal Determine Code Motions Required to schedule op Apply Speculative Code Motions Branch Balancing During Code Motion Integrating transformations into Scheduler Scheduler IR Walker Available Operations Candidate Fetcher Candidate Walker Candidate Validater Candidate Chooser Candidate Mover Dynamic Transforms

Use Hierarchical Task Graphs for enabling efficient code motions Speculation and movement along control paths without duplication Conditional and Reverse Speculation: Duplication of operations into multiple Conditional Branches Speculative Code Motions Scheduling Problem Formulation Given • Data Flow Graph Gd(Vop, Edata) • Control Flow Graph Gc(Vbb, Econtrol) • Mapping Φ : Vops → Vbb, Resource List: R • Find New Mapping Φsched : Vops → Vbb and Start timesΤof all operations in Vop such that • Each vop starts executing after all its predecessor operations in Gd have finished executing • Each vop is mapped to a resource in R • If vbbs = BBsched(vop) and vbbun = BBUnSched(vop) • Then either vbbs dominates vbbun or vbbun dominates vbbs • Or vop is duplicated into a set of basic blocks β such that either • β dominates vbbun or vbbun dominates β

SPARKHigh Level Synthesis Framework

SPARK Parallelizing HLS Framework • C input and Synthesizable RTL VHDL output • Range of compiler, parallelizing compiler and HLS transformations applied during Pre-synthesis and Scheduling phases • Tool-box of Transformations and Heuristics • Each of these can be developed independently of the other • Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation • Interconnect Minimizing Resource Binding • Enables Graphical Visualization of Design description and intermediate results • About 100,000 + lines of C++ code

HTG Graph Visualization DFG

Resource Utilization Graph Scheduling

Example of Complex HTG • Example of a real design: MPEG-1 pred2 function • Multiple nested loops and conditionals

Experiments • Experiments for several transformations • Pre-synthesis transformations • Speculative Code Motions • Dynamic CSE • We used SPARK to synthesize designs derived from several industrial designs • MPEG-1, MPEG-2, GIMP Image Processing software • Case Study of Intel Instruction Length Decoder • Scheduling Results • Number of States in FSM • Cycles on Longest Path through Design • VHDL: Logic Synthesis • Critical Path Length (ns) • Unit Area

Target Applications

36% 39% 42% 36% 10% 8% Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks + Pre-Synthesis Transforms + Speculative Code Motions + Dynamic CSE

36% 39% 42% 36% 10% 8% Overall: 63-66 % improvement in Delay Almost constant Area Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks + Pre-Synthesis Transforms + Speculative Code Motions + Dynamic CSE

33% 52% 20% 1% 41% 14% Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks + Pre-Synthesis Transforms + Speculative Code Motions + Dynamic CSE

33% 52% 20% 1% 41% 14% Overall: 48-76 % improvement in Delay Almost constant Area Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks + Pre-Synthesis Transforms + Speculative Code Motions + Dynamic CSE

Case Study: Intel Instruction Length Decoder Stream of Instructions Instruction Buffer Instruction Length Decoder First Insn Second Insn Third Instruction

Case Study: ILD Block from Intel • A design derived from the Instruction Length Decoder of the Intel Pentium® class of processors • Decodes length of instructions streaming from memory • Sequentially looks at up to 4 bytes for each instruction • Has to execute in one cycle and decode about 64 bytes of instructions • Characteristics of Microprocessor functional blocks • Low Latency: Single or Dual cycle implementation • Consist of several small computations • Intermix of control and data logic

Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark