High-Level Synthesis with LegUp A Crash Course for Users and Researchers

High-Level Synthesis with LegUpA Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February 2013 ACM FPGA SymposiumMonterey, CA Dept. of Electrical and Computer EngineeringUniversity of Toronto

Berlin Hong Kong LegUp LegUp LegUp LegUp LegUp LegUp LegUp LegUp LegUp New York City Tokyo

Tutorial Outline • Overview of LegUp and its algorithms (60 min) • Labs (“hands on” via VirtualBox) • Lab 1: Using the LegUp Framework (30 min) • Break • Lab 2: Adding resource constraints (30 min) • Lab 3: Changing How LegUp implements hardware (30 min)

Project Motivation • Hardware design has advantages over software: • Speed • Energy-efficiency • Hardware design is difficult and skills are rare: • 10 software engineers for every hardware engineer* • We need a CAD flow that simplifies hardware design for software engineers *US Bureau of Labour Statistics ‘08

Top-Level Vision int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum); } .... Processor (MIPS) C Compiler Program code Self-Profiling Processor Profiling Data: Execution Cycles Power Cache Misses Altered SW binary (calls HW accelerators) High-levelsynthesis Suggested programsegments to target to HW P Hardenedprogramsegments FPGA fabric

LegUp: Key Features • C to Verilog high-level synthesis • Many benchmarks (incl. 12 CHStone) • MIPS processor (Tiger) • Hardware profiler • Automated verification tests • Open source, freely downloadable • Like ABC (Synthesis) or VPR (Place & Route) • 600+ downloads since March 2011 • http://legup.eecg.utoronto.ca

System Architecture FPGA Cyclone II or Stratix IV Hardware Accelerator Hardware Accelerator Memory Memory MIPS Processor AVALON INTERFACE On-Chip Cache Memory Memory Controller Off-Chip Memory ALTERA DE2 or DE4 Board

High-Level Synthesis Framework • Leverage LLVM compiler infrastructure: • Language support: C/C++ • Standard compiler optimizations • More on this shortly • We support a large subset of ANSI C:

Hardware Profiler Architecture • Monitor instr. bus to detect function call/ret. • Call: Hash (in HW) from function address to index; push to stack. • Ret: pop function index from stack. • Use function indexes to associate profiling data (e.g. cycles, power) with counters. MIPS P instr Instr. $ PC Op Decoder tAddr+= V1 tAddr += (tAddr << 8) tAddr ^= (tAddr >> 4) b = (tAddr >> B1) & B2 a = (tAddr + (tAddr << A1)) >> A2 fNum = (a ^ tab[b]) Address Hash (in hardware) ret call target address Call Stack counter 0 1 0 1 function # reset Data Counter(for current function) (ret | call) Popped F# 0 + count Incr. when PC changes F# Counter Storage Memory (for all functions) PC count See paper IEEE ASAP’11

Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } intdotproduct(int N) { … for (i=0; i<N; i++) { sum += A[i] * B[i]; } return sum; }

Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } intdotproduct(int N) { … for (i=0; i<N; i++) { sum += A[i] * B[i]; } return sum; } #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS (volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C int legup_dotproduct(int N) { *dotproduct_ARG1 = (volatile int) N; *dotproduct_STATUS = 1; return *dotproduct_DATA; }

Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } HLS set_accelerator_function “dotproduct” HW Accelerator

Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS(volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C intlegup_dotproduct(int N) { *dotproduct_ARG1 = (volatile int) N; *dotproduct_STATUS = 1; return *dotproduct_DATA; } sum = legup_dotproduct(N);

Processor/Accelerator Hybrid Flow int main () { … ... } #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS(volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C intlegup_dotproduct(int N) { *dotproduct_ARG1 = (volatile int) N; *dotproduct_STATUS = 1; return *dotproduct_DATA; } sum = legup_dotproduct(N); SW MIPS Processor

How Does LegUp Handle Memory and Pointers? • LegUp stores each array in a separate FPGA BRAM • BRAM data width matches the data in the array • Each BRAM is identified by a 9-bit tag • Addresses consist of the RAM tag and array index: • A shared memory controller uses the tag bit to determine which BRAM to read or write from • The array index is the address passed to the BRAM 31 23 22 0 9-bit Tag 23-bit Index

Pointer Example • We have two arrays in the C function: • int A[100], B[100] • Tag 0 is reserved for NULL pointers • Tag 1 is reserved for off-chip memory • Assign tag 2 to array A and tag 3 to array B • Address of A[3]: Address of B[7]: 23 22 23 22 31 31 0 0 Tag=2 Index=3 Tag=3 Index=7

Shared Memory Controller • Both arrays A and B have 100 element BRAMs • Load from pointer D: FF FF B[0] A[0] 0 0 2 ... ... 32 A[13] 32 3 B[13] A[13] 13 13 32 31 0 …. …. 23 22 Tag=2 Index=13 B[99] A[99] 99 99 BRAM Tag=2 BRAM Tag=3

Core Benchmarks (+Many More) • 12 CHStone Benchmarks (JIP’09) and Dhrystone • Too large/complex for academic HLS tools • Include golden input/output test vectors • Not supported by academic tools

Experimental ResultsLegUp1.0 (2011) for Cyclone II • Pure software on MIPS Hybrid (software/hardware): • Second most compute-intensive function (and descendants) in H/W • Same as 2 but with most compute-intensive • Pure hardware using LegUp • Pure hardware using eXCite (commercial tool)

Experimental Results

Comparison: LegUpvseXCite • Benchmarks compiled to hardware • eXCite: Commercial high-level synthesis tool • Couldn’t compile Dhrystone

Energy Consumption 18x less energy than software

Current Release: LegUp 3.0 • Loop pipelining • Dual and multi-ported memory support • Bitwidth minimization • Multi-pumping DSP units for area reduction • Alias analysis for dependency checks • Parallel accelerators via Pthreads & OpenMP Results now considerably better than LegUp 1.0 release

LegUp 3.0 vs. LegUp 1.0

LLVM Compiler and HLS Algorithms

LLVM Compiler • Open-source compiler framework. • http://llvm.org • Used by Apple, NVIDIA, AMD, others. • Competitive quality with gcc. • LegUp HLS is a “back-end” of LLVM. • LLVM: low-level virtual machine.

LLVM Compiler • LLVM will compile C code into a control flow graph (CFG) • LLVM will perform standard optimizations • 50+ different optimizations in LLVM CFG C Program BB0 Compiler int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return sum; } .... LLVM BB1 BB2

Control Flow Graph • Control flow graph is composed of basic blocks • basic block:is a sequence of instructions terminated with exactly one branch • Can be represented by an acyclic data flow graph: CFG load load load BB0 + BB1 + store BB2

LLVM Details • Instructions in basic blocks are primitive computational operations: • shift, add, divide, xor, and, etc. • Or are control-flow operations: • branch, call, etc. • The CDFG is represented in LLVM’s intermediate representation (IR) • IR is machine-independent assembly code.

High-Level Synthesis Flow C Compiler (LLVM) Optimized LLVM IR Target H/W Characterization C Program Allocation Scheduling • User Constraints • Timing • Resource Binding RTL Generation Synthesizable Verilog

Scheduling • Scheduling: is the task of scheduling operations into clock cycles using a finite state machine FSM Schedule State 0 load load State 1 + load + State 2 store State 3

Binding • Binding: is the task of assigning scheduled operations to functional units in the datapath Schedule Datapath load load 2-port RAM FF + load + + store

High-Level Synthesis: Scheduling

SDC Scheduling • SDC  System of Difference Constraints • Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC formulation”. DAC 2006: 433-438. • Basic idea: formulate scheduling as a mathematical optimization problem • Linear objective function + linear constraints (==, <=, >=). • The problem is a linear program (LP) • Solvable in polynomial time with standard solvers

Define Variables • For each operation i to schedule, create a variable ti. • The ti’s will hold the cycle # in which each op is scheduled. • Here we have: • tadd, tshift, tsub + << - Data flow graph (DFG): already accessible in LLVM.

Dependency Constraints • In this example, the subtract can only happen after the add and shift. • tsub – tadd >= 0 • tsub – tshift >= 0 • Hence the name difference constraints. add shift sub

Handling Clock Period Constraints mod • Target period: P (e.g., 10 ns) • For each chain of dependant operations in DFG, estimate the path delay D (LegUp’s models) • E.g.: D from mod -> or = 23 ns. • Compute: R = ceiling(D/P) - 1 • E.g.: R = 2 • Add the difference constraint: • tor - tmod >= 2 xor shr or

Resource Constraints • Restriction on # of operations of a given type that can execute in a cycle • Why we need it? • Want to use dual-port RAMs in FPGA • Allow up to 2 load/store operations in a cycle • Floating point • Do not want to instantiate many FP cores of a given type, probably just one • Scheduling must honour # of FP cores available

Resource Constraints in SDC • Res-constrained scheduling is NP-hard. • Implemented approach in [Cong & Zhang DAC2006] A + E H + + + + B F C + + G Say want to schedule with only have 2 addersin the HW (lab #2) + D

Add SDC Constraints • Generate a topological ordering of the resource-constrained operations. • Say constrained to 2 adders in HW. • Starting at C in the ordering, create a constraint: tC – tA > 0 • Next consider, E, add constraint: tE- tB > 0 • Continue to the end • Resulting schedule will have <= 2 adds / cycle A B C E F D G H

ASAP Objective Function • Minimize the sum of the variables: • Operations will be scheduled as early as possible, subject to the constraints • LP program solvable in polynomial time

High-Level Synthesis: Binding

High-Level Synthesis: Binding • Weighted bipartite matching-based binding • Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted matching”. DAC 1990: 499-504. • Finds the minimum weighted matching of a bipartite graph at each step • Solve using the Hungarian Method (polynomial) operations edge costs hardware functional units

Binding • Bind the following scheduled program

Binding • Resource Sharing: requires 3 multipliers

Binding • Functional Units • Bind the first cycle • 1 • 1 • 1

Binding • Functional Units • Bind the second cycle • 2 • 2 • 1

Binding • Functional Units • Bind the third cycle • 2 • 2 • 2

Binding • Functional Units • Bind the fourth cycle • 3 • 2 • 2

Binding • Functional Units • Required Multiplexing: • 3 • 2 • 2

High-Level Synthesis with LegUp A Crash Course for Users and Researchers

High-Level Synthesis with LegUp A Crash Course for Users and Researchers

Presentation Transcript

High-Level Synthesis an introduction

LegUp : High-Level Synthesis for FPGA-Based Processor/Accelerator Systems

High Level Synthesis

How to use Twitter? A crash course for researchers

A Crash Course in Radio Astronomy and Interferometry : 2. Aperture Synthesis

A Crash Course

IL2200 - High Level Synthesis

High-Level Synthesis

ENGG3190 Logic Synthesis High Level Synthesis

Validating High-Level Synthesis

Lower Power High Level Synthesis

A Crash Course in Secondary Data Sources for Berkeley Researchers

High-Level Synthesis-II

A Crash Course in Secondary Data Sources for Berkeley Researchers

High-Level Synthesis for Reconfigurable Systems

High-Level Synthesis Algorithms

High-level synthesis

High-Level Synthesis

High-level Synthesis Transformations