1 / 80

High-Level Synthesis with LegUp A Crash Course for Users and Researchers

High-Level Synthesis with LegUp A Crash Course for Users and Researchers. Jason Anderson, Stephen Brown, Andrew Canis , Jongsok (James) Choi 11 February 2013 ACM FPGA Symposium Monterey, CA. Dept. of Electrical and Computer Engineering University of Toronto . Berlin. Hong Kong. LegUp.

barney
Télécharger la présentation

High-Level Synthesis with LegUp A Crash Course for Users and Researchers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Level Synthesis with LegUpA Crash Course for Users and Researchers Jason Anderson, Stephen Brown, Andrew Canis, Jongsok (James) Choi 11 February 2013 ACM FPGA SymposiumMonterey, CA Dept. of Electrical and Computer EngineeringUniversity of Toronto

  2. Berlin Hong Kong LegUp LegUp LegUp LegUp LegUp LegUp LegUp LegUp LegUp New York City Tokyo

  3. Tutorial Outline • Overview of LegUp and its algorithms (60 min) • Labs (“hands on” via VirtualBox) • Lab 1: Using the LegUp Framework (30 min) • Break • Lab 2: Adding resource constraints (30 min) • Lab 3: Changing How LegUp implements hardware (30 min)

  4. Project Motivation • Hardware design has advantages over software: • Speed • Energy-efficiency • Hardware design is difficult and skills are rare: • 10 software engineers for every hardware engineer* • We need a CAD flow that simplifies hardware design for software engineers *US Bureau of Labour Statistics ‘08

  5. Top-Level Vision int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum); } .... Processor (MIPS) C Compiler Program code Self-Profiling Processor Profiling Data: Execution Cycles Power Cache Misses Altered SW binary (calls HW accelerators) High-levelsynthesis Suggested programsegments to target to HW P Hardenedprogramsegments FPGA fabric

  6. LegUp: Key Features • C to Verilog high-level synthesis • Many benchmarks (incl. 12 CHStone) • MIPS processor (Tiger) • Hardware profiler • Automated verification tests • Open source, freely downloadable • Like ABC (Synthesis) or VPR (Place & Route) • 600+ downloads since March 2011 • http://legup.eecg.utoronto.ca

  7. System Architecture FPGA Cyclone II or Stratix IV Hardware Accelerator Hardware Accelerator Memory Memory MIPS Processor AVALON INTERFACE On-Chip Cache Memory Memory Controller Off-Chip Memory ALTERA DE2 or DE4 Board

  8. High-Level Synthesis Framework • Leverage LLVM compiler infrastructure: • Language support: C/C++ • Standard compiler optimizations • More on this shortly • We support a large subset of ANSI C:

  9. Hardware Profiler Architecture • Monitor instr. bus to detect function call/ret. • Call: Hash (in HW) from function address to index; push to stack. • Ret: pop function index from stack. • Use function indexes to associate profiling data (e.g. cycles, power) with counters. MIPS P instr Instr. $ PC Op Decoder tAddr+= V1 tAddr += (tAddr << 8) tAddr ^= (tAddr >> 4) b = (tAddr >> B1) & B2 a = (tAddr + (tAddr << A1)) >> A2 fNum = (a ^ tab[b]) Address Hash (in hardware) ret call target address Call Stack counter 0 1 0 1 function # reset Data Counter(for current function) (ret | call) Popped F# 0 + count Incr. when PC changes F# Counter Storage Memory (for all functions) PC count See paper IEEE ASAP’11

  10. Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } intdotproduct(int N) { … for (i=0; i<N; i++) { sum += A[i] * B[i]; } return sum; }

  11. Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } intdotproduct(int N) { … for (i=0; i<N; i++) { sum += A[i] * B[i]; } return sum; } #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS (volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C int legup_dotproduct(int N) { *dotproduct_ARG1 = (volatile int) N; *dotproduct_STATUS = 1; return *dotproduct_DATA; }

  12. Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } HLS set_accelerator_function “dotproduct” HW Accelerator

  13. Processor/Accelerator Hybrid Flow int main () { … sum = dotproduct(N); ... } #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS(volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C intlegup_dotproduct(int N) { *dotproduct_ARG1 = (volatile int) N; *dotproduct_STATUS = 1; return *dotproduct_DATA; } sum = legup_dotproduct(N);

  14. Processor/Accelerator Hybrid Flow int main () { … ... } #define dotproduct_DATA (volatile int *) 0xf0000000 #define dotproduct_STATUS(volatile int *) 0xf0000008 #define dotproduct_ARG1 (volatile int *) 0xf000000C intlegup_dotproduct(int N) { *dotproduct_ARG1 = (volatile int) N; *dotproduct_STATUS = 1; return *dotproduct_DATA; } sum = legup_dotproduct(N); SW MIPS Processor

  15. How Does LegUp Handle Memory and Pointers? • LegUp stores each array in a separate FPGA BRAM • BRAM data width matches the data in the array • Each BRAM is identified by a 9-bit tag • Addresses consist of the RAM tag and array index: • A shared memory controller uses the tag bit to determine which BRAM to read or write from • The array index is the address passed to the BRAM 31 23 22 0 9-bit Tag 23-bit Index

  16. Pointer Example • We have two arrays in the C function: • int A[100], B[100] • Tag 0 is reserved for NULL pointers • Tag 1 is reserved for off-chip memory • Assign tag 2 to array A and tag 3 to array B • Address of A[3]: Address of B[7]: 23 22 23 22 31 31 0 0 Tag=2 Index=3 Tag=3 Index=7

  17. Shared Memory Controller • Both arrays A and B have 100 element BRAMs • Load from pointer D: FF FF B[0] A[0] 0 0 2 ... ... 32 A[13] 32 3 B[13] A[13] 13 13 32 31 0 …. …. 23 22 Tag=2 Index=13 B[99] A[99] 99 99 BRAM Tag=2 BRAM Tag=3

  18. Core Benchmarks (+Many More) • 12 CHStone Benchmarks (JIP’09) and Dhrystone • Too large/complex for academic HLS tools • Include golden input/output test vectors • Not supported by academic tools

  19. Experimental ResultsLegUp1.0 (2011) for Cyclone II • Pure software on MIPS Hybrid (software/hardware): • Second most compute-intensive function (and descendants) in H/W • Same as 2 but with most compute-intensive • Pure hardware using LegUp • Pure hardware using eXCite (commercial tool)

  20. Experimental Results

  21. Comparison: LegUpvseXCite • Benchmarks compiled to hardware • eXCite: Commercial high-level synthesis tool • Couldn’t compile Dhrystone

  22. Energy Consumption 18x less energy than software

  23. Current Release: LegUp 3.0 • Loop pipelining • Dual and multi-ported memory support • Bitwidth minimization • Multi-pumping DSP units for area reduction • Alias analysis for dependency checks • Parallel accelerators via Pthreads & OpenMP Results now considerably better than LegUp 1.0 release

  24. LegUp 3.0 vs. LegUp 1.0

  25. LLVM Compiler and HLS Algorithms

  26. LLVM Compiler • Open-source compiler framework. • http://llvm.org • Used by Apple, NVIDIA, AMD, others. • Competitive quality with gcc. • LegUp HLS is a “back-end” of LLVM. • LLVM: low-level virtual machine.

  27. LLVM Compiler • LLVM will compile C code into a control flow graph (CFG) • LLVM will perform standard optimizations • 50+ different optimizations in LLVM CFG C Program BB0 Compiler int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return sum; } .... LLVM BB1 BB2

  28. Control Flow Graph • Control flow graph is composed of basic blocks • basic block:is a sequence of instructions terminated with exactly one branch • Can be represented by an acyclic data flow graph: CFG load load load BB0 + BB1 + store BB2

  29. LLVM Details • Instructions in basic blocks are primitive computational operations: • shift, add, divide, xor, and, etc. • Or are control-flow operations: • branch, call, etc. • The CDFG is represented in LLVM’s intermediate representation (IR) • IR is machine-independent assembly code.

  30. High-Level Synthesis Flow C Compiler (LLVM) Optimized LLVM IR Target H/W Characterization C Program Allocation Scheduling • User Constraints • Timing • Resource Binding RTL Generation Synthesizable Verilog

  31. Scheduling • Scheduling: is the task of scheduling operations into clock cycles using a finite state machine FSM Schedule State 0 load load State 1 + load + State 2 store State 3

  32. Binding • Binding: is the task of assigning scheduled operations to functional units in the datapath Schedule Datapath load load 2-port RAM FF + load + + store

  33. High-Level Synthesis: Scheduling

  34. SDC Scheduling • SDC  System of Difference Constraints • Cong, Zhang, “An efficient and versatile scheduling algorithm based on SDC formulation”. DAC 2006: 433-438. • Basic idea: formulate scheduling as a mathematical optimization problem • Linear objective function + linear constraints (==, <=, >=). • The problem is a linear program (LP) • Solvable in polynomial time with standard solvers

  35. Define Variables • For each operation i to schedule, create a variable ti. • The ti’s will hold the cycle # in which each op is scheduled. • Here we have: • tadd, tshift, tsub + << - Data flow graph (DFG): already accessible in LLVM.

  36. Dependency Constraints • In this example, the subtract can only happen after the add and shift. • tsub – tadd >= 0 • tsub – tshift >= 0 • Hence the name difference constraints. add shift sub

  37. Handling Clock Period Constraints mod • Target period: P (e.g., 10 ns) • For each chain of dependant operations in DFG, estimate the path delay D (LegUp’s models) • E.g.: D from mod -> or = 23 ns. • Compute: R = ceiling(D/P) - 1 • E.g.: R = 2 • Add the difference constraint: • tor - tmod >= 2 xor shr or

  38. Resource Constraints • Restriction on # of operations of a given type that can execute in a cycle • Why we need it? • Want to use dual-port RAMs in FPGA • Allow up to 2 load/store operations in a cycle • Floating point • Do not want to instantiate many FP cores of a given type, probably just one • Scheduling must honour # of FP cores available

  39. Resource Constraints in SDC • Res-constrained scheduling is NP-hard. • Implemented approach in [Cong & Zhang DAC2006] A + E H + + + + B F C + + G Say want to schedule with only have 2 addersin the HW (lab #2) + D

  40. Add SDC Constraints • Generate a topological ordering of the resource-constrained operations. • Say constrained to 2 adders in HW. • Starting at C in the ordering, create a constraint: tC – tA > 0 • Next consider, E, add constraint: tE- tB > 0 • Continue to the end • Resulting schedule will have <= 2 adds / cycle A B C E F D G H

  41. ASAP Objective Function • Minimize the sum of the variables: • Operations will be scheduled as early as possible, subject to the constraints • LP program solvable in polynomial time

  42. High-Level Synthesis: Binding

  43. High-Level Synthesis: Binding • Weighted bipartite matching-based binding • Huang, Chen, Lin, Hsu, “Data path allocation based on bipartite weighted matching”. DAC 1990: 499-504. • Finds the minimum weighted matching of a bipartite graph at each step • Solve using the Hungarian Method (polynomial) operations edge costs hardware functional units

  44. Binding • Bind the following scheduled program

  45. Binding • Resource Sharing: requires 3 multipliers

  46. Binding • Functional Units • Bind the first cycle • 1 • 1 • 1

  47. Binding • Functional Units • Bind the second cycle • 2 • 2 • 1

  48. Binding • Functional Units • Bind the third cycle • 2 • 2 • 2

  49. Binding • Functional Units • Bind the fourth cycle • 3 • 2 • 2

  50. Binding • Functional Units • Required Multiplexing: • 3 • 2 • 2

More Related