From Software to Circuits: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems

From Software to Circuits: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems Jason Anderson Tools to Tackle Big Data – Big Data Workshop3 July 2014 Dept. of Electrical and Computer EngineeringUniversity of Toronto

LegUp Research Team • Undergrad Researchers: Mathew Hall, Stefan Hadjis, Joy Chen • Faculty: Stephen Brown and myself • Industry Liaison: Tomasz Czajkowski, Altera AndrewCanis JamesChoi Nazanin Calagar Lanny Lian Blair Fort

Computations in Two Ways

Computations in Two Ways Write Software

Computations in Two Ways Write Software Design Custom Circuits

Design Methodology

Design Methodology Write software

Design Methodology Write software • Easy

Design Methodology Write software • Easy • Flexibility  lower performance

Design Methodology Write software • Easy • Flexibility  lower performance Design Custom Circuits

Design Methodology Write software • Easy • Flexibility  lower performance Design Custom Circuits • Efficient, low power

Design Methodology Write software • Easy • Flexibility  lower performance Design Custom Circuits • Efficient, low power • Need specialized knowledge

Hardware’s Potential • Implementing computations in FPGA hardware can have speed/energy advantages over software: • Lithography simulation: 15X speed-up [Cong & Zou, TRETS’09] • Linear system solver: 2.2X speed-up, 5X more energy efficient [Zhang, Betz, Rose, TRETS’12] • Monte Carlo simulation for photodynamic therapy: 80X faster, 45X more energy efficient [Lo et al., J. Biomed Optics’09] • Options pricing: 4.6X faster, 25X more energy efficient [Tse, Thomas, Luk, TVLSI’12]

So Why Doesn’t Everybody Use Hardware? • Hardware design is difficult and skills are rare: • Requires use of hardware description languages: Verilog and VHDL • Low-level of abstraction (individual bits) • 10 software engineers for every hardware engineer* • We need a CAD flow that simplifies hardware design for software engineers *US Bureau of Labour Statistics 2012

A Solution • High-Level Synthesis • Design circuits using software languages • From a software program, high-level synthesis tool automatically “synthesizes” circuit that does the same computations as the program • Benefits of software programmability and hardware performance

LegUp High-Level Synthesis for FPGAs • LegUp is a high-level synthesis tool we have been developing since 2009. • Takes a C program as input, and produces a circuit. • 1000+ downloads of our tool since its first release in 2011. • http://legup.eecg.toronto.edu

legup.eecg.toronto.edu

Why Use FPGAs to Implement Circuits? • Building fully fabricated custom chips is hard • Very complex design process • Costs $millions to prototype a chip • Takes 2-3 months to fabricate • Only done for high volume applications or apps that require high speed or lowest power • Alternative: pre-fabricated, programmable chips Field-Programmable Gate Arrays (FPGAs)

Field-Programmable Gate Arrays • Pre-fabricated chip consists of “array” of logic blocks Surrounded by programmable interconnect • Hardware “becomes” what you want by programming blocks and interconnect (electrically) Configurable logic block CLB CLB CLB CLB Common blocks: multiplier, DSP, processor,PCI, ADC, DLL Block RAM CLB CLB CLB CLB Hard IP Block Channels ofprogrammableinterconnect CLB CLB CLB CLB Block RAM CLB CLB CLB CLB CLB CLB CLB CLB Hard IP Block SRAM block(e.g., 18 kbits) Block RAM CLB CLB CLB CLB

A Real FPGA – Altera Stratix III

FPGA Advantages over “Hard” Chips • “Manufacture” takes seconds vs. months • Design, test and manufacture: $single-digit millions vs. $tens of millions • Giving: • Faster time-to-market for products • FPGA vendor handles difficult design & manufacture issues • FPGA vendor shares inventory risk across many customers • FPGA vendor does test • Two largest FPGA vendors: Xilinx and Altera

FPGAs and High-Level Synthesis • FPGAs mainly accessible to HW engineers • Vendors want to expand user-base: make FPGAs useable as computing platforms • Area/power/delay gap between HLS-generated HW and manually crafted HW • In custom Si, user must “pay” for area gap • Power/performance one of main reasons to go custom • FPGAs likely the IC media through which HLS goes “mainstream”

LegUp: Top-Level Vision int FIR(int ntaps, int sum) { int i; for (i=0; i < ntaps; i++) sum += h[i] * z[i]; return (sum); } .... Processor (MIPS/ARM) C Compiler Program code Self-Profiling Processor Profiling Data: Execution Cycles Power Cache Misses Altered SW binary (calls HW accelerators) High-levelsynthesis Suggested programsegments to target to HW P Hardenedprogramsegments FPGA fabric

LegUp: Key Features • C to Verilog high-level synthesis • Many benchmarks (incl. 12 CHStone) • Automated verification tests • Support for four different FPGAs: • Altera Cyclone II, Stratix IV, Cyclone IV, Cyclone V-SoC • Open source, freely downloadable

How Does High-Level Synthesis Work?

Digital Circuits • Example: you buy a “1 GHz processor”

Digital Circuits • Example: you buy a “1 GHz processor” 1 GHz = 1 nanosecond time-steps Some computation is done in each time step

Digital Circuits • Example: you buy a “1 GHz processor” 1 GHz = 1 nanosecond time-steps Some computation is done in each time step time

Digital Circuits • Example: you buy a “1 GHz processor” 1 GHz = 1 nanosecond time-steps Some computation is done in each time step 1ns time

Example Circuit A B 1ns + Calculate A+B

Example Circuit A B 1ns + Store computation after each step

Example Circuit A B C D E F 1ns – + *

Example Circuit A B C D E F 1ns – + * 1ns *

Example Circuit A B C D E F 1ns – + * 1ns * 1ns – (A+B)*(C–D) – (E*F)

Scheduling: Key Aspect of HLS • How to assign the computations of a program into the hardware time steps? C language snippet: z = a+b; x = c+d; q = z+x; q = q-2; r = q*2; Programs do not contain the notionof “time steps”. Here, we have: 3 add operations 1 subtract operation 1 multiplication operation

Scheduling Questions: • Which operations can be scheduled in the same time step? • Which operations are dependent on others? • If addition takes 5ns, subtraction takes 5ns and multiplication takes 10ns, how to schedule? • Target clock step length is 10ns C language snippet: z = a+b; x = c+d; q = z+x; q = q-2; r = q*2;

Scheduling d a b c 10ns + + 10ns 2 + - 10ns 2 *

Scheduling d a b c 10ns + + parallel operations 10ns 2 + chaining - 10ns 2 *

HLS Challenges • Performance of HLS-generated circuits not as good as human-designed circuits • However, HLS-generated circuits are already better than SW in many cases • Much of our research is aimed towards improving HLS quality

Loop Pipelining

Loop Pipelining • Cycles: 3N • Adders: 3 • Utilization: 33% for (inti = 0; i < N; i++) { sum[i] = a + b + c + d } cycle a b + 1 c + 2 d + 3

Loop Pipelining Steady State • Cycles: N+2 (~1 cycle per iteration) • Adders: 3 • Utilization: 100% in steady state

Loop Pipelining • Ideally, we could start a loop iteration every clock cycle • Initiation interval (II) = 1 • However, • Loops may have dependencies across iterations • There may be constraints on resources • e.g. only two memory accesses in a cycle • Loop pipelining seeks to minimize II subject to constraints

Exploiting Spatial Parallelism

Motivation • Speed benefits of HW arise from spatial parallelism • Extracting parallelism from a sequential program is difficult • Auto-parallelizing compilers do not work well! • Easier to start from parallel code • Pthreads/OpenMP can help!

From Software to Circuits: High-Level Synthesis for FPGA-Based Processor/Accelerator Systems