Center for Embedded Computer Systems University of California, Irvine and San Diego

Hardware and Interface Synthesis of FPGA Blocks using Parallelizing Code Transformations Sumit Gupta Manev Luthra, Nikil Dutt, Alex Nicolau Rajesh Gupta Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark Supported by Semiconductor Research Corporation

Interface Synthesis Overview of a HW/SW Co-design Flow Application Specs Logic Synthesis and P&R RTL VHDL High Level Synthesis Hardware Behavioral Description C HW/SW Partitioning ASIC FPGA Memory I/O Software Behavioral Description Software Compiler Processor Core

Outline • Motivation and Background • Our Approach to Parallelizing High-Level Synthesis • Our Approach to Interface Synthesis using Memory Mapping • Experimental Results • HLS: Multimedia and Image Processing Applications • Interface Synthesis: Case Study of MPEG-1 block • Conclusions and Future Work

M e m o r y Control ALU Poor quality of HLS results beyond straight-line behavioral descriptions Problem # 1 : Data path Problem # 2 : Poor/No controllability of the HLS results High Level Synthesis Transform behavioral descriptions to RTL/gate level From C to CDFG to Architecture x = a + b c = a < b x = a + b; c = a < b; if (c) then d = e – f; else g = h + i; j = d x g; l = e + x; If Node c T F d = e - f g = h + i j = d x g l = e + x

High-level Synthesis • Well-researched area: from early 1980’s • Renewed interest due to new system level design methodologies • Large number of synthesis optimizations have been proposed • Either operation level: algebraic transformations on DSP codes • or logic level: Don’t Care based control optimizations • In contrast, compiler transformations operate at both operation level (fine-grain) and source level (coarse-grain) • Parallelizing Compiler Transformations • Different optimization objectives and cost models than HLS • Our aim: Develop Synthesis and Parallelizing Compiler Transformations that are “useful” for HLS • Beyond scheduling results: in Circuit Area and Delay • For large designs with complex control flow (nested conditionals/loops)

C Input VHDL Output Original CDFG Scheduling & Binding Optimized CDFG Source-Level Compiler Transformations Scheduling Compiler & Dynamic Transformations Our Approach: Parallelizing HLS (PHLS) • Optimizing Compiler and Parallelizing Compiler transformations applied at Source-level (Pre-synthesis) and during Scheduling • Source-level code refinement using Pre-synthesis transformations • Code Restructuring by Speculative Code Motions • Operation replication to improve concurrency • Dynamic transformations: exploit new opportunities during scheduling

SPARKHigh Level Synthesis Framework

SPARK Parallelizing HLS Framework • C input and Synthesizable RTL VHDL output • Tool-box of Transformations and Heuristics • Each of these can be developed independently of the other • Script based control over transformations & heuristics • Hierarchical Intermediate Representation (HTGs) • Retains structural information about design (conditional blocks, loops) • Enables efficient and structured application of transformations • Complete HLS tool: Does Binding, Control Synthesis and Backend VHDL generation • Interconnect Minimizing Resource Binding • Enables Graphical Visualization of Design description and intermediate results • 100,000+ lines of C++ code

Interface Synthesis • “Glue” logic to make the hardware and software talk to each other • Data must be transferred or shared between the hardware and software • Our approach to Interface Synthesis: • Map common data to a shared memory • Our focus is on optimizing the number and size of memory banks used for mapping • Target architecture: Platform with FPGA and Processor core on same FPGA fabric • Take advantage of the Memory Blocks available on such FPGA platforms for implementing the shared memory

The Interface Hardware • System architecture illustrating the interface HW Data exchanged between HW and SW resides in the shared memory Processor Core Main Memory Peripheral Shared Memory Interface Logic Peripheral Interface Hardware Custom Logic CPU Bus

Target HW Interface Architecture FPGA Fabric Bus Interface Appl. Kernel Memory Controller Processor Core Appl. Kernel Control Logic Mapped Local Memory CPU Bus Multiple Ports Single Port

Efficient use of FPGA resources by utilizing its abundant memory blocks Mem K Mem 1 Var 3 Mem 2 Var N Var 2 Var 1 K:2M Mux FU1 FUM An Example of Memory Mapping Inefficient use of the FPGA fabric by implementing independent registers Local Memory Var 1 Var 2 Var N N:2M Mux FU1 FUM

Condition resolved at run time Mapping Designs with Control Flow • Mapping variables to memory banks is a well-known technique • What’s new here: Use of • Scheduling info from High Level Synthesis tool • Mutual exclusivity info in designs with control flow • Data access patterns resolved only during run time • So how do we perform memory mapping at compile time ? If Node T F BB 2 BB 1 C0 b c a a b C1 b a c a BB 3 Access Pattern: {(a, b), (a, c)} or {(a, b, c), (a, b)}

Our Approach to Memory Mapping • We use • Scheduling information from SPARK Synthesis tool • Cycles in which variables are accessed (read/write) • Mutual exclusivity information from the control flow graph • The conditions under which variables are accessed • Capture this information using Multi-color Conflict Graphs • Then create a Variable to Memory Bank Mapping that minimizes the number of memory instances • Details available in ICCD 2003 paper

Experiments for SPARK HLS Tool • Results presented here for • Pre-synthesis transformations • Speculative Code Motions • Dynamic CSE • We used SPARK to synthesize designs derived from several industrial designs • MPEG-1, MPEG-2, GIMP Image Processing software • Case Study of Intel Instruction Length Decoder • Scheduling Results • Number of States in FSM • Cycles on Longest Path through Design • VHDL: Logic Synthesis • Critical Path Length (ns) • Unit Area

Target Applications

36% 39% 42% 36% 10% 8% Overall: 63-66 % improvement in Delay Almost constant Area Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks + Pre-Synthesis Transforms + Speculative Code Motions + Dynamic CSE

33% 52% 20% 1% 41% 14% Overall: 48-76 % improvement in Delay Almost constant Area Scheduling & Logic Synthesis Results Non-speculative CMs: Within BBs & Across Hier Blocks + Pre-Synthesis Transforms + Speculative Code Motions + Dynamic CSE

Interface Synthesis: Case Study • Target Platform – Altera’s Nios development board @33.33 MHz • Soft core of the Nios embedded processor (no caches) • Library of peripheral IP’s like timer, UART, SRAM controller • Altera FPGA with 8320 Logic Cells • Tools • The SPARK parallelizing high level synthesis framework for high level synthesis • Leonardo Spectrum and Quartus II for logic synthesis and P&R respectively • The Nios specific GNU toolkit for compiling C code

Example: MPEG-1 Prediction Block • MPEG-1 pred1 used 1 out of 3 kernels mapped to HW • MPEG-2 pred2 used the remaining 2 kernels mapped to HW • HW/SW interface consisted of 181 words of data

Conclusions • Parallelizing code transformations enable a new range of HLS transformations • Provide the needed improvement in quality of HLS results • Possible to be competitive against manually designed circuits. • Can enable productivity improvements in microelectronic design • Built a C-to-VHDL synthesis system with a range of code transformations • Platform for applying Coarse and Fine-grain Optimizations • Tool-box approach where transformations and heuristics can be developed • Script-based synthesis enables designer to control HLS process • Performance improvements of 60-70 % across a number of designs • Presented an interface synthesis approach targeted at FPGA based platforms • Based on a memory mapping algorithm that utilizes scheduling information for synthesizing designs with control flow

SPARK Release • Available for download http://www.cecs.uci.edu/~spark • User Manual • Running the tool • Customizing the synthesis scripts • Tutorial • Synthesis of Portion of Motion Compensation algorithm in MPEG-1 player (along with logic synthesis scripts)

Thank You

Center for Embedded Computer Systems University of California, Irvine and San Diego

Center for Embedded Computer Systems University of California, Irvine and San Diego

Presentation Transcript

University of California, San Diego

University of California, San Diego

UNIVERSITY OF CALIFORNIA, SAN DIEGO

Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu/~spark

University of California Irvine

University of California, San Diego

University of California , San Diego (UCSD)

University of California, San Diego Brain – Computer Technology

Center for Embedded Computer Systems University of California, Irvine

Center for Astrophysics and Space Sciences, University of California, San Diego

University of California Irvine

University of California: San Diego

Center for Embedded Computer Systems University of California, Irvine

ISR #11 University of California , San Diego

University of California, Irvine and San Diego

University of California, San Diego

1 Center for Astrophysics and Space Sciences, University of California, San Diego

University of California, San Diego Center for Energy Research

Betsy Barton Center for Cosmology University of California, Irvine

University of California, Irvine

University of California, San Diego: An Engaged University

Betsy Barton Center for Cosmology University of California, Irvine