Dynamic Hardware/Software Partitioning: A First Approach

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine

Introduction • Dynamic optimizations an increasing trend • Examples • Dynamo • Dynamic software optimizations • Transmeta Crusoe • Dynamic code morphing • Just In Time Compilation • Interpreted languages • Advantages • Transparent optimizations • No designer effort • No tool restrictions • Adapts to actual usage

Profiler Critical Regions Sw ______ ______ ______ Hw ______ ______ ______ ASIC/FPGA Processor Introduction • Drawbacks of current dynamic optimizations • Currently limited to software optimizations • Limited speedup (1.1x to 1.3x common) • Alternatively, we could perform hw/sw partitioning • Achieve large speedups (2x to 10x common) • However, presently dynamic optimization not possible Sw ______ ______ ______

Introduction • Ideally, we would perform hardware/software partitioning dynamically • Transparent partitioning • Supports all sw languages/tools • Most partitioning approaches have complex tool flows • Achieves better results than software optimizations • >2x speedup, energy savings • Adapts to actual usage • Appropriate architecture required • Requires a processor and configurable logic

FPGA Processor Processor FPGA Introduction • Microprocessor/FPGA single-chip platforms make partitioning more attractive • More efficient communication, smaller size • Higher performance, low power • Examples • Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Atmel FPSLIC • Makes dynamic hw/sw partitioning more feasible • However, partitioning must be performed at binary level 1990s 2003

Binary Profiling Binary Updater Hw Exploration Decompilation Behavioral Synthesis Updated Binary Netlist Processor FPGA Introduction • Binary-level hw/sw partitioning • Binary is profiled and hardware candidates are determined • Regions to be partitioned are decompiled into CDFG • CDFG is synthesized to hardware • Binary is updated to use hardware • Many advantages over source-level partitioning • Supports any language or software compiler • No change in tools • Better software size and performance estimation at binary level • Enables dynamic hw/sw partitioning

Micro- processor Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module Memory Memory SW SW _________ _________ _________ Dynamic Hw/Sw Partitioning add add add add add add add add add add add add add add add add add add add add add add add add

Micro- processor Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module Memory Memory SW SW _________ _________ _________ Dynamic Hw/Sw Partitioning beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq beq

Micro- processor Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module add add add add Memory Memory SW SW _________ _________ _________ Dynamic Hw/Sw Partitioning add add add add add add add add add add add Dynamic Partitioning Module add add add add add add add add add add add

Micro- processor Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module beq beq beq beq Memory Memory SW SW _________ _________ _________ Dynamic Hw/Sw Partitioning beq beq beq beq beq beq beq beq beq beq beq Dynamic Partitioning Module beq beq beq beq beq beq beq beq beq beq beq

Micro- processor Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module Memory Memory SW SW _________ _________ _________ Frequent Loops Dynamic Hw/Sw Partitioning Dynamic Partitioning Module SW SW SW SW SW SW SW SW

Micro- processor Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module Memory Memory SW SW _________ _________ _________ Frequent Loops Dynamic Hw/Sw Partitioning Dynamic Partitioning Module HW HW HW HW HW HW HW Frequent Loops

Micro- processor Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module Dynamic Partitioning Module Memory Memory SW SW _________ _________ _________ Frequent Loops Dynamic Hw/Sw Partitioning Configurable Logic Frequent Loops

Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module Memory Dynamic Partitioning Module • Dynamic partitioning module executes partitioning tools on chip • Profiler, partitioning compiler, synthesis, place&route SW Source Profiler Partitioning Compiler SW Binary Synthesis Place&Route HW

Dynamic Partitioning Module • Synthesis and place & route tools all moved on-chip • These tools typically execute on powerful workstations • Most people will cringe at idea of moving these tools on-chip • However, dynamic partitioning deals with small regions of code • Typically, small innermost loops • Therefore, we can develop lean tools that work specifically for these small loops • Lean tools make on-chip execution possible • Area overhead becoming less critical due to Moore’s Law

Micro- processor Micro- processor Configurable Logic Micro- processor Micro- processor Dynamic Partitioning Module Memory System Architecture • Microprocessors • MIPS (may be many) • On-chip memory • Configurable logic • Dynamic partitioning module

Memory Profiler Partitioning Co-Processor Dynamic Partitioning Module • Dynamically detects frequent loops and then reimplements the loops in hardware running on the configurable logic • Architectural components • Profiler • Additional processor and memory • But SOCs may have dozens anyways • Alternatively, we could share main processor

Configurable Logic • Greatly simplified in order to create lean place & route tools • DMA used to access memory • Two registers • R0_Input stores data from memory • R1_InOut stores temporary data & data to write back to memory • Fabric • Supports combinational logic • Implies loops must have body implemented in single cycle (temporary restriction) DMA R0_Input R1_InOut Configurable Logic Fabric

Inputs Inputs SRAM (8x2) Outputs Configurable Logic Fabric 0 1 2 3 SM M SM SM LUT T LUT UT 3 3 ... 2 2 1 1 SM M SM M SM SM 0 0 ... 0 1 2 3 Configurable Logic Fabric • Fabric • 3-input 2-output LUTS surrounded by switch matrices • Switch Matrix • Connect wire to same channel on different side • LUT • 3-input (8 word) 2-output SRAM Configurable Logic Fabric Switch Matrix LUT

Binary Loop Profiling Small, Frequent Loops Decompilation DMA Configuration RT and Logic Synthesis Tech. Mapping Place & Route Binary Modification Bitfile Creation Updated Binary HW Tool Overview • Tool flow slightly different from standard partitioning flow • Decompilation • Binary modification

Loop Profiling • Non-intrusive profiler • Monitors instruction bus • Very little overhead • Small cache (~16 entries) and 2,300 logic gates • Less than 1% power overhead To L1 Memory Micro-processor rd/wr Frequent Loop Cache Frequent Loop Cache Controller rd/wr addr data addr saturation sbb ++ data data

Decompilation • Decompilation recovers high-level information • Creates optimized CDFG • All instruction-set inefficiencies are removed • Binary partitioning has been shown to achieve similar results to source-level partitioning for many applications • [Greg Stitt, Frank Vahid, ICCAD 2002]

1 r1 • Memory Read • Increment Address • Block Request DMA Read r2 + Read r2 + + r1 r3 DMA Configuration • Maps memory accesses to our DMA architecture • Reads/writes • Increment/decrement address updates • Single/block request modes • Optimizes DFG for DMA • Removes address calculations • Removes loop counters/exit conditions r3

r1 r3 8 r2 + < 32-bit adder 32-bit comparator r4 r5 Register Transfer Synthesis • Maps DFG operations to hw library components • Adders, Comparators, Multiplexors, Shifters • Creates Boolean expression for each output bit in dataflow graph by replacing hw components with corresponding expressions r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0] r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= ……. …….

r1 4 + r2 r2[0] = r1[0] r2[1] = r1[1] xor carry[0] r2[2] = r1[2]’ xor carry[1] r2[3] = r1[3] xor carry[2] … Logic Synthesis • Optimizes Boolean equations from RT synthesis • Large opportunity for logic minimization due to use of immediate values in the binary • Simple on-chip 2-level logic minimization method • Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed) r2[0] = r1[0] xor 0 xor 0 r2[1] = r1[1] xor 0 xor carry[0] r2[2] = r1[2] xor 1 xor carry[1] r2[3] = r1[3] xor 0 xor carry[2] …

3-input, 2-output LUTs Technology Mapping • Maps logic operations to 3-input, 2-output LUTs • Traverse logic network and combine nodes to determine single output LUTs • Combine nodes to form two output LUTs

LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT LUT Placement • Nodes along critical path are placed in single horizontal row • Build dependencies between remaining nodes and placed nodes • Use dependencies to place remaining nodes • Either above or below placed nodes

Routing • Greedy algorithm • At each switch matrix, choose directionto route • Continue to route until reaching switchmatrix that is already in use • Backtrack to previous switch matrix,and try another direction • Place and route most complex task; currently working on improvements

DMA Configuration HW Netlist Bitfile Creation DMA R0_Input R1_InOut Bitfile Configurable Logic Fabric Bitfile Creation • Combines place&routed hardware description with DMA configuration into bitfile • Used to initialize the configurable logic

loop: Jump hw_init .. after_loop: ….. Binary Modification • Updates the application binary in order to utilize the new hardware • Loop replaced with jump to hw initialization code • Wisconsin Architectural Research Tool Set (WARTS) • EEL (Executable Editing Library) • We assume memory is RAM or programmable ROM loop: Load r2, 0(r1) Add r1, r1, 1 Add r3, r3, r2 Blt r1, 8, loop after_loop: ….. • hw_init: • Initialize HW registers • Enable HW • Shutdown processor • Woken up by HW interrupt • Store any results • Jump to after_loop

Tool Statistics • Executed on SimpleScalar • Similar to a MIPS instruction set • Used 60 MHz clock (like Triscend A7 device) • Statistics • Total run time of only 1.09 seconds • Requires less than ½ megabyte of RAM • Code size much smaller than standard synthesis tools

Experiments • Benchmark Information • Powerstone (Brev, g3fax1&2) • NetBench (url) • Logic minimization kernel (logmin) • Statistics • 55% of total time spent in loops that are moved to hardware • Ideal speedup of 2.8 • These loops were only 2.4% of the size of the original application

Experiments • Results • Achieved average speedup of 2.6, close to ideal 2.8 • Hardware loops were 20X faster than software loops • Even with simple architecture and tools, large speedups were achieved

Conclusion • Dynamic hardware/software partitioning has advantages over other partitioning approaches • Completely transparent • Designers get performance/energy benefits of hw/sw partitioning by simply writing software • Quality likely not as good as desktop CAD for some applications, so most suitable when transparency is critical (very often!) • Achieved average speedup of 2.6 • Very close to ideal speedup of 2.8 • Future work • More complex configurable logic fabric • Designed in close conjunction with on-chip CAD tools • Sequential logic and increased inputs/outputs • Support larger hardware regions, not just simple loops • Improved algorithms (especially place and route) • Handle more complex memory access patterns

Dynamic Hardware/Software Partitioning: A First Approach