High-Level Programming Issues for Reconfigurable Computing Systems

High-Level Programming Issues for Reconfigurable Computing Systems Mark Jones ECE Virginia Tech Blacksburg, Virginia mtj@vt.edu www.ccm.ece.vt.edu

The Virginia TechConfigurable Computing Lab • Focus on devices, architectures, applications, and programming issues for configurable computing • 30+ undergraduates,graduates, and post-docs • Variety of public andprivate sponsors • Peter Athanas & • Mark Jones

Overview • Run-time reconfiguration (RTR) • Obstacles to RTR • Recent developments enabling RTR • New hardware • New bitstream generation tools • New runtime control software • RTR applications • Summary and predictions • Disclaimer: In the interests of time, I am not mentioning all of the relevant projects.

Run-Time Reconfiguration • Adaptive computing devices (e.g., FPGAs) • Hardware configurations can be changed • Speed of reconfiguration varies by device • Reconfigure during the runtime of an applications(s) – less than 1 ms • Goals of the DARPA ACS program include • Development of hardware supporting fast RTR • Creation of software to control RTR hardware • Applications that demonstrate the computational benefits of RTR in size, weight, and power

Types of RTR • Virtual Hardware • Provide programmer with an abstraction of unlimited hardware, similar to Virtual Memory • Useful abstraction which, like virtual memory, provides portability between devices • OS is responsible for directing the chip to context-switch user “hardware” (may include multiple processes) • Requires fast context-switching capability and software to effectively partition user hardware • Virtual co-processor work (e.g. DISC @ BYU) can be thought of in a similar fashion

Types of RTR (continued) • Data-driven RTR • Based on the data encountered, the hardware is reconfigured to process it • e.g., for a given DES key, the hardware is reconfigured to a DES core specific to the key • Can provide increased speed in a small package • Hardware must be able to reconfigure quickly and (in most cases) direct its own reconfiguration based on data encountered

Device Reconfiguration Methods • Entire device via a single bitstream • e.g. Xilinx 4K series • Long reconfiguration times • Logic-unit addressable reconfiguration • e.g. Xilinx 6200 • Significant chip area devoted to this function • Context-based reconfiguration • Sanders CSRC chip • Significant chip area devoted to this function

Device Reconfiguration Methods (continued) • Partial reconfiguration • e.g. Xilinx Virtex • Must reconfigure column at a time • Stream-based reconfiguration • e.g, Colt/Stallion • Appropriate for stream-based computation • Pipeline-oriented reconfiguration • e.g, PipeRench • Appropriate for deeply pipelined applications

Types of Reconfigurable Apps • Stream-oriented applications • Intelligent network devices, software radios, video processing • Reconfiguration must occur quickly enough and w/o disruption of hardware to avoid losing data in stream (buffering too expensive in many situations) • “Batch”-type applications • Number-crunching simulations, off-line analysis of data • Reconfiguration must simply be cost-effective when trading off processing for reconfiguration

Prior Obstacles to RTR • Lack of hardware devices that support RTR in an appropriate fashion • Provide fast reconfiguration without sacrificing performance • Lack of software to support RTR • Generate and modify bitstream configurations during runtime • The following slides will survey projects which are overcoming these obstacles • These projects really represent evolutionary advances on previous research projects

Virtual Hardware:PipeRench (CMU) • Many applications, particularly stream-based applications, can be deeply pipelined to improve performance • PipeRench is built as a reconfigurable pipeline n units • The programmer views PipeRench as a programmable pipeline of m units where m is arbitrarily large

PipeRench (CMU) • PipeRench supports this Virtual Hardware abstraction by reconfiguring the physical pipeline through the abstract pipeline

PipeRench (CMU) • Only one stage must be reconfigured at each step • Allows for fast reconfiguration because only part of chip must be reconfigured • Defines a scalable architecture series • No changes to code are needed as hardware increases in size • Realization in VLSI exists as well as compiler tools

Runtime Generation of Bitstreams: Loki Project(Xilinx and Virginia Tech) APPLICATION PROGRAM NEW STATE FUNCTIONALITY PLACE & ROUTE STATE CONNECTIVITY RESOURCES

Loki Project (continued) • JBits provides an API to the Xilinx bitstream for the 4K and Virtex parts • Java-based API at the LUT/pip level • Executing a Java program with the JBits API can create or modify a bitstream • The Loki project builds on this API to provide a design environment • Focus is on Run-Time Parameterizable cores

Loki Project (continued) • RTP cores (tens of cores at this point) • Finite state machines • KCMs • CAMs • Execution time for customizing bitstreams • Milliseconds (or less) for modification of LUTs in an existing bitstream • Challenge is to provide similar speeds when routing is required

Loki Project (continued) • The RTP core-based approach provides a hierarchical approach • Routing & placement is handled within the core, a full chip-wide P&R is not required • The JBits & RTP-based approach in the Java environment make development of new tools much easier • Simulator for Virtex devices • Visualization of routing delays • Visualization of core layout and runtime execution

BoardScope Core View Output Shift Register (Vertical) 3 Input Shift Registers. (Horizontal) Center Register Highlighted. Evolved Synchronous Circuit

Runtime Hardware Control: SLAAC & DRACS (Sanders, Virginia Tech, USC/ISI-East) • The new hardware that supports fast RTR requires new runtime control software to reduce/eliminate the software overhead associated with reconfiguration • Need to provide the programmer with an abstraction for RTR that is easy to use, yet doesn’t incur runtime overhead

Runtime Hardware Control: Target Hardware • The SLAAC-1V board • 3 Virtex 1000 chips capable of partial reconfiguration • On-board configuration controller (Virtex 100) with a local memory cache • The Sanders RCM board • 2 CSRC chips capable of context-switching • PowerPC and Xilinx 4085 with local memory cache

Runtime Hardware Control: Virtual Hardware • Consider an OS that is swapping hardware configurations in/out of chip (microseconds) • Partial configurations in and out of the Virtex parts on the SLAAC-1V • Switching contexts on the RCM board • Cannot afford to have the configurations sent by the OS to board on every configuration swap • Overwhelm the microsecond cost

Runtime Hardware Control: Virtual Hardware (continued) • Most programs exhibit temporal locality • Exploit this in way similar to virtual memory • Both the SLAAC-1V and the Sanders RCM provide the memory and the control capability to build a configuration cache • Instead of sending configurations to the board, control signals are sent invoking reconfiguration from the cache • Transparent to the programmer

Runtime Hardware Control: Data-Driven RTR • Data-driven RTR requires extremely fast reconfiguration and virtually no overhead in the control of RTR • Little benefit to clock-cycle RTR (CSRC) if the control software takes longer • Must execute control of RTR near the chip • Need an abstraction for programmers to target

Runtime Hardware Control: Data-Driven RTR (continued) • Using a Finite State Machine (FSM) provides a suitable solution • The FSM monitors the data encountered, triggering changes in state • State change in the FSM reconfigures the chip from the configuration cache • FSM can execute in small space (e.g., fraction of Xilinx 4085) local to board • Interface familiar to most programmers

Application: DES Core (Xilinx) • The circuitry for DES computation can be significantly reduced if a specific key is “folded into” the circuitry • This reduction allows for a smaller, faster hardware realization of DES • Of course, a DES implementation that is specific to a single key isn’t useful unless it can be reconfigured…

DES Core (continued) • A DES core was implemented using JBits • A new core for each key is generated at runtime • Requires only changes to LUTs to configure for a new key • This implementation is faster than the current ASIC DES champion from Sandia • Technique being exploited for other encryption methods at Xilinx

EPIC View of 16 Rounds Courtesy Cameron Patterson

Comparing Fully Unrolled and Pipelined Designs Courtesy Cameron Patterson

Application: Number Crunching (Virginia Tech) • Traditional “numerical-analysis” style computation has focused on the use of IEEE-compliant floating-point arithmetic on general purpose CPUs • Two trends are forcing a refocus • Intel (and others) do not focus design on this market • Embedded processing is becoming increasingly complex, requiring more “number-crunching”

Number Crunching (cont.) • Cannot do away with key features of IEEE-compliant arithmetic (too many algorithms depend on it) • Floating-point units, however, are large and expensive • Can customize hardware to provide performance in reasonable package • Reconfiguration is a key

Number Crunching (cont.) • Use constant floating-point multipliers • e.g., as coefficients in an FIR • These multipliers are smaller and faster than two-input multipliers • analytical analysis provides bounds on size of IEEE-compliant implementations

Summary • Obstacles to practical RTR are being overcome • New hardware devices, experimental and commercial, are now available • New software is coming online to allow run-time bitstream generation • And now for some predictions…

RTR Predictions • Security of reconfigurable devices come into question and changes are made to address this issue • APIs to commercial FPGA bitstreams become commonplace, allowing more widespread innovation in RTR software • RTR hardware becomes essential aspect of SOC solutions which, by their nature, avoid the “scale by adding more hardware” aspect of PCs • Will proliferate in industries that need low-cost, low-power, small solutions (e.g., cellular phones)

High-Level Programming Issues for Reconfigurable Computing Systems

High-Level Programming Issues for Reconfigurable Computing Systems

Presentation Transcript

Reconfigurable Computing - Pipelined Systems

Reconfigurable Computing

Reconfigurable Computing - Performance Issues

ECE 636 Reconfigurable Computing Lecture 12 High-Level Compilation

Reconfigurable Computing

ENG6530 Reconfigurable Computing Systems

Reconfigurable Computing Systems: An Overview

ENG6530 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems

ENG6530/ENG3050 Reconfigurable Computing Systems

Synthesis for Partially Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems

Reconfigurable Computing

ENG6530 Reconfigurable Computing Systems

Operating Systems for Reconfigurable Computing Systems

ENG6090 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems

ENG6530/ENG3050 Reconfigurable Computing Systems

ENG6530 Reconfigurable Computing Systems

ECE 636 Reconfigurable Computing Lecture 12 High-Level Compilation

Reconfigurable Computing