1 / 34

High-Level Programming Issues for Reconfigurable Computing Systems

High-Level Programming Issues for Reconfigurable Computing Systems. Mark Jones ECE Virginia Tech Blacksburg, Virginia mtj@vt.edu www.ccm.ece.vt.edu. The Virginia Tech Configurable Computing Lab.

orpah
Télécharger la présentation

High-Level Programming Issues for Reconfigurable Computing Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-Level Programming Issues for Reconfigurable Computing Systems Mark Jones ECE Virginia Tech Blacksburg, Virginia mtj@vt.edu www.ccm.ece.vt.edu

  2. The Virginia TechConfigurable Computing Lab • Focus on devices, architectures, applications, and programming issues for configurable computing • 30+ undergraduates,graduates, and post-docs • Variety of public andprivate sponsors • Peter Athanas & • Mark Jones

  3. Overview • Run-time reconfiguration (RTR) • Obstacles to RTR • Recent developments enabling RTR • New hardware • New bitstream generation tools • New runtime control software • RTR applications • Summary and predictions • Disclaimer: In the interests of time, I am not mentioning all of the relevant projects.

  4. Run-Time Reconfiguration • Adaptive computing devices (e.g., FPGAs) • Hardware configurations can be changed • Speed of reconfiguration varies by device • Reconfigure during the runtime of an applications(s) – less than 1 ms • Goals of the DARPA ACS program include • Development of hardware supporting fast RTR • Creation of software to control RTR hardware • Applications that demonstrate the computational benefits of RTR in size, weight, and power

  5. Types of RTR • Virtual Hardware • Provide programmer with an abstraction of unlimited hardware, similar to Virtual Memory • Useful abstraction which, like virtual memory, provides portability between devices • OS is responsible for directing the chip to context-switch user “hardware” (may include multiple processes) • Requires fast context-switching capability and software to effectively partition user hardware • Virtual co-processor work (e.g. DISC @ BYU) can be thought of in a similar fashion

  6. Types of RTR (continued) • Data-driven RTR • Based on the data encountered, the hardware is reconfigured to process it • e.g., for a given DES key, the hardware is reconfigured to a DES core specific to the key • Can provide increased speed in a small package • Hardware must be able to reconfigure quickly and (in most cases) direct its own reconfiguration based on data encountered

  7. Device Reconfiguration Methods • Entire device via a single bitstream • e.g. Xilinx 4K series • Long reconfiguration times • Logic-unit addressable reconfiguration • e.g. Xilinx 6200 • Significant chip area devoted to this function • Context-based reconfiguration • Sanders CSRC chip • Significant chip area devoted to this function

  8. Device Reconfiguration Methods (continued) • Partial reconfiguration • e.g. Xilinx Virtex • Must reconfigure column at a time • Stream-based reconfiguration • e.g, Colt/Stallion • Appropriate for stream-based computation • Pipeline-oriented reconfiguration • e.g, PipeRench • Appropriate for deeply pipelined applications

  9. Types of Reconfigurable Apps • Stream-oriented applications • Intelligent network devices, software radios, video processing • Reconfiguration must occur quickly enough and w/o disruption of hardware to avoid losing data in stream (buffering too expensive in many situations) • “Batch”-type applications • Number-crunching simulations, off-line analysis of data • Reconfiguration must simply be cost-effective when trading off processing for reconfiguration

  10. Prior Obstacles to RTR • Lack of hardware devices that support RTR in an appropriate fashion • Provide fast reconfiguration without sacrificing performance • Lack of software to support RTR • Generate and modify bitstream configurations during runtime • The following slides will survey projects which are overcoming these obstacles • These projects really represent evolutionary advances on previous research projects

  11. Virtual Hardware:PipeRench (CMU) • Many applications, particularly stream-based applications, can be deeply pipelined to improve performance • PipeRench is built as a reconfigurable pipeline n units • The programmer views PipeRench as a programmable pipeline of m units where m is arbitrarily large

  12. PipeRench (CMU) • PipeRench supports this Virtual Hardware abstraction by reconfiguring the physical pipeline through the abstract pipeline

  13. PipeRench (CMU) • Only one stage must be reconfigured at each step • Allows for fast reconfiguration because only part of chip must be reconfigured • Defines a scalable architecture series • No changes to code are needed as hardware increases in size • Realization in VLSI exists as well as compiler tools

  14. Runtime Generation of Bitstreams: Loki Project(Xilinx and Virginia Tech) APPLICATION PROGRAM NEW STATE FUNCTIONALITY PLACE & ROUTE STATE CONNECTIVITY RESOURCES

  15. Loki Project (continued) • JBits provides an API to the Xilinx bitstream for the 4K and Virtex parts • Java-based API at the LUT/pip level • Executing a Java program with the JBits API can create or modify a bitstream • The Loki project builds on this API to provide a design environment • Focus is on Run-Time Parameterizable cores

  16. Loki Project (continued) • RTP cores (tens of cores at this point) • Finite state machines • KCMs • CAMs • Execution time for customizing bitstreams • Milliseconds (or less) for modification of LUTs in an existing bitstream • Challenge is to provide similar speeds when routing is required

  17. Loki Project (continued) • The RTP core-based approach provides a hierarchical approach • Routing & placement is handled within the core, a full chip-wide P&R is not required • The JBits & RTP-based approach in the Java environment make development of new tools much easier • Simulator for Virtex devices • Visualization of routing delays • Visualization of core layout and runtime execution

  18. BoardScope Core View Output Shift Register (Vertical) 3 Input Shift Registers. (Horizontal) Center Register Highlighted. Evolved Synchronous Circuit

  19. Runtime Hardware Control: SLAAC & DRACS (Sanders, Virginia Tech, USC/ISI-East) • The new hardware that supports fast RTR requires new runtime control software to reduce/eliminate the software overhead associated with reconfiguration • Need to provide the programmer with an abstraction for RTR that is easy to use, yet doesn’t incur runtime overhead

  20. Runtime Hardware Control: Target Hardware • The SLAAC-1V board • 3 Virtex 1000 chips capable of partial reconfiguration • On-board configuration controller (Virtex 100) with a local memory cache • The Sanders RCM board • 2 CSRC chips capable of context-switching • PowerPC and Xilinx 4085 with local memory cache

  21. Runtime Hardware Control: Virtual Hardware • Consider an OS that is swapping hardware configurations in/out of chip (microseconds) • Partial configurations in and out of the Virtex parts on the SLAAC-1V • Switching contexts on the RCM board • Cannot afford to have the configurations sent by the OS to board on every configuration swap • Overwhelm the microsecond cost

  22. Runtime Hardware Control: Virtual Hardware (continued) • Most programs exhibit temporal locality • Exploit this in way similar to virtual memory • Both the SLAAC-1V and the Sanders RCM provide the memory and the control capability to build a configuration cache • Instead of sending configurations to the board, control signals are sent invoking reconfiguration from the cache • Transparent to the programmer

  23. Runtime Hardware Control: Data-Driven RTR • Data-driven RTR requires extremely fast reconfiguration and virtually no overhead in the control of RTR • Little benefit to clock-cycle RTR (CSRC) if the control software takes longer • Must execute control of RTR near the chip • Need an abstraction for programmers to target

  24. Runtime Hardware Control: Data-Driven RTR (continued) • Using a Finite State Machine (FSM) provides a suitable solution • The FSM monitors the data encountered, triggering changes in state • State change in the FSM reconfigures the chip from the configuration cache • FSM can execute in small space (e.g., fraction of Xilinx 4085) local to board • Interface familiar to most programmers

  25. Application: DES Core (Xilinx) • The circuitry for DES computation can be significantly reduced if a specific key is “folded into” the circuitry • This reduction allows for a smaller, faster hardware realization of DES • Of course, a DES implementation that is specific to a single key isn’t useful unless it can be reconfigured…

  26. DES Core (continued) • A DES core was implemented using JBits • A new core for each key is generated at runtime • Requires only changes to LUTs to configure for a new key • This implementation is faster than the current ASIC DES champion from Sandia • Technique being exploited for other encryption methods at Xilinx

  27. EPIC View of 16 Rounds Courtesy Cameron Patterson

  28. Comparing Fully Unrolled and Pipelined Designs Courtesy Cameron Patterson

  29. Application: Number Crunching (Virginia Tech) • Traditional “numerical-analysis” style computation has focused on the use of IEEE-compliant floating-point arithmetic on general purpose CPUs • Two trends are forcing a refocus • Intel (and others) do not focus design on this market • Embedded processing is becoming increasingly complex, requiring more “number-crunching”

  30. Number Crunching (cont.) • Cannot do away with key features of IEEE-compliant arithmetic (too many algorithms depend on it) • Floating-point units, however, are large and expensive • Can customize hardware to provide performance in reasonable package • Reconfiguration is a key

  31. Number Crunching (cont.) • Use constant floating-point multipliers • e.g., as coefficients in an FIR • These multipliers are smaller and faster than two-input multipliers • analytical analysis provides bounds on size of IEEE-compliant implementations

  32. Summary • Obstacles to practical RTR are being overcome • New hardware devices, experimental and commercial, are now available • New software is coming online to allow run-time bitstream generation • And now for some predictions…

  33. RTR Predictions • Security of reconfigurable devices come into question and changes are made to address this issue • APIs to commercial FPGA bitstreams become commonplace, allowing more widespread innovation in RTR software • RTR hardware becomes essential aspect of SOC solutions which, by their nature, avoid the “scale by adding more hardware” aspect of PCs • Will proliferate in industries that need low-cost, low-power, small solutions (e.g., cellular phones)

More Related