1 / 41

Reconfigurable Computing

Reconfigurable Computing. Dr. Christophe Bobda CSCE Department University of Arkansas. Chapter 1 (Cont.) Architectures. Agenda. Motivation Coarse-Grained Reconfigurable Devices DataFlow machines The PACT XPP The NEC DRP The PicoChip Network-Based architectures The Quicksilver ACM

maricelaj
Télécharger la présentation

Reconfigurable Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reconfigurable Computing • Dr. Christophe Bobda • CSCE Department • University of Arkansas

  2. Chapter 1 (Cont.)Architectures

  3. Agenda • Motivation • Coarse-Grained Reconfigurable Devices • DataFlow machines • The PACT XPP • The NEC DRP • The PicoChip • Network-Based architectures • The Quicksilver ACM • Embedded PLDs • The IPflex DAP/DANN • Tensilica reconfigurable processor • The Strech Processor

  4. 1. Coarse-grained Reconfigurable DevicesMotivation

  5. 1. Recall • Brief Historycally development (Estrin Fix-Plus and Rammig machine) • Programmable Logic • PALs and PLAs • CPLDs • FPGAs • Technology • Architecfture by mean of example • Actel • Xilinx • Altera

  6. 1. Once again: General purpose vs Special purpose • With the LUT as function generators, FPGA can be seen as general purpose devices • Like any general purpose device, they are flexible and “inefficient“ • Flexible because any n-variables Boolean function can be implemented in a n-input LUT • Inefficient since complex functions must be implemented in many LUTs at different locations. • The connections among the LUTs is done using the routing matrix wich increases the signal delays • LUT implementation is usually slower than dircect „wiring“

  7. A B Connection matrix D A F C D A B C 1. Once again: General purpose vs Special purpose • Example: Implement the function using 2-input LUTs. • LUTs are grouped in logic blocks (LB). 2 2-input LUT per • LB Connection inside a LB is efficient (direct) • Connection outside LBs are slow (Connection matrix)

  8. 1. Once again: General purpose vs Special purpose • Idea: Implement frequently used blocks as hard-core • module in the device A B Connection matrix D A F C D A B C A B C D

  9. 1. Coarse grained reconfigurable devices • Overcome the inefficiency of FPGAs by providing coarse grained functional units (Adder, multipliers, integrators, etc...), efficiently implemented • Advantage: Very efficient in term of speed (no need for connections over connection matrice for basic operators) • Advantage: Direct wiring istead of LUT implementation • A coarse grained device is usually an array of programmable and identical processing element (PE) capable of executing few operations like addition and multiplication • Depending on the manufacturer, the functional units communicate via busses or can be directly connected using programmable routing matrices

  10. 1. Coarse grained reconfigurable devices • Memory exist between and inside the PEs. • Several other functional units according to the manufacturer. • A PE is usually an 8-bit, 16-bit or 32-bit tiny ALU which can be configured to executed only one operation on a given period (until the next configuration) • Communication among the PEs can be either packet oriented (on busses) or point-to-point (using crossbar switches) • Since each vendor has its own implementation approach, study will be done by mean of few examples. Considered are: PACT XPP, Quicksilver ACM, NEC DRP, picoChip, IPflex DAP/DNA.

  11. 2.1 Dataflow Machines

  12. 2.1 The PACT XPP – Overall structure • XPP (Extreme Processing Platform) is a hierarchical structure consisting of: • An array of Processing Array Elements (PAE) grouped in clusters called Processing Arrays (PA) • PAC = Processing Array Cluster (PAC) + Configuration manager (CM) • A hierarchical configuration tree • Local CMs manage the configuration at the PA level • The local CMs access the local configuration memory while Supervisor CM (SCM) access external memory and supervise the whole configuration process on the device

  13. 2.1 The PACT XPP – Overall structure • The PAE: Two types of PAE • The ALU PAE • The RAM PAE • The ALU PAE: • Contain an ALU which can be configured to perform basic operations • Back-register (BREG) provides routing channels for data and events from bottom to top • Forward Register (FREG) provides routing channels from top to bottom

  14. 2.1 The PACT XPP – Overall structure • DataFlow Register (DF-REG) can be used at the object outputs for buffering data • Input register can be preloaded by configuration data • The RAM PAE: • Differs from the ALU-PAE only on the function. Instead of an ALU, a RAM-PAE contains a dual-ported RAM • Useful for data storage • Data is written or read after the reading of an address at the RAM-inputs • BREG, FREG, and DF-REG of the RAM-PAE have the same function like in the ALU-PAE

  15. 2.1 The PACT XPP – Overall structure • Routing in the PACT XPP: • Two independent networks • One for data transmission • The other for event transmission • A Configuration BUS exists besides the data and event networks (very few information exists about the configuration bus) • All objects can be connected to horizontal routing channels using switch-objects • Vertical routing channels are provided by the BREG and FREG • BREGs route from bottom to top • FREGs route from top to bottom Vertical routing channels Horizontal routing channels

  16. 2.1 The PACT XPP - Interface • Interfaces are available inside the chip • Number and type of interfaces vary from device to device • On the XPP42-A1: 6 internal interfaces consisting of: • 4 identical general purpose I/O on-chip interfaces (bottom left, upper left, upper right, and bottom right) • One configuration manager (not shown on the picture) • One JTAG (Join Test Action Group, "IEEE Standard 1149.1") Boundary scan interface or for testing purpose Interfaces

  17. 2.1 The PACT XPP - Interface • The I/O interfaces can operateindependent from each other. Two operation modes • The RAM mode • The streaming mode • RAM mode: • Each port can access external Static RAM (SRAM). • Controls signals for the SRAM transaction are available. • No additional logic requires

  18. 2.1 The PACT XPP - Interface • Streaming mode: • For high speed streaming of data to and from the device • Each I/O element provides two bidirectional port for data streaming • Handshake signals are used for synchronization of data packets to external port

  19. 2.1 The NEC DRP – Architecture • The NEC Dynamically Reconfigurable Processor (DRP) consists of: A set of byte oriented processing elements (PE) • A programmable interconnection network for communication among the PEs. • A sequencer. Can be programmed as finite state machine (FSM) to control the reconfiguration process • Memory around the device for storing configuration and computation data • Various Interfaces

  20. 2.1 The NEC DRP - The Processing Element • ALU: ordinary byte arithmetic/logic operations • DMU (data management unit): handles byte select, shift, mask, constant generation, etc., as well as bit manipulations • An instruction dictates ALU/DMU operations and inter-PE connections • Source/destination operands can either from/to • its own register file • other PEs (i.e., flow through) • Instruction pointer (IP) is provided from STC (state transition controller)

  21. 2.1 The NEC DRP - The Processing Element • Instruction Pointer(IP) from STC identifies a datapath plane • Spatial computation with using a customized datapath plane • When IP changes, datapath plane switches instantaneously • PE instructions as a collection behave like an extreme VLIW • Sequencing through instructions=> Dynamic reconfiguration AES 3DES MD5 Data In SHA-1 Data Out Compress (task selection by descriptor) Control Multiple Datapath Planes

  22. PE Array IP = “1” ALU DMU PE IP = “1” 3 Insts. 0 1 1 2 1 4 PE Add Sel Identify the instruction to be executed 1 Decode the instruction in the ALU plane 2 Add Cmp Sel Add Add Cmp 2.1 The NEC DRP – Reconfiguration Process PE Array 1 PE ALU DMU 2 Insts. 0 1 2 Configure the ALU Plane according to the instruction 3 4 +

  23. 2.1 The picoChip - Architecture • Hundreds of array elements each with versatile 16-bit processor and local data • heterogeneous architecture with four types of elements optimized for different tasks (DSP or wireless function) • Interface for: • SRAM • Host communication • External systems • Inter picoChip system

  24. 2.2 Network-Based Machines

  25. 2.2 The Quicksilver ACM - Architecture • Quicksilver ACM (Adaptive Computing Machine) • Fractal like structure • Hierarchically group of four nodes with full communication among the nodes • 4 lower level nodes are grouped in a higher level node • The lowest level consist of 4 heterogeneous processing nodes • The connection is done in a Matrix Interconnect Network (MIN) • A system controller • Various I/O

  26. 2.2 The Quicksilver ACM – The processing node • An ACM processing node consist of: • An algorithmic engine. It is unique to each nodes type and defines the operation to perform by the node. • The node Memory for data storage at the node level. • A node wrapper which is common to all nodes. It is use to hide the complexity of the heterogeneous architecture.

  27. 2.2 The Quicksilver ACM – The processing node • Four types of nodes exist: • The Programmable Scalar Node (PSN) provides a standard 32-bit RISC architecture with 32-bit general purpose registers • The Adaptive Execution Node (AXN) provides variable size MAC and ALU operations • The Domain Bit Manipulation (DBM) node provides bit manipulation and byte oriented operation • External Memory Controller node provides DDRRAM, SRAM, memory random access DMA control interface ACM PSN-Node

  28. ACM AXN-Node ACM DBM-Node 2.2 The Quicksilver ACM – The processing node

  29. 2.2 The Quicksilver ACM – The node wrapper • The node wrapper:Envelopes the algorithmic engine and presents an identical interface to neighbouring nodes. It features: • A MIN interface to support the communication among nodes via the MIN-network • A hardware task manager for task management at the node level • A DMA engine • Dedicated I/O circuitry • Memory controllers • Data distributors and aggregators The ACM Node-Wrapper

  30. 2.2 The Quicksilver ACM – The node wrapper • Matrix Interconnect Network is the communication medium in an ACM chip • Hierarchically organized. The MIN at a given level connects many lower-level MINs • The MIN-Root is used for: • Off-chip communication • Configuration • Support the communication among nodes • Provides service like Point to point dataflow streaming, Real-time broadcasting, DMA, etc... Example of ACM Chip configuration

  31. 2.2 The Quicksilver ACM – The System Controller The system controller • The system controller is in charge of the system management • Loads tasks into node ready-to-run queue for execution • Statically or dynamically sets the communication channels between the processing nodes • Carry the reconfiguration of nodes on a clock cycle-by-clock cycle basis • The ACM chip features a set of I/O interfaces controllers like: • PCI • PLL • SDRAM and SRAM The interface controllers

  32. 2.3 Embedded PLD

  33. 2.3 The IPflex DAP/DNA - Structure • The IPflex DAP/DNA has the structure of a System on Chip (SoC) with an embedded FPGA. It features: • Integrated RISC core • Carry some computation • Controls the reconfiguration process • A Distributed Network Architecture (DNA) matrix (matrix of configurable operation units) • Communication over an internal bus • Different caches for data, instructions and configuration • I/O and memory Interface controllers

  34. 2.3 Tensilica Xtensa • 32-bit synthesizable CPU core (SOC) • High-performance, low-power • Tensilica Instruction Extension ("TIE") language • Designers can modify the feature set of the processor: cache sizes, address and data bus widths, and other core parameters • Automatic generation of a complete software development tool environment for a specific processor instance • “application-specific extensions” at design time

  35. 2.3 Tensilica Xtensa • Instruction Set Architecture (ISA) • Consists of: • Base set of instructions (~80 instructions, superset of traditional RISC) • set of configurable options • Attributes: • Enables configurability • Minimizes code size • Reduces power req. • Maximizes performance

  36. 2.3 Tensilica Xtensa • Instruction Set Architecture (ISA)´Consists of: • Base set of instructions (~80 instructions, superset of traditional RISC) • set of configurable options • Attributes: • Enables configurability • Minimizes code size • Reduces power req. • Maximizes performance

  37. 2.3 Tensilica Xtensa

  38. 2.3 Tensilica Xtensa • The user gets • Resulting customized synthesizable hardware description • Additional logic to build a complete Xtensa CPU on an FPGA • Full set of diagnostics to verify the RTL • CAD tool scripts to with assist embedding the core in your design • Customized Compiler Toolchain tuned to the core • GNU C/C++/Assembler and profiler • GDB and the DDDebugger GUI • XMON for debugging on live FPG

  39. 2.3 Strech software configurable processor • Xtensa V RISC CPU from Tensilica • Run-time instruction set extension trough an embedded programmable Logic • Instruction Set Extension Fabric (ISEF) • SW-Controlled • Wide Load/Store support for any alignment • Wide register file • Up to 3 wide operand • 1 or 2 wide result

  40. 2.3 Strech software configurable processor Strech S5000 ISEF Strech S5000 configurable processor

  41. Device size • Usually measure in the number of transistor used in the device • This is not so helpful for reconfigurable devices, since the number of transistors is not the number of usable resource in the chip. For example: FPGA are one of the most complex chip (complexer than Pentium processors), but their capacity is smaller than their ASIC counterpart. • The Capacity of FPGA is usually measured in term of the number of Gates equivalent a design need to be implemented. • A gate equivalent is a unit of measure. 1 gate equivalent = 1 2-inputs NAND gate • A one million-gates FPGA is able to implement the equivalent of a circuit containing 1 million 2-inputs NAND gates

More Related