1 / 72

Lecture 14: (Re)configurable Computing Case Studies

Lecture 14: (Re)configurable Computing Case Studies. Prof. Jan Rabaey Computer Science 252, Spring 2000. The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics) to this slide set are gratefully acknowledged. Summary of Previous Class.

Télécharger la présentation

Lecture 14: (Re)configurable Computing Case Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 14: (Re)configurable ComputingCase Studies Prof. Jan Rabaey Computer Science 252, Spring 2000 The contributions of Andre Dehon (Caltech) and Ravi Subramanian (Morphics) to this slide set are gratefully acknowledged

  2. Summary of Previous Class • Configurable Computing using “programming in space” versus “programming in time” for traditional instruction-set computers • Key design choices • Computational units and their granularity • Interconnect Network • (Re)configuration time and frequency • Next class: Some practical examples of reconfigurable computers

  3. Applicability of Configurable Processing • Stand-alone computational engines • E.g. PADDI, UCLA Mojave • Adding programmable I/O to embedded processors • E.g. Napa 2000 • Augmenting the instruction set of processors • E.g. GARP, RAW • Providing programmable accelerator co-processors to embedded micro’s and DSP • Chameleon, Pleiades, Morphics

  4. Stand-Alone ComputationalEnginesUCLA Mojave System Template Matching for AutomaticTarget Recognition I960 Board

  5. Logic used in place of ASIC environment customization external FPGA/PLD devices Example bus protocols peripherals sensors, actuators Case for: Always have some system adaptation to do – varying glue logic requirements Modern chips have capacity to hold processor + glue logic Reduces part count Valued added must now be accommodated on chip (formerly board level) As Programmable Interface and I/O Processor

  6. Triscend E5 Example: Interface/Peripherals

  7. Array dedicated to servicing IO channel sensor, lan, wan, peripheral Provides protocol handling stream computation compression, encrypt Looks like IO peripheral to processor Case for: many protocols, services only need few at a time dedicate attention, offload processor Model: IO Processor

  8. IO Processing • Single threaded processor • cannot continuously monitor multiple data pipes (src, sink) • need some minimal, local control to handle events • for performance or real-time guarantees , may need to service event rapidly • E.g. checksum (decode) and acknowledge packet

  9. CR32 CompactRISCTM 32 Bit Processor PMA Pipeline Memory Array BIU Bus Interface Unit External Memory Interface SMA Scratchpad Memory Array CR32 Peripheral Devices NAPA 1000 Block Diagram TBT ToggleBusTM Transceiver System Port RPC Reconfigurable Pipeline Cntr ALP Adaptive Logic Processor CIO Configurable I/O Source: National Semiconductor

  10. NAPA 1000 as IO Processor SYSTEM HOST Application Specific Sensors, Actuators, or other circuits System Port NAPA1000 CIO Memory Interface ROM & DRAM Source: National Semiconductor

  11. I/O Stream Processor Combines Trimedia VLIW withConfigurable media co-processors Philips Nexperia NX-2700 A programmable HDTV media processor

  12. Model: Instruction Augmentation • Observation: Instruction Bandwidth • Processor can only describe a small number of basic computations in a cycle • I bits 2I operations • This is a small fraction of the operations one could do even in terms of www Ops • w22(2w) operations • Processor could have to issue w2(2 (2w) -I) operations just to describe some computations • An a priori selected base set of functions could be very bad for some applications

  13. Instruction Augmentation • Idea: • provide a way to augment the processor’s instruction set • with operations needed by a particular application • close semantic gap / avoid mismatch • What’s required: • some way to fit augmented instructions into stream • execution engine for augmented instructions • if programmable, has own instructions • interconnect to augmented instructions

  14. “First” Instruction Augmentation • PRISM • Processor Reconfiguration through Instruction Set Metamorphosis • PRISM-I • 68010 (10MHz) + XC3090 • can reconfigure FPGA in one second! • 50-75 clocks for operations [Athanas+Silverman: Brown]

  15. PRISM-1 Results Raw kernel speedups

  16. PRISM • FPGA on bus • access as memory mapped peripheral • explicit context management • some software discipline for use • …not much of an “architecture” presented to user

  17. [Razdan+Smith: Harvard] PRISC • Takes next step • what look like if we put it on chip? • how integrate into processor ISA? • Architecture: • couple into register file as “superscalar” functional unit • flow-through array (no state)

  18. PRISC • ISA Integration • add expfu instruction • 11 bit address space for user-defined expfu instructions • fault on pfu instruction mismatch • trap code to service instruction miss • all operations occur in clock cycle • easily works with processor context switch • no state + fault on mismatch pfu instr

  19. All compiled working from MIPS binary <200 4LUTs ? 64x3 200MHz MIPS base PRISC Results Razdan/Micro27

  20. Chimaera • Start from PRISC idea • integrate as functional unit • no state • RFUOPs (like expfu) • stall processor on instruction miss, reload • Add • manage multiple instructions loaded • more than 2 inputs possible [Hauck: Northwestern]

  21. “Live” copy of register file values feed into array Each row of array may compute from register values or intermediates (other rows) Tag on array to indicate RFUOP Chimaera Architecture • Results • Compress 1.11 • Eqntott 1.8 • Life 2.06 (160 hand parallelization) [Hauck/FCCM97]

  22. Instruction Augmentation • Small arrays with limited state • so far, for automatic compilation • reported speedups have been small • open • discover less-local recodings which extract greater benefit

  23. GARP Identified Problems: • Single-cycle flow-through • not most promising usage style • Moving data through Register File to/from array • can present a limitation • bottleneck to achieving high computation rate [Hauser+Wawrzynek: UCB]

  24. GARP • Integrate as coprocessor • similar bandwidth to processor as FU • own access to memory • Support multi-cycle operation • allow state • cycle counter to track operation • Fast operation selection • cache for configurations • dense encodings, wide path to memory

  25. GARP • ISA -- coprocessor operations • issue gaconfig to make a particular configuration resident (may be active or cached) • explicitly move data to/from array • 2 writes, 1 read • processor suspend during coprocessor operation • cycle count tracks operation • array may directly access memory • processor and array share memory space • cache/mmu keeps consistency • can exploit streaming data operations

  26. GARP • Processor Instructions

  27. Row oriented logic denser for datapath operations Dedicated path for processor/memory data Processor not have to be involved in arraymemory path GARP Array

  28. General results 10-20x on stream, feed-forward operation 2-3x when data-dependencies limit pipelining GARP Results [Hauser+Wawrzynek/FCCM97]

  29. PRISC/Chimaera basic op is single cycle: expfu (rfuop) no state could conceivably have multiple PFUs? Discover parallelism => run in parallel? Can’t run deep pipelines GARP basic op is multicycle gaconfig mtga mfga can have state/deep pipelining ? Multiple arrays viable? Identify mtga/mfga w/ corr gaconfig? PRISC/Chimera … GARP

  30. Common Theme • To get around instruction expression limits • define new instruction in array • many bits of config … broad expressability • many parallel operators • give array configuration short “name” which processor can callout • …effectively the address of the operation But – Impact of using reconfiguration at Instruction Level seems limited  Explore opportunities at larger granularity levels (basic block, task, process)

  31. Applicability of Configurable Processing • Stand-alone computational engines • E.g. PADDI, UCLA Mojave • Adding programmable I/O to embedded processors • E.g. Napa 2000 • Augmenting the instruction set of processors • E.g. GARP, RAW • Providing programmable accelerator co-processors to embedded micro’s and DSP • Chameleon, Pleiades, Morphics

  32. Example: Chameleon Reconfigurable Co-Processor (network, communication applications) JTAG Debugging Port PCI Interface Memory Controller Data Memory Wide Internal communications bus Bus Manager & DMA Controllers ARC CPU Instruction Cache Background Configuration Plane Local Store Memory (LSM) Reconfigurable Logic Array of 32-bit Data Path Operators& Control Logic Configuration bit stream Multiple banks of I/O

  33. Reconfigurable Processor Tools Flow Customer Application / IP (C code) RTL HDL C Compiler Synthesis & Layout ARC Object Code Configuration Bits Linker Chameleon Executable Development Board C Model Simulator C Debugger

  34. Heterogeneous Reconfiguration Reconfigurable Logic Reconfigurable Datapaths Reconfigurable Arithmetic Reconfigurable Control Bit-Level Operations e.g. encoding Dedicated data paths e.g. Filters, AGU Arithmetic kernels e.g. Convolution RTOS Process management

  35. Satellite Processor Configuration DedicatedArithmetic ArithmeticProcessor ArithmeticProcessor ArithmeticProcessor Communication Network Network Interface Control Processor Configurable Datapath Configurable Logic Multi-granularity Reconfigurable Architecture: The Berkeley Pleiades Architecture Configuration Bus • Computational kernels are “spawned” to satellite processors • Control processor supports RTOS and reconfiguration • Order(s) of magnitude energy-reduction over traditional programmable architectures

  36. AddressGen AddressGen Memory Memory Convolution MAC MAC L G C Control Processor Two models of computation: communicating processes + data-flow Matching Computation and Architecture Two architectural models: sequential control+ data-driven

  37. AddrGen Execution Model of a Data-Flow Kernel Embedded processor Code seg end start for(i=1;i<=L;i++) for(k=i;k<=L;k++) phi[i][k]= phi[i-1][k-1] +in[NP-i]*in[NP-k] -in[NA-1-i]*in[NA-1-k]; MEM: in AddrGen MPY MPY MEM: phi ALU Code seg ALU • Distributed control and memory

  38. Reconfigurable Kernels for W-CDMA • Dominant kernel M(MTX) requires array of MACs and segmented memories • Additional operations such as sqrt(x), 1/x, and Trellis decoding may be implemented using FPGA or cordic satellite

  39. Inter-Satellite Communication • Data-driven execution • A satellite processor is enabled only when input data is ready • Data sources generate data of different types: scalars, vectors, matrices • Data computing processors handle data inputs of different types end-of-vector token 1 MPY 1 1 AddrGen Memory 1 MPY n n Embedded processor n MAC 1 n Data sources Data computing processors

  40. 0.75 570n 13 Pleiades Pleiades Pleiades Impact of Architectural Choice Example: 16 point Complex Radix-2 FFT (Final Stage)

  41. Adaptive Multi-User Detector for W-CDMAPilot Correlator Unit Using LMS

  42. Architecture Comparison LMS Correlator at 1.67 MSymbols Data Rate Complexity: 300 Mmult/sec and 357 Macc/sec 16 Mmacs/mW! Note: TMS implementation requires 36 parallel processors to meet data rate - validity questionable

  43. Maia: Reconfigurable Baseband Processor for Wireless • 0.25um tech: 4.5mm x 6mm • 1.2 Million transistors • 40 MHz at 1V • 1 mW VCELP voice coder • Hardware • 1 ARM-8 • 8 SRAMs & 8 AGPs • 2 MACs • 2 ALUs • 2 In-Ports and 2 Out-Ports • 14x8 FPGA

  44. Mesh Module Reconfigurable Interconnect Exploration Hierarchical Mesh

  45. C++ SUIF+ C-IF Software Methodology Flow

  46. Macromodel call Hardware-Software Exploration

  47. A protocol = Extended FSM RACH akn RACH req Memory idle RACH slotset write read update R_ENA idle BUF BUF W_ENA Slot_Set_Tbl 2x16 addr Slot start slot_set <31:0> Slot_no <5:0> Pkt end Implementation Fabrics for Protocols Intercom TDMA MAC

  48. Intercom TDMA MACImplementation alternatives • ASIC: 1V, 0.25 mm CMOS process • FPGA: 1.5 V 0.25 mm CMOS low-energy FPGA • ARM8: 1 V 25 MHz processor; n = 13,000 • Ratio: 1 - 8 - >> 400

  49. The Software-Defined Radio Embedded uP FPGA Dedicated FSM Dedicated DSP Reconfigurable DataPath

  50. A D B E F C A B A B RF/IF Tuner RF/IF Tuner RF/IF Tuner RF/IF Tuner RF/IF Tuner RF/IF Tuner Block-Spectrum A/D Block-Spectrum A/D Block-Spectrum A/D Block-Spectrum A/D Block-Spectrum A/D Block-Spectrum A/D An Industrial Example:Basestation for Cellular Wireless 800 MHz 1900 MHz Antenna Antenna System System multiple sectors multi-band multiple sectors multi-band HIGH-SPEED DIGITAL BUS ... T1 / E1 • Modular/Parameterizable • per carrier • per TDMA time-slot • per CDMA code

More Related