Reconfigurable Architectures

ReconfigurableArchitectures AMANO, Hideharu hunga＠am．ics．keio．ac．jp

ReconfigurableSystem（CustomComputingMachine） • A target algorithm is executed directly with a hardware on SRAM-style FPGA/PLDs. • High performance of special purpose machines. • High degree of flexibility of general purpose machines. • A completely different execution mechanism from a stored program computers.

Simple Gate Arrays are replaced with FPGA/PLDs. Recent FPGA/PLDs • More than 1000K Gates (It is difficult to use efficiently.) • The operational frequency is 30MHz – 60MHz. • A large internal data RAMs.

Switch LookUpTable ConfigurationMemory 5-inputs SRAM FPGA (Xilinx’s) (FieldProgrammableGateArray) 2F．F． I/O LogicBlock Switch

SRAM(ConfigurationMemory） SRAM CPLD (ComplexProgrammableLogicDevice) I/O LogicBlock Switch

ReconfigurableSystems • Stand alone type • Implemented on boards or cabinet. • Splash　１・２，　RM-I,II,III,IV，RASH（Mitsubishi）, ATTRACTOR（NTT）,　ＦＬＥＭＩＮＧ • Co-processor type • Improve performance of general purpose processors. • PRISMI,II,　ＤＩＳＣ　ＩＩ, Garp, CHEMAERA, Chameleon, PipeRench

ReconfigurableSystems StandAlone Co-processor NewDevice 1990 The 1st FPL SPLASH MPLD PRISM-I 1992 The 1st JapaneseFPGA/PLDConf. SPLASH-2 PRISM-II RM-I WASMII 1993 The 1st FCCM RM-II CacheLogic RM-III DISC RM-IV 1995 YARDS Mult．ContextFPGA RM-V DISC-II HOSMII ATTRACTOR FIPSOC Cont．Switch．FPGA RASH PipeRench DRL PCA 2000 CHIMERA Chameleon

米国計算機科学センター String matching, Image processing, DNA matching, 330 times faster than the supercomputer Cray-II. Systolic algorithm VHDL,Parallel C AnnapolisMicroSystems（WILDFIRE) Splash-2(Arnold et.al 92)

ＦＰＧＡ ＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡＦＰＧＡ mem． mem． mem． mem． mem． mem． mem． mem． mem． mem． mem． mem． mem． mem． mem． mem． RM-IV (Kobe Univ.) FPIC Ｉｎｔｅｒｆａｃｅ

disk RASH(Mitsubishi) CompactPCI bus EXE- ボード CPUボード Display RASH unit Ethernet LAN CD 1Unit: 6 EXE boards CPU boards (Pentium) Multiple Units can be connected &p This slide is supported by Dr.Nakajima of Mitsubishi.

Clocks／Cont. signals Local-bus EXE boards of RASH Mesh links and buses 2 clock lines PCI bus I/F A large SRAM DRAM daughter board PCI-bus PCI-bus I/F SRAM （2MB） PCI Local-bus EXE-board controller FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGAAlteraFLEX10K100A(62K-158KGate) &p

FPGA FPGA RISC RISC ATTRACTOR（NTT） High speed serial link（1Gbps） ATM I/O RAM （LUT) ATM SW Buffer RISC RISC RISC RISC Ethernet CompactPCI MPU Specialized for ATM communication Using various boards Mem. Board level reconfiguration

Co-processor type • Tightly coupled with core CPU • A part of program is selected and executed. • Recently, on-chip implementation with the core CPU and reconfigurable part is possible. • Tightly coupling co-processor • NAPA,Garp,Chameleon,CHIMAERA, PipeRench

PRISMII（Brown Univ.） Am2955 CPU ＤａｔａＡｄｄｒｅｓｓＣｏｎｔｒｏｌ BootROM Sｗｉｔｃｈ DRAM BurstMode MemoryController DRAM ＦＰＧＡ　ＭｏｄｕｌｅＦＰＧＡ　ＭｏｄｕｌｅＦＰＧＡ　Ｍｏｄｕｌｅ A program core is executed. Frontier of co-processors.

Proposed in UCB MIPS Core and Reconfigurable Array share a cache system. Loop is extracted with a compiler, and converted to hardware. Image processing, 43 times faster than Ultrasparc Garp (Hauser97) Memory queue MIPS Cache Q Q Q Crossbar 32bit buses x 5 ReconfigurableArray

Brigham Young Univ. A general purpose processor using partial reconfigurable chip. Custom instructions can be attached. Each module can be designed by the user. Function called by C-language. DISC (Wirthlin et al. 95) FPGA 3 Processor Core System Memory FPGA 1 Bus I/F Configuration Controller FPGA 2 Custom Instruction Space Host P/C

Northwestern Univ. A reconfigurable array is inserted in the datapath of a super-scalar machine. 9 registers can be read in parallel from the shadow-register. Out of Order control 10～20% performance improvement CHIMAERA (Ye et.al. 2000) Shadow registers Register file Reconfigurable Array uP Core Controller

Chameleon（Chameleon Co.）　 • FieldProgrammableSystemLevelIntegratedCircuits(FPSLICs) • Coarse grain ReconfigurableProcessingFabric、RISCCore、PCIController、MemoryController、DMAController and SRAM are implemented on a single chip. • In Signal processing, Communication protocol processing, It is 5-10 times faster than high speed DSPs.

Chameleon CS2112 32-bit PCI Bus 64-bit Memory Bus PCI Cont. RISC Core Memory Controller 128-bit RoadRunner Bus Configuration Subsystem DMA Subsystem Reconfigurable Processing Fabric 160-pin Programmable I/O

8 instructions stored in the CTL are executed in the DPU. The CTL can select the next instruction in the same cycle. Configuration can be changed by loading a bitstream. LM DPU CTL Tile0 Slice0 Reconfigurable Processing Fabric in Chameleon LM DPU CTL Tile0 Slice3 108 DPU(DataPathUnit)s consists 4 Slices（3Tiles each） 1Tile:9DPU＝32bit ALU X 7 16bit + 16bit multiplierX　２

DPU OP：Operations in C or Verilog SIMD arrays and pipelines are formed with multiple DPUs. Instruction Register ＆ Mask Routing MUX OP Register Barrel Shifter Register Register ＆ Mask Routing MUX

Problems on Reconfigurable Systems • Calculators with SRAM type FPGAs are 10 times slower than ASIC calculators and requires 10 times wide area. • Weak connection between memory modules. • No standard method for generating a efficient hardware. • The size limitation problem.

Toward solving problems (1) • Speed and area problem compared with dedicated calculators. • The disadvantage is reduced using a novel process. • Coarse grain FPGA • Implementation with the CPU • Weak connection between memory modules • Connection with a large scale integrated SRAM • DRAM integration

DRAM integrated FPGA（NEC) 256X256 DRAM Module Logic Element WordDriver(128) Logic Element SenseAmp．(128) SenseAmp．(128) Logic Element Logic Element WordDriver(128)

FPAccA (Hiroshima City Univ.) RoutingMatrix Arrayoffloating ALU(Add/Mult） model2(0．35um) 12ｘ　25MFLOPS ＡＬＵ

Toward solving problems (2) • Algorithm conversion problem • Co-processing between integrated CPU • High-level synthesis techniques • Data-driven execution • Systolic algorithm • Size limitation problem • PartialReconfiguration • Multi-contextFPGA • Virtual hardware

Systolic algorithm Data x Computational array Data y A data stream x, y are inserted with a specific interval into a special computational array. Suitable for reconfigurable computing.

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax y0 y1 y2 y3 x0 x1 x2 x3 = a ｙｉｙｏＸ＋ｙｏ＝ａｘ＋ｙｉ x

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a23 a32 a22 a12 a21 a11 Ｘ＋ x1

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a33 a23 a32 a22 a12 a21 y1=a11x1 Ｘ＋Ｘ＋ x2 x1

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a34 a43 a33 a23 a32 a22 y1=a11 x1+ a12 x2 y2=a21 x1 X ＋ x3 x2 x1

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a44 a34 a43 a33 a23 a32 y2=a21 x1+ a22 x2 Ｘ＋Ｘ＋ x2 x3

a11 a12 0 0 a21 a22 a23 0 0 a32 a33 a34 0 0 a43 a44 Band matrix multiplyy=Ax a44 a34 a43 y2=a21 x1+ a22 x2+ a23 x3 a33 y3= a32 x2 Ｘ＋ x3 x2

Data flow algorithm ｄｅｃｘａｂ＋＋ｘ（ａ＋ｂ）ｘ（ｃ＋（ｄｘｅ））

ＰＣＡ（PlasticCellArchitecture） 16word x 1bit memory（LUT) Hardware controller which executes 12 instructions 16bits output: 1bit Built-in Part input: 1bit 14bits Logic/Memory Plastic Part 0 7 16bits 14bits Variable control: 6bits data: 8bits 7 Unit cells are connected in a mesh structure

PCA(Plastic Cell Architecture) Built-in Part Communication Path Plastic Part .…. Configuration Path Self reconfiguration Asynchronous communication

Context Multi-context FPGA ConfigurationRAM can be changeable. Fujitsu’s MPLD(1990)、WASMII(1992)、Xilinx(1997) NEC’s DRL(1999) Output data 1 Logic cells Logic cells Logic cells 2 Multiplexer n SRAM slots Input data

Dynamic Reconfigurable Logic • Multi-context(8 context) and partial reconfiguration. • 4×12 (Logic Block, LB) • Interface logics • Logic Block • 4×4 (Unified Cell, UC) • Reconfiguration Controller (RC) • Bus Connector (BC)

ConfigurationStoreAddress(10b) UC：Unified Cell RC：Reconfiguable Circuit BC:Bus Connecter LB:論理ブロック Vertical Local Bus Data Config (4b×2) Address Decoder (4b×2) (3b×2) Memory (4b×2) LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB LB Global Bus Switch Data BC BC BC BC BC BC BC BC (4b×2) Config LB RC (3b×2) RC UC UC UC UC Configuration/Data Output(79b) UC UC UC UC Input Selector Output Selector BC Configuration/Data Input(79b) BC Horizontal Local Bus UC UC UC UC UC UC UC UC BC BC (4b×2) UC UC UC UC UC UC UC UC BC BC UC UC UC UC UC UC UC UC BC BC Input Select External Config. Control(4b) Config. Store Control(2b) Output Select Reset CLK DRL

Virtual Hardware WASMII on the DRL（Keio U. +NEC) Token router Active Page Page 1 Page 2 Configuration Data line Page 3 Page 4 Controller Execution block Page Controller Input Token Registers InputToken Registers WASMII chip

Page1 Page2 Page3 Page　ｎ Page Controller Page Controller WASMII operation I Token Router FPGA Configuration Data line Page Controller Input Token Registers

Page1 Page2 Page2 Page3 WASMII Chip External Input Token Registers Backup RAM WASMII operation II(Outside chip extension) Token Router FPGA Configuration Data line Page　ｎ Page Controller Input Token Registers

LB Layout of WASMII on DRL Execution block 32 LBs Control block 16 LBs Dynamically reconfigured Statically configured

WASMII on the DRL • Small applications have been implemented. • Continuous System Simulation • Neural Network Emulation • Almost the same speed as recent PCs • Conservative implementation because of the first prototype. → Drastically improved in the next version • The limitation of the context is an essential problem.

PE PE PE PE PE PE PE PE PipeRench Architecture（CMU） Global buses Pass registers ・・・ Interconnection stripe ・・・ Interconnection

1 4 2 5 6 3 1 4 2 5 6 3 1 4 1 4 4 1 2 2 5 5 2 3 3 6 3 Pipelined Reconfiguration Cycle: Stage 1 Stage 2 Stage 3 Virtual pipeline Stage 4 Stage 5 Cycle: Stage 1 Physical pipeline Stage 2 Stage 3

Applications • No flexible program change • No IEEE standard floating point • Not memory bounded • Image processing, analysis, pattern matching, • Logic simulation, Fault simulation. • Neural network simulation. • Encryption /Decryption • QueuingModel、Markov Analysis • Electric Power Flow • Censer processing • Efficient use of on the fly processing. • Communication control、Protocol control • Software radio

Summary • Another computing system than stored program computers. • Not a perfect replace of stored program type computers. • Advance of the semiconductor techniques directly enhance the performance. • A lot of problems and subjects to research.

Historical flow of computer systems ENIAC EDVAC、EDSAC IBM machines Reconfigurable Machine RISC, Intel’s microprocessors

Reconfigurable Architectures