1 / 10

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović

A Parameterizable FPGA Prototype of a Vector-Thread Processor. Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA. Vector Execution Unit. Control Proc. Lane 0. Lane 1. Lane 2. Lane 3. VRU.

ted
Télécharger la présentation

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Parameterizable FPGA Prototype of a Vector-Thread Processor Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA

  2. Vector Execution Unit Control Proc Lane 0 Lane 1 Lane 2 Lane 3 VRU Throttle Logic Refill Unit Stride SEG SEG SEG SEG SCALE Vector-Thread Processor Key Features • 4 lanes, 4 clusters • Cluster for indexed accesses • 4 segment address generators • 4 VLDQs • VRU includes throttle logic, refill address generator

  3. Cache Arbiter and Crossbar Seg Buf Seg Buf Seg Buf Seg Buf Tags Tags Tags Tags Data Data Data Data MSHR MSHR MSHR MSHR Memory Port Arbiter and Crossbar SCALE Cache Key Features • Two cycle hit latency • Four 8 KB banks • 32 way associative • 32B cachelines • 16B/cycle per bank • Four 16B segment buffers per bank

  4. SCALE Prototype Chip ctrl CP0 L/S ALU byp shftr PC RF MD 4 mm Cache Bank (8KB) Cache Bank (8KB) Memory Interface / Cache Control Cache Tags Cache Bank (8KB) Control Processor Crossbar Mult Div 2.5 mm ctrl IQC ctrl IQC ctrl IQC ctrl IQC ctrl LDQ shftr shftr shftr shftr RF RF RF RF ALU ALU ALU ALU latch latch latch latch mux/ mux/ mux/ mux/ ctrl IQC ctrl ctrl IQC ctrl IQC ctrl IQC Cluster Cache Bank (8KB) Memory Unit LDQ shftr shftr shftr shftr ALU ALU RF ALU ALU RF RF RF latch latch latch latch mux/ mux/ mux/ mux/ ctrl IQC ctrl IQC ctrl IQC ctrl IQC ctrl LDQ shftr shftr shftr shftr RF RF RF RF ALU ALU ALU ALU latch latch latch latch mux/ mux/ mux/ mux/ ctrl IQC ctrl IQC ctrl IQC ctrl IQC ctrl Lane LDQ shftr shftr shftr shftr RF latch ALU RF latch ALU latch ALU RF latch ALU RF mux/ mux/ mux/ mux/ • Prototype SCALE processor in development • Control processor: MIPS, 1 instr/cycle • VTU: 4 lanes, 4 clusters/lane, 32 registers/cluster, 128 VPs max • Primary I/D cache: 32 KB, 4x128b per cycle, non-blocking • DRAM: 64b, 200 MHz DDR2 (64b at 400Mb/s: 3.2GB/s) • Estimated 10 mm2 in 0.18μm, 400 MHz (25 FO4) • Cycle-level execution-driven C++ microarchitectural simulator • Detailed VTU and memory system model

  5. Scale Prototype Board • Single Xilinx Virtex-II FPGA • Configured via direct JTAG connection or SystemACE • Multiple Memory Chips • Six Micron DDR2 SDRAMs • Two Micron Mobile SDRAMs • One Micron RLDRAM • One Samsung SRAM • Two Logic Analyzer connections • Multiple separate power islands • Attached to custom test baseboard • Sixteen independently measurable power supplies • Byte-serial connection to a Linux PC

  6. Module Placement • Reduce the risk of the final custom chip implementation • Allow early rapid prototyping of many of the system interactions • Provide a parameterizable prototype for architectural experiments

  7. Testing Setup

  8. Testing Setup

  9. Status • Completed Work • Single-issue seven-stage pipeline MIPS processor core • Mapped to the board and passes our MIPS verification test suite • Will form the SCALE control processor • DDR2 memory controllers • Tested in isolation using simple memory traffic generators • Work in progress • Cache subsystem • Vector-thread unit

  10. Advantages of Using an FPGA • Rapid full system simulation of a large variety of designs • Allows extensive characterization of the design space • Parameterization allows exploration of various tradeoffs • Cache parameters and replacement policies • Prefetch strategies • DRAM access scheduling policies and power-down modes • DRAM types (e.g., DDR2 vs. Mobile DRAM) • Fast emulation system for SCALE software development • Allows thorough debugging before going to silicon

More Related