1 / 19

Farhad Mehdipour , Hiroaki Honda, Hiroshi Kataoka , Koji Inoue, Kazuaki Murakami

Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator. Farhad Mehdipour , Hiroaki Honda, Hiroshi Kataoka , Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

floria
Télécharger la présentation

Farhad Mehdipour , Hiroaki Honda, Hiroshi Kataoka , Koji Inoue, Kazuaki Murakami

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator FarhadMehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

  2. CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum (SFQ) circuits Superconducting Research Lab. (SRL) SFQ process Yokohama National Univ. SFQ-FPU chip, cell library Nagoya Univ. SFQ-RDP chip, cell library, and wiring Nagoya Univ. CAD for logic design and arithmetic circuits S. Nagasawa et al. N. Yoshikawa et al. A. Fujimaki et al. N. Takagi (Leader) et al. Kyushu Univ. Architecture, Compiler and Applications K. Murakami K. Inoue H. Honda F. Mehdipour H. Kataoka SFQ-LSRDP Our mission: Architecture, compiler and application development

  3. Outline of Large-Scale Reconfigurable Data-Path (LSRDP) Processor SFQ Features: • High-speed switching and signal transmission • Low power consumption • Compact implementation (smaller area) • Suitable for pipeline processing

  4. … … … … … conf. bit-stream … … … … … … GPP GPP GPP … … … … How it works Memory Controller Memory Controller inst; inst; … conf_LSRDP ( ); Loop: rearrange_input_data ( ); set_IO_info ( ); run_LSRDP ( ); inst; … sync_lsrdp ( ); rearrange_output_data ( ); End_Loop inst; … Buffers Memory inst inst inst conf_LSRDP(); rearrange_input_data () set_IO_info ( ); sync_lsrdp ( ); Waiting for the LSRDP rearrange_output_data ( ) run_LSRDP ( ); GPP LSRDP terminating the operation LSRDP Buffers

  5. TU FU TU FU TU FU TU TU TU PE arch. I 4-inps/3-outs PE arch. II 3-inps/3-outs Basic PE arch. 3-inps/2-outs Architecture Exploration LSRDP Layouts PE structures ORN structures Number of rows = 2×M Number of rows = 1.5×M Number of rows = 1.5×M MCL= 1 MCL= 1 Number of columns = 6×MCL+2 Number of columns = 4×MCL MCL= 2 Number of columns = 4×MCL+1

  6. LSRDP Tool Chain Modifying application code Inserting LSRDP instructions in the code Application C code Modified application code 1 1 2 1 LSRDP architecture description LSRDP library file Function definitions & declarations DFG Extraction 1 ISAcc or COINS compiler 2 2 1 Placing and Routing Tool Data flow graphs 2 binary code 2 1: flow of the assembly code generation for GPP 2: flow of configuration bit-stream generation for the LSRDP Configuration file + various text & schematic reports Simulator Performance evaluation

  7. DFG Placing Input Nodes LSRDP Architecture Description Placing Operational & Output Nodes Routing Nets Routing IO Nets Final Map Mapping DFGs onto LSRDP Longest connections

  8. Global routing algorithms Routing DFG connections between source and destination PEs exhaustive search-based very time consuming branch and bound alg. Very fast src src vacant fully- occupied dest dest

  9. FU FU FU FU FU FU FU FU T T T T T T T T i-th row ORN … (i+1)-th row Micro-Routing-Problem Definition • Inputs • LSRDP basic specifications • Layout, Width (W), MCL, PE arch., and etc. • List of connections b/w consecutive rows • ORN structure including • The number of CBs and T2s in each row • The number of CB rows • Topology of connections among CBs • Output • Detailed routes via cross-bar switches • The list of CBs used for routing each connection • Configuration of CBs A micro-routing algorithm has been implemented for the LSRDP with underlying layout II and PE arch. III

  10. ORN Micro-routing CB: 2-input/2-output 2 Example PE1 PE 5 CB 1 (CB) 1 1 (PE1 PE 5) (PE2 PE5, PE6, PE7) (PE3 PE6, PE8 ) (PE4 PE7, PE8) 2 CB ½CB PE 2 - 2 PE 6 3 CB 2 2 CB 4 ½CB 1/2CB: 1-input/2-ouput PE 3 2 3 PE 7 CB Micro-nets 2 3 2 ½CB 3 ½CB CB 10 11 00 01 10 11 00 01 PE 4 PE 8 3 4 CB 4 3 4 ½CB CB 4 4 (CB)

  11. 6 6 6 6 6 7 6 5 8 4 9 8 10 5 6 4 7 9 10 11 11 12 12 12 12 12 12 12 12 6 6 6 6 6 6 12 7 7 7 7 12 7 7 7 7 7 7 12 … 8 8 8 8 8 8 8 8 8 8 18 17 17 17 17 17 17 17 18 9 9 9 17 17 9 18 18 18 18 18 18 18 17 18 17 9 18 9 9 9 9 9 18 10 10 10 10 20 20 20 20 20 20 20 20 … 10 10 10 10 10 10 18 20 20 20 18 11 11 11 11 18 11 11 11 11 11 11 18 24 12 12 12 12 24 24 24 24 24 24 24 18 12 12 24 12 24 12 12 24 12 18 25 25 25 25 24 25 25 25 13 13 13 13 … 25 25 13 13 13 24 13 13 25 13 25 24 14 14 14 14 14 14 14 14 24 14 14 15 15 24 15 15 24 24 15 15 15 15 15 15 31 31 31 31 31 31 31 31 31 31 31 16 16 16 16 32 32 32 32 32 32 32 32 16 16 16 16 16 16 32 32 … 32 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 ORN Micro-Routing Example: Heat 8x2- ORN b/w 3rd and 4th Rows PEs in 4th row PEs in 3rd Row

  12. Specifications of Attempted DFGs

  13. Example of a DFG MappingVibration- 8x2

  14. Results of routing nets using the proposed algorithms

  15. Thank You for Your Attention! Any Questions!

  16. 10TFLOPS SFQ-RDP computer 4.2 K SFQ 0.5μm process CMOS CPU (One Chip) ORN 2TB memory module (FB-DIMM [DDR3@1333MHz, 128GB] ×16 modules) ... FPU SFQ RDP (32 PE×32 chips) (2.5 GFLOPS/PE) ORN : : : : ... ORN ... ORN Streaming memory Access controller SB : : : ... : 1024FPU@MCM (34chips)×4MCM SMAC SMAC SMAC Memory bandwidth per MCM:256GB/s (=16GB/s ×16 channels)

  17. FU FP TU TU TU TU TU TU • Development of RDPArchitecture Chip Micro-architecture: • Two types of PEs: FPA and FPM • PE layout: Checkered pattern • PE:Two Inputs(A,B,C)→ Three Outputs (A(*B),B,C) • Threescales of RDP (Small, Medium and Large-Scales) • TU:Data Through

  18. Development of RDP Complier Modifyingapplication code Manual: Inserting LSRDP instructions in the code Application C code Modified code 1 1 2 1 RDP architecture description RDP library file Functions definition & declaration DFG Extraction Semi-manual 1 ISAcc or COINScompiler 2 2 1 Placement and Routing Tool 2 Data flow graphs .asm code for MIPS-based GPP 2 1: flow of the assembly code generation for GPU 2: flow of configuration bit-stream generation for the RDP Configuration file + various text and schematic reports Simulator Performance evaluation

  19. Development of RDP Oriented Algorithms • One-dimensional heat and vibrational equations • Two-dimensional heat and FDTD equations • Two-Electron Repulsion Integral calculation in quantum chemistry • Runge-Kutta calculation for ordinary differential equation • Performance Evaluation • Two-dimensional heat equation(1024x1024 mesh) • SFQ-RDP1): 50.6GFlop/s vs. GPU2): 63.0GFlop/s 1) Evaluation method: RDP: - Execution time model, - DFG has 21 inputs,9 outputs, and 63 operations GPP: - Cycle-accurate processor simulator - BW: 159.0GB/s 2) T.Aoki, and A. Nukada,“CUDA programming premier,“ Kougakusya, ISBN-10:4777514773, 2009 (in Japanese). 19

More Related