200 likes | 352 Vues
Performance Evaluations of Finite Difference Applications Realized on a Single Flux Quantum Circuits-Based Reconfigurable Accelerator. Hiroaki Honda 1 , Farhad Mehdipour 2 , Hiroshi Kataoka 1 , Koji Inoue 1 , and Kazuaki J. Murakami 1
E N D
Performance Evaluations of Finite DifferenceApplications Realized on a Single Flux QuantumCircuits-Based Reconfigurable Accelerator Hiroaki Honda1, Farhad Mehdipour2,Hiroshi Kataoka1, Koji Inoue1,and Kazuaki J. Murakami1 1Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan 2Center for Japan-Egypt Cooperation in Science and Technology, Kyushu University, Fukuoka, Japan Email: dahon@soc.ait.kyushu-u.ac.jp
Agenda • Introduction • Single-flux quantum (SFQ) circuit • SFQ-reconfigurable data-path (RDP) processor • Objective • Implementing an Application on SFQ-RDP • Tool chain • Code modification • DFG extraction and mapping • Performance Evaluation • Comparison with GPU and GPP results • Conclusions
Top500Supercomputer Rankingand Projection 10EF 1EFlops 2022 • 1 ExaFlop/s [=109 GFlop/s] can be attained in ~2019and 10 ExaFLop/s in ~2022?? (only in next ten years) • PetaFLop/s [=106GFlop/s] world from 2009, 1000 times speed up in 10 years http://www.top500.org/
Energy Consumption Estimation for Floating Point Units (FPUs) Power / [1FPU (2GHz)] is larger than 10 mW (CMOS, ~8nm in ~2019) 1) Power / [1GFlop/s] is larger than 5 mW (1ExaFlop/s =109 GFlop/s) Enegy consumption of FPUs for 10 ExaFlop/s system is larger than 5 mW * 10 * 109= 50 MW !! • It is extremely power consuming to construct 10 ExaFlop/s supercomputer system by CMOS circuit processor • Additional power consumption by memory, network, storage,… • http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf, p178
Single-Flux Quantum (SFQ) Circuit x 10~100 fasteroperation x ~1/10 energy consumption Superconductivity loop Josephson junction 2~3 ps SFQ Pulse (quantized magnetic flux) ~1 mV SFQ Pulse Pulse logic: Bit serial/slice description for 32/64 bits Advantages Disadvantages • Ultra high speed switching • Ultra low power • No cost for latch • Suitable for Pipeline processing • Difficult to implement feed backloops and conditional branches • No practical SFQ memory
Single-Flux Quantum-Reconfigurable Data Path (SFQ-RDP) Computer PE One FPU and data through units ORN Network connecting between PEs and PEs ~2.5TFLOPS/chip 2-ports/1-port Data accesses For Input / Output • Large scale two-dimensional floating-point unit array, data-path architecture • Reconfigurable Operand Routing Network (ORN) • No on-chip memory • Dynamically reconfigurable PEs and ORNs • Data Flow is unidirectional • No feed back loop • Minimal amount of control circuits
CREST-JST SFQ-RDP Project (2006~): A Low-Power, High-performance Reconfigurable Processor Based on Single-Flux Quantum Circuits Nagoya Univ. CAD for logic design and arithmetic circuits Yokohama National Univ. SFQ-FPU chip, cell library Kyushu Univ. Architecture, Compiler and Applications SFQ-RDP Superconducting Research Lab. (SRL) SFQ process Nagoya Univ. SFQ-RDP chip, cell library, and wiring Goals: Discovering appropriate computation-intensive scientific applications Developing compiler tools Developing performance evaluation tools Designing the SFQ-LSRDP architecture
Prototype 2x3 SFQ-RDP Processorand SFQ-MUL FPU SFQ- Floating Point Multiplier2) 2x3 SFQ-RDP processor1) • 16-bit FPUs: Adder, Multiplier • MUL • Frequency: 32GHz • Performance: 2.6 GFLOPs • The number of junctions: 11044 JJs • Power consumption: 3.5 mW • Circuit area: 6.22 ×3.78 mm2 • 8-bit ALUs implementing: • ADD, SUB, AND, OR, XOR • Frequency: 25GHz • Process: 2mm • Area:6.84 x 6.72 mm2 • Power: 4.1mW 1) Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008. 2) H.Hara, et al.,"Design and Implementation of SFQ Half-Precision Floating-Point Multipliers,", ACS08, 2008.
Objectives • Performance evaluations by implementing practical applications and showing possibility of efficient computations by SFQ-RDP computer system • Applications: 2D-diffusion,2D-Finite-Difference Time-Domain (2D-FDTD) • Comparisons of execution times with GPP and GPU Points • 2D-FPU array, data-flow architecture • Data Flow Graphs (DFGs) are extracted from applications and mapped onto the SFQ-RDP • Compiler tools • Compiler tools have to be developed • No on-chip memory • DMA transfer of DRAM has to be fully used to avoid random accesses • Dynamically reconfigurable PEs and ORNs • One time reconfiguration is enough for both Diffusion & FDTD applications
Tool Chain for Implementationof an Application on SFQ-RDP Input Application: C/Fortran code Modified code Code Modification using SFQ-RDP API Data Flow Graph (DFG) Extraction (Semi-manual) RDP library file Functions definition & declaration Compiler developed for SFQ-RDP RDP architecture description Extracted DFG GPP Object code SFQ-RDP RDP Configuration file Placement and Routing Tool Tool chain has beenalmost completed
Implementing an Application on SFQ-RDP:2D Diffusion • Basic Finite Difference Method (FDM) formula j y-axis (space) n i (time=n points) x-axis (space) Time development calculation by FDM n+1 n-axis (time)
Code Implementation and Modification for SFQ-RDP Unrolled Loop Code for SFQ-RDP ( n ⇒ n+1) loop n loop i, j, (+3, +3) f(n+1)[i,j] = C0 * ( f(n)[i-1,j] + f(n)[i+1,j] ) + C1 * ( f(n)[i,j-1] + f(n)[i,j+1] ) + C2 * f(n)[i,j] f(n+1)[i+1,j] = C0 * ( f(n)[i,j] + f(n)[i+2,j] ) + C1 * ( f(n)[i+1,j-1] + f(n)[i+1,j+1] ) + C2 * f(n)[i+1,j] f(n+1)[i+2,j] = … … f(n+1)[i+2,j+2]= … end end Original Code for GPP ( n ⇒ n+1 ) loop n loop i, j f(n+1)[i,j] = C0 * ( f(n)[i-1,j] + f(n)[i+1,j] ) + C1 * ( f(n)[i,j-1] + f(n)[i,j+1] ) + C2 * f(n)[i,j] end end DFG Extraction Extracted DFG: 9 formulas in loop-body
Mapping Extracted DFG onto SFQ-RDP Extracted DFG Placement and Routing DFG mapping Result RDP configuration data
Improving Data Access Efficiency-Data Structure Conversion for DMA Transfer f[i,j]: Unrolled loop includes 21 inputs and 9 outputs for calculation Data Structure Conversion: f[i,j] A[i],B[i] Random memory accesses 15(A)+15(B) input data are accessed via two input ports 9 output data are accessed All two dimensional f[i,j] values are divided and stored as two one-dimensional arrays: A[] and B[] A[i]: B[i]: double buffering Sequential memory accesses: possible to use DMA transfer
Performance Evaluation System Architecture System Configuration 2input/1output ports *BW numbers are based on ones for GPU calculation Estimation of execution times GPP:Simulation by cycle accurate processor simulator SFQ-RDP:Performance evaluation modeling
Results of Performance Evaluation • Comparable results to GPU • SFQ-RDP processor, which is implementedby superconductivity circuits and simple 2D-array architecture, can be used as an efficient accelerator 1) T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-10:4777514773, 2009. 2) N. Takada, et al., “Speeding up of FDTD finite difference calculations by efficient use of GPU and shared memory,” (Japanese), Proceedings of Forum of Information Science and Technology, 2009 3) H. Kataoka, et al.,"Reducing Preprocessing Overhead Times in a Reconfigurable Accelerator of Finite Difference Applications", SAAHPC 10, Jul. 2010.
Why Can We Achieve Comparable Results? • Based on the utilization of HW for rearrangement of input data • Single Precision Calculation, BW 159.0GB/s , GeForce GTX 285 • GeForce GTX 285, 1 proc. calculation: (1024x1204 mesh) • T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-10:4777514773, 2009.
Conclusions and Future Works • Conclusions • An Single-Flux Quantum Reconfigurable Data-Path (SFQ-RDP) with two-dimensional floating point array architecture implemented by superconducting circuits was introduced. • Two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications were implemented on SFQ-RDP and performance evaluations were conducted. • For 2D-Heat and 2D-FDTD, 50.6 and 79.0 times faster computation than general purpose processor were achievable respectively, while these performance values were comparable to reported results for the GPU. • SFQ-RDP accelerator can be used for practical scientific calculations especially based on finite difference methods. • Future Works • Implementations and performance evaluations of other applications
Acknowledgement This research was supported by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST). Other SFQ-RDP research members • CAD for logic design and arithmetic circuits • Prof. N.Takagi (Leader), Prof. K.Takagi (Kyoto Univ.) • SFQ-RDP chip, cell library, and wiring • Prof. A.Fujimaki, Prof. H.Akaike, Prof. M.Tanaka (Nagoya Univ.) • SFQ-FPU chip, cell library • Prof. N.Yoshikawa (Yokohama National Univ.) • SFQ process • Dr. S.Nagasawa, Dr. M.Hidaka (SLRC)