1 / 31

Next KEK machine

Next KEK machine. Shoji Hashimoto (KEK) @ 3 rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005. KEK supercomputer. Leading computing facility in that time 1985 Hitachi S810/10 350 MFlops 1989 Hitachi S820/80 3 GFlops 1995 Fujitsu VPP500 128 GFlops 2000 Hitachi SR8000 F1 1.2 TFlops

donnel
Télécharger la présentation

Next KEK machine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Next KEK machine Shoji Hashimoto (KEK) @ 3rd ILFT Network Workshop at Jefferson Lab., Oct. 3-6, 2005

  2. KEK supercomputer Leading computing facility in that time • 1985 Hitachi S810/10 350 MFlops • 1989 Hitachi S820/80 3 GFlops • 1995 Fujitsu VPP500 128 GFlops • 2000 Hitachi SR8000 F1 1.2 TFlops • 2006 ??? Shoji Hashimoto (KEK)

  3. Formality • “KEK Large Scale Simulation Program” : call for proposals of project to be performed on the supercomputer. • Open for Japanese researcher working on high energy accelerator science (particle and nuclear physics, astrophysics, accelerator physics, material science related to the photon factory) • Program Advisory Committee (PAC) decides the approval and machine time allocation. Shoji Hashimoto (KEK)

  4. Usage Lattice QCD is a dominant user. • About 60-80% of the computer time for lattice QCD • Among them, ~60% is for the JLQCD collaboration • Others include Hatsuda-Sasaki, Nakamura et al., Suganuma et al., Suzuki et al. (Kanazawa), … • Simulation for accelerator design is another big user: beam-beam simulation for the KEK-B factory. Shoji Hashimoto (KEK)

  5. JLQCD collaboration • 1995~ (on VPP500) • Continuum limit in the quenched approximation ms BK fB, fD Shoji Hashimoto (KEK)

  6. JLQCD collaboration • 2000~ (on SR8000) • Dynamical QCD with the improved Wilson fermion mV vs mPS2 fB, fBs Kl3 form factor Shoji Hashimoto (KEK)

  7. Around the triangle Shoji Hashimoto (KEK)

  8. Chiral extrapolation: very hard to go beyond ms/2 Problem for every physical quantities. Maybe, solved by the new algorithms and machines… JLQCD Nf=2 (2002) MILC coarse lattice (2004) The wall New generation of dynamical QCD Shoji Hashimoto (KEK)

  9. Upgrade Thanks to Hideo Matsufuru (Computing Research Center, KEK) for his hard work. • Upgrade scheduled on March 1st 2006. • Called for bids from vendors. • At least 20x more computing power, measured mainly using the QCD codes. • No restriction on architecture (scalar or vector, etc.), but some amount must be a shared memory machine. • Decision was made, recently. Shoji Hashimoto (KEK)

  10. The next machine A combination of two systems: • Hitachi SR11000 K1, 16 nodes, 2.15 TFlops peak performance. • IBM Blue Gene/L, 10 racks, 57.3 TFlops peak performance. Hitachi Ltd. is the prime contractor. Shoji Hashimoto (KEK)

  11. Hitachi SR11000 K1 Will be announced, tomorrow. • POWER5+: 2.1 GHz, dual core, 2 simultaneous multiply/add per cycle (8.4 GFlops/core), 1.875 MB L2 (on chip), 36 MB L3 (off chip) • 8.5 GB/s chip-memory bandwidth, hardware and software prefetch • 16-way SMP (134.4 GFlops/node), 32 GB memory (DDR2 SDRAM). • 16 nodes (2.15 TFlops) • Interconnect: Federation switch 8GB/s (bidirectional) Shoji Hashimoto (KEK)

  12. SR11000 node Shoji Hashimoto (KEK)

  13. 16-way SMP Shoji Hashimoto (KEK)

  14. High Density Module Shoji Hashimoto (KEK)

  15. IBM Blue Gene/L • Node: 2 PowerPC440 (dual core), 700 MHz, double FPU (5.6 GFlops/chip), 4MB on-chip L3 (shared), 512 MB memory. • Interconnect: 3D torus, 1.4 Gbps/link (6 in + 6 out) from each node. • Midplane: 8x8x8 nodes (2.87 TFlops); rack = 2 Midplane • 10 rack system All the info in the following comes from the “Red book” (ibm.com/redbooks) and the articles in IBM Journal of Research and Development. Shoji Hashimoto (KEK)

  16. BG/L system 10 Racks Shoji Hashimoto (KEK)

  17. Double Floating-Point-Unit (FPU) added to the PPC440 core. 2 fused multiply-add per core Not a true SMP: L1 has no cache coherency, L2 has a snoop. Shared 4MB L3. Communication between the two core through the “multiported shared SRAM buffer” Embedded memory controller and networks. BG/L node ASIC Shoji Hashimoto (KEK)

  18. Compute note modes • Virtual node mode: use both CPUs separately, running a different process on each core. Communication using MPI, etc. Memory and bandwidth are shared. • Co-processor mode: use the secondary processor as a co-processor for communication. Peak performace is ½. • Hybrid node mode: use the secondary processor also for computation. Need a special care about the L1 cache incoherency. Used for Linpack. Shoji Hashimoto (KEK)

  19. QCD code optimization Jun Doi and Hikaru Samukawa (IBM Japan): • Use the virtual node mode • Fully used the Double FPU (hand-written assembler code) • Use a low-level communication API Shoji Hashimoto (KEK)

  20. Double FPU • SIMD extension of PPC440. • 32 pairs of 64-bit FP register, addresses are shared. • Quadword load and store. • Primary and secondary pipelines. Fused multiply-add for each pipe. • Cross operations possible; best suited for complex arithmetic. Shoji Hashimoto (KEK)

  21. Examples Shoji Hashimoto (KEK)

  22. SU(3) matrix*vector y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2]; y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2]; y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2]; complex mult: u[0][0] * x[0] re(y[0])=re(u[0][0])*re(x[0]) im(y[0])=re(u[0][0])*im(x[0]) FXPMUL (y[0],u[0][0],x[0]) FXCXNPMA (y[0],u[0][0],x[0],y[0]) re(y[0])+=-im(u[0][0])*im(x[0]) im(y[0])+=im(u[0][0])*re(x[0]) + u[0][1] * x[1] + u[0][2] * x[2]; FXCPMADD (y[0],u[0][1],x[1],y[0]) FXCXNPMA (y[0],u[0][1],x[1],y[0]) FXCPMADD (y[0],u[0][2],x[2],y[0]) FXCXNPMA (y[0],u[0][2],x[2],y[0]) must be combined with other rows to avoid pipeline stall (wait 5 cycles). Shoji Hashimoto (KEK)

  23. 32+32 registers can hold 32 complex numbers. 3x3(=9) for a gauge link; 3x4(=12) for a spinor: need 2 spinors for input and output Load the gauge link while computing, using 6+6 registers. Straightforward for y+=U*x, but not so for y+=conjg(U)*x. Use the inline-assembler of gcc; xlf and xlc have intrinsic functions. Early xlf/xlc wasn’t good enough to produce these code, but is improved more recently. Scheduling Shoji Hashimoto (KEK)

  24. Parallelization on BG/L Example: 243x48 lattice. • Use the virtual node mode. • For the midplane, divide the entire lattice onto 2x8x8x8 processors. For one rack, 2x8x8x16. (2 is inner-node.) • To use more than one rack, 323x64 lattice is the minimum. • Each processor has 12x3x3x6 (or 12x3x3x3) lattice. Shoji Hashimoto (KEK)

  25. Communication is fast: 6 links to nearest-neighbors. 1.4 Gbps (bi-directional) for each link. latency is 140ns for one hop. MPI is too heavy: Need additional buffer copy = waste the cache and memory bandwidth. Multi-thread not available in the virtual node mode. Overlapping comp and comm is not possible within MPI. Communication Shoji Hashimoto (KEK)

  26. “QCD Enhancement Package” Low-level communication API • Directly send/recv by accessing the torus interface FIFO. No copy to memory buffer. • Non-blocking send; blocking recv. • Up to 224 byte data to send/recv at once (spinor at one site = 192 byte). • Assuming the nearest-neighbor communication. Shoji Hashimoto (KEK)

  27. An example #define BGLNET_WORK_REG 30 #define BGLNET_HEADER_REG 30 BGLNetQuad* fifo; BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6); for(i=0;i<Nx;i++){ // put results to reg 24--29 BGLNet_Send_Enqueue_Header(fifo); BGLNet_Send_Enqueue(fifo,24); BGLNet_Send_Enqueue(fifo,25); BGLNet_Send_Enqueue(fifo,26); BGLNet_Send_Enqueue(fifo,27); BGLNet_Send_Enqueue(fifo,28); BGLNet_Send_Enqueue(fifo,29); BGLNet_Send_Packet(fifo); } Create packet header Put the packet header to the send buffer Put the data to the send buffer Kick! Shoji Hashimoto (KEK)

  28. Wilson solver (BiCGstab) 243x48 lattice on a midplace (8x8x8=512 nodes, half rack) 29.2% of the peak performance 32.6% if measured the Dslash only Domain-wall solver (CG) 243x48 lattice on a midplace; Ns=16. Doesn’t fit in the on-chip L3 ~22% of the peak performance Benchmark Shoji Hashimoto (KEK)

  29. Vranas @ Lattice 2004 Comparison ~50% improvement Shoji Hashimoto (KEK)

  30. Physics target “Future opportunities: ab initio calculations at the physical quark masses” • Using dynamical overlap fermion • Details are under discussion (actions, algorithms, etc.) • Primitive code has been written; test runs are on-going on SR8000. • Many things to do by March… Shoji Hashimoto (KEK)

  31. Summary • New KEK machine will be made available for Japanese lattice community on March 1st, 2006. • Hitachi SR11000 (2.15 TF) + IBM BlueGene/L (57.3 TF) Shoji Hashimoto (KEK)

More Related