1 / 43

Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture

Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture. 4/26/2007 Dr Steve Hunter. BlueGene/L Program. December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals:

Télécharger la présentation

Architecture of Parallel Computers CSC / ECE 506 BlueGene Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecture of Parallel ComputersCSC / ECE 506 BlueGene Architecture 4/26/2007 Dr Steve Hunter

  2. BlueGene/L Program • December 1999: IBM Research announced a 5 year, $100M US, effort to build a petaflop/s scale supercomputer to attack science problems such as protein folding. Goals: • Advance the state of the art of scientific simulation. • Advance the state of the art in computer design and software for capability and capacity markets. • November 2001: Announced Research partnership with Lawrence Livermore National Laboratory (LLNL).November 2002: Announced planned acquisition of a BG/L machine by LLNL as part of the ASCI Purple contract. • May 11, 2004: Four racks DD1 (4096 nodes at 500 MHz) running Linpack at 11.68 TFlops/s. It was ranked #4 on 23rd Top500 list. • June 2, 2004: 2 racks DD2 (1024 nodes at 700 MHz) running Linpack at 8.655 TFlops/s. It was ranked #8 on 23rd Top500 list. • September 16, 2004, 8 racks running Linpack at 36.01 TFlops/s. • November 8, 2004, 16 racks running Linpack at 70.72 TFlops/s. It was ranked #1 on the 24th Top500 list. • December 21, 2004 First 16 racks of BG/L accepted by LLNL. CSC / ECE 506

  3. BlueGene/L Program • Massive collection of low-power CPUs instead of a moderate-sized collection of high-power CPUs. • A joint development of IBM and DOE’s National Nuclear Security Administration (NNSA) and installed at DOE’s Lawrence Livermore National Laboratory • BlueGene/L has occupied the No. 1 position on the last three TOP500 lists (http://www.top500.org/) • It has reached a Linpack benchmark performance of 280.6 TFlop/s (“teraflops” or trillions of calculations per second) and still remains the only system ever to exceed the level of 100 TFlop/s. • BlueGene/L holds the #1 and #3 positions in top 10. • “Objective was to retain exceptional cost/performance levels achieved by application-specific machines, while generalizing the massively parallel architecture enough to enable a relatively broad class of applications” - Overview of BG/L system architecture, IBM JRD • Design approach was to use a very high level of integration that made simplicity in packaging, design, and bring-up possible • JRD issue available at: http://www.research.ibm.com/journal/rd49-23.html CSC / ECE 506

  4. BlueGene/L Program • BlueGene is a family of supercomputers. • BlueGene/L is the first step, aimed as a multipurpose, massively parallel, and cost/effective supercomputer 12/04 • BlueGene/P is the petaflop generation 12/06 • BlueGene/Q is the third generation ~2010. • Requirements for future generations • Processors will be more powerful. • Networks will be higher bandwidth. • Applications developed on BlueGeneG/L will run well on BlueGene/P. CSC / ECE 506

  5. BlueGene/L Fundamentals • Low Complexity nodes gives more flops per transistor and per watt • 3D Interconnect supports many scientific simulations as nature as we see it is 3D CSC / ECE 506

  6. BlueGene/L Fundamentals • Cellular architecture • Large numbers of low power, more efficient processors interconnected • Rmax of 280.6 Teraflops • Maximal LINPACK performance achieved • Rpeak of 360 Teraflops • Theoretical peak performance • 65,536 dual-processor compute nodes • 700MHz IBM PowerPC 440 processors • 512 MB memory per compute node, 16 TB in entire system. • 800 TB of disk space • 2,500 square feet CSC / ECE 506

  7. Comparing Systems (Peak) CSC / ECE 506

  8. Comparing Systems (Byte/Flop) • Red Storm 2.0 2003 • Earth Simulator 2.0 2002 • Intel Paragon 1.8 1992 • nCUBE/2 1.0 1990 • ASCI Red 1.0 (0.6) 1997 • T3E 0.8 1996 • BG/L 1.5 0.75(torus)+0.75(tree) 2004 • Cplant 0.1 1997 • ASCI White 0.1 2000 • ASCI Q 0.05 Quadrics 2003 • ASCI Purple 0.1 2004 • Intel Cluster 0.1 IB 2004 • Intel Cluster 0.008 GbE 2003 • Virginia Tech 0.16 IB 2003 • Chinese Acad of Sc 0.04 QsNet 2003 • NCSA - Dell 0.04 Myrinet 2003 CSC / ECE 506

  9. Comparing Systems (GFlops/Watt) • Power efficiencies of recent supercomputers • Blue: IBM Machines • Black: Other US Machines • Red: Japanese Machines IBM Journal of Research and Development CSC / ECE 506

  10. Comparing Systems * 10 megawatts approximate usage of 11,000 households CSC / ECE 506

  11. BG/L Summary of Performance Results • DGEMM (Double-precision, GEneral Matrix-Multiply): • 92.3% of dual core peak on 1 node • Observed performance at 500 MHz: 3.7 GFlops • Projected performance at 700 MHz: 5.2 GFlops (tested in lab up to 650 MHz) • LINPACK: • 77% of peak on 1 node • 70% of peak on 512 nodes (1435 GFlops at 500 MHz) • sPPM (Spare Matrix Multiple Vector Multiply), UMT2000: • Single processor performance roughly on par with POWER3 at 375 MHz • Tested on up to 128 nodes (also NAS Parallel Benchmarks) • FFT (Fast Fourier Transform): • Up to 508 MFlops on single processor at 444 MHz (TU Vienna) • Pseudo-ops performance (5N log N) @ 700 MHz of 1300 Mflops (65% of peak) • STREAM – impressive results even at 444 MHz: • Tuned: Copy: 2.4 GB/s, Scale: 2.1 GB/s, Add: 1.8 GB/s, Triad: 1.9 GB/s • Standard: Copy: 1.2 GB/s, Scale: 1.1 GB/s, Add: 1.2 GB/s, Triad: 1.2 GB/s • At 700 MHz: Would beat STREAM numbers for most high end microprocessors • MPI: • Latency – < 4000 cycles (5.5 ls at 700 MHz) • Bandwidth – full link bandwidth demonstrated on up to 6 links CSC / ECE 506

  12. BlueGene/L Architecture • To achieve this level of integration, the machine was developed around a processor with moderate frequency, available in system-on-a-chip (SoC) technology • This approach was chosen because of the performance/power advantage • In terms of performance/watt the low-frequency, low-power, embedded IBM PowerPC core consistently outperforms high-frequency, high-power, microprocessors by a factor of 2 to 10 • Industry focus on performance / rack • Performance / rack = Performance / watt * Watt / rack • Watt / rack = 20kW for power and thermal cooling reasons • Power and cooling • Using conventional techniques, a 360 Tflops machine would require 10-20 megawatts. • BlueGene/L uses only 1.76 megawatts CSC / ECE 506

  13. Microprocessor Power Density Growth CSC / ECE 506

  14. System Power Comparison CSC / ECE 506

  15. BlueGene/L Architecture • Networks were chosen with extreme scaling in mind • Scale efficiently in terms of both performance and packaging • Support very small messages • As small as 32 bytes • Includes hardware support for collective operations • Broadcast, reduction, scan, etc. • Reliability, Availability and Serviceability (RAS) is another critical issue for scaling • BG/L need to be reliable and usable even at extreme scaling limits • 20 fails per 1,000,000,000 hours = 1 node failure every 4.5 weeks • System Software and Monitoring also important to scaling • BG/L designed to efficiently utilize a distributed memory, message-passing programming model • MPI is the dominant message-passing model with hardware features added and parameter tuned CSC / ECE 506

  16. RAS (Reliability, Availability, Serviceability) • System designed for RAS from top to bottom • System issues • Redundant bulk supplies, power converters, fans, DRAM bits, cable bits • Extensive data logging (voltage, temp, recoverable errors … ) for failure forecasting • Nearly no single points of failure • Chip design • ECC on all SRAMs • All dataflow outside processors is protected by error-detection mechanisms • Access to all state via noninvasive back door • Low power, simple design leads to higher reliability • All interconnects have multiple error detections and correction coverage • Virtually zero escape probability for link errors CSC / ECE 506

  17. BlueGene/L System 136.8 Teraflop/s on LINPACK (64K processors) 1 TF = 1000,000,000,000 Flops Rochester Lab 2005 CSC / ECE 506

  18. BlueGene/L System CSC / ECE 506

  19. BlueGene/L System CSC / ECE 506

  20. BlueGene/L System CSC / ECE 506

  21. Physical Layout of BG/L CSC / ECE 506

  22. Midplanes and Racks CSC / ECE 506

  23. The Compute Chip • System-on-a-chip (SoC) • 1 ASIC • 2 PowerPC processors • L1 and L2 Caches • 4MB embedded DRAM • DDR DRAM interface and DMA controller • Network connectivity hardware • Control / monitoring equip. (JTAG) CSC / ECE 506

  24. Compute Card CSC / ECE 506

  25. Node Card CSC / ECE 506

  26. BlueGene/L Compute ASIC • IBM CU-11, 0.13 µm • 11 x 11 mm die size • 25 x 32 mm CBGA • 474 pins, 328 signal • 1.5/2.5 Volt CSC / ECE 506

  27. BlueGene/L Interconnect Networks 3 Dimensional Torus • Main network, for point-to-point communication • High-speed, high-bandwidth • Interconnects all compute nodes (65,536) • Virtual cut-through hardware routing • 1.4Gb/s on all 12 node links (2.1 GB/s per node) • 1 µs latency between nearest neighbors, 5 µs to the farthest • 4 µs latency for one hop with MPI, 10 µs to the farthest • Communications backbone for computations • 0.7/1.4 TB/s bisection bandwidth, 68TB/s total bandwidth Global Tree • One-to-all broadcast functionality • Reduction operations functionality • MPI collective ops in hardware • Fixed-size 256 byte packets • 2.8 Gb/s of bandwidth per link • Latency of one way tree traversal 2.5 µs • ~23TB/s total binary tree bandwidth (64k machine) • Interconnects all compute and I/O nodes (1024) • Also guarantees reliable delivery Ethernet • Incorporated into every node ASIC • Active in the I/O nodes (1:64) • All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt • Latency of round trip 1.3 µs Control Network CSC / ECE 506

  28. The Torus Network • 3 dimensional: 64 x 32 x 32 • Each compute node is connected to its six neighbors: x+, x-, y+, y-, z+, z- • Compute card is 1x2x1 • Node card is 4x4x2 • 16 compute cards in 4x2x2 arrangement • Midplane is 8x8x8 • 16 node cards in 2x2x4 arrangement • Communication path • Each uni-directional link is 1.4Gb/s, or 175MB/s. • Each node can send and receive at 1.05GB/s. • Supports cut-through routing, along with both deterministic and adaptive routing. • Variable-sized packets of 32,64,96…256 bytes • Guarantees reliable delivery CSC / ECE 506

  29. Complete BlueGene/L System at LLNL BG/L I/O nodes 1,024 WAN 48 visualization 64 archive 128 BG/L compute nodes 65,536 Federated Gigabit Ethernet Switch 2,048 ports CWFS 1024 512 Front-end nodes 8 Service node 8 8 Control network CSC / ECE 506

  30. System Software Overview • Operating system - Linux • Compilers - IBM XL C, C++, Fortran95 • Communication - MPI, TCP/IP • Parallel File System - GPFS, NFS support • System Management - extensions to CSM • Job scheduling - based on LoadLeveler • Math libraries - ESSL CSC / ECE 506

  31. BG/L Software Hierarchical Organization • Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) • I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination • Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software CSC / ECE 506

  32. BG/L System Software • Simplicity • Space-sharing • Single-threaded • No demand paging • Familiarity • MPI (MPICH2) • IBM XL Compilers for PowerPC CSC / ECE 506

  33. Operating Systems • Front-end nodes are commodity systems running Linux • I/O nodes run a customized Linux kernel • Compute nodes use an extremely lightweight custom kernel • Service node is a single multiprocessor machine running a custom OS CSC / ECE 506

  34. Compute Node Kernel (CNK) • Single user, dual-threaded • Flat address space, no paging • Physical resources are memory-mapped • Provides standard POSIX functionality (mostly) • Two execution modes: • Virtual node mode • Coprocessor mode CSC / ECE 506

  35. Service Node OS • Core Management and Control System (CMCS) • BG/L’s “global” operating system. • MMCS - Midplane Monitoring and Control System • CIOMAN - Control and I/O Manager • DB2 relational database CSC / ECE 506

  36. Running a User Job • Compiled, and submitted from front-end node. • External scheduler • Service node sets up partition, and transfers user’s code to compute nodes. • All file I/O is done using standard Unix calls (via the I/O nodes). • Post-facto debugging done on front-end nodes. CSC / ECE 506

  37. Performance Issues • User code is easily ported to BG/L. • However, MPI implementation requires effort & skill • Torus topology instead of crossbar • Special hardware, such as collective network. CSC / ECE 506

  38. BG/L MPI Software Architecture GI = Global Interrupt CIO = Control and I/O Protocol CH3 = Primary device distributed with MPICH2 communication MPD = Multipurpose Daemon CSC / ECE 506

  39. MPI_Bcast CSC / ECE 506

  40. MPI_Alltoall CSC / ECE 506

  41. References • IBM Journal of Research and Development, Vol. 49, No. 2-3. • http://www.research.ibm.com/journal/rd49-23.html • “Overview of the Blue Gene/L system architecture” • “Packaging the Blue Gene/L supercomputer” • “Blue Gene/L compute chip: Memory and Ethernet subsystems” • “Blue Gene/L torus interconnection network” • “Blue Gene/L programming and operating environment” • “Design and implementation of message-passing services for the Blue Gene/L supercomputer” CSC / ECE 506

  42. References (cont.) • BG/L homepage @ LLNL: <http://www.llnl.gov/ASC/platforms/bluegenel/> • BlueGene homepage @ IBM: <http://www.research.ibm.com/bluegene/> CSC / ECE 506

  43. The End CSC / ECE 506

More Related