1 / 42

STATE OF THE ART

STATE OF THE ART. Section 2. Overview We are going to briefly describe some state-of-the-art supercomputers The goal is to evaluate the degree of integration of the three main components, processing nodes, interconnection network and system software

rigg
Télécharger la présentation

STATE OF THE ART

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STATE OF THE ART

  2. Section 2 • Overview • We are going to briefly describe some state-of-the-art supercomputers • The goal is to evaluate the degree of integration of the three main components, processing nodes, interconnection network and system software • Analysis limited to 6 supercomputers (ASCI Q and Thunder, System X, BlueGene/L, Cray XD1 and ASCI Red Storm), due to space and time limitations

  3. ASCI Q: Los Alamos National Laboratory

  4. ASCI Q • Total — 20.48 TF/s, #3 in the top 500 • Systems— 2048 AlphaServer ES45s • 8,192 EV-68 1.25-GHz CPUs with 16-MB cache • Memory— 22 Terabytes • System Interconnect • Dual Rail Quadrics Interconnect • 4096 QSW PCI adapters • Four 1024-way QSW federated switches • Operational in 2002

  5. Quad C-Chip Controller Quad C-Chip Controller D D D D D D D D D D D D D D D D PCI ChipBus 0,1 PCI ChipBus 0 PCI ChipBus 1 PCI ChipBus 2,3 64b 66MHz (528 MB/S) 64b 66MHz (528 MB/S) 64b 33MHz (266MB/S) 64b 33MHz (266MB/S) 64b 33 MHz (266 MB/S) 64b 66 MHz (528 MB/S) PCI 9 PCI8 PCI7 PCI6 PCI5 PCI4 PCI3 PCI2 PCI1 PCI0 PCI 9 PCI8 PCI7 PCI6 PCI5 PCI4 PCI3 PCI2 PCI1 PCI0 PCI-USB PCI-junk IO PCI-USB PCI-junk IO Node: HP (Compaq) AlphaServer ES45 21264 System Architecture EV68 1.25 GHz Each @ 64b 500 MHz (4.0 GB/s) Memory Up to 32 GB MMB 0 256b 125 MHz (4.0 GB/s) MMB 1 Cache 16 MB per CPU 256b 125 MHz (4.0 GB/s) MMB 2 MMB 3 PCI3 HS PCI2 HS PCI1 HS PCI9 HS PCI8 HS PCI7 HS PCI6 HS PCI5 PCI4 PCI0 Serial, Parallel keyboard/mouse floppy 3.3V I/O 5.0V I/O

  6. QsNET: Quaternary Fat Tree Hardware support for Collective Communication MPI Latency 4ms, Bandwidth 300 MB/s Barrier latency less than 10ms

  7. Interconnection Network Switch Level Super Top Level 6 5 Mid Level 4 1024 nodes (2x = 2048 nodes) ... 3 2 16th 64U64D Nodes 960-1023 1st 64U64D Nodes 0-63 1 48 63 960 1023

  8. System Software • Operating System is Tru64 • Nodes organized in Clusters of 32 for resource allocation and administration purposes (TruCluster) • Resource management executed through Ethernet (RMS)

  9. ASCI Q: Overview • Node Integration • Low (multiple boards per node, network interface on I/O bus) • Network Integration • High (HW support for atomic collective primitives) • System Software Integration • Medium/Low (TruCluster)

  10. ASCI Thunder, 1,024 Nodes, 23 TF/s peak

  11. ASCI Thunder, Lawrence Livermore National Laboratory 1,024 Nodes, 4096 Processors, 23 TF/s, #2 in the top 500

  12. ASCI Thunder: Configuration • 1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total) • 2.5 ms, 912 MB/s MPI latency and bandwidth over Quadrics Elan4 • Barrier synchronization 6 ms, allreduce 15 ms • 75 TB in local disk in 73GB/node UltraSCSI320 • Lustre file system with 6.4 GB/s delivered parallell I/O performance • Linux RH 3.0, SLURM, Chaos

  13. CHAOS: Clustered High Availability Operating System • Derived from Red Hat, but differs in the following areas • Modified kernel (Lustre and hw specific) • New packages for cluster monitoring, system installation, power/console management • SLURM, an open-source resource manager

  14. ASCI Thunder: Overview • Node Integration • Medium/Low (network interface on I/O bus) • Network Integration • Very High (HW support for atomic collective primitives) • System Software Integration • Medium (Chaos)

  15. System X: Virginia Tech

  16. System X, 10.28 TF/s • 1100 dual Apple G5 2GHz CPU based nodes. • 8 billion operations/second/processor (8 GFlops) peak double precision floating performance. • Each node has 4GB of main memory and 160 GB of Serial ATA storage. • 176TB total secondary storage. • Infiniband, 8ms and 870 MB/s, latency and bandwidth, partial support for collective communication • System-level Fault-tolerance (Déjà vu)

  17. System X: Overview • Node Integration • Medium/Low (network interface on I/O bus) • Network Integration • Medium (limited support for atomic collective primitives) • System Software Integration • Medium (system-level fault-tolerance)

  18. BlueGene/L System System (64 cabinets, 64x32x32) Cabinet (32 Node boards, 8x8x16) Node Card (32 chips, 4x4x2) 16 Compute Cards Compute Card 180/360 TF/s (2 chips, 2x1x1) 16 TB DDR Chip (2 processors) 2.9/5.7 TF/s 256 GB DDR 90/180 GF/s 8 GB DDR 5.6/11.2 GF/s 2.8/5.6 GF/s 0.5 GB DDR 4 MB

  19. PLB (4:1) 32k/32k L1 256 128 L2 440 CPU 4MB EDRAM Shared “Double FPU” L3 directory L3 Cache 1024+ Multiported for EDRAM or 256 144 ECC Shared snoop Memory SRAM 32k/32k L1 Buffer 128 440 CPU L2 256 Includes ECC I/O proc 256 “Double FPU” 128 DDR JTAG Control Ethernet Torus Tree Global Gbit Access with ECC Interrupt 6 out and 3 out and 144 bit wide Gbit JTAG 4 global 6 in, each at 3 in, each at DDR Ethernet barriers or 1.4 Gbit/s link 2.8 Gbit/s link 256/512MB interrupts BlueGene/L Compute ASIC IBM CU-11, 0.13 µm 11 x 11 mm die size 25 x 32 mm CBGA 474 pins, 328 signal 1.5/2.5 Volt

  20. 16 compute cards 2 I/O cards DC-DC Converters: 40V  1.5, 2.5V

  21. BlueGene/L Interconnection Networks 3 Dimensional Torus • Interconnects all compute nodes (65,536) • Virtual cut-through hardware routing • 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) • 350/700 GBytes/s bisection bandwidth • Communications backbone for computations Global Tree • One-to-all broadcast functionality • Reduction operations functionality • 2.8 Gb/s of bandwidth per link • Latency of tree traversal in the order of 5 µs • Interconnects all compute and I/O nodes (1024) Ethernet • Incorporated into every node ASIC • Active in the I/O nodes (1:64) • All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier • 8 single wires crossing whole system, touching all nodes Control Network (JTAG) • For booting, checkpointing, error logging

  22. BlueGene/L System Software Organization • Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) • I/O nodes run Linux and provide O/S services • file access • process launch/termination • debugging • Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software

  23. Operating Systems • Compute nodes: CNK • Specialized simple O/S • 5000 lines of code, • 40KBytes in core • No thread support, no virtual memory • Protection • Protect kernel from application • Some net devices in userspace • File I/O offloaded (“function shipped”) to IO nodes • Through kernel system calls • “Boot, start app and then stay out of the way” • I/O nodes: Linux • 2.4.19 kernel (2.6 underway) w/ ramdisk • NFS/GPFS client • CIO daemon to • Start/stop jobs • Execute file I/O • Global O/S (CMCS, service node) • Invisible to user programs • Global and collective decisions • Interfaces with external policy modules (e.g., job scheduler) • Commercial database technology (DB2) stores static and dynamic state • Partition selection • Partition boot • Running of jobs • System error logs • Checkpoint/restart mechanism • Scalability, robustness, security • Execution mechanisms in the core • Policy decisions in the service node

  24. BlueGeneL: Overview • Node Integration • High (processing node integrates processors and network interfaces, network interfaces directly connected to the processors) • Network Integration • High (separate tree network) • System Software Integration • Medium/High (Compute kernels are not globally coordinated) • #2 and #4 in the top500

  25. Cray XD1

  26. Cray XD1 System Architecture Compute • 12 AMD Opteron 32/64 bit, x86 processors • High Performance Linux RapidArray Interconnect • 12 communications processors • 1 Tb/s switch fabric Active Management • Dedicated processor Application Acceleration • 6 co-processors • Processors directly connected to the interconnect

  27. Cray XD1 Processing Node Six 2-way SMP Blades 4 Fans 500 Gb/s crossbar switch Six SATA Hard Drives Chassis Front 12-port Inter-chassis connector Four independent PCI-X Slots Connector to 2nd 500 Gb/s crossbar switch and 12-port inter-chassis connector Chassis Rear

  28. Cray XD1 Compute Blade RapidArray Communications Processor 4 DIMM Sockets for DDR 400 Registered ECC Memory AMD Opteron 2XX Processor AMD Opteron 2XX Processor Connector to Main Board 4 DIMM Sockets for DDR 400 Registered ECC Memory

  29. Fast Access to the Interconnect GigaBytes GFLOPS GigaBytes per Second Memory I/O Interconnect Processor 1 GB/s PCI-X 0.25 GB/s GigE Xeon Server 5.3 GB/s DDR 333 Cray XD1 6.4GB/s DDR 400 8 GB/s

  30. Communications Optimizations RapidArray Communications Processor • HT/RA tunnelling with bonding • Routing with route redundancy • Reliable transport • Short message latency optimization • DMA operations • System-wide clock synchronization RapidArray Communications Processor AMD Opteron 2XX Processor 2 GB/s RA 3.2 GB/s 2 GB/s

  31. Active Manager System Usability • Single System Command and Control Resiliency • Dedicated management processors, real-time OS and communications fabric. • Proactive background diagnostics with self-healing. • Synchronized Linux kernels Active Management Software

  32. Cray XD1: Overview • Node Integration • High (direct access from HyperTransport to RapidArray) • Network Integration • Medium/High (HW support for collective communication) • System Software Integration • High (Compute kernels are globally coordinated) • Early stage

  33. ASCI Red STORM

  34. Red Storm Architecture • Distributed memory MIMD parallel supercomputer • Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network • 108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz) • ~10 TB of DDR memory @ 333MHz • Red/Black switching: ~1/4, ~1/2, ~1/4 • 8 Service and I/O cabinets on each end (256 processors for each color240 TB of disk storage (120 TB per color)

  35. Red Storm Architecture • Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes • Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes • Separate RAS and system management network (Ethernet) • Router table-based routing in the interconnect

  36. Red Storm architecture Compute File I/O Service Users /home Net I/O

  37. System Layout(27 x 16 x 24 mesh) { { NormallyUnclassified SwitchableNodes NormallyClassified Disconnect Cabinets

  38. Red Storm System Software • Run-Time System • Logarithmic loader • Fast, efficient Node allocator • Batch system – PBS • Libraries – MPI, I/O, Math • File Systems being considered include • PVFS – interim file system • Lustre – Pathforward support, • Panassas… • Operating Systems • LINUX on service and I/O nodes • Sandia’s LWK (Catamount) on compute nodes • LINUX on RAS nodes

  39. ASCI Red Storm: Overview • Node Integration • High (direct access from HyperTransport to network through custom network interface chip) • Network Integration • Medium (No support for collective communication) • System Software Integration • Medium/High (scalable resource manager, no global coordination between nodes) • Expected to become the most powerful machine in the world (competition permitting)

  40. Overview

More Related