Cplant™: Scalable Computing Systems for Efficient Computational Capacity

TM The Largest Linux Clusters Neil Pundit Scalable Computing Systems Sandia National Laboratories ndpundi@sandia.gov http://www.cs.sandia.gov/cplant/

Outline • Cplant™ hardware, software, and performance • Major difficulties and lessons learned • Research and development activities • Celera Genomics CRADA • Red Storm • Applications • Contributors • Additional Info

What is Cplant™? • Cplant™ is a concept • Provide computational capacity at low cost • MPP’s from commodity components • Cplant™ is an overall effort: • Multiple computing systems • Alaska, Barrow, Siberia, Antartica/Ross, Antartica/West, Hawaii, Carmel, Asilomar, Delmar, Zenia • Multiple projects • Portals 3.0 message passing, runtime, management tools, system integration & test, operations & management • Cplant™ is a software package • Released under commercial license to Unlimited Scale, Inc. • Released as open source under GNU Public License

Compute Service File I/O Service Nodes I/O ATM Compute Nodes I/O Nodes Nodes HiPPI … … other … Ethernet … Users System … … … … Net I/O /home Operator(s) … … System Support Sys Admin CplantTM CplantTM Architecture • MPP “look and feel” • Distributed systems and services architecture • Scalable to 10,000 nodes • Embedded RAS features • Preserve application code base ASCI Red Extends ASCI Red advantages

Current Deployment • NM clusters • Alaska, yellow, 272 nodes (FY98) • Barrow, red, 96 nodes (FY98) • Siberia, yellow, 592 nodes, (FY99) • Ross/Antarctica, yellow, 1024 nodes (FY00) • West/Antarctica, green, 80 nodes (FY00) • CA clusters • Asilomar-SON, green, 64 nodes (FY97) • Asilomar-SRN, yellow, 64 nodes (FY97) • Carmel, yellow, 128 nodes (FY99) • Delmar, yellow, 256 nodes (FY00) • Zenia, red, 32 nodes (FY00)

256 Nodes* 256 Nodes 256 Nodes 256 Nodes 256 Nodes Antarctica - Current • Single Plane connects up to 256 Nodes via LAN • Center Planes Swing to 1 of 3 “Heads” • Each “Head” connects up to 256 CPU Nodes via LAN • IO & Service Nodes connected via SAN (Z Direction) 24 Service & I/O Nodes* 128 paths • 8x8x6+ Aspect Ratio • Supports periods processing on 3 networks 128 paths 128 paths 128 paths 128 paths 128 paths 256 Nodes* 32 paths 1/4 Plane 24 Service & I/O Nodes 80 Nodes 16 Service & I/O Nodes *not yet operational

Antarctica – August ‘01 24 Service & I/O Nodes 256 Nodes 128 Nodes 16 Service & I/O Nodes 256 Nodes 256 Nodes 256 Nodes 128 paths 256 Nodes 128 paths 256 Nodes 128 paths 256 Nodes 128 paths 128 paths 128 paths 256 Nodes 32 paths 24 Service & I/O Nodes 256 Nodes 16 Service & I/O Nodes

Applications Portable Batch System Management Software Parallel I/O Library MPI Library Cluster Services Boot virtual machine Update virtual machine Runtime Environment Discover utility yod PCT bebopd pingd Boot scalable unit Update SSS0 Add Delete Find Power Role Database Boot node Distributed Services Library Device Database Power control Remote distribution IP Portals PERL Hardware Configuration Software Linux Operating System PERL Hardware System Software • Portals for fast message passing • Linux OS • Configuration & management tools enable managing large clusters

Application Launch Performance

ENFS ENFS ENFS ENFS VizNFS gigE switch ENFS - A Parallel File Server Capability • Employs standard NFS • Direct data deposit onto the visualization machine. • Parallel (100 MB/s) • Multiple paths to the server(s) • Scalable • Pushes scaling issues to the server side • Global • Available to all compute nodes

ENFS • Removes locking semantics from NFS protocol • Parallel independent I/O to multiple files • Non-overlapping access to single file • Uses I/O nodes as proxies • Allows for investigation of third party solutions • Currently SGI’s XFS – 117 MB/s • Compaq’s Petal/Frangipani • Clemson’s PVFS

Supporting Software Efforts • Etnus, Inc. – TotalView debugger (vers. 4.1.0-1) • Cplant™ runtime environment extended to support bulk debug server launch • Only works on GNU and Compaq Alpha/Linux binaries • Can launch yod or attach to running job • TotalView communications port to Portals 3.0 in progress • MPI Software Technology, Inc. – MPI/Pro • MPI/Pro ported to Portals 3.0 • Kuck and Associates, Inc./Pallas, Inc. - Vampir • Vampirtrace for MPI/Pro and ENFS • Mission Critical Linux - Linux enhancements • Kernel modifications to increase performance on Alpha processor systems

Large Clusters Require an Extensive Integration-Test Process Power Supplies Integration Hardware Error Reports for 1024 Nodes of Antarctica 27 ECC Errors 33 Mother Boards 25 Ethernet Cable 33 Serial Cable Bad Myrinet Cable Loose Myrinet Cable 17 Misconfigured Myrinet Cable 49 Myrinet Card 2 3 PCI Riser Card 2 RPC Unit 5 33 Terminal Server 9 Misc. Hardware Misc. Software 13 46 13 No Diagnosis

MPLinpack Performance • 552 Siberia Nodes • 309.2 GFLOPS • Would place 61st on November 2000 Top 500 list • 1000 Antarctica nodes • 512.4 GFLOPS • Would place 31st on November 2000 Top 500 list

Usage Data

Outline of Major Difficultiesin the Last Two Years • Interconnect • Communication middleware • Runtime environment • Batch scheduler • Parallel I/O • System management • Testing and release process

Major Difficulties • Interconnect (Myrinet) problems (2 PY) • GM mapper limitations (2 PM) • Each new cluster exceeded the number of nodes the mapper could handle • Non-deadlock-free routes (4 PM) • Code for routing algorithm gave only shortest path routes • Reliability • Error detection/correction (6 PM) • Switch diagnostics capture and display (1 PY)

Myrinet Reliability • Alaska Myrinet is very reliable • Siberia Myrinet is very unreliable • Daily bit error rate can be from 10-7 to 10-14 • Storms of multi-bit errors • Added error detection/correction to Myrinet driver • Implemented Myrinet switch monitoring software • Implemented switch error visualization tool

Switch Error Visualization Tool

Major Difficulties (cont’d) • Communication middleware (3.5 PY) • Portals 2.0 in Linux (6 PM) • No API • Data structures in user space • Protection boundaries have to be crossed to access data structures • Data structures have to be copied, manipulated, and copied back • Requires interrupts • Address validation/translation on the fly • Incoming messages trigger address validation • Doesn’t fit the Linux model of validating addresses on a system call for the currently running process • Developed Portals 3.0 API (1 PY) • Implemented Portals 3.0 (1 PY) • Transition from P2 to P3 (1 PY)

Major Difficulties (cont’d) • Runtime environment (2 PY) • Most problems related to message passing • Runtime utilities must recover from network errors • Linux copy-on-write caused “lost” messages • Problems show up as • Failure to start job • Utilities become uncommunicative – compute nodes become stale, allocator is unresponsive • Interaction of Linux, Portals, and the utilities (60% rewrite, 30% debugging, 10% enhancement)

Major Difficulties (cont’d) • Batch scheduling (1 PY) • Enhanced OpenPBS • Added non-blocking I/O for enhanced reliability (patches available under GPL) • Integrated PBS into the runtime environment • Uses FIFO scheduler • Reflects “good citizen” rules established by users • Few problems with PBS

Major Difficulties (cont’d) • Parallel I/O (6 PY) • Fyod – parallel independent files • Partial success (6 PM) • Striping fyod • Abandoned for lack of robustness (2 PY) • ENFS (3.5 PY) • Have MPI-IO for ENFS, working on HDF-5 • 119 MB/s from 8 I/O nodes to SGI O2K with XFS

Major Difficulties (cont’d) • System management tools (6 PY) • All tools are homegrown • Commercial tools do not address scalability and Cplant™ architecture • First implementation was too hardware specific and tightly integrated to runtime environment • Latest implementation is flexible and separate from runtime environment • Focus of late is on automation and robustness

Major Difficulties (cont’d) • Testing and release process (5 PY) • Slow awakening that system tests were incomplete • Testing needs to include a few representative applications • Beyond infant mortality, we need to do stress testing • Five-phase testing procedure in place

Five-Phase Testing Procedure • Phase 0: Repository regression tests • Runs nightly on 32-node system to insure the functionality of the repository • Phase 1: Runtime environment and basic message passing tests • Simple MPI tests and basic file I/O functions • Phase 2: Small applications and benchmarks • NAS Benchmarks, MPLinpack, CTH and MPSalsa with small problems • Phase 3: Message passing stress tests • Based on the Intel acceptance tests for ASCI/Red • Phase 4: Friendly user applications • Friendly users running real applications

Lessons Learned • Bug fixing (50%) • Enhancements (30%) • Release testing (20%) • Currently barely adequate • Need greater attention to robustness

Current Research and Development • OS bypass performance enhancement • Dynamic compute node allocation • Intelligent compute node allocation • Portals 3.0 on Quadrics network • Support for multi-threaded apps • Support for SMP compute nodes • Enhance cluster management tools to support switching between heads

Collaborative Research Efforts • Study of optimal error correction protocols (Ohio State) • Heterogeneous cluster study (Syracuse/U. of Virginia) • Study of performance with topology, communication and applications (Ohio State) • OS bypass (U. New Mexico) • Fault tolerance in applications (U. of Texas) • Portals 3.0 implementations (VIA, LAPI) and extensions (gather/scatter) (Mississippi State U.) • Scalable I/O (Lock manager, coherence) (Northwestern) • New MPP architectures (CalTech/JPL) • SciDAC- Scalable Systems Software Enabling Technology Center (DOE)

Celera Genomics • Mutli-year Cooperative Research and Development Agreement • Develop advanced parallel bioinformatics algorithms • Develop massively parallel computer hardware designs • Incorporate these into single, integrated, high-performance data analysis capability • Integrate technology advances into both companies’ mainstream business activities • Enhance Celera’s technical depth in high-performance parallel computing • Enhance Sandia’s technical depth in genomics and proteomics

ASCI RedStorm • 20 Tflops ASCI Red Storm • Tightly Coupled MPP • 20+ TF • Distributed Memory MIMD • 3-D Mesh Interconnect • Red/Black Switching • Partitioned Hardware - System and I/O, Compute, RAS • Partitioned System Software - System and I/O, Compute, RAS • Integrated System Management and Full System RAS • No Local Disk or User Writable Non-volatile Memory

CTH 3D Eulerian shock physics ALEGRA 3D arbitrary Lagrangian-Eulerian solid dynamics GILA Unstructured low-speed flow solver MPQuest Quantum electronic structures SALVO 3D seismic imaging LADERA Dual control volume grand canonical MD simulation Parallel MESA Parallel OpenGL Xpatch Electromagnetism RSM/TEMPRA Weapon safety assessment ITS Coupled Electron/Photon Monte Carlo Transport TRAMONTO 3D density functional theory for inhomogeneous fluids CEDAR Genetic algorithms Applications Work In Progress

AZTEC Iterative sparse linear solver DAVINCI 3D charge transport simulation SALINAS Finite element modal analysis for linear structural dynamics TORTILLA Mathematical and computational methods for protein folding EIGER DAKOTA Analysis kit for optimization PRONTO Numerical methods for transient solid dynamics SnRAD Radiation transport solver ZOLTAN Dynamic load balancing MPSALSA Numerical methods for simulation of chemically reacting flows Applications Work In Progress http://www.cs.sandia.gov/cplant/apps

CTH Grind Time

Ron Brightwell Lee Ann Fisk Nathan Dauchy (HPTi) Sue Goudy Rena Haynes Jeanette Johnston Lisa Kennicott Ruth Klundt (Compaq) Jim Laros Barney Maccabe (UNM) Jim Otto Rolf Riesen Eric Russell Lee Ward David Evensky Sophia Corwell Bob Davis Eric Enquvist Cathy Houf Donna Johnson Mike McConkey Geoff McGirt Mike Kurtzer Doug Clay Doug Doerfler John Noe Neil Pundit Art hale, Deputy Director Bill Camp, Director Cplant™ Contributors System Software Development and Testing Production Support Management Team

More Info • Web site • http://www.cs.sandia.gov/cplant/ • Recent papers • http://www.cs.sandia.gov/cplant/papers/ • Including: • “Scalable Parallel Application Launch on Cplant™”, extended abstract submitted to SC’01 • “Dynamic Allocation of Nodes on a Large Space-shared Cluster”, submitted to IEEE Cluster Computing 2001 • “Scalability and Performance of Two Large Linux Clusters”, Journal of Parallel and Distributed Computing, to appear 2001 • “Scalability and Performance of CTH on the Computational Plant”, Proceedings of 2nd International Conference on Cluster Computing • Sandia’s Computer Science Research Institute (CSRI) • http://www.cs.sandia.gov/CSRI/

Cplant™: Scalable Computing Systems for Efficient Computational Capacity

Cplant™: Scalable Computing Systems for Efficient Computational Capacity

Presentation Transcript

Mandip Bhuller Fastest ROI: Linux RAC clusters Oracle Corporation

Largest Ocean Largest Sea Largest Lake Largest River

Protocol-Dependent Message-Passing Performance on Linux Clusters

Can Commodity Linux Clusters Scale to Petaflops?

The 10 Largest Volcanoes

The World’s Largest Franchise

Linux Clusters in ITD

Operational Forecasting on the SGI Origin 3800 and Linux Clusters

3-D Graphics Rendering Using LINUX CLUSTERS

Introduction to Scientific Computing on Linux Clusters

High Performance Linux Clusters

The Career Clusters

Distributed Security Model for Linux Clusters

Linux Clusters for High-Performance Computing

Groups, Clusters and Clusters of Clusters

High Performance, Dense, Low Power Linux Clusters

PVFS: A Parallel File System for Linux Clusters

The Largest Moons

Protocol-Dependent Message-Passing Performance on Linux Clusters

OpenSSI - Kickass Linux Clusters Dr. Bruce J. Walker