Scaling Up: Teraflop to Petaflop Performance

Scaling Up:Teraflop to Petaflop Performance SDSC Summer Institute 2006 Robert Harkness, SDSC

Reality Check • Top500 is about politics, not productivity • HPC Challenge is a better measure, but narrow • Industry driven by mass marketing, not HPC • Cost of ownership (per peak flop) results in poor hardware design sub-optimal for HPC • Gap between peak and sustained growing exponentially • Continual increase in code complexity to compensate • Who measures the cost in productivity? • Performance on your application is what matters • Scientific results are the only measure of success

Challenges • Processor speed falls below Moore’s law • Memory speed and cpu speed still diverging • Power, cooling and physical size • Reliability – HW & SW MTBF • Lack of HW investment forces MPP • MPP incurs overhead and high programmer load • Inherent limits to scaling & load balancing • Overcoming latency at every level • Operational model

The end of Moore’s Law? Moore’s Law CPU MEM DISK

Easy or Hard? • Easy problems? • Embarrassingly Parallel • High degree of locality • Nearest neighbor communication only • Hard problems? • Wide range of physical scales • Highly non-local communication • Multi-physics • Long relaxation time, long dynamical times • Intrinsically serial processes

Limits to Scalability • Physics • Long-range interaction requiring global communication • Local nonlinear effects leading to load imbalance • Separation of time scales • Relaxation over many dynamical times can limit useable parallelism for domain decomposition • Computation • Correctness, vaidation & verification • I/O • Scheduling • Cost

Full development cycle • Mathematical statement and decomposition • Cost analysis for practical problem size • Coding • Debugging • Production • Post-processing and data management • Archival storage

Reaching 1 TFlop • How do you reach 1 TFlop today? • Net efficiency 10% => 10 TF system @ full scale • 2000 processors @ 5 GFlop peak each • 2000 MPI tasks or threads • O/S redundancy, replication overhead • Only DataStar and BigBen sustain 1 TFlop • Most users still in 1-100 Gflop range

Reaching 1 Petaflop • 1 Pflop sustained requires 2-20 PFlop peak • Micro cores may reach 5 GHz, 4ops/cycle • Memory bandwidth starved in many cases • Efficiency likely less than today – say 5% • At 20 GFlop/core > 1 MILLION PROCESSORS • Custom processors could exceed 50% efficiency • FPGAs may be 100x faster than micros • Algorithms in hardware

EarthSimulator C90 DataStar LLNL BG/L

With Apologies to Jack WorltonBelief in Petaflop Computing Atheists +R,-P Heretics +R,P Believers +R,+P True Believers R,+P R,P ? -R,P Fanatics -R,+P Luddites -R,-P

Future hw directions • Vector registers and functional units • MTA with large number of contexts • PIM for locality with SIMD-like economy • FPGAs and Accelerators • Superconductors, non-electronic devices • Carbon nanotubes • Spintronics • Optical storage

A Petaflop for the Rest of Us • Investment in hardware and software • Reduce the burden on the programmer! • Real improvements in specific performance require $$ • New languages may help, but adoption will be slow and very risky for users and vendors • Keep complexity at a manageable level • Design for the future

Factors Limiting Useful Parallelism • Latency • Load Imbalance • Synchronization overhead • Task & thread management • Competition for shared resources (bandwidth) • Parallelism at differing scales • Empty pipes and other no-ops

Latency is Enemy #1 • Pipeline latency ~10cp • Cache latency ~1-10cp • Memory latency ~100-1000+ cp • Switch latency ~10,000-100,000+ cp • Software latency ~ 10,000-100,000+ cp • Speed of light :-( Its too slow! • Across the machine room ~ 1000 cp • Across the country ~ 100ms ~10^8-10^9 cp • Latency is getting worse!

Latency Function

Bandwidth is Enemy #2 • Excellent bandwidth within a microprocessor • Immediate loss going to external cache • Enormous on-chip L3 is possible now (~2B transistors!) • Irregular memory access patterns are incompatible with current design in microprocessor caches designed for streaming • Typically huge loss in going to memory • Memory is large and cheap but slow • Multi-cores share bandwidth – less per processor • Only custom hardware exposes parallelism in memory • Only custom hardware addresses irregular access • Network & I/O bandwidth usually least of all…

Bandwidth-2 • Network bandwidth 10 Gbit/s – 40 Gbit/s • Requires many cpus to drive at that rate • Still not adequate for large computational processor count • Reasonable for file transfers • Achievable bandwidth depends on software • What will happen if networks get busier?

Strong Scaling • Fixed problem size, increasing cpu count N • Serial fraction f, maximum speedup 1/f • Parallel fraction (1-f) is not independent of N

Weak Scaling • Scale problem size with number of processors N • Limited by resources • Scaled speedup (Barsis): s’ is serial time on parallel system and p’ is parallel time then a serial process would take s’+p’N giving a scaled speedup of

Weak or Strong? • The truth lies in between • Serial fraction may not be independent of problem size • “The Corollary of Modest Potential” (Snyder) • Real resource constraints or policy may limit weak scaling in practice • How big an allocation can one get? • Can a calculation be finished in < resource cycle? • Can a calculation be finished in the life of the machine?

Example • Hydrodynamics on a fixed Eulerian mesh • Courant condition on timestep • Ghost cells for 3D decomposition

Cost at fixed work per cpu

Work Smarter, Not Harder…? • Adaptive Mesh Refinement • Can vastly exceed capability of uniform meshes • Different scaling model – higher overheads • Different MPP model: shared memory or globally addressable memory • Strong implications for HW design

ENZO Hydrodynamical Cosmology: 2048^3 mesh on 2048 processors

ENZO Strong Scaling

ENZO Weak Scaling

Decomposition • Choose decomposition • Select a method that exposes maximum parallel content - plan for execution with >> 10,000 threads • Choose the memory model • Is shared memory required? • Is globally addressable memory required? • Choose the I/O strategy • Massively parallel I/O must be designed for at the outset • MP is 100% overhead!

Coding • No religion - use the best tool for the job • Stick to mainstream languages (C/C++/F95) • Strictly adhere to standards for portability • Use the minimum set of features you need • Check all possible result codes and design for error detection and recovery • Design in checkpointing and restart • Parallel I/O is essential • Fault tolerance?

MPI • Send/Recv is not enough • Buffered, asynchronous messaging • But does your HW and/or MPI really allow it? • Aggregating messages for message length • Caution with derived types (holes) • Send bytes for speed (dangerous) • One-sided model: always use “get” not “put” • Beware of cache effects • Use your own instance for COMM_WORLD

CAF and UPC • Coding for automatic messaging managed by the compiler – eliminates error-prone MP • Clean, logical approach but can lack flexibility • Exploit HW with real global addressing capability • Preserves investment in coding even with shared memory systems

Co-Array Fortran • Co-Array Standard Fortran in 2008 • Almost trivial extension to Fortran • Arrays replicated on all images • Co-size is always equal to NUM_IMAGES() • Upper bound must always be [*] REAL :: X(NX)[ND,*] INTEGER :: II[*] … X(J) [2,3] = II[3] ! Automatic put/get generation

CAF-2 • Limitations (as of current implementation *) • Co-array can be a derived type but can never be a component of a derived type • Co-arrays cannot be assumed size • REAL :: X(*)[*] is not allowed • Co-arrays cannot be assumed-shape • REAL :: X(:)[*] is not allowed • REAL :: Y(:)[:] is not allowed • Automatic co-arrays are not supported • A significant problem for dynamic structures • But co-arrays can be allocatable and can also appear in COMMON and EQUIVALENCE

CAF-3 • Explicit synchronization • CALL SYNC_IMAGES() • Image ID • Index = THIS_IMAGE() • Number of images • Image_count = NUM_IMAGES()

I.O - Advantages of HDF5 • Machine independent data format • No endian-ness issues • Easy control of precision • Parallel interface built on MPI I/O • High performance and very robust • Excellent logical design ! • Hierarchical structure ideal for consolidation • Easy to accommodate local metadata • Useful inspection tools • H5ls • H5dump

Network Realities • Why do it at all? • NSF allocation (2.1 million SU in 2005) is too large for one center to support • Some architectures are better suited for different parts of the computational pipeline • Central location for processing and archival storage at SDSC (IBM P690s, HPSS, SAM-QFS, SRB) • TeraGrid backbone and GridFTP make it possible…

Local and Remote Resources • SDSC • IBM Power4 P655 and P690 (DataStar) • IBM BlueGene/L • NCSA • TeraGrid IA-64 Cluster (Mercury) • SGI Altix (Cobalt) • PSC • Compaq ES45 Cluster (Lemieux) • Cray XT3 (Big Ben) • LLNL • IA-64 Cluster (Thunder) • NERSC • IBM Power3 (Seaborg) • IBM Power5 (Bassi) • Linux Networks Opteron cluster (Jaquard)

Network Transfer Options • GridFTP • Clumsy, but fast: 250+ MB/sec across TG • globus-url-copy, tgcp • BBftp • Easy to use, moderate speed: 90 MB/sec across Abilene • SRB • Global accessibility, complex capability, wide support • Lower performance • Can be combined with faster methods but still provide global access • HPSS • Easy to use, moderate speed • Local support only

Recommendations • Maximize parallelism in all I/O operations • Use HDF5 or MPI I/O • Process results while they are on disk • Never use scp when GridFTP or BBftp is available • Containerize/tar your data before archiving it! • Use md5 checksums when you move data • Archive your code and metadata as well as your results – the overhead is minimal but you will never regret it! • Use SRB to manage all results from a project • Maximize parallelism in your work flow

Debugging • Built in self-test • Levels of debug detail and verbosity • Problem test suite for accuracy and performance • Regression tests • Reasonable scale for interactive debuggers • Make use of norms for error checking • Use full error detection • Check that results are independent of task count

Dos and Donts • Master/Slave will not scale up far enough • Never serialize any part of the process • In particular, plan for parallel I/O • Instrument your code for computation, I/O and communication performance • Design for checkpointing – must be parallel! • Design for real-time visualization, monitoring and steering • Anticipate failure – check every result code, particularly with I/O and networking

Dos and Donts-2 • Always use 64-bit address mode • Code using a flexible approach to precision • Define your own types • Use strongly typed languages • Use 32-bit floating point with caution • Beware of lack of support for 128-bit fp – especially in C • Run with arithmetics checks ON – not always default • Will you need 64-bit integers? • MPI, HDF5, libraries all assume 32-bit integer controls

Scaling is not all… • Scalability is far less important than: • Correctness • Reproducible results (consider global operations) • Robust operation • Throughput (scientific output) • Computational performance • Every code declines in efficiency beyond some processor/task count - measure where! • Fewer & faster always beats more & slower

Site Policy Issues • Productive Petascale systems will require a completely different approach to operations • Observatory/instrument-style operations • Planned computational campaigns • Long dedicated runs at near full scale • Dedicated support staff to ensure run-time reliability • Error recovery procedures • Hot spares • Planned data transfer/archival storage capcity • Long-term storage policy

Magic Bullets • Sorry – there are no magic bullets…

Scaling Up: Teraflop to Petaflop Performance