Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD Sameer Kumar, Gheorghe Almasi Blue Gene System Software, IBM T J Watson Research Center, Yorktown Heights, NY {sameerk,gheorghe}@us.ibm.com L. V. Kale, Chao Huang Department of Computer Science, University of Illinois at Urbana Champaign, Urbana, IL {kale,chuang10}@uiuc.edu

Outline • Background and motivation • NAMD and Charm++ • Blue Gene optimizations • Performance results • Summary

Blue Gene/L • Slow embedded core at a clock speed of 700 Mhz • 32 KB L1 cache • L2 is a small prefetch buffer • 4MB Embedded DRAM L3 cache • 3D Torus interconnect • Each processor is connected to six torus links with a throughput of 175 MB/s • System optimized for massive scaling and power

System Blue Gene/L 64 Racks, 64x32x32 Rack 32 Node Cards Node Card 180/360 TF/s 32 TB (32 chips 4x4x2) 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute Card 2 chips, 1x2x1 90/180 GF/s 16 GB Chip 2 processors Has this slide been presented 65536 times ? 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB

Can we scale on Blue Gene/L ? • Several applications have demonstrated weak scaling • NAMD was one of the first applications to achieve strong scaling on Blue Gene/L

NAMD and Charm++

NAMD: A Production MD program NAMD • Fully featured program from University of Illinois • NIH-funded development • Distributed free of charge (thousands downloads so far) • Binaries and source code • Installed at NSF centers • User training and support • Large published simulations (e.g., aquaporin simulation featured in keynote)

NAMD Benchmarks BPTI 3K atoms ATP Synthase 327K atoms (2001) Estrogen Receptor 36K atoms (1996) Recent NSF Peta-scale proposal presents a 100 Million atom system

Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • Thousands to even a million atoms • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Short-distance: every timestep • Long-distance: using PME (3D FFT) • Multiple Time Stepping : PME every 4 timesteps • Calculate velocities and advance positions • Challenge: femto-second time-step, millions needed!

Movable Computes Spatial Decomposition • Atoms distributed to cubes based on their location • Size of each cube : • Just a bit larger than cut-off radius • Computation performed by movable computes • C/C ratio: O(1) • However: • Load Imbalance • Easily scales to about 8 times number of patches Typically 13 computes per patch Cells, Cubes or“Patches”

NAMD Computation • Application data divided into data objects called patches • Sub-grids determined by cutoff • Computation performed by migratable computes • 13 computes per patch pair and hence much more parallelism • Computes can be further split to increase parallelism

obj obj obj obj obj obj Scheduler Interface System implementation Send Msg Q Recv Msg Q Network Charm++ and Converse • Charm++: Application mapped to Virtual Processors (VPs) • Runtime maps VPs to physical processors • Converse: communication layer for Charm++ • Send, recv, progress, on node level User View

NAMD Parallelization using Charm++ 108 VPs 847 VPs 100,000 VPs These 100,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

Optimizing NAMD on Blue Gene/L

The Apo-lipo Protein A1 • 92,000 atoms • Benchmark for testing NAMD performance on various architectures

F1 ATP Synthase • 327K atoms • Can we run it on Blue Gene/L in virtual node mode?

Lysozyme in 8M Urea Solution • Total ~40,000 atoms • Solvated in 72.8Ǻ x 72.8Ǻ x 72.8Ǻ box • Lysozyme: 129 residues, 1934 atoms • Urea: 1811 molecules • Water: 7799 molecules • Water/Urea ratio: 4.31 • Red: protein, Blue: urea; CPK: water Ruhong Zhou, Maria Eleftheriou, Ajay Royyuru, Bruce Berne

H5N1 Virus Hemaglutinin Binding

HA Binding Simulation Setup • Homotrimer, each with 2 subunits (HA1 & HA2) • Protein: 1491 residues, and 23400 atoms • 3 Sialic acids, 6 NAGs (N-acetyl-D-Glucosamine) • Solvated in 91Å x 94Å x 156Å water box, with total 35,863 water molecules • 30 Na+ ions to neutralize the system • Total ~131,000 atoms • PME for long-range electrostatic interactions • NPT simulation at 300K and 1atm

NAMD 2.5 in May 2005 Step Time (ms) Initial serial time 17.6s Processors APoA1 step time with PME in Co-Processor Mode

Easy Tiny working data Spatial locality Uniform atom density Persistent repetition Hard Sequential timesteps Very short iteration time Full electrostatics Fixed problem size Dynamic variations Parallel MD: Easy or Hard?

NAMD on BGL • Disadvantages • Slow embedded CPU • Small memory per node • Low bisection bandwidth • Hard to scale full electrostatics • Hard to overlap communication with computation • Advantages • Both application and hardware are 3D grids • Large 4MB L3 cache • Higher bandwidth for short messages • Six outgoing links from each node • Static TLB • No OS Daemons

Single Processor Performance • Inner loops • Better software pipelining • Aliasing issues resolved through the use of #pragma disjoint (*ptr1, *ptr2) • Cache optimizations • 440d to use more registers • Serial time down from 17.6s (May 2005) to 7s • Iteration time down from 80 cycles to 32 cycles • Full 440d optimization would require converting some data structures from 24 to 32 bytes

Memory Performance • Memory overhead high due to several short memory allocations • Group short memory allocations into larger buffers • We can now run the ATPase system in virtual node mode • Other sources of memory pressure • Parts of atom structure duplicated on all processors • Other duplication to support external clients like TCL and VMD • These issues still need to be addressed

BGL Parallelization • Topology driven problem mapping • Blue Gene Has a 3D Torus network • Near neighbor communication has better performance • Load-balancing schemes • Choice of correct grain size • Communication optimizations • Overlap of computation and communication • Messaging performance

Problem Mapping Y Y Z Z X X Processor Grid Application Data Space

Problem Mapping Z Y X Z Y X Processor Grid Application Data Space

X Y Z Y X Z Processor Grid Problem Mapping Application Data Space

Data Objects Cutoff-driven Compute Objects Y Z X Processor Grid Problem Mapping

Improving Grain Size: Two Away Computation • Patches based on cutoff are too coarse on BGL • Each patch can be split along a dimension • Patches now interact with neighbors of neighbors • Makes application more fine grained • Improves load balancing • Messages of smaller size sent to more processors • Improves torus bandwidth

Two Away X

Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing Instrumented Timesteps Refinement Load Balancing

Load-balancing Metrics • Balancing load • Minimizing communication hop-bytes • Place computes close to patches • Minimizing number of proxies • Effects connectivity of each patch object

Communication in NAMD • Three major communication phases • Coordinate multicast • Heavy communication • Force reduction • Messages trickle in • PME • Long range calculations which require FFTs and alltoalls

Optimizing communication • Overlap of communication with computation • New messaging protocols • Adaptive eager • Active put • Fifo mapping schemes

Overlap of Computation and Communication • Each FIFO has 4 packet buffers • Progress engine should be called every 4000 cycles • Progress overhead of about 200 cycles • 5 % increase in computation • Remaining time can be used for computation

Network Progress Calls • NAMD makes progress engine calls from the compute loops • Typical frequency is10000 cycles, dynamically tunable for ( i = 0; i < (i_upper SELF(- 1)); ++i ){ CmiNetworkProgress(); const CompAtom &p_i = p_0[i]; //…………………………… //Compute Pairlists for (k=0; k<npairi; ++k) { //Compute forces } } void CmiNetworkProgress() { new_time = rts_get_timebase(); if(new_time < lastProgress + PERIOD) { lastProgress = new_time; return; } lastProgress = new_time; AdvanceCommunication(); }

Charm++ Runtime Scalability • Charm++ MPI Driver • Iprobe based implementation • Higher progress overhead of MPI_Test • Statically pinned FIFOs for point to point communication • BGX Message Layer (developed in collaboration with George Almasi) • Lower progress overhead makes overlap feasible • Active messages • Easy to design complex communication protocols • Charm++ BGX driver was developed by Chao Huang last summer • Dynamic FIFO mapping

Better Message Performance: Adaptive Eager • Messages sent without rendezvous but with adaptive routing • Impressive performance results for messages in the 1KB-32KB range • Good performance for small non-blocking all-to-all operations like PME • Can achieve about 4 links of throughput

Active Put • A put that fires a handler at the destination on completion • Persistent communication • Adaptive routing • Lower per message overheads • Better cache performance • Can optimize NAMD coordinate multicast

FIFO Mapping • pinFifo Algorithms • Decide which of the 6 FIFOs to use when send msg to {x,y,z,t} • Cones, Chessboard • Dynamic FIFO mapping • A special send queue that msg can go from whichever FIFO that is not full

Performance Results

BGX Message layer vs MPI • Fully non-blocking version performed below par on MPI • Polling overhead high for a list of posted receives • BGX native comm. layer works well with asynchronous communication NAMD 2.6b1 Co-Processor Mode Performance (ms/step) (OCT 2005)

Scaling = 2.5 Scaling = 4.5 NAMD Performance Step Time (ms) Time-step = 4ms Processors APoA1 step time with PME in Co-Processor Mode

Virtual Node Mode Step Time (ms) Plot comparing VN mode with CO mode on twice as many chips Processors APoA1 step time with PME

Impact of Optimizations NAMD cutoff step time on the APoA1 system on 1024 processors

Blocking Communication (Projections timeline of a 1024-node run without aggressive network progress) • Network progress not aggressive enough: communication gaps result in a low utilization of 65%

Effect of Network Progress (Projections timeline of a 1024-node run with aggressive network progress) • More frequent advance closes gaps: higher network utilization of about 75%

Summary

Impact on Science • Dr Zhao ran the Lysome system for 6.7 billion time steps over about two months on 8 racks of Blue Gene/L

Achieving Strong Scaling On Blue Gene/L: Case Study with NAMD