Programming to PetaScale with Multicore Chips and Early Experience on Abe with Charm++

Programming to PetaScale with Multicore ChipsandEarly Experience on Abe with Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign NCSA Abe Multicore Workshop

Outline • A series of lessons learned • How one should parallelize applications for the petascale • Early Experience on Abe • New Programming Models • Simplifying Parallel Programming NCSA Abe Multicore Workshop

PPL Mission and Approach • To enhance Performance and Productivity in programming complex parallel applications • Performance: scalable to thousands of processors • Productivity: of human programmers • Complex: irregular structure, dynamic variations • Approach: Application Oriented yet CS centered research • Develop enabling technology, for a wide collection of apps. • Develop, use and test it in the context of real applications • How? • Develop novel Parallel programming techniques • Embody them into easy to use abstractions • So, application scientist can use advanced techniques with ease • Enabling technology: reused across many apps NCSA Abe Multicore Workshop

Benefits • Software engineering • Number of virtual processors can be independently controlled • Separate VPs for different modules • Message driven execution • Adaptive overlap of communication • Predictability : • Automatic out-of-core • Asynchronous reductions • Dynamic mapping • Heterogeneous clusters • Vacate, adjust to speed, share • Automatic checkpointing • Change set of processors used • Automatic dynamic load balancing • Communication optimization System implementation User View Migratable Objects (aka Processor Virtualization) Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI NCSA Abe Multicore Workshop

Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization Real Processors Migratable Objects (aka Processor Virtualization) Benefits Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI MPI processes Virtual Processors (user-level migratable threads) NCSA Abe Multicore Workshop

Enabling CS technology of parallel objects and intelligent runtime systems (Charm++ and AMPI) has led to several collaborative applications in CSE Quantum Chemistry (QM/MM) Protein Folding Molecular Dynamics Computational Cosmology Crack Propagation Parallel Objects, Adaptive Runtime System Libraries and Tools Space-time meshes Dendritic Growth Rocket Simulation NCSA Abe Multicore Workshop

Application Oriented Parallel Abstractions Synergy between Computer Science Research and Apps has been beneficial to both LeanCP Space-time meshing Other Applications Issues NAMD Charm++ Techniques & libraries Rocket Simulation ChaNGa NCSA Abe Multicore Workshop

Charm++ for Multicores • Announcing “beta” release of multicore version • A specialized stand-alone version for single desktops • Also, extended support for Abe-like multicore/SMP systems • Official release in a month or so NCSA Abe Multicore Workshop

Porting Charm++ to Abe • Charm has a machine dependent layer • Frequently called the machine layer • First port: using existing MPI layers • mpi based layer: MPICH-VMI, [MVAPICH] • Using lower level layers • Multiple machine layers are usable on Abe • ibverbs layer that uses the verbs api directly • VMI

Ibverbs layer • Reliable connection among all processors • Small messages – eager protocol • Large messages – RDMA • Eager protocol • Unexpected messages is the common case for a Charm program • We use an infiniband shared receive queue to post receive buffers for all processors

Eager protocol contd.. • Packet based since pre-posted receive buffers have to be of a fixed size • Flow control among processors • Prevents one processor from flooding another one • Increases resources for a processor that is sending more messages to one • Has a memory pool for charm messages • One copy for short messages only on recv side • Necessary to merge multiple packet messages

RDMA Layer • Zero copy messaging for RDMA • Rdma also part of flow control between processors • Rdma also being used for a persistent communication API (apart from regular messaging)

Planned steps • Develop a smp version of the ibverbs layer • Improve communication performance of processors within a node • A separate thread for communication • Require locking • Reduce memory cost of scaling to large numbers of processors

Lesson 1: Choose Your Algorithms NCSA Abe Multicore Workshop

Choose your algorithms carefully • Create Parallelism where there was none • Parallel Prefix (scan) operation • Degree of parallelism • More is better, usually • Overlap of phases • Modern machines make one rethink algorithms: • Operation count may be less important than memory accesses • Degree of reuse NCSA Abe Multicore Workshop

Analyze Scalability of the Algorithm (say via the iso-efficiency metric) NCSA Abe Multicore Workshop

Equal efficiency curves Problem size processors Isoefficiency Analysis • An algorithm (*) is scalable if • If you double the number of processors available to it, it can retain the same parallel efficiency by increasing the size of the problem by some amount • Not all algorithms are scalable in this sense.. • Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency Parallel efficiency= T1/(Tp*P) T1 : Time on one processor Tp: Time on P processors NCSA Abe Multicore Workshop

Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • Thousands of atoms (10,000 – 5,000,000) • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Short-distance: every timestep • Long-distance: using PME (3D FFT) • Multiple Time Stepping : PME every 4 timesteps • Calculate velocities and advance positions • Challenge: femtosecond time-step, millions needed! Collaboration with K. Schulten, R. Skeel, and coworkers NCSA Abe Multicore Workshop

Traditional Approaches: non isoefficient In 1996-2002 • Replicated Data: • All atom coordinates stored on each processor • Communication/Computation ratio: P log P • Partition the Atoms array across processors • Nearby atoms may not be on the same processor • C/C ratio: O(P) • Distribute force matrix to processors • Matrix is sparse, non uniform, • C/C Ratio: sqrt(P) Not Scalable Not Scalable Not Scalable NCSA Abe Multicore Workshop

Spatial Decomposition Via Charm • Atoms distributed to cubes based on their location • Size of each cube : • Just a bit larger than cut-off radius • Communicate only with neighbors • Work: for each pair of nbr objects • C/C ratio: O(1) • However: • Load Imbalance • Limited Parallelism Charm++ is useful to handle this Cells, Cubes or“Patches” NCSA Abe Multicore Workshop

Object Based Parallelization for MD: Force Decomposition + Spatial Decomposition • Now, we have many objects to load balance: • Each diamond can be assigned to any proc. • Number of diamonds (3D): • 14·Number of Patches • 2-away variation: • Half-size cubes • 5x5x5 interactions • 3-away interactions: 7x7x7 NCSA Abe Multicore Workshop

Listen to Amdahl’s Law and Variants NCSA Abe Multicore Workshop

Amdahl and variants • The original Amdahl’s law, interpreted as: • If there is a x% sequential component, speedup can’t be more than 100/x. • Variations: • If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts • If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use • … NCSA Abe Multicore Workshop

Fine Grained Decomposition on BlueGene NCSA Abe Multicore Workshop

Decouple decomposition from Physical Processors NCSA Abe Multicore Workshop

Parallel Decomposition and Processors • MPI-style encourages • Decomposition into P pieces, where P is the number of physical processors available • If your natural decomposition is a cube, then the number of processors must be a cube • … • Charm++/AMPI style “virtual processors” • Decompose into natural objects of the application • Let the runtime map them to processors • Decouple decomposition from load balancing NCSA Abe Multicore Workshop

LeanCP Car-Parinello ab initio MD • Collabrative IT project with: R. Car, M. Klein, M. Tuckerman, Glenn Martyna, N. Nystrom, .. • Specific software project (leanCP): Glenn Martyna, Mark Tuckerman, L.V. Kale and co-workers (E. Bohm, Yan Shi, Ramkumar Vadali) • Funding : NSF-CHE, NSF-CS, NSF-ITR, IBM NCSA Abe Multicore Workshop

Parallelization under Charm++: NCSA Abe Multicore Workshop

NCSA Abe Multicore Workshop

Parallel scaling of liquid water* as a function of system size on the Blue Gene/L installation at YKT: *Liquid water has 4 states per molecule. • Weak scaling is observed! • Strong scaling on processor numbers up to ~60x the number of states! NCSA Abe Multicore Workshop

Use Dynamic Load Balancing NCSA Abe Multicore Workshop

Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing Instrumented Timesteps Refinement Load Balancing NCSA Abe Multicore Workshop

Load Balancing Aggressive Load Balancing Refinement Load Balancing Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. NCSA Abe Multicore Workshop

ChaNGa: Parallel Gravity • Collaborative project (NSF ITR) • with Prof. Tom Quinn, Univ. of Washington • Components: gravity, gas dynamics • Barnes-Hut tree codes • Oct tree is natural decomposition: • Geometry has better aspect ratios, and so you “open” fewer nodes up. • But is not used because it leads to bad load balance • Assumption: one-to-one map between sub-trees and processors • Binary trees are considered better load balanced • With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors NCSA Abe Multicore Workshop

NCSA Abe Multicore Workshop

5.6s 6.1s Load balancing with GreedyLB dwarf 5M on 1,024 BlueGene/L processors NCSA Abe Multicore Workshop

5.6s 5.0s Load balancing with OrbRefineLB dwarf 5M on 1,024 BlueGene/L processors NCSA Abe Multicore Workshop

ChaNGa Preliminary Performance ChaNGa: Parallel Gravity Code Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++ NCSA Abe Multicore Workshop

ChaNGa Preliminary Performance on Abe ChaNGa: Parallel Gravity Code Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++ NCSA Abe Multicore Workshop

ChaNGa on Abe: Larger dataset NCSA Abe Multicore Workshop

Load Balancing • Adaptive load balancing examples • 1-D elastic-plastic wave propagation • Bar is dynamically loaded resulting in an elastic wave propagating down bar, upon reflection from the fixed end the material becomes plastic • 3-D dynamic elastic-plastic fracture • Load imbalance occurs at the onset of an element turning from elastic to plastic, zone of plasticity forms over a limited number of processors as the crack propagates Collaboration with Philippe Geubelle NCSA Abe Multicore Workshop

Fractography on Abe Fractography: Structural dynamics, with cohesive elements Developed in Collaboration with Philippe Geubelle NCSA Abe Multicore Workshop

Use Asynchronous Collectives • Barrier/reduction performance is not a problem • When you find processors waiting at a barrier, its usually because of load imbalances • But avoiding barriers to overlap phases is good! NCSA Abe Multicore Workshop

NAMD Parallelization using Charm++ : PME 192 + 144 VPs 700 VPs 30,000 VPs These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system NCSA Abe Multicore Workshop

94% efficiency Shallow valleys, high peaks, nicely overlapped PME Apo-A1, on BlueGene/L, 1024 procs Charm++’s “Projections” Analysis too Time intervals on x axis, activity added across processors on Y axisl green: communication Red: integration Blue/Purple: electrostatics Orange: PME turquoise: angle/dihedral NCSA Abe Multicore Workshop

76% efficiency Cray XT3, 512 processors: Initial runs Clearly, needed further tuning, especially PME. But, had more potential (much faster processors) NCSA Abe Multicore Workshop

On Cray XT3, 512 processors: after optimizations 96% efficiency NCSA Abe Multicore Workshop

Abe: NAMD, Apo-A1, on 512 cores NCSA Abe Multicore Workshop

Analyze Performance with Sophisticated Tools NCSA Abe Multicore Workshop

Programming to PetaScale with Multicore Chips and Early Experience on Abe with Charm++