1 / 80

Scalable Molecular Dynamics for Large Biomolecular Systems

Scalable Molecular Dynamics for Large Biomolecular Systems. Robert Brunner James C Phillips Laxmikant Kale Department of Computer Science and Theoretical Biophysics Group University of Illinois at Urbana Champaign. Parallel Computing with Data-driven Objects. Laxmikant (Sanjay) Kale

breena
Télécharger la présentation

Scalable Molecular Dynamics for Large Biomolecular Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Molecular Dynamicsfor Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale Department of Computer Science and Theoretical Biophysics Group University of Illinois at Urbana Champaign

  2. Parallel Computing withData-driven Objects Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science http://charm.cs.uiuc.edu

  3. Overview • Context: approach and methodology • Molecular dynamics for biomolecules • Our program, NAMD • Basic parallelization strategy • NAMD performance optimizations • Techniques • Results • Conclusions: summary, lessons and future work

  4. Parallel Programming Laboratory • Objective: Enhance performance and productivity in parallel programming • For complex, dynamic applications • Scalable to thousands of processors • Theme: • Adaptive techniques for handling dynamic behavior • Strategy: Look for optimal division of labor between human programmer and the “system” • Let the programmer specify what to do in parallel • Let the system decide when and where to run them • Data driven objects as the substrate: Charm++

  5. System Mapped Objects 5 8 1 1 2 10 4 3 8 2 3 9 7 5 6 10 9 4 9 12 11 13 6 13 7 11 12

  6. Data Driven Execution Scheduler Scheduler Message Q Message Q

  7. Charm++ • Parallel C++ with data driven objects • Object Arrays and collections • Asynchronous method invocation • Object Groups: • global object with a “representative” on each PE • Prioritized scheduling • Mature, robust, portable • http://charm.cs.uiuc.edu

  8. Multi-partition Decomposition • Writing applications with Charm++ • Decompose the problem into a large number of chunks • Implements chunks as objects • Or, now, as MPI threads (AMPI on top of Charm++) • Let Charm++ map and remap objects • Allow for migration of objects • If desired, specify potential migration points

  9. Load Balancing Mechanisms • Re-map and migrate objects • Registration mechanisms facilitate migration • Efficient message delivery strategies • Efficient global operations • Such as reductions and broadcasts • Several classes of load balancing strategies provided • Incremental • Centralized as well as distributed • Measurement based

  10. Principle of Persistence • An observation about CSE applications • Extension of principle of locality • Behavior of objects, including computational load and communication patterns, tend to persist over time • Application induced imbalance: • Abrupt, but infrequent, or • Slow, cumulative • Rarely: frequent, large changes • Our framework still deals with this case as well • Measurement based strategies

  11. Measurement-Based Load Balancing Strategies • Collect timing data for several cycles • Run heuristic load balancer • Several alternative ones • Robert Brunner’s recent Ph.D. thesis: • Instrumentation framework • Strategies • Performance comparisons

  12. Molecular Dynamics ApoA-I: 92k Atoms

  13. Molecular Dynamics and NAMD • MD is used to understand the structure and function of biomolecules • Proteins, DNA, membranes • NAMD is a production-quality MD program • Active use by biophysicists (published science) • 50,000+ lines of C++ code • 1000+ registered users • Features include: • CHARMM and XPLOR compatibility • PME electrostatics and multiple timestepping • Steered and interactive simulation via VMDl

  14. NAMD Contributors • PI s : • Laxmikant Kale, Klaus Schulten, Robert Skeel • NAMD Version 1: • Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey, Mark Nelson • NAMD2: • M. Bhandarkar, R. Brunner, Justin Gullingsrud, A. Gursoy, N.Krawetz, J. Phillips, A. Shinozaki, K. Varadarajan, Gengbin Zheng, .. Theoretical Biophysics Group, supported by NIH

  15. Molecular Dynamics • Collection of [charged] atoms, with bonds • Newtonian mechanics • At each time-step • Calculate forces on each atom • Bonds • Non-bonded: electrostatic and van der Waal’s • Calculate velocities and advance positions • 1 femtosecond time-step, millions needed! • Thousands of atoms (1,000 - 100,000)

  16. Cut-off Radius • Use of cut-off radius to reduce work • 8 - 14 Å • Far away atoms ignored! (screening effects) • 80-95 % work is non-bonded force computations • Some simulations need faraway contributions • Particle-Mesh Ewald (PME) • Even so, cut-off based computations are important: • Near-atom calculations constitute the bulk of the above • Multiple time-stepping is used: k cut-off steps, 1 PME • So, (k-1) steps do just cut-off based simulation

  17. Early methods • Atom replication: • Each processor has data for all atoms • Force calculations parallelized • Collection of forces: O(N log p) communication • Computation: O(N/P) • Communication/computation Ratio: O(P log P) : Not Scalable • Atom Decomposition • Partition the atoms array across processors • Nearby atoms may not be on the same processor • Communication: O(N) per processor • Ratio: O(N) / (N / P) = O(P): Not Scalable

  18. Force Decomposition • Distribute force matrix to processors • Matrix is sparse, non uniform • Each processor has one block • Communication: • Ratio: • Better scalability in practice • Can use 100+ processors • Plimpton: • Hwang, Saltz, et al: • 6% on 32 processors • 36% on 128 processor • Yet not scalable in the sense defined here!

  19. Spatial Decomposition • Allocate close-by atoms to the same processor • Three variations possible: • Partitioning into P boxes, 1 per processor • Good scalability, but hard to implement • Partitioning into fixed size boxes, each a little larger than the cut-off distance • Partitioning into smaller boxes • Communication: O(N/P) • Communication/Computation ratio: O(1) • So, scalable in principle

  20. Ongoing work • Plimpton, Hendrickson: • new spatial decomposition • NWChem (PNL) • Peter Kollman, Yong Duan et al: • microsecond simulation • AMBER version (SANDER)

  21. Spatial Decomposition in NAMD But the load balancing problems are still severe

  22. Hybrid Decomposition

  23. FD + SD • Now, we have many more objects to load balance: • Each diamond can be assigned to any processor • Number of diamonds (3D): • 14·Number of Patches

  24. Bond Forces • Multiple types of forces: • Bonds(2), angles(3), dihedrals (4), .. • Luckily, each involves atoms in neighboring patches only • Straightforward implementation: • Send message to all neighbors, • receive forces from them • 26*2 messages per patch!

  25. Bond Forces • Assume one patch per processor: • An angle force involving atoms in patches(x1,y1,z1), (x2,y2,z2), (x3,y3,z3)is calculated in patch: (max{xi}, max{yi}, max{zi}) A C B

  26. NAMD Implementation • Multiple objects per processor • Different types: patches, pairwise forces, bonded forces • Each may have its data ready at different times • Need ability to map and remap them • Need prioritized scheduling • Charm++ supports all of these

  27. Load Balancing • Is a major challenge for this application • Especially for a large number of processors • Unpredictable workloads • Each diamond (force “compute” object) and patch encapsulate variable amount of work • Static estimates are inaccurate • Very slow variations across timesteps • Measurement-based load balancing framework Cell (patch) Compute Cell (patch)

  28. Load Balancing Strategy Greedy variant (simplified): Sort compute objects (diamonds) Repeat (until all assigned) S = set of all processors that: -- are not overloaded -- generate least new commun. P = least loaded {S} Assign heaviest compute to P Refinement: Repeat - Pick a compute from the most overloaded PE - Assign it to a suitable underloaded PE Until (No movement) Cell Compute Cell

  29. Speedups in 1998 ApoA-I: 92k atoms

  30. Optimizations • Series of optimizations • Examples discussed here: • Grainsize distributions (bimodal) • Integration: message sending overheads • Several other optimizations • Separation of bond/angle/dihedral objects • Inter-patch and intra-patch • Prioritization • Local synchronization to avoid interference across steps

  31. Grainsize and Amdahls’s Law • A variant of Amdahl’s law, for objects, would be: • The fastest time can be no shorter than the time for the biggest single object! • How did it apply to us? • Sequential step time was 57 seconds • To run on 2k processors, no object should be more than 28 msecs. • Should be even shorter • Grainsize analysis via projections showed that was not so..

  32. Grainsize Analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem

  33. Grainsize Reduced

  34. Performance Audit • Through the optimization process, • an audit was kept to decide where to look to improve performance Total Ideal Actual Total 57.04 86 nonBonded 52.44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Overhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receives 0 1.61 Integration time doubled

  35. Integration Overhead Analysis integration Problem: integration time had doubled from sequential run

  36. Integration Overhead Example • The projections pictures showed the overhead was associated with sending messages. • Many cells were sending 30-40 messages. • The overhead was still too much compared with the cost of messages. • Code analysis: memory allocations! • Identical message is being sent to 30+ processors. • Simple multicast support was added to Charm++ • Mainly eliminates memory allocations (and some copying)

  37. Integration Overhead: After Multicast

  38. ApoA-I on ASCI Red 57 ms/step

  39. ApoA-I on Origin 2000

  40. ApoA-I on Linux Cluster

  41. ApoA-I on T3E

  42. BC1 complex: 200k atoms

  43. BC1 on ASCI Red 58.4 GFlops

  44. Lessons Learned • Need to downsize objects! • Choose smallest possible grainsize that amortizes overhead • One of the biggest challenge • Was getting time for performance tuning runs on parallel machines

  45. ApoA-I with PME on T3E

  46. Future and Planned Work • Increased speedups on 2k-10k processors • Smaller grainsizes • Parallelizing integration further • New algorithms for reducing communication impact • New load balancing strategies • Further performance improvements for PME • With multiple timestepping • Needs multi-phase load balancing • Speedup on small molecules! • Interactive molecular dynamics

  47. More Information • Charm++ and associated framework: • http://charm.cs.uiuc.edu • NAMD and associated biophysics tools: • http://www.ks.uiuc.edu • Both include downloadable software

  48. Parallel Programming Laboratory • Funding: • Dept of Energy (via Rocket center) • National Science Foundation • National Institute of Health • Group Members Affiliated (NIH/Biophysics) Jim Phillips Kirby Vandivoort Joshua Unger Gengbin Zheng Jay Desouza Sameer Kumar Chee wai Lee Milind Bhandarkar Terry Wilmarth Orion Lawlor Neelam Saboo Arun Singla Karthikeyan Mahesh

  49. The Parallel Programming Problem • Is there one? • We can all write MPI programs, right? • Several Large Machines in use • But: • New complex apps with dynamic and irregular structure • Should all application scientists also be experts in parallel computing?

More Related