1 / 27

Accelerating Molecular Dynamics on a GPU

Accelerating Molecular Dynamics on a GPU. John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/ Careers in High-Performance Systems (CHiPS) Workshop

haile
Télécharger la présentation

Accelerating Molecular Dynamics on a GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerating Molecular Dynamics on a GPU John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/ Careers in High-Performance Systems (CHiPS) Workshop National Center for Supercomputing Applications, July 25, 2009

  2. Computational Biology’s Insatiable Demand for Processing Power • Simulations still fall short of biological timescales • Large simulations extremely difficult to prepare, analyze • Order of magnitude increase in performance would allow use of more sophisticated models Satellite Tobacco Mosaic Virus (STMV)

  3. Programmable Graphics Hardware Groundbreaking research systems: AT&T Pixel Machine (1989): 82 x DSP32 processors UNC PixelFlow (1992-98): 64 x (PA-8000 + 8,192 bit-serial SIMD) SGI RealityEngine (1990s): Up to 12 i860-XP processors perform vertex operations (ucode), fixed-func. fragment hardware All mainstream GPUs now incorporate fully programmable processors UNC PixelFlow Rack SGI Reality Engine i860 Vertex Processors

  4. GLSL Sphere Fragment Shader • Written in OpenGL Shading Language • High-level C-like language with vector types and operations • Compiled dynamically by the graphics driver at runtime • Compiled machine code executes on GPU

  5. GPU Computing • Commodity devices, omnipresent in modern computers (over a million sold per week) • Massively parallel hardware, hundreds of processing units, throughput oriented architecture • Standard integer and floating point types supported • Programming tools allow software to be written in dialects of familiar C/C++ and integrated into legacy software • GPU algorithms are often multicore friendly due to attention paid to data locality and data-parallel work decomposition

  6. What Speedups Can GPUs Achieve? • Single-GPU speedups of 10x to 30x vs. one CPU core are common • Best speedups can reach 100x or more, attained on codes dominated by floating point arithmetic, especially native GPU machine instructions, e.g. expf(), rsqrtf(), … • Amdahl’s Law can prevent legacy codes from achieving peak speedups with shallow GPU acceleration efforts

  7. GPU Peak Single-Precision Performance:Exponential Trend

  8. GT200 GPU Peak Memory Bandwidth: Linear Trend

  9. Comparison of CPU and GPU Hardware Architecture CPU: Cache heavy, focused on individual thread performance GPU: ALU heavy, massively parallel, throughput oriented

  10. TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC SM SM SM NVIDIA GT200 Streaming Processor Array Grid of thread blocks Multiple thread blocks, many warps of threads Texture Processor Cluster Streaming Multiprocessor SP SP SP SP SFU SFU SP SP SP SP Texture Unit Individual threads

  11. GPU Memory Accessible in CUDA • Mapped host memory: up to 4GB, ~5.7GB/sec bandwidth (PCIe), accessible by multiple GPUs • Global memory: up to 4GB, high latency (~600 clock cycles), 140GB/sec bandwidth, accessible by all threads, atomic operations (slow) • Texture memory: read-only, cached, and interpolated/filtered access to global memory • Constant memory: 64KB, read-only, cached, fast/low-latency if data elements are accessed in unison by peer threads • Shared memory:16KB, low-latency, accessible among threads in the same block, fast if accessed without bank conflicts

  12. An Approach to Writing CUDA Kernels • Find an algorithm that exposes substantial parallelism, thousands of independent threads… • Identify appropriate GPU memory subsystems for storage of data used by kernel • Are there trade-offs that can be made to exchange computation for more parallelism? • “Brute force” methods that expose significant parallelism do surprisingly well on current GPUs • Analyze the real-world use case for the problem and optimize the kernel for the problem size/characteristics that will be heavily used

  13. NAMD Parallel Molecular Dynamics Kale et al., J. Comp. Phys. 151:283-312, 1999. • Designed from the beginning as a parallel program • Uses the Charm++ philosophy: • Decompose computation into a large number of objects • Intelligent Run-time system (Charm++) assigns objects to processors for dynamic load balancing with minimal communication Hybrid of spatial and force decomposition: • Spatial decomposition of atoms into cubes (called patches) • For every pair of interacting patches, create one object for calculating electrostatic interactions • Recent: Blue Matter, Desmond, etc. use this idea in some form

  14. NAMD Overlapping Execution Phillips et al., SC2002. Example Configuration 108 847 objects 100,000 Offload to GPU Objects are assigned to processors and queued as data arrives.

  15. Non-bonded Interactions • Calculate forces for pairs of atoms within cutoff distance Cutoff radius rij: distance between Atom[i] to Atom[j] Atom[i] Atom[j]

  16. Nonbonded Forces on G80 GPU • Start with most expensive calculation: direct nonbonded interactions. • Decompose work into pairs of patches, identical to NAMD structure. • GPU hardware assigns patch-pairs to multiprocessors dynamically. Force computation on single multiprocessor (GeForce 8800 GTX has 16) 16kB Shared Memory Patch A Coordinates & Parameters Texture Unit Force Table Interpolation 32-way SIMD Multiprocessor 32-256 multiplexed threads Constants Exclusions 32kB Registers Patch B Coords, Params, & Forces 64kB cache 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency Stone et al., J. Comp. Chem. 28:2618-2640, 2007.

  17. texture<float4> force_table; __constant__ unsigned int exclusions[]; __shared__ atom jatom[]; atom iatom; // per-thread atom, stored in registers float4 iforce; // per-thread force, stored in registers for ( int j = 0; j < jatom_count; ++j ) { float dx = jatom[j].x - iatom.x; float dy = jatom[j].y - iatom.y; float dz = jatom[j].z - iatom.z; float r2 = dx*dx + dy*dy + dz*dz; if ( r2 < cutoff2 ) { float4 ft = texfetch(force_table, 1.f/sqrt(r2)); bool excluded = false; int indexdiff = iatom.index - jatom[j].index; if ( abs(indexdiff) <= (int) jatom[j].excl_maxdiff ) { indexdiff += jatom[j].excl_index; excluded = ((exclusions[indexdiff>>5] & (1<<(indexdiff&31))) != 0); } float f = iatom.half_sigma + jatom[j].half_sigma; // sigma f *= f*f; // sigma^3 f *= f; // sigma^6 f *= ( f * ft.x + ft.y ); // sigma^12 * fi.x - sigma^6 * fi.y f *= iatom.sqrt_epsilon * jatom[j].sqrt_epsilon; float qq = iatom.charge * jatom[j].charge; if ( excluded ) { f = qq * ft.w; } // PME correction else { f += qq * ft.z; } // Coulomb iforce.x += dx * f; iforce.y += dy * f; iforce.z += dz * f; iforce.w += 1.f; // interaction count or energy } } Nonbonded Forces CUDA Code Force Interpolation Exclusions Parameters Accumulation Stone et al., J. Comp. Chem. 28:2618-2640, 2007.

  18. NAMD Performance on NCSA GPU Cluster, April 2008 STMV Performance • STMV virus (1M atoms) • 60 GPUs match performance of 330 CPU cores • 5.5-7x overall application speedup w/ G80-based GPUs • Overlap with CPU • Off-node results done first • Plans for better performance • Tune or port remaining work • Balance GPU load 25.7 13.8 7.8 faster 2.4 GHz Opteron + Quadro FX 5600

  19. NAMD Performance on GT200 GPU Cluster, August 2008 • 8 GT200s, 240 SPs @ 1.3GHz: • 72x faster than a single CPU core • 9x overall application speedup vs. 8 CPU cores • 32% faster overall than 8 nodes of G80 cluster • GT200 CUDA kernel is 54% faster • ~8% variation in GPU load • Cost of double-precision for force accumulation is minimal: only 8% slower than single-precision

  20. VMD – “Visual Molecular Dynamics” • Visualization and analysis of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry simulations, particle systems, … • User extensible with scripting and plugins • http://www.ks.uiuc.edu/Research/vmd/

  21. GPU Acceleration in VMD Electrostatic field calculation, ion placement: factor of 20x to 44x faster Molecular orbital calculation and display: factor of 120x faster Imaging of gas migration pathways in proteins with implicit ligand sampling: factor of 20x to 30x faster

  22. Electrostatic Potential Maps • Electrostatic potentials evaluated on 3-D lattice: • Applications include: • Ion placement for structure building • Time-averaged potentials for simulation • Visualization and analysis Isoleucine tRNA synthetase

  23. Direct Coulomb Summation • Each lattice point accumulates electrostatic potential contribution from all atoms: potential[j] += charge[i] / rij rij: distance from lattice[j] to atom[i] Lattice point j being evaluated atom[i]

  24. Photobiology of Vision and Photosynthesis Investigations of the chromatophore, a photosynthetic organelle Light Partial model: ~10M atoms Electrostatics needed to build full structural model, place ions, study macroscopic properties Electrostatic field of chromatophore model from multilevel summation method: computed with 3 GPUs (G80) in ~90 seconds, 46x faster than single CPU core Full chromatophore model will permit structural, chemical and kinetic investigations at a structural systems biology level

  25. Lessons Learned • GPU algorithms need fine-grained parallelism and sufficient work to fully utilize the hardware • Much of per-thread GPU algorithm optimization revolves around efficient use of multiple memory systems and latency hiding • Concurrency can often be traded for per-thread performance, in combination with increased use of registers or shared memory • Fine-grained GPU work decompositions often compose well with the comparatively coarse-grained decompositions used for multicore or distributed memory programing

  26. Lessons Learned (2) • The host CPU can potentially be used to “regularize” the computation for the GPU, yielding better overall performance • Overlapping CPU work with GPU can hide some communication and unaccelerated computation • Targeted use of double-precision floating point arithmetic, or compensated summation can reduce the effects of floating point truncation at low cost to performance

  27. Acknowledgements Theoretical and Computational Biophysics Group, University of Illinois at Urbana-Champaign Wen-mei Hwu and the IMPACT group at University of Illinois at Urbana-Champaign NVIDIA Center of Excellence, University of Illinois at Urbana-Champaign NCSA Innovative Systems Lab David Kirk and the CUDA team at NVIDIA NIH support: P41-RR05969

More Related