Accelerating Molecular Dynamics on a GPU

Accelerating Molecular Dynamics on a GPU John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/ Careers in High-Performance Systems (CHiPS) Workshop National Center for Supercomputing Applications, July 25, 2009

Computational Biology’s Insatiable Demand for Processing Power • Simulations still fall short of biological timescales • Large simulations extremely difficult to prepare, analyze • Order of magnitude increase in performance would allow use of more sophisticated models Satellite Tobacco Mosaic Virus (STMV)

Programmable Graphics Hardware Groundbreaking research systems: AT&T Pixel Machine (1989): 82 x DSP32 processors UNC PixelFlow (1992-98): 64 x (PA-8000 + 8,192 bit-serial SIMD) SGI RealityEngine (1990s): Up to 12 i860-XP processors perform vertex operations (ucode), fixed-func. fragment hardware All mainstream GPUs now incorporate fully programmable processors UNC PixelFlow Rack SGI Reality Engine i860 Vertex Processors

GLSL Sphere Fragment Shader • Written in OpenGL Shading Language • High-level C-like language with vector types and operations • Compiled dynamically by the graphics driver at runtime • Compiled machine code executes on GPU

GPU Computing • Commodity devices, omnipresent in modern computers (over a million sold per week) • Massively parallel hardware, hundreds of processing units, throughput oriented architecture • Standard integer and floating point types supported • Programming tools allow software to be written in dialects of familiar C/C++ and integrated into legacy software • GPU algorithms are often multicore friendly due to attention paid to data locality and data-parallel work decomposition

What Speedups Can GPUs Achieve? • Single-GPU speedups of 10x to 30x vs. one CPU core are common • Best speedups can reach 100x or more, attained on codes dominated by floating point arithmetic, especially native GPU machine instructions, e.g. expf(), rsqrtf(), … • Amdahl’s Law can prevent legacy codes from achieving peak speedups with shallow GPU acceleration efforts

GPU Peak Single-Precision Performance:Exponential Trend

GT200 GPU Peak Memory Bandwidth: Linear Trend

Comparison of CPU and GPU Hardware Architecture CPU: Cache heavy, focused on individual thread performance GPU: ALU heavy, massively parallel, throughput oriented

TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC SM SM SM NVIDIA GT200 Streaming Processor Array Grid of thread blocks Multiple thread blocks, many warps of threads Texture Processor Cluster Streaming Multiprocessor SP SP SP SP SFU SFU SP SP SP SP Texture Unit Individual threads

GPU Memory Accessible in CUDA • Mapped host memory: up to 4GB, ~5.7GB/sec bandwidth (PCIe), accessible by multiple GPUs • Global memory: up to 4GB, high latency (~600 clock cycles), 140GB/sec bandwidth, accessible by all threads, atomic operations (slow) • Texture memory: read-only, cached, and interpolated/filtered access to global memory • Constant memory: 64KB, read-only, cached, fast/low-latency if data elements are accessed in unison by peer threads • Shared memory:16KB, low-latency, accessible among threads in the same block, fast if accessed without bank conflicts

An Approach to Writing CUDA Kernels • Find an algorithm that exposes substantial parallelism, thousands of independent threads… • Identify appropriate GPU memory subsystems for storage of data used by kernel • Are there trade-offs that can be made to exchange computation for more parallelism? • “Brute force” methods that expose significant parallelism do surprisingly well on current GPUs • Analyze the real-world use case for the problem and optimize the kernel for the problem size/characteristics that will be heavily used

NAMD Parallel Molecular Dynamics Kale et al., J. Comp. Phys. 151:283-312, 1999. • Designed from the beginning as a parallel program • Uses the Charm++ philosophy: • Decompose computation into a large number of objects • Intelligent Run-time system (Charm++) assigns objects to processors for dynamic load balancing with minimal communication Hybrid of spatial and force decomposition: • Spatial decomposition of atoms into cubes (called patches) • For every pair of interacting patches, create one object for calculating electrostatic interactions • Recent: Blue Matter, Desmond, etc. use this idea in some form

NAMD Overlapping Execution Phillips et al., SC2002. Example Configuration 108 847 objects 100,000 Offload to GPU Objects are assigned to processors and queued as data arrives.

Non-bonded Interactions • Calculate forces for pairs of atoms within cutoff distance Cutoff radius rij: distance between Atom[i] to Atom[j] Atom[i] Atom[j]

Nonbonded Forces on G80 GPU • Start with most expensive calculation: direct nonbonded interactions. • Decompose work into pairs of patches, identical to NAMD structure. • GPU hardware assigns patch-pairs to multiprocessors dynamically. Force computation on single multiprocessor (GeForce 8800 GTX has 16) 16kB Shared Memory Patch A Coordinates & Parameters Texture Unit Force Table Interpolation 32-way SIMD Multiprocessor 32-256 multiplexed threads Constants Exclusions 32kB Registers Patch B Coords, Params, & Forces 64kB cache 8kB cache 768 MB Main Memory, no cache, 300+ cycle latency Stone et al., J. Comp. Chem. 28:2618-2640, 2007.

texture<float4> force_table; __constant__ unsigned int exclusions[]; __shared__ atom jatom[]; atom iatom; // per-thread atom, stored in registers float4 iforce; // per-thread force, stored in registers for ( int j = 0; j < jatom_count; ++j ) { float dx = jatom[j].x - iatom.x; float dy = jatom[j].y - iatom.y; float dz = jatom[j].z - iatom.z; float r2 = dx*dx + dy*dy + dz*dz; if ( r2 < cutoff2 ) { float4 ft = texfetch(force_table, 1.f/sqrt(r2)); bool excluded = false; int indexdiff = iatom.index - jatom[j].index; if ( abs(indexdiff) <= (int) jatom[j].excl_maxdiff ) { indexdiff += jatom[j].excl_index; excluded = ((exclusions[indexdiff>>5] & (1<<(indexdiff&31))) != 0); } float f = iatom.half_sigma + jatom[j].half_sigma; // sigma f *= f*f; // sigma^3 f *= f; // sigma^6 f *= ( f * ft.x + ft.y ); // sigma^12 * fi.x - sigma^6 * fi.y f *= iatom.sqrt_epsilon * jatom[j].sqrt_epsilon; float qq = iatom.charge * jatom[j].charge; if ( excluded ) { f = qq * ft.w; } // PME correction else { f += qq * ft.z; } // Coulomb iforce.x += dx * f; iforce.y += dy * f; iforce.z += dz * f; iforce.w += 1.f; // interaction count or energy } } Nonbonded Forces CUDA Code Force Interpolation Exclusions Parameters Accumulation Stone et al., J. Comp. Chem. 28:2618-2640, 2007.

NAMD Performance on NCSA GPU Cluster, April 2008 STMV Performance • STMV virus (1M atoms) • 60 GPUs match performance of 330 CPU cores • 5.5-7x overall application speedup w/ G80-based GPUs • Overlap with CPU • Off-node results done first • Plans for better performance • Tune or port remaining work • Balance GPU load 25.7 13.8 7.8 faster 2.4 GHz Opteron + Quadro FX 5600

NAMD Performance on GT200 GPU Cluster, August 2008 • 8 GT200s, 240 SPs @ 1.3GHz: • 72x faster than a single CPU core • 9x overall application speedup vs. 8 CPU cores • 32% faster overall than 8 nodes of G80 cluster • GT200 CUDA kernel is 54% faster • ~8% variation in GPU load • Cost of double-precision for force accumulation is minimal: only 8% slower than single-precision

VMD – “Visual Molecular Dynamics” • Visualization and analysis of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry simulations, particle systems, … • User extensible with scripting and plugins • http://www.ks.uiuc.edu/Research/vmd/

GPU Acceleration in VMD Electrostatic field calculation, ion placement: factor of 20x to 44x faster Molecular orbital calculation and display: factor of 120x faster Imaging of gas migration pathways in proteins with implicit ligand sampling: factor of 20x to 30x faster

Electrostatic Potential Maps • Electrostatic potentials evaluated on 3-D lattice: • Applications include: • Ion placement for structure building • Time-averaged potentials for simulation • Visualization and analysis Isoleucine tRNA synthetase

Direct Coulomb Summation • Each lattice point accumulates electrostatic potential contribution from all atoms: potential[j] += charge[i] / rij rij: distance from lattice[j] to atom[i] Lattice point j being evaluated atom[i]

Photobiology of Vision and Photosynthesis Investigations of the chromatophore, a photosynthetic organelle Light Partial model: ~10M atoms Electrostatics needed to build full structural model, place ions, study macroscopic properties Electrostatic field of chromatophore model from multilevel summation method: computed with 3 GPUs (G80) in ~90 seconds, 46x faster than single CPU core Full chromatophore model will permit structural, chemical and kinetic investigations at a structural systems biology level

Lessons Learned • GPU algorithms need fine-grained parallelism and sufficient work to fully utilize the hardware • Much of per-thread GPU algorithm optimization revolves around efficient use of multiple memory systems and latency hiding • Concurrency can often be traded for per-thread performance, in combination with increased use of registers or shared memory • Fine-grained GPU work decompositions often compose well with the comparatively coarse-grained decompositions used for multicore or distributed memory programing

Lessons Learned (2) • The host CPU can potentially be used to “regularize” the computation for the GPU, yielding better overall performance • Overlapping CPU work with GPU can hide some communication and unaccelerated computation • Targeted use of double-precision floating point arithmetic, or compensated summation can reduce the effects of floating point truncation at low cost to performance

Acknowledgements Theoretical and Computational Biophysics Group, University of Illinois at Urbana-Champaign Wen-mei Hwu and the IMPACT group at University of Illinois at Urbana-Champaign NVIDIA Center of Excellence, University of Illinois at Urbana-Champaign NCSA Innovative Systems Lab David Kirk and the CUDA team at NVIDIA NIH support: P41-RR05969

Accelerating Molecular Dynamics on a GPU

Accelerating Molecular Dynamics on a GPU

Presentation Transcript

Molecular dynamics

Molecular Dynamics

Accelerating SQL Database Operations on a GPU with CUDA

Molecular Dynamics

Molecular Dynamics

Molecular Dynamics

Dynamics in gpu market

Molecular dynamics

Accelerating String Matching Using Multi-threaded Algorithm on GPU

Molecular Dynamics Simulations on a GPU in OpenCL

Molecular Dynamics

Molecular Dynamics

Molecular Dynamics

Molecular Dynamics

Molecular Dynamics

GPU-Accelerating A Deep Learning Anomaly Detection Platform

molecular dynamics

Molecular Dynamics

Molecular Dynamics

Molecular Dynamics

Accelerating Molecular Dynamics on a GPU