Parallel Particle-Based Reaction-Diffusion: A GPU Implementation for Systems Biology

Parallel particle-based reaction diffusion: a GPU implementation Lorenzo Dematté– Cosbi HIBI2010

Systems biology • Understand biological systems as a system. • Structure and dynamicsrather than the characteristics of isolated parts of a cell or organism (Kitano 2002). • Modelling & Simulation

M&S in Systems Biology What? A simplified abstract view of the system: complex reality, empirical objects, phenomena, and physical processes. Why? impossible or impractical to create experimental conditions; human thought processes amplified; (leverage computational power to simulate, visualize, manipulate) Modelling Simulation (and more) How? consistency to empirical data: ability to explain past observations, ability to predict future observations. Simulation. Visualization

Several simulation methods Computational complexity

Parallel M&S: motivation

Spatial simulation • Why? localized fluctuations, diffusion have an important role • mRNA movement within the cytoplasm [Fusco2003] • mRNA localization in budding yeast; Morphogen gradients [Alberts] • Bacterial chemotaxis [Bray2003] • Synapse specificity of long-term facilitation in neural cells [Kandel2001] • Microtubule assisted protein movement during the cell cycle [Novak2001] • … • Two main ways:

Parallel M&S: motivation A+B -> C

Scaling up model complexity • More complex models • More complex methods • Solution? • More powerful computers… • Moore’s law • More cores! Need for parallel processing • Different “threads” of execution • Partition of work/data into (mostly) independent sub-units

Parallelization techniques • Multiple Replication in Parallel (MRIP) • Launch independent replications (uncorrelated) on multiple processes (cores/CPUs/computers). • Allows for sensitivity analysis, statistics, evolutionary computations, optimization problems… (widespread, relatively easy) • Typically data parallel • Single Replication in Parallel (SRIP) • Decomposition of a stochastic trajectory into logical processes, running on different processors. • Challenging, as it have to preserve “exactness” (prove to be equivalent to a sequential simulation) • Task parallel: DES (Discrete Event Simulations) • Data parallel Evolutionary computations Spatial Gillespie GPU Particle based methods

GPU computing for SB • Recent survey [BiB2010] • No Brownian Dynamics • Lots of MD implementations • Why? • Many related problems (collision detection, ABMs) already have GPU implementations • Candidate: Smoldyn • Single molecular detail, fixed time step • Paper by author: no speed-up Andrews, Steven S. and Dennis Bray. "Stochastic simulation of chemical reactions with spatial resolution and single molecule detail." Phys. Biol. 1:137-151, 2004

GPUs • Strong points: • Peak performance in the TFlop range • Weak points: • GPU programming is challenging • Not everything fits on the GPU model! • Very fine-grained • Re-think a problem in a different way

GPUs • Streaming Processors (execution unit) • Single program on partitioned data + SIMD (SIMT) • Typically data parallel

GPUs CPU/OS Threads VS. GPU “Threads” //CPU code void increment_cpu(float *a, float b, int N) { for (intidx = 0; idx<N; idx++) a[idx] = a[idx] + b; } //GPU code __device__ void increment_gpu(float *a, float b, int N){ intidx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) a[idx] = a[idx] + b; } • Context switch is “free” • Actually, GPUs need to switch to keep SM busy • Thousands of threads in-flight • Each “thread” works on a piece of data

Initialize Smoldyn on the CPU • Zeroth reaction B A A • Diffusion • First order • Second order C # 0th reactions Binding/unbinding radii # molecules (foreach) # molecules # molecules^2, # molecules (space partition)

Challenges • General • Arithmetic intensity: operations / words transferred • Random number generation • Mersenne Twisters • Gaussian distributions • LUT in GPU constant memory (read-only, cached) • Variable number of molecules • Keeping a data parallel approach (streams) • Control flow: rethink (if (nProducts == 1) then…) • Natural CPU implementation: linked list w. pointers

Implementation: data structures 0.1 1.1 A 0.2 -2 C 1.2 A 2.4 2.3 A -3 ... ... ... ... A -2 1.2 ... Same for the reaction list Bim rates/radii in a 2D matrix

Implementation: data structures • Struct of arrays: better memory access (also compared to striping) • Parallel implementation of Mersenne Twister (MT) • Single twister not parallel • Instead, many twisters • different parameters (correlation) • dcmt • Gaussian LUT • Constant memory

Implementation: diffusion One kernel for each molecule uint index = blockIdx.x * blockDim.x + threadIdx.x; if (index >= numMolecules) return; // Ensure coalesced reads volatile float4 pos = posArray[index]; volatile inttypeId = typeArray[index]; intrngIndex = index % MT_RNG_COUNT; TwisterState* rngState = &(rngStateArray[rngIndex]); // Get diffusion rate float rate = tex1Dfetch(diffusionRatesTex, typeId); intrandX = MTRand(rngState, rngIndex); // Transform it to gaussian, combine with rate pos.x += rate * gaussianLookupTable[randX & tableDim-1]; // ... same for y, z

Implementation: first order ... A C A A ... A A C A+B 0 A A A 0 ... A 0 B 0 0 ... 0 A A A 0 B ... A 0 B 0 0 C 0 ... 0 0 1 2 3 3 ... n 0 0 1 1 1 2 ... m CUDPP “style” scan A A A B ... A B C ... n + m

Implementation: second order • GPU collision detection • Used for video-games • Missile-vehicle, vehicle-vehicle • “Special effects” (smoke, trails, water) • N-body, bruteforce • Excellent results on CUDA • But collisions are local interactions • Spatial subdivision with a uniform grid • CUDA Whitepaper [S.Green] • Uses sorting to improve performance

Implementation: second order Cell -> which particles? ?? cellId = SortedList[CellStart [CellIndex]].Second

Implementation: second order uintgridHash = calcGridHash(gridPos); // get start of bucket for this cell uintstartIndex = FETCH(cellStart, gridHash); uint mol1Index = gridParticleIndex[index], mol1Type = types[mol1Index]; // iterate over particles in this cell uintendIndex = FETCH(cellEnd, gridHash); for(uint j=startIndex; j<endIndex; j++) { if (j != index) { // check not colliding with self uintmol2Index = gridParticleIndex[j], mol2Type = types[mol2Index]; // is a reaction possible between them? inttableIdx = mol1Type * numTypes + mol2Type; int r = reactionTable[tableIdx]; if (r == -1) continue; float3 pos2 = make_float3(FETCH(oldPos, j)); float3 relPos = pos1 - pos2; // calculate relative position if (lengthSquared(relPos) <= reactionList->bindRadiusSquared[r]) { float p = reactionList->prob[r]; if (p == 1 || randReal(randState, threadId) <p) bimReact(...); } } } Index is the current particle index, j the one we will test . Both are indexes to the sorted array NOTE: as for the “diffusion” and “first” steps, code for the algorithm is unchanged. CUDA helps here: we just copy-pasted the core C code.

Initialize • Initialize Smoldyn on the GPU • Zeroth reaction • Zeroth reaction CPU CPU GPU • First order • Second order • Diffusion Transfer data Use aux array for “birth”. Compact “birth” and particles arrays (death). Diffusion First order (segmented scan) Read back N Use particle array (non-death), # particles can only diminish. Compact particle array. Fuse them (copy) (reallocate if needed) Second order (build grid, compute hash, process cell, segmented scan) Read back N (Opt) Reallocate • Data structures entirely on the GPU

Results

Results(2)

Summary • Think data parallel • Minimize data exchange CPU-GPU • New data and control flow structure also on multi- many-core • Memory key to great performances • ~140x speed-up only after memory tweaks • Specific to GPUs: prefix scan (compaction) and sorted grid (collision detection) • Future: • Smoldyn input/integration • Editor for spatial structure

Questions? Thank You

Parallel Particle-Based Reaction-Diffusion: A GPU Implementation for Systems Biology

Parallel Particle-Based Reaction-Diffusion: A GPU Implementation for Systems Biology

Presentation Transcript

Digestion and absorption of proteins Protein enters the stomach and the low pH:

Diagnostic Postural reactions

EXAM REVIEW

Dieckmann Condensation Reaction

Prentice Hall Chemistry c 2005

The Hetero-Ene Reaction: Development and Synthetic Utility

Realtime particle based fluid simulation

Enthalpy

Diffusion Mass Transfer

Parallel Programming Models, Languages and Compilers

Parallel Programming Models, Languages and Compilers

Chapter 11 Rate of Reaction

Introduction to Particle Physics

Parallel Algorithms on Networks of Processors

Parallel Real-Time Systems

Chapter 28

KINETICS

Chapter 13: rates of reaction