270 likes | 393 Vues
This paper presents a GPU implementation of parallel particle-based reaction-diffusion systems aimed at modeling complex biological phenomena. By leveraging computational power, we simulate the dynamics and structure of biological systems, moving beyond isolated components to capture the essence of cellular interactions. The study addresses challenges in parallel processing, introduces optimization techniques, and emphasizes the importance of accuracy in simulations. Key applications range from mRNA localization to bacterial chemotaxis, offering insights into cellular behavior through advanced modeling techniques.
E N D
Parallel particle-based reaction diffusion: a GPU implementation Lorenzo Dematté– Cosbi HIBI2010
Systems biology • Understand biological systems as a system. • Structure and dynamicsrather than the characteristics of isolated parts of a cell or organism (Kitano 2002). • Modelling & Simulation
M&S in Systems Biology What? A simplified abstract view of the system: complex reality, empirical objects, phenomena, and physical processes. Why? impossible or impractical to create experimental conditions; human thought processes amplified; (leverage computational power to simulate, visualize, manipulate) Modelling Simulation (and more) How? consistency to empirical data: ability to explain past observations, ability to predict future observations. Simulation. Visualization
Several simulation methods Computational complexity
Spatial simulation • Why? localized fluctuations, diffusion have an important role • mRNA movement within the cytoplasm [Fusco2003] • mRNA localization in budding yeast; Morphogen gradients [Alberts] • Bacterial chemotaxis [Bray2003] • Synapse specificity of long-term facilitation in neural cells [Kandel2001] • Microtubule assisted protein movement during the cell cycle [Novak2001] • … • Two main ways:
Parallel M&S: motivation A+B -> C
Scaling up model complexity • More complex models • More complex methods • Solution? • More powerful computers… • Moore’s law • More cores! Need for parallel processing • Different “threads” of execution • Partition of work/data into (mostly) independent sub-units
Parallelization techniques • Multiple Replication in Parallel (MRIP) • Launch independent replications (uncorrelated) on multiple processes (cores/CPUs/computers). • Allows for sensitivity analysis, statistics, evolutionary computations, optimization problems… (widespread, relatively easy) • Typically data parallel • Single Replication in Parallel (SRIP) • Decomposition of a stochastic trajectory into logical processes, running on different processors. • Challenging, as it have to preserve “exactness” (prove to be equivalent to a sequential simulation) • Task parallel: DES (Discrete Event Simulations) • Data parallel Evolutionary computations Spatial Gillespie GPU Particle based methods
GPU computing for SB • Recent survey [BiB2010] • No Brownian Dynamics • Lots of MD implementations • Why? • Many related problems (collision detection, ABMs) already have GPU implementations • Candidate: Smoldyn • Single molecular detail, fixed time step • Paper by author: no speed-up Andrews, Steven S. and Dennis Bray. "Stochastic simulation of chemical reactions with spatial resolution and single molecule detail." Phys. Biol. 1:137-151, 2004
GPUs • Strong points: • Peak performance in the TFlop range • Weak points: • GPU programming is challenging • Not everything fits on the GPU model! • Very fine-grained • Re-think a problem in a different way
GPUs • Streaming Processors (execution unit) • Single program on partitioned data + SIMD (SIMT) • Typically data parallel
GPUs CPU/OS Threads VS. GPU “Threads” //CPU code void increment_cpu(float *a, float b, int N) { for (intidx = 0; idx<N; idx++) a[idx] = a[idx] + b; } //GPU code __device__ void increment_gpu(float *a, float b, int N){ intidx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) a[idx] = a[idx] + b; } • Context switch is “free” • Actually, GPUs need to switch to keep SM busy • Thousands of threads in-flight • Each “thread” works on a piece of data
Initialize Smoldyn on the CPU • Zeroth reaction B A A • Diffusion • First order • Second order C # 0th reactions Binding/unbinding radii # molecules (foreach) # molecules # molecules^2, # molecules (space partition)
Challenges • General • Arithmetic intensity: operations / words transferred • Random number generation • Mersenne Twisters • Gaussian distributions • LUT in GPU constant memory (read-only, cached) • Variable number of molecules • Keeping a data parallel approach (streams) • Control flow: rethink (if (nProducts == 1) then…) • Natural CPU implementation: linked list w. pointers
Implementation: data structures 0.1 1.1 A 0.2 -2 C 1.2 A 2.4 2.3 A -3 ... ... ... ... A -2 1.2 ... Same for the reaction list Bim rates/radii in a 2D matrix
Implementation: data structures • Struct of arrays: better memory access (also compared to striping) • Parallel implementation of Mersenne Twister (MT) • Single twister not parallel • Instead, many twisters • different parameters (correlation) • dcmt • Gaussian LUT • Constant memory
Implementation: diffusion One kernel for each molecule uint index = blockIdx.x * blockDim.x + threadIdx.x; if (index >= numMolecules) return; // Ensure coalesced reads volatile float4 pos = posArray[index]; volatile inttypeId = typeArray[index]; intrngIndex = index % MT_RNG_COUNT; TwisterState* rngState = &(rngStateArray[rngIndex]); // Get diffusion rate float rate = tex1Dfetch(diffusionRatesTex, typeId); intrandX = MTRand(rngState, rngIndex); // Transform it to gaussian, combine with rate pos.x += rate * gaussianLookupTable[randX & tableDim-1]; // ... same for y, z
Implementation: first order ... A C A A ... A A C A+B 0 A A A 0 ... A 0 B 0 0 ... 0 A A A 0 B ... A 0 B 0 0 C 0 ... 0 0 1 2 3 3 ... n 0 0 1 1 1 2 ... m CUDPP “style” scan A A A B ... A B C ... n + m
Implementation: second order • GPU collision detection • Used for video-games • Missile-vehicle, vehicle-vehicle • “Special effects” (smoke, trails, water) • N-body, bruteforce • Excellent results on CUDA • But collisions are local interactions • Spatial subdivision with a uniform grid • CUDA Whitepaper [S.Green] • Uses sorting to improve performance
Implementation: second order Cell -> which particles? ?? cellId = SortedList[CellStart [CellIndex]].Second
Implementation: second order uintgridHash = calcGridHash(gridPos); // get start of bucket for this cell uintstartIndex = FETCH(cellStart, gridHash); uint mol1Index = gridParticleIndex[index], mol1Type = types[mol1Index]; // iterate over particles in this cell uintendIndex = FETCH(cellEnd, gridHash); for(uint j=startIndex; j<endIndex; j++) { if (j != index) { // check not colliding with self uintmol2Index = gridParticleIndex[j], mol2Type = types[mol2Index]; // is a reaction possible between them? inttableIdx = mol1Type * numTypes + mol2Type; int r = reactionTable[tableIdx]; if (r == -1) continue; float3 pos2 = make_float3(FETCH(oldPos, j)); float3 relPos = pos1 - pos2; // calculate relative position if (lengthSquared(relPos) <= reactionList->bindRadiusSquared[r]) { float p = reactionList->prob[r]; if (p == 1 || randReal(randState, threadId) <p) bimReact(...); } } } Index is the current particle index, j the one we will test . Both are indexes to the sorted array NOTE: as for the “diffusion” and “first” steps, code for the algorithm is unchanged. CUDA helps here: we just copy-pasted the core C code.
Initialize • Initialize Smoldyn on the GPU • Zeroth reaction • Zeroth reaction CPU CPU GPU • First order • Second order • Diffusion Transfer data Use aux array for “birth”. Compact “birth” and particles arrays (death). Diffusion First order (segmented scan) Read back N Use particle array (non-death), # particles can only diminish. Compact particle array. Fuse them (copy) (reallocate if needed) Second order (build grid, compute hash, process cell, segmented scan) Read back N (Opt) Reallocate • Data structures entirely on the GPU
Summary • Think data parallel • Minimize data exchange CPU-GPU • New data and control flow structure also on multi- many-core • Memory key to great performances • ~140x speed-up only after memory tweaks • Specific to GPUs: prefix scan (compaction) and sorted grid (collision detection) • Future: • Smoldyn input/integration • Editor for spatial structure
Questions? Thank You