Memory Access Patterns for Cellular Automata using GPGPUs

By: James M. Balasalle Memory Access Patterns For Cellular Automata Using GPGPUs

Agenda • Background Information • Different Patterns and Techniques • Results • Case Study: Surface Water Flow • Conclusions • Questions

Background Info: Parallel Processing • How is parallel processing related to Moore’s Law? • Super Computers • Multicore CPUs • Interconnected, Independent Machines • Clusters, MPI • Grid Computing • GPUs

Background Info: Cellular Automata • A cellular automaton (CA) is a discrete mathematical model used to calculate the global behavior of a complex system using (ideally) simple local rules. • Usually grid-based model of states • Values are determined by local neighbors • Wide range of applications

Background Info: Conway’s Game of Life • The Game of Life, showing several well known patterns: crab, sliders, etc.

Background Info: Conway’s Game of Life • Cellular Automaton • Cell has two states: alive and dead • Next state is based on the surrounding 8 neighbors • Alive Cell: • 2 or 3 live neighbors: stay alive, else die • Dead Cell: • Exactly 3 live neighbors: come alive, else stay dead • Simple rules lead to complex patterns

Background Info: SIFT • Scale Invariant Feature Transform • Calculation of robust features in an image • Features can then be used to identify images or portions of an image • Widely used in Computer Vision Applications • From: http://acmechimera.blogspot.com/2008/03/paper-review-distinctive-image-features.html

Background Info: SIFT • SIFT is a pipeline of successive operations • Initial Keypoint detection • Keypoint refinement, edge removal • Keypoint orientation calculation • Keypoint descriptor creation

Background Info: SIFT • Focus is on Step 1: initial keypoint detection • Scale Space creation: successive Gaussian blurring and downsampling • Difference of Gaussians, adjacent in scale space • Local extrema detection in DoG • Resulting extrema are initial candidate keypoints

Nvidia GPUs • External coprocessor card, connected to system bus • Manages its own DRAM store • Made up of one or more Streaming Multi-processors (SM) • Each SM contains • 8 Processing cores • 16KB of on-chip cache / storage • 2 Special Functional Units for transcendentals, etc

Nvidia GPUs • Memory Regions: • Global Memory – non-cached memory, similar to RAM for CPU • Shared Memory – user-managed, on-chip cache • Texture Memory – alternative access path for accessing global memory, hardware calculations supported • Constant Memory – immutable cached memory store

Patterns and Techniques • Two broad categories: • Resource Utilization • Different memory regions • Memory alignment and coalescence • Maximizing bus usage • Overhead Reduction • Instruction Reduction • Arithmetic intensity

Patterns and Techniques • Global Memory • Conditional logic to handle boundary cells vs. memory halo • Halo achieves an 18% speed increase

Patterns and Techniques • Shared vs. global memory • Utilize faster on-chip cache for frequently requested data • Shared memory is 30% faster

Patterns and Techniques Coalescence: when all memory access requests for a half-warp are aggregated into a single request. • Aligned memory: • Align data on a 64 or 128-byte boundary • Achieved by padding each row • For a half-warp, coalescence reduces number of requests from 16 to 1 (or 2) • 8% performance improvement • Could possibly require significant host CPU processing

Patterns and Techniques • Memory Region Shape • Minimum bus transaction is 32 bytes, even for 4-byte requests • Some halo cells are unaligned, minimize these • 16% faster

Patterns and Techniques • Moving into overhead reduction and arithmetic intensity focused techniques • Index calculations, performed by every thread: • unsigned int row = blockIdx.y * blockDim.y + threadIdx.y; • unsigned intcol = blockIdx.x * blockDim.x + threadIdx.x; • intidx = row * width + col; • Approximately 15 total instructions to compute idx • For 1M data elements, 15,000,000 instructions devoted to index calculation

Patterns and Techniques • Calculate 2 (or more) elements per thread • Calculate first index, using ~15 instructions • Calculate second index, relative to first, in a single add instruction • For 1M elements, 8,000,000 instructions; a 46% reduction • 44% performance improvement, over aligned memory

Patterns and Techniques • Arithmetic intensity: ratio of actual computation to memory loads and index calculations • Multiple elements per thread • Multi-generation implementations • Data packing / interleaving

Patterns and Techniques • Multi-generational kernel • Compute 2 generations in a single kernel launch • Reduces total index calculations • Reduces total memory loads • Uses shared memory for temporary storage

Patterns and Techniques • Multi-generational kernel • Results are poor • Instruction count is limiting factor • Index calculations!

Patterns and Techniques • Multi-generational kernel thread allocations • One thread per effective element • Results in many threads loading multiple elements • And computing multiple elements for each generation • Each load, computation requires index calculations • One thread for each element required to be loaded • Not implemented, future work

SIFT Results • 2-element is faster, approximately 37% • Improvement due to instruction reduction • Gaussian Blur • Implemented as a non-separable convolution • Multiply a square matrix by each element and its neighbors • Square matrix is result of Gaussian function • Data elements are pixel values of image in question

SIFT Results • Difference of Gaussians • Simply subtract results of blurring kernel • Kernel is extremely simple: more index calculations than effective operations • Kernel utilizes data packing • Too simple to measure

SIFT Results • Extrema Detection • Each element compares itself to its neighbors • Minimum and maximum values are extrema

SIFT Results • Extrema Detection • 2-element kernel is fastest • Rectangular kernel not effective since algorithm has built-in bounds checking

Case Study: Surface Water Flow • Based on Masters Thesis by Jay Parsons • Using a digital elevation map, determine the amount and location of water during and after a rain event • Built upon a CA model that uses elevation distance between cells to determine where water flows

Case Study: Surface Water Flow • Sample output

Case Study: Surface Water Flow • Initial Steps • Port from Java to C++ • Gain understanding • Create a baseline implementation for timing comparisons • Initial GPU implementation • Application of techniques

Case Study: Surface Water Flow • Problem: • During the processing of one cell, state values of its neighbors were updated • Design decision to make calculation of incoming water easier • Complicates CA implementation • Push vs. Pull methods

Case Study: Surface Water Flow • Modify implementation, simply CA rules • New value is: • current value – outgoing volume + incoming volume • Incoming volume more difficult to calculate • Dramatic improvement: 3.6x speedup • Reduced instruction count • Better usage of shared memory

Case Study: Surface Water Flow

Recap • What worked • Shared memory, memory alignment, 2-element processing, rectangular regions • What didn’t work • Multi-generation kernels – more investigation needed • Future work • Data packing • Texture memory

Observations • Balance between instruction-bound and memory-bound • Strict CA rules help performance and implementation • Powerful analysis tools required • Compromises • Shared Memory • 2-element processing • Rectangular regions

Conclusion • GPUs are a great platform for cellular automata • Other problems that exhibit spatial locality • Techniques presented have real, measureable impact • Straightforward implementation • Applicable to wide range of problems • Worthwhile area of research

Questions??

Memory Access Patterns for Cellular Automata using GPGPUs

Memory Access Patterns for Cellular Automata using GPGPUs

Presentation Transcript

Cellular Automata

Cellular Automata

Cellular Automata

Cellular Automata

Cellular Automata

Simulations: Cellular Automata

Cellular Automata

Cellular Automata

Recitation 7: Memory Access Patterns

Cellular Automata

CELLULAR AUTOMATA

Robot Route Planning using Cellular Automata

Cellular Automata

Cellular Automata

Cellular Automata

Cellular Automata

Evolutionary Model for Bone Adaptation Using Cellular Automata

Evacuation Simulations using Cellular Automata

Cellular Automata

Cellular Automata

Cellular Automata