GPU-Efficient Recursive Filtering and Summed-Area Tables Jeremiah van Oosten Reinier van Oeveren
Table of Contents • Introduction • Related Works • Prefix Sums and Scans • Recursive Filtering • Summed-Area Tables • Problem Definition • Parallelization Strategies • Baseline (Algorithm RT) • Block Notation • Inter-block Parallelism • Kernel Fusion (Algorithm 2) • Overlapping • Causal-Anticausal overlapping (Algorithm 3 & 4) • Row-Column Causal-Anitcausal overlapping (Algorithm 5) • Summed-Area Tables • Overlapped Summed-Area Tables (Algorithm SAT) • Results • Conclusion
Introduction • Linear filtering is commonly used to blur, sharpen or down-sample images. • A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd).
Introduction • The cost of the image filter can be reduced using a recursive filterin which case previous results can be used to compute the current value: • Cost can be reduced to O(hwr) where r is the number of recursive feedbacks.
Recursive Filters • At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements. 0(hwr) Continue…
Recursive Filters recursive filters • Applications of recursive filters • Low-pass filtering like Gaussian kernels • Inverse Convolution ( • Summed-area tables input blurred
Causality • Recursive filters can be causal or anticausal (or non-causal). • Causal filters operate on previous values. • Anticausal filters operate on “future” values.
Anticausal • Anticausal filters operate on “future” values. Continue…
Filter Sequences • It is often required to perform a sequence of recursive image filters. P • Independent Columns • Causal • Anticausal • Independent Rows • Causal • Anticausal P’ X Y Z U V E’ E
Maximizing Parallelism • The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU. • The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores. • Under utilization of the GPU cores does not allow for latency hiding. • We need a way to make better utilization of the GPU without increasing IO.
Overlapping • In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters.
Block Partitioning • Partition the image into 2D blocks of size .
Prefix Sums and Scans • A prefix sum • Simple case of a first-order recursive filter. • A scan generalizes the recurrence using an arbitrary binary associative operator. • Parallel prefix-sums and scans are important building blocks for numerous algorithms. • [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et. al. 2007] • An optimized implementation comes with the CUDPP library .
Recursive Filtering • A generalization of the prefix sum using a weighted combination of prior outputs. • This can be implemented as a scan operation with redefined basic operators. • Ruijters and Thevenaz  exploit parallelisim across the rows and columns of the input.
Recursive Filtering • Sung and Mitra  use block parallelism and split the computation into two parts: • One computation based only on the block data assuming a zero initial conditions. • One computation based only on the initial conditions and assuming zero block data.
- + height width UL UR - + LL LR Summed-Area Tables • Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads
Summed-Area Tables • The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute). Image A Image B Image A Image B
Summed-Area Tables • In 2010, Justin Hensley extended his 2005 implementation to compute shaderstaking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass.
Problem Definition • Casual recursive filters of order are characterized by a set of feedback coefficients in the following manner. • Given a prologue vector and an input vector of any size the filter produces the output: • Such that (has the same size as the input ).
Problem Definition • Causal recursive filters depend on a prologue vector • Similar for the anitcausal filter. Given an input vector and an epilogue vector , the output vector is defined by:
Problem Definition • For row processing, we define an extended casual filter and anticausal filter .
Problem Definition • With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left). P • Independent Columns • Causal • Anticausal • Independent Rows • Causal • Anticausal P’ X Y Z U V E’ E
Problem Definition • The goal is to implement this algorithm on the GPU to make full use of all available resources. • Maximize occupancy by splitting the problem up to make use of all cores. • Reduce I/O to global memory. • Must break the dependency chain in order to increase task parallelism. • Primary design goal: Increase the amount of parallelism without increasing memory I/O.
Prior Parallelization strategies • Baseline algorithm ‘RT’ • Block notation • Inter-block parallelism • Kernel fusion
Algorithm Ruijters & Thévenaz Independent row and column processing • Step RT1: In parallel for each column in , apply sequentially and store . • Step RT2: In parallel for each column in , apply sequentially and store . • Step RT1: In parallel for each row in , apply sequentially and store . • Step RT1: In parallel for each row in , apply sequentially and store .
Algorithm RT in diagram form input stages output columnprocessing row processing
Algorithm RT performance • Completion takes 4r ) steps • Bandwidthusage in total is • = streamingmultiprocessors • = number of cores (per processor) • = width of the input image • = height of the input image • = order of the appliedfilter
Block notation (1) • Partition input image intoblocks • = number of threads in warp (=32) • What means what? • = block in matrix with index • = column-prologue submatrix • = column-epilogue submatrix For rows we have (similar) transposed operators: and
Block notation (2) • Tail andhead operators: selectingprologue- andepilogue-shaped submatrices from
Block notation (3) • Result: blockedversion of problemdefinition , , ,
Someusefulkeyproperties (1) Superposition(based on linearity) Effects of the input andprologue/epilogue on the output canbecomputedindependently
Someusefulkeyproperties (2) Express as matrix products For any, is the identity matrix Precomputed matrices thatdependonly on the feedback coefficients of filters andrespectively. Details in paper. ,
Inter-block parellelism (1) Perform block computationindependently output block superposition Prologue / tail of prev. output block
Inter-block parellelism (2) first term second term incomplete causal output
Inter-block parellelism (3) (1) Recall: (2) Algorithm 1 1.1 In parallel forall m, computeand store each 1.2 Sequentiallyforeach m, computeand store the accordingto(1)andusing the previouslycomputed 1.3 In parallel forall m, compute & store output block using(2)and the previouslycomputed
Inter-block parellelism (4) Processing allrowsand columns usingcausaland anti-causal filter pairs requires 4 successiveapplicationsof algorithm 1. There are independent tasks: hides memory access latency. However.. The memory bandwidthusage is now Significantly more thanalgorithm RT ( canbesolved
Kernelfusion (1) • Original idea: Kirk & Hwu  • Use output of onekernel as input for the next withoutgoingthroughglobal memory. • Fusedkernel: code frombothkernels but keep intermediateresults in shared mem.
Kernelfusion (2) • Use Algorithm 1 forall filters, do fusing. • Fuse last stage of with first stage of • Fuselast stage of and first stage of • Fuse last stage of with first stage of We aimedforbandwidthreduction. Diditwork? • Algorithm 1: • Algorithm 2: yes, itdid!
Kernelfusion (3), Algorithm 2 * input stages fix fix fix fix output *for the full algorithm in text, pleasesee the paper
Kernelfusion (4) • Further I/O reduction is stillpossible: byrecomputingintermediaryresultsinstead of storing in memory. • More bandwidthreduction: (=good) • No. of steps: (≈bad*) Bandwidthusage is lessthanAlgorithm RT(!) but involves more computations*But.. future hardware may tip the balance in favor of more computations.
Causal-Anticausal Overlapping • Overlapping is introduced to reduce IO to global memory. • It is possible to work with twice-incompleteanticausal epilogues , computed directly from the incomplete causal output block . • This is called casual-anticausal overlapping.
Causal-Anticausal Overlapping • Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together.
Causal-Anticausal Overlapping • Using the previous properties, we can split the dependency chains of anticausal epilogues.
Causal-Anticausal Overlapping • Which can be further simplified to: