Create Presentation
Download Presentation

Download Presentation
## GPU-Efficient Recursive Filtering and Summed-Area Tables

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**GPU-Efficient Recursive Filtering and Summed-Area Tables**Jeremiah van Oosten Reinier van Oeveren**Table of Contents**• Introduction • Related Works • Prefix Sums and Scans • Recursive Filtering • Summed-Area Tables • Problem Definition • Parallelization Strategies • Baseline (Algorithm RT) • Block Notation • Inter-block Parallelism • Kernel Fusion (Algorithm 2) • Overlapping • Causal-Anticausal overlapping (Algorithm 3 & 4) • Row-Column Causal-Anitcausal overlapping (Algorithm 5) • Summed-Area Tables • Overlapped Summed-Area Tables (Algorithm SAT) • Results • Conclusion**Introduction**• Linear filtering is commonly used to blur, sharpen or down-sample images. • A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd).**Introduction**• The cost of the image filter can be reduced using a recursive filterin which case previous results can be used to compute the current value: • Cost can be reduced to O(hwr) where r is the number of recursive feedbacks.**Recursive Filters**• At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements. 0(hwr) Continue…**Recursive Filters**recursive filters • Applications of recursive filters • Low-pass filtering like Gaussian kernels • Inverse Convolution ( • Summed-area tables input blurred**Causality**• Recursive filters can be causal or anticausal (or non-causal). • Causal filters operate on previous values. • Anticausal filters operate on “future” values.**Anticausal**• Anticausal filters operate on “future” values. Continue…**Filter Sequences**• It is often required to perform a sequence of recursive image filters. P • Independent Columns • Causal • Anticausal • Independent Rows • Causal • Anticausal P’ X Y Z U V E’ E**Maximizing Parallelism**• The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU. • The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores. • Under utilization of the GPU cores does not allow for latency hiding. • We need a way to make better utilization of the GPU without increasing IO.**Overlapping**• In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters.**Block Partitioning**• Partition the image into 2D blocks of size .**Prefix Sums and Scans**• A prefix sum • Simple case of a first-order recursive filter. • A scan generalizes the recurrence using an arbitrary binary associative operator. • Parallel prefix-sums and scans are important building blocks for numerous algorithms. • [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et. al. 2007] • An optimized implementation comes with the CUDPP library [2011].**Recursive Filtering**• A generalization of the prefix sum using a weighted combination of prior outputs. • This can be implemented as a scan operation with redefined basic operators. • Ruijters and Thevenaz [2010] exploit parallelisim across the rows and columns of the input.**Recursive Filtering**• Sung and Mitra [1986] use block parallelism and split the computation into two parts: • One computation based only on the block data assuming a zero initial conditions. • One computation based only on the initial conditions and assuming zero block data.**-**+ height width UL UR - + LL LR Summed-Area Tables • Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads**Summed-Area Tables**• The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute). Image A Image B Image A Image B**Summed-Area Tables**• In 2010, Justin Hensley extended his 2005 implementation to compute shaderstaking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass.**Problem Definition**• Casual recursive filters of order are characterized by a set of feedback coefficients in the following manner. • Given a prologue vector and an input vector of any size the filter produces the output: • Such that (has the same size as the input ).**Problem Definition**• Causal recursive filters depend on a prologue vector • Similar for the anitcausal filter. Given an input vector and an epilogue vector , the output vector is defined by:**Problem Definition**• For row processing, we define an extended casual filter and anticausal filter .**Problem Definition**• With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left). P • Independent Columns • Causal • Anticausal • Independent Rows • Causal • Anticausal P’ X Y Z U V E’ E**Problem Definition**• The goal is to implement this algorithm on the GPU to make full use of all available resources. • Maximize occupancy by splitting the problem up to make use of all cores. • Reduce I/O to global memory. • Must break the dependency chain in order to increase task parallelism. • Primary design goal: Increase the amount of parallelism without increasing memory I/O.**Prior Parallelization strategies**• Baseline algorithm ‘RT’ • Block notation • Inter-block parallelism • Kernel fusion**Algorithm Ruijters & Thévenaz**Independent row and column processing • Step RT1: In parallel for each column in , apply sequentially and store . • Step RT2: In parallel for each column in , apply sequentially and store . • Step RT1: In parallel for each row in , apply sequentially and store . • Step RT1: In parallel for each row in , apply sequentially and store .**Algorithm RT in diagram form**input stages output columnprocessing row processing**Algorithm RT performance**• Completion takes 4r ) steps • Bandwidthusage in total is • = streamingmultiprocessors • = number of cores (per processor) • = width of the input image • = height of the input image • = order of the appliedfilter**Block notation (1)**• Partition input image intoblocks • = number of threads in warp (=32) • What means what? • = block in matrix with index • = column-prologue submatrix • = column-epilogue submatrix For rows we have (similar) transposed operators: and**Block notation (2)**• Tail andhead operators: selectingprologue- andepilogue-shaped submatrices from**Block notation (3)**• Result: blockedversion of problemdefinition , , ,**Someusefulkeyproperties (1)**Superposition(based on linearity) Effects of the input andprologue/epilogue on the output canbecomputedindependently**Someusefulkeyproperties (2)**Express as matrix products For any, is the identity matrix Precomputed matrices thatdependonly on the feedback coefficients of filters andrespectively. Details in paper. ,**Inter-block parellelism (1)**Perform block computationindependently output block superposition Prologue / tail of prev. output block**Inter-block parellelism (2)**first term second term incomplete causal output**Inter-block parellelism (3)**(1) Recall: (2) Algorithm 1 1.1 In parallel forall m, computeand store each 1.2 Sequentiallyforeach m, computeand store the accordingto(1)andusing the previouslycomputed 1.3 In parallel forall m, compute & store output block using(2)and the previouslycomputed**Inter-block parellelism (4)**Processing allrowsand columns usingcausaland anti-causal filter pairs requires 4 successiveapplicationsof algorithm 1. There are independent tasks: hides memory access latency. However.. The memory bandwidthusage is now Significantly more thanalgorithm RT ( canbesolved**Kernelfusion (1)**• Original idea: Kirk & Hwu [2010] • Use output of onekernel as input for the next withoutgoingthroughglobal memory. • Fusedkernel: code frombothkernels but keep intermediateresults in shared mem.**Kernelfusion (2)**• Use Algorithm 1 forall filters, do fusing. • Fuse last stage of with first stage of • Fuselast stage of and first stage of • Fuse last stage of with first stage of We aimedforbandwidthreduction. Diditwork? • Algorithm 1: • Algorithm 2: yes, itdid!**Kernelfusion (3), Algorithm 2*** input stages fix fix fix fix output *for the full algorithm in text, pleasesee the paper**Kernelfusion (4)**• Further I/O reduction is stillpossible: byrecomputingintermediaryresultsinstead of storing in memory. • More bandwidthreduction: (=good) • No. of steps: (≈bad*) Bandwidthusage is lessthanAlgorithm RT(!) but involves more computations*But.. future hardware may tip the balance in favor of more computations.**Causal-Anticausal Overlapping**• Overlapping is introduced to reduce IO to global memory. • It is possible to work with twice-incompleteanticausal epilogues , computed directly from the incomplete causal output block . • This is called casual-anticausal overlapping.**Causal-Anticausal Overlapping**• Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together.**Causal-Anticausal Overlapping**• Using the previous properties, we can split the dependency chains of anticausal epilogues.**Causal-Anticausal Overlapping**• Which can be further simplified to: