GPU-Efficient Recursive Filtering and Summed-Area Tables

# GPU-Efficient Recursive Filtering and Summed-Area Tables

## GPU-Efficient Recursive Filtering and Summed-Area Tables

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. GPU-Efficient Recursive Filtering and Summed-Area Tables Jeremiah van Oosten Reinier van Oeveren

2. Table of Contents • Introduction • Related Works • Prefix Sums and Scans • Recursive Filtering • Summed-Area Tables • Problem Definition • Parallelization Strategies • Baseline (Algorithm RT) • Block Notation • Inter-block Parallelism • Kernel Fusion (Algorithm 2) • Overlapping • Causal-Anticausal overlapping (Algorithm 3 & 4) • Row-Column Causal-Anitcausal overlapping (Algorithm 5) • Summed-Area Tables • Overlapped Summed-Area Tables (Algorithm SAT) • Results • Conclusion

3. Introduction

4. Introduction • Linear filtering is commonly used to blur, sharpen or down-sample images. • A direct implementation evaluating a filter of support d on a h x w image has a cost of O(hwd).

5. Introduction • The cost of the image filter can be reduced using a recursive filterin which case previous results can be used to compute the current value: • Cost can be reduced to O(hwr) where r is the number of recursive feedbacks.

6. Recursive Filters • At each step, the filter produces an output element by a linear combination of the input element and previously computed output elements. 0(hwr) Continue…

7. Recursive Filters recursive filters • Applications of recursive filters • Low-pass filtering like Gaussian kernels • Inverse Convolution ( • Summed-area tables input blurred

8. Causality • Recursive filters can be causal or anticausal (or non-causal). • Causal filters operate on previous values. • Anticausal filters operate on “future” values.

9. Anticausal • Anticausal filters operate on “future” values. Continue…

10. Filter Sequences • It is often required to perform a sequence of recursive image filters. P • Independent Columns • Causal • Anticausal • Independent Rows • Causal • Anticausal P’ X Y Z U V E’ E

11. Maximizing Parallelism • The naïve approach to solving the sequence of recursive filters does not sufficiently utilize the processing cores of the GPU. • The latest GPU from NVIDIA has 2,668 shader cores. Processing even large images (2048x2048) will not make full use of all available cores. • Under utilization of the GPU cores does not allow for latency hiding. • We need a way to make better utilization of the GPU without increasing IO.

12. Overlapping • In the paper “GPU-Efficient Recursive Filtering and Summed-Area Tables” by Diego Nehab et. al. they introduce a new algorithmic framework to reduce memory bandwidth by overlapping computation over the full sequence of recursive filters.

13. Block Partitioning • Partition the image into 2D blocks of size .

14. Related Works

15. Prefix Sums and Scans • A prefix sum • Simple case of a first-order recursive filter. • A scan generalizes the recurrence using an arbitrary binary associative operator. • Parallel prefix-sums and scans are important building blocks for numerous algorithms. • [Iverson 1962; Stone 1971; Blelloch 1989; Sengupta et. al. 2007] • An optimized implementation comes with the CUDPP library [2011].

16. Recursive Filtering • A generalization of the prefix sum using a weighted combination of prior outputs. • This can be implemented as a scan operation with redefined basic operators. • Ruijters and Thevenaz [2010] exploit parallelisim across the rows and columns of the input.

17. Recursive Filtering • Sung and Mitra [1986] use block parallelism and split the computation into two parts: • One computation based only on the block data assuming a zero initial conditions. • One computation based only on the initial conditions and assuming zero block data.

18. - + height width UL UR - + LL LR Summed-Area Tables • Summed-area tables enable the averaging rectangular regions of pixel with a constant number of reads

19. Summed-Area Tables • The paper titled “Fast Summed-Area Table Generation…” from Justin Hensley et. al. (2005) describes a method called recursive doubling which requires multiple passes of the input image. (A 256x256 image requires 16 passes to compute). Image A Image B Image A Image B

20. Summed-Area Tables • In 2010, Justin Hensley extended his 2005 implementation to compute shaderstaking more samples per pass and storing the result in intermediate shared memory. Now a 256x256 image only required 4 passes when reading 16 samples per pass.

21. Problem Definition

22. Problem Definition • Casual recursive filters of order are characterized by a set of feedback coefficients in the following manner. • Given a prologue vector and an input vector of any size the filter produces the output: • Such that (has the same size as the input ).

23. Problem Definition • Causal recursive filters depend on a prologue vector • Similar for the anitcausal filter. Given an input vector and an epilogue vector , the output vector is defined by:

24. Problem Definition • For row processing, we define an extended casual filter and anticausal filter .

25. Problem Definition • With these definitions, we are able to formulate the problem of applying the full sequence of four recursive filters (down, up, right, left). P • Independent Columns • Causal • Anticausal • Independent Rows • Causal • Anticausal P’ X Y Z U V E’ E

26. Problem Definition • The goal is to implement this algorithm on the GPU to make full use of all available resources. • Maximize occupancy by splitting the problem up to make use of all cores. • Reduce I/O to global memory. • Must break the dependency chain in order to increase task parallelism. • Primary design goal: Increase the amount of parallelism without increasing memory I/O.

27. Prior Parallelization Strategies

28. Prior Parallelization strategies • Baseline algorithm ‘RT’ • Block notation • Inter-block parallelism • Kernel fusion

29. Algorithm Ruijters & Thévenaz Independent row and column processing • Step RT1: In parallel for each column in , apply sequentially and store . • Step RT2: In parallel for each column in , apply sequentially and store . • Step RT1: In parallel for each row in , apply sequentially and store . • Step RT1: In parallel for each row in , apply sequentially and store .

30. Algorithm RT in diagram form input stages output columnprocessing row processing

31. Algorithm RT performance • Completion takes 4r ) steps • Bandwidthusage in total is • = streamingmultiprocessors • = number of cores (per processor) • = width of the input image • = height of the input image • = order of the appliedfilter

32. Block notation (1) • Partition input image intoblocks • = number of threads in warp (=32) • What means what? • = block in matrix with index • = column-prologue submatrix • = column-epilogue submatrix For rows we have (similar) transposed operators: and

33. Block notation (1 cont’d)

34. Block notation (2) • Tail andhead operators: selectingprologue- andepilogue-shaped submatrices from

35. Block notation (3) • Result: blockedversion of problemdefinition , , ,

36. Someusefulkeyproperties (1) Superposition(based on linearity) Effects of the input andprologue/epilogue on the output canbecomputedindependently

37. Someusefulkeyproperties (2) Express as matrix products For any, is the identity matrix Precomputed matrices thatdependonly on the feedback coefficients of filters andrespectively. Details in paper. ,

38. Inter-block parellelism (1) Perform block computationindependently output block superposition Prologue / tail of prev. output block

39. Inter-block parellelism (2) first term second term incomplete causal output

40. Inter-block parellelism (3) (1) Recall: (2) Algorithm 1 1.1 In parallel forall m, computeand store each 1.2 Sequentiallyforeach m, computeand store the accordingto(1)andusing the previouslycomputed 1.3 In parallel forall m, compute & store output block using(2)and the previouslycomputed

41. Inter-block parellelism (4) Processing allrowsand columns usingcausaland anti-causal filter pairs requires 4 successiveapplicationsof algorithm 1. There are independent tasks: hides memory access latency. However.. The memory bandwidthusage is now Significantly more thanalgorithm RT ( canbesolved

42. Kernelfusion (1) • Original idea: Kirk & Hwu [2010] • Use output of onekernel as input for the next withoutgoingthroughglobal memory. • Fusedkernel: code frombothkernels but keep intermediateresults in shared mem.

43. Kernelfusion (2) • Use Algorithm 1 forall filters, do fusing. • Fuse last stage of with first stage of • Fuselast stage of and first stage of • Fuse last stage of with first stage of We aimedforbandwidthreduction. Diditwork? • Algorithm 1: • Algorithm 2: yes, itdid!

44. Kernelfusion (3), Algorithm 2 * input stages fix fix fix fix output *for the full algorithm in text, pleasesee the paper

45. Kernelfusion (4) • Further I/O reduction is stillpossible: byrecomputingintermediaryresultsinstead of storing in memory. • More bandwidthreduction: (=good) • No. of steps: (≈bad*) Bandwidthusage is lessthanAlgorithm RT(!) but involves more computations*But.. future hardware may tip the balance in favor of more computations.

46. Overlapping

47. Causal-Anticausal Overlapping • Overlapping is introduced to reduce IO to global memory. • It is possible to work with twice-incompleteanticausal epilogues , computed directly from the incomplete causal output block . • This is called casual-anticausal overlapping.

48. Causal-Anticausal Overlapping • Recall that we can express the filter so that the input and the prologue or epilogue can be computed independently and later added together.

49. Causal-Anticausal Overlapping • Using the previous properties, we can split the dependency chains of anticausal epilogues.

50. Causal-Anticausal Overlapping • Which can be further simplified to: