Download
gpu efficient recursive filtering and summed area tables n.
Skip this Video
Loading SlideShow in 5 Seconds..
GPU-Efficient Recursive Filtering and Summed-Area Tables PowerPoint Presentation
Download Presentation
GPU-Efficient Recursive Filtering and Summed-Area Tables

GPU-Efficient Recursive Filtering and Summed-Area Tables

212 Views Download Presentation
Download Presentation

GPU-Efficient Recursive Filtering and Summed-Area Tables

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 R. S. Lima2 H. Hoppe3 1IMPA 2Digitok3Microsoft Research

  2. Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs input prologue output

  3. Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs • Sequential dependency chain input prologue output

  4. Applications of recursive filtering recursivepreprocessingstep • B-Spline (or other) interpolation input coefficients interpolation (from coefficients)

  5. Applications of recursive filtering recursive filters • B-Spline (or other) interpolation • Fast, wide, Gaussian-blur approximation • Summed-area tables input blurred

  6. Causality and order • Recursive filters can be causal or anticausal • Causal goes forward, anticausal in reverse direction • Filter order is simply the number r of feedbacks input epilogue output

  7. Filter sequences and separability • Often, sequences of recursive filters are needed • Independent columns • Causal • Anticausal • Independent rows • Causal • Anticausal

  8. Algorithm RT • The baseline algorithm • Process columns in parallel, then rows in parallel • Ruijterset al. 2010 “GPU prefilter […]” input stages output columnprocessing row processing

  9. First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G ( 4 t u p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

  10. Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires manytimes more tasks • Images are not large enough: must parallelize further

  11. Increasing parallelism • Similar to parallel prefix-sum algorithms • Sengupta et al. 2007 “Scan primitives for GPU computing” • Dotsenko et al. 2008 “Fast scan algorithms […]” • Compute and store incomplete prologues • Fix incomplete prologues • Somewhat more complicated than a recursive invocation • Use prologues to compute and store causal results … … … … ✗ ✗ ✗ ✗ …

  12. Fixing incomplete prologues … … … superposition ✗ linearity

  13. Algorithm 2 • Adds block parallelism • Sung et al. 1986 “Efficient […] recursive […]”, or • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms input stages output fix fix fix fix

  14. First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

  15. Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires manytimes more tasks • Images are not large enough: must parallelize further • FLOP/IO ratio of recursive filters is too low • Can use even more FLOPs but must reduce IO • To do so, we introduce overlapping

  16. Causal-anticausal overlapping • Start anticausal processing before causal is done • Saves reading and writing causal results! • Compute and store incomplete prologues & epilogues • Fix incomplete prologues & twice-incomplete epilogues • Twice-incomplete epilogues are trickier • Use them to compute and store anticausal results … …

  17. Fixing twice-incomplete epilogues • Repeatedly apply linearity and superposition • Tedious derivation, simple result corrected epilogue corrected prologue twice-incomplete epilogue

  18. Algorithm 4 • Adds causal-anticausal overlapping • Eliminates reading and writing causal results • Both in column and in row processing • Modest increase in computation input stages output fix both fix both

  19. First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms • Alg. 4 adds causal-anticausal overlapping • Eliminates 4hw of IO • Modest increase in computation 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G 4 ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

  20. Algorithm 5 • Adds row-column overlapping • Eliminates reading and writing column results • Modest increase in computation input stages output fix all!

  21. Start from input and global borders

  22. Load blocks into shared memory

  23. Compute & store incomplete borders

  24. Compute & store incomplete borders

  25. Compute & store incomplete borders

  26. Compute & store incomplete borders

  27. Compute & store incomplete borders

  28. Compute & store incomplete borders

  29. Compute & store incomplete borders

  30. Compute & store incomplete borders

  31. All borders in global memory

  32. Fix incomplete borders

  33. Fix twice-incomplete borders

  34. Fix thrice-incomplete borders

  35. Fix four-times-incomplete borders

  36. Done fixing all borders

  37. Load blocks into shared memory

  38. Finish causal columns

  39. Finish anticausalcolumns

  40. Finish causal rows

  41. Finish anticausal rows

  42. Store results to global memory

  43. Done!

  44. Row-column overlapping rules • Fixing thrice-incomplete row-prologues • Fixing four-times-incomplete row-epilogues

  45. First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms • Alg. 4 adds causal-anticausal overlapping • Eliminates 4hw of IO • Modest increase in computation • Alg. 5 adds row-column overlapping • Eliminates additional 2hw of IO • Modest increase in computation 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P 5 i G 4 ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

  46. Second-order filter benchmarks • Alg. 42 uses causal-anticausal overlapping • Alg. 52adds row-column overlapping • Added complexity outweighs IO reduction • Balance will change (hardware, compiler, implementation) Quintic B-Spline Interpolation (GeForce GTX 480) 5 42 52 4 ) s / P i G 3 ( t u p h g u o 2 r h T 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

  47. Gaussian blur results • CUFFT is in frequency domain • complexity • DIR is direct convolution • complexity • Podlozhnyuk2007 whitepaper“Image convolution with CUDA” • Overlapped recursive • 3rd order approximation • complexity • van Vliet et al. 1998 “Recursive Gaussian derivative filters” • Implemented as 51 fused with 42 • Recursive approximation is faster • Even for modest size images • Also modest standard-deviations Gaussian Blur (GeForce GTX 480) 4 Overlapped Recursive DIR2.5 DIR 5 DIR 10 3 CUFFT ) s / P i G ( t u p 2 h g u o r h T 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

  48. Summed-area table benchmarks • Harris et al 2008, GPU Gems 3 • “Parallel prefix-scan […]” • Multi-scan + transpose + multiscan • Implemented with CUDPP • Hensley 2010, Gamefest • “High-quality depth of field” • Multi-wave method • Our improvements+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages • Overlapped SAT • Row-column overlapping • First-order filter, unit coefficient, no anticausal component Summed-area Table (GeForce GTX 480) 9 8 Overlapped SAT 7 Improved Hensley [2010] ) Hensley [2010] s / 6 P Harris et al [2008] i G ( 5 t u p h g 4 u o r h 3 T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

  49. Future work • Volumetric processing • Overlapping should generalize • Not enough shared memory (yet?) • CPU implementation • Blocking should increase L1 cache effectiveness • Is doubling amount of computation worth it? • Solving general narrow-banded linear systems • Overlapping back- and forward- substitution

  50. Conclusions • Recursive filters are useful in many applications • Cubic and quintic B-Spline interpolation • Gaussian-blur approximation • Summed-area table computation • We introduced parallel algorithms for GPUs • Overlapping reduces IO requirements • Leads to faster algorithms • Code is available from project page • Most is already there, rest is on the way