GPU-Efficient Recursive Filtering and Summed-Area Tables

# GPU-Efficient Recursive Filtering and Summed-Area Tables

## GPU-Efficient Recursive Filtering and Summed-Area Tables

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 R. S. Lima2 H. Hoppe3 1IMPA 2Digitok3Microsoft Research

2. Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs input prologue output

3. Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs • Sequential dependency chain input prologue output

4. Applications of recursive filtering recursivepreprocessingstep • B-Spline (or other) interpolation input coefficients interpolation (from coefficients)

5. Applications of recursive filtering recursive filters • B-Spline (or other) interpolation • Fast, wide, Gaussian-blur approximation • Summed-area tables input blurred

6. Causality and order • Recursive filters can be causal or anticausal • Causal goes forward, anticausal in reverse direction • Filter order is simply the number r of feedbacks input epilogue output

7. Filter sequences and separability • Often, sequences of recursive filters are needed • Independent columns • Causal • Anticausal • Independent rows • Causal • Anticausal

8. Algorithm RT • The baseline algorithm • Process columns in parallel, then rows in parallel • Ruijterset al. 2010 “GPU prefilter […]” input stages output columnprocessing row processing

9. First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G ( 4 t u p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

10. Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires manytimes more tasks • Images are not large enough: must parallelize further

11. Increasing parallelism • Similar to parallel prefix-sum algorithms • Sengupta et al. 2007 “Scan primitives for GPU computing” • Dotsenko et al. 2008 “Fast scan algorithms […]” • Compute and store incomplete prologues • Fix incomplete prologues • Somewhat more complicated than a recursive invocation • Use prologues to compute and store causal results … … … … ✗ ✗ ✗ ✗ …

12. Fixing incomplete prologues … … … superposition ✗ linearity

13. Algorithm 2 • Adds block parallelism • Sung et al. 1986 “Efficient […] recursive […]”, or • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms input stages output fix fix fix fix

14. First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

15. Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires manytimes more tasks • Images are not large enough: must parallelize further • FLOP/IO ratio of recursive filters is too low • Can use even more FLOPs but must reduce IO • To do so, we introduce overlapping

16. Causal-anticausal overlapping • Start anticausal processing before causal is done • Saves reading and writing causal results! • Compute and store incomplete prologues & epilogues • Fix incomplete prologues & twice-incomplete epilogues • Twice-incomplete epilogues are trickier • Use them to compute and store anticausal results … …

17. Fixing twice-incomplete epilogues • Repeatedly apply linearity and superposition • Tedious derivation, simple result corrected epilogue corrected prologue twice-incomplete epilogue

18. Algorithm 4 • Adds causal-anticausal overlapping • Eliminates reading and writing causal results • Both in column and in row processing • Modest increase in computation input stages output fix both fix both

19. First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms • Alg. 4 adds causal-anticausal overlapping • Eliminates 4hw of IO • Modest increase in computation 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P i G 4 ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

20. Algorithm 5 • Adds row-column overlapping • Eliminates reading and writing column results • Modest increase in computation input stages output fix all!

21. Start from input and global borders

22. Load blocks into shared memory

23. Compute & store incomplete borders

24. Compute & store incomplete borders

25. Compute & store incomplete borders

26. Compute & store incomplete borders

27. Compute & store incomplete borders

28. Compute & store incomplete borders

29. Compute & store incomplete borders

30. Compute & store incomplete borders

31. All borders in global memory

32. Fix incomplete borders

33. Fix twice-incomplete borders

34. Fix thrice-incomplete borders

35. Fix four-times-incomplete borders

36. Done fixing all borders

37. Load blocks into shared memory

38. Finish causal columns

39. Finish anticausalcolumns

40. Finish causal rows

41. Finish anticausal rows

42. Store results to global memory

43. Done!

44. Row-column overlapping rules • Fixing thrice-incomplete row-prologues • Fixing four-times-incomplete row-epilogues

45. First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms • Alg. 4 adds causal-anticausal overlapping • Eliminates 4hw of IO • Modest increase in computation • Alg. 5 adds row-column overlapping • Eliminates additional 2hw of IO • Modest increase in computation 7 6 Cubic B-Spline Interpolation (GeForce GTX 480) 5 ) s / P 5 i G 4 ( 4 t u 2 p h RT g 3 u o r h T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

46. Second-order filter benchmarks • Alg. 42 uses causal-anticausal overlapping • Alg. 52adds row-column overlapping • Added complexity outweighs IO reduction • Balance will change (hardware, compiler, implementation) Quintic B-Spline Interpolation (GeForce GTX 480) 5 42 52 4 ) s / P i G 3 ( t u p h g u o 2 r h T 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

47. Gaussian blur results • CUFFT is in frequency domain • complexity • DIR is direct convolution • complexity • Podlozhnyuk2007 whitepaper“Image convolution with CUDA” • Overlapped recursive • 3rd order approximation • complexity • van Vliet et al. 1998 “Recursive Gaussian derivative filters” • Implemented as 51 fused with 42 • Recursive approximation is faster • Even for modest size images • Also modest standard-deviations Gaussian Blur (GeForce GTX 480) 4 Overlapped Recursive DIR2.5 DIR 5 DIR 10 3 CUFFT ) s / P i G ( t u p 2 h g u o r h T 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

48. Summed-area table benchmarks • Harris et al 2008, GPU Gems 3 • “Parallel prefix-scan […]” • Multi-scan + transpose + multiscan • Implemented with CUDPP • Hensley 2010, Gamefest • “High-quality depth of field” • Multi-wave method • Our improvements+ specialized row and column kernels+ save only incomplete borders+ fuse row and column stages • Overlapped SAT • Row-column overlapping • First-order filter, unit coefficient, no anticausal component Summed-area Table (GeForce GTX 480) 9 8 Overlapped SAT 7 Improved Hensley [2010] ) Hensley [2010] s / 6 P Harris et al [2008] i G ( 5 t u p h g 4 u o r h 3 T 2 1 2 2 2 2 2 2 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6 I n p u t s i z e ( p i x e l s )

49. Future work • Volumetric processing • Overlapping should generalize • Not enough shared memory (yet?) • CPU implementation • Blocking should increase L1 cache effectiveness • Is doubling amount of computation worth it? • Solving general narrow-banded linear systems • Overlapping back- and forward- substitution

50. Conclusions • Recursive filters are useful in many applications • Cubic and quintic B-Spline interpolation • Gaussian-blur approximation • Summed-area table computation • We introduced parallel algorithms for GPUs • Overlapping reduces IO requirements • Leads to faster algorithms • Code is available from project page • Most is already there, rest is on the way