Maximizing Pixel-Parallelism via Fragment-Parallel Composite and Filters

Fragment-Parallel Composite and Filter Anjul Patney, Stanley Tzeng, and John D. Owens University of California, Davis

Parallelism in Interactive Graphics • Well-expressed in hardware as well as APIs • Consistently growing in degree & expression • More and more cores on upcoming GPUs • From programmable shaders to pipelines • We should rethink algorithms to exploit this • This paper provides one example • Parallelization of composite/filter stages

A Feed-Forward Rendering Pipeline Primitives Geometry Processing Rasterization Composite Filter Pixels

Composite & Filter Sample Locations Pixel • Input: • Unordered list of fragments • Output • Pixel colors • Assumption • No fragments are discarded

Basic Idea Processors Pixel-Parallel

Basic Idea Processors Insufficient parallelism Fragment-Parallel Irregularity

Motivation • Most applications have low depth complexity • Pixel-level parallelism is sufficient • We are interested in applications with • Very high depth complexity • High variation in depth complexity • Further • Future platforms will demand more parallelism • High depth-complexity can limit pixel-parallelism

Motivation

Related Work 1 Maximum MSAA samples per pixel 2 Maximum render targets Order-Independent Transparency (OIT) • Depth-Peeling [Everitt 01] • One pass per transparent layer • Stencil-Routed A-buffer [Myers & Bavoil 07] • One pass per 8 depth layers1 • Bucket Depth-Peeling [Liu et al. 09] • One pass per up to 32 layers2

Related Work Order-Independent Transparency (OIT) • OIT using Direct3D 11 [Gruen et al. 10] • Use fragment linked-lists • Per-pixel sort and composite • Hair Self-Shadowing [Sintorn et al. 09] • Each fragment computes its contribution • Assumes constant opacity

Related Work Programmable Rendering Pipelines • RenderAnts[Zhou et al. 09] • Sort fragments globally • Per-pixel composite/filter • FreePipe[Liu et al. 10] • Sort fragments globally • Per-pixel composite/filter

Pixel-Parallel Formulation Pi P(i+1) P(i+2) Sj j S(j+1) (j+1) (j+2) S(j+2) S(j+3) (j+3) S(j+4) (j+4) (j+5) S(j+5) (j+6) S(j+6) Thread IDs P: Pixel S: Subsample

Fragment-Parallel Formulation Pi P(i+1) P(i+2) Sj S(j+1) S(j+2) S(j+3) S(j+4) S(j+5) S(j+6) j j+1 j+2 j+3 j+4 j+5 j+6 j+7 j+8 j+9 j+10 j+11 j+12 j+13 j+14 j+15 j+16 j+17 j+18 j+19 j+20 j+21 j+22 j+23 P: Pixel S: Subsample P: Pixel S: Subsample Thread IDs

Fragment-Parallel Formulation fragment 1 fragment 2 … background Cs = α1C1 + (1-α1){α2C2+(1-α2)(…(αN+(1-αN)CB)…} Cs = 1.α1.C1 + (1-α1).α2.C2 + (1-α1)(1-α2).α3.C3 + … + (1-α1)(1-α2)…(1-αk-1).αi.Ck + … + (1-α1)(1-α2)…(1-αN).CB Local Contribution Lk Global Contribution Gk How can this behavior be achieved? Revisit the composite equation

Fragment-Parallel Formulation Cs = G1.L1 + G2.L2 + G3.L3 … GN.LN Gk = (1-α1).(1-α2)…(1-αk-1) Lk = αk.Ck • Lk is trivially parallel (local computation) • Gk is the result of a scan operation (product) • For the list of input fragments • Compute G[ ] and L[ ], multiply • Perform reduction to add subpixel contributions

Fragment-Parallel Formulation • Cp = Cs1.κ1 + Cs2.κ2 + … + CsM.κM • Filter, for every pixel: • This can be expressed as another reduction • After multiplying with subpixel weights κm • Can be merged with previous reduction

Fragment-Parallel Composite & Filter Final Algorithm • Two-key sort (Subpixel ID, depth) • Segmented Scan (obtain Gk) • Premultiply with weights (Lk, κm) • Segmented Reduction

Fragment-Parallel Formulation Pi P(i+1) P(i+2) Segmented Scan (product) Segmented Reduction (sum) P: Pixel S: Subsample P: Pixel S: Subsample

Implementation • Hardware used: NVIDIA GeForce GTX 280 • We require fast Segmented Scan and Reduce • CUDPP library provides that • Restricts implementation to NVIDIA CUDA • No direct access to hardware rasterizer • We wrote our own

Example System – Polygons • Applications • Games • Depth Complexity • 1 to few tens of layers • Suited to pixel-parallel • Fragment-parallel software rasterizer

Example System – Particles • Applications • Simulations, games • Depth Complexity • Hundreds of layers • High depth-variance • Particle-parallel sprite rasterizer

Example System – Volumes • Applications • Scientific Visualization • Depth Complexity • Tens to Hundreds of layers • Low depth-variance • Major-axis-slice rasterizer

Example System – Reyes • Applications • Offline rendering • Depth Complexity • Tens of layers • Moderate depth variance • Data-parallel micropolygon rasterizer

Performance Results

Performance Variation

Limitations • Increased memory traffic • Several passes through CUDPP primitives • Unclear how to optimize for special cases • Threshold opacity • Threshold depth complexity

Summary and Conclusion • Parallel formulation of composite equation • Maps well to known primitives • Can be integrated with filter • Consistent performance across varying workloads • FPC is applicable to future rendering pipelines • Exploits higher degree of parallelism • Better related to size of rendering workload • A tool for building programmable pipelines

Future Work • Performance • Reduction in memory traffic • Extension to special-case scenes • Hybrid PPC-FPC formulations • Applications • Integration with hardware rasterizer • Cinematic rendering, Photoshop

Acknowledgments • NSF Award 0541448 • SciDACInsitute for Ultrascale Visualization • NVIDIA Research Fellowship • Equipment donated by NVIDIA • Discussions and Feedback • ShubhoSengupta (UC Davis), Matt Pharr (Intel), Aaron Lefohn (Intel), Mike Houston (AMD) • Anonymous reviewers • Implementation assistance • Jeff Stuart, ShubhoSengupta

Thanks!

Maximizing Pixel-Parallelism via Fragment-Parallel Composite and Filters

Maximizing Pixel-Parallelism via Fragment-Parallel Composite and Filters

Presentation Transcript

The Frustrating Fragment

Filter quality “Q” and Filter types

Fragment Shading

Afro-American Fragment

Fragment Sentences

Fragment sentences

Fragment Sentences

Fragment Assembly

Fragment Assembly

Sentence or Fragment

Lowpass Filter and Highpass Filter

ARB Fragment Program

Parallel Kalman Filter Track Fit Based on Vector Classes

Fragment Assembly

Filter and Firewalls

UDP and IP Fragment

Sentence Fragment

Model And Filter

Fab Fragment

Fragment Design

Fragment Library

A Fragment …