Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart

Streaming v. Multicore inGraphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart Univ. of Illinois

Dynamic Virtual Environments • World of Warcraft • Social Internet World • Completely Unconstrained(can build & share things) • Lower Quality Graphics • Grand Theft Auto IV • “Sandbox” World • Free Interaction(within gamespace) • High Quality Graphics • Halo 3 • First-Person Shooter • Constrained Interaction • Photorealistic Graphics(much precomputation) Multicore enables both flexibility and photorealism Dynamic, Flexible“Game” Graphics Precomputed, Rigid“Film” Graphics

Videogame Production • Costly • Expensive: $10M/title • Slow: 3+ years/title • Compromises • Precomputed visibility – restricts viewermobility and environment complexity • Precomputed lighting – restricts scene dynamics, user alterations • Precomputed motion – restricts movement to mocap data, rigging • Consequences • Significant development effort to achieve realtime rates • Dynamic social gamespace quality lags that of solo/team shooter levels • Solution • Leverage multicore power to ray trace for dynamic visibility & lighting

How Close Are We? • Single CPU ray tracing • RTRT Core renders at1~5 Hz on 2.5 GHz P4 • Need 60 Hz for games • 30 GHz CPU needed to ray trace game scenes [Schmittler et al., Realtime Ray Tracing for Current & Future Games, SIGGRAPH 2006 Real Time Ray Tracing Course Notes] • We won’t see a 30GHz serial processor (burns too brightly!) • We will see 16+ cores • But can we do in parallel what we predict in serial? Ingo Wald, RTRT Core, SIGGRAPH 2005Real Time Ray Tracing Course Notes

Spatial Data Structures Nearest Neighbor Problems in Graphics • Rendering: Photon Mapping (k-NN) • Find 500 photons nearest to a ray-surfaceintersection to compute surface’s illumination • Modeling: Surface Reconstruction (e-NN) • Surface reconstructed at each point depends olocations of nearest points within a given distance • Animation: Collision Detection (e-NN) • Collision between multiple interacting elements accelerated by avoiding all pairs intersections Built on hierarchical spatial data structures How can we build, query and maintain on SIMD GPU’s?

kD-Tree • Hierarchy of axis-aligned partitions • 2-D partitions are lines • 3-D partitions are planes • Axis of partitions alternates wrt depth of the tree • Average access time is O(log n) • Worst case O(n) when tree is severely lopsided • Need to maintain a balanced tree O(n log n) • Can find k nearest neighbors inO(k + log n) time using a heap

GPU Hierarchy Traversal • SIMD “stackless” hierarchy traversal • Prethread with hit/miss pointers • Hit pointer points to first child • Miss pointer points to next sibling or if last sibling then ancestor’s sibling • References • Foley & Sugerman, kD-tree Acceleration Structures for aGPU raytracer, Graphics HW 05 • Carr, Hoberock, Crane & Hart,Fast GPU Ray Tracing of Dynamic Meshes Using Geometry Images, Graphics Interface 2006

GPU Hierarchy Construction • Recent approaches sort first,then organize into hierarchy • Zhou, Hou, Wang, Guo, “Real-Time KD-Tree Construction onGraphics Hardware, SIGGRAPH Asia 2008 • Godiyal, Hoberock, Hart, Garland,“Rapid Multipole Graph Drawingon the GPU,” Graph Drawing 2008 • Latter uses kD-tree for fastn-body approximation tocompute force directed layout • CPU+GPU • CPU builds kD-tree • GPU performs median selection • Practical when > 50K elements

Incoherent Shader Execution • Videogame graphics rasterize triangles • Same shader applied to all pixels(fragments) in triangle • Shading & visibility occur simultaneously • Future videogames will also trace rays • Visibility first, then shading • Primary eye rays are coherent • Secondary rays are reflected or scattered into incoherent shader queries • Different shader (not just different shader data) applied to each ray • e.g. hair, skin, cloth, liquids, foliage Chris Wyman

GPU Architecture • GPU = MIMD of SIMD • MIMD processing • Cell: 8 MIMD nodes • GF8800: 16 MIMD nodes • LRB: 32 MIMD nodes • SIMD processing • Cell: 4 per MIMD node • GF8800: 8 per MIMD node • LRB: 16 per MIMD node • Some MIMD nodes have distinct “control” processors though similar processing could occur via one SIMD node (masking rest) • LRB “core” is a MIMD proc., NVIDIA “core” is a SIMD proc. • NVIDIA “warp” is 32 threads streaming on one MIMD node MIMD Node SIMDNode

IBM Cell Architecture Synergistic processor elements SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU Local Local Local Local Local Local Local Local store store store store store store store store SMF SMF SMF SMF SMF SMF SMF SMF 16 bytes/cycle Element interconnect bus (up to 96 bytes/cycle) Power processor 16 bytes/cycle 16 bytes/cycle (2x) element Memory Bus Power interface interface processor unit controller controller L1 Power L2 cache cache execution unit 32 bytes/cycl e 16 bytes/cycle Dual XDR Flex I/O Gschwind et al., Synergistic Processing inCell’s Multicore Architecture, IEEE Micro, 2006 64-bit Power Architecture with vector media extensions

NVIDIA Tesla Architecture

Conditional Program Flow • High-performance stuck with low-level streaming SIMD • Even in multicore • Problem with SIMD: Conditional Program Flow • If a data-dependent condition leads to two different program flows • Then both program flows must be executed on all SIMD nodes (serialization) • Result masked per SIMD processor by the condition data MIMD for loop SIMD for loop if (X) then A else B T T T T T T T F X: X? A B A B Mask on X A A A A A A A B

Deferred Shading • Handle visibility first • Intersect rays w/scene • Store result for later shading • Shade ray intersections • If different rays in the sameMIMD node need differentshaders, then shaders areserialized • O(NS) performance • N = # of rays • S = # of shaders (per MIMD node) • O(S) when distributed across N processes MIMD for all rays SIMD for all rays intersect ray with scene set mask to shader # MIMD for all rays SIMD for all rays for all shaders in SIMD ray warp shader(ray) if mask == shader

Process Sorting • Need to bucket computations to move those with identical control flows onto the SIMD processors of the same MIMD node • When is it worth the trouble?

Scan (Prefix Sum) 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1 1 2 2 3 3 3 3 4 4

Shader Scheduling • Sort jobs based onshader request • Radix sort • Segmented scan • Global v. local sort • Load MIMD nodes onlywith rays requesting thesame shader • Still O(NS) • Performing O(N) scan on each of S shaders • Can we scan on all shaders simultaneously? MIMD for all rays SIMD for all rays intersect ray with scene MIMD for all shaders Scan rays needing that shader MIMD for all rays needing that shader SIMD for all rays shader(ray)

Stanford Bunny in Cornell Box • Three shaders: wall, glass, light • Shaders simple • Warp size: 32 How often ray’s shader differs from previous ray’s Average # of branches per warp

Automotive/CAD Viz • DJ_Designs via Google 3D WH • 16 simple shaders • Small parts ameliorate their shader’s impact on overall efficiency

Angel in Cornell Box Bounce Incoherence Efficiency 1 1.2% 77% 2 52% 23% 3 53% 21% 4 47% 22% 5 40% 23% • Four shaders: • wall, light simple • marble, wood are more expensive, procedural

Siebel Center Staircase • Six shaders • Copper, glass girder, chrome, marble, light • Efficiency bump due to smooth glass/chrome coherence and rays exiting the scene

Efficiency Images Branching Penalties Warp size: 32 All 32 SIMD threadsmust follow the samecontrol flow one shader 16 shaders

Memory Coherence • Shader execution • Serial: one at a time • SIMD: as a “big switch” • Serialized • Slower, wastes processors • Avoids locks • Can conserve memory • Compare w/ & w/o stream compaction Processes: Memory: Processes: Memory: Processes: Memory:

Scheduling Approaches Five Options Serial Unsorted Serial Global Compaction Parallel Unsorted Parallel Local Compaction Parallel Global Compaction Each variation involves bookkeeping overhead

3500 3000 2500 2000 0 Sorted Global Unsorted SortedLocal SortedGlobal Unsorted 1500 Serialized SIMD Parallel 1000 • Observations • Even for these modest scenes there are significant performance gains • Local per-node compaction doesn't work • Even zero-time sort would not improve most cases • Local per-node workloads hindered by too many shaders to schedule • Faster stream compaction: Prefix sum, Scatter/Gather 500

Conclusions • Stream compaction • Not practical for simple shaders • Practical for procedural textures (wood, marble) • Probably for complex shaders (hair, cloth, skin) • Warp coherence nevertheless leads to data incoherence • Even when all shaders in a MIMD node run the same shader, their data is still distributed across memory, outside of cache boundaries • Static tuning ok, but run-time better • Broader implication to object polymorphism • Streaming same objects with different virtual function tables

Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart

Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart

Presentation Transcript

Applications and Runtime for multicore/manycore

C++ Graphics Primitives

HIM’s Role in Facilitating C hart C orrection in the EMR

Multicore Applications Team

Multicore Applications Team

Lu v Canada 2004 I.A.D.

Vinod Patel and John Davies

Victor Barra, Chris Hayes, John Paul

LU, 2014, Juris V īksna

Graphics and C++

John Stephenson † , Christopher Jones †† , Victor Snieckus ††

C-V C+V C+KI+V D+V

Multicore Applications Team

John Stephenson † , Christopher Jones †† , Victor Snieckus ††

C+V Satellite Applications

John Stephenson , Christopher Jones † , Victor Snieckus †

Simulation of Streaming Applications on Multicore Systems

Graphics and C++

Applications and Runtime for multicore/manycore

Dr. Victor Polanco C.

Prepared By : Mahavirsinh Parmar (120224106002) Vidhan Patel (120220106006) Sanjay Thakor