Introduction to Realtime Ray Tracing Course 41

Introduction to Realtime Ray TracingCourse 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald

Hardware for Realtime Ray Tracing • Custom Hardware for Realtime Ray Tracing • Characteristics and requirements • RPU Design and Implementation • GPU + Recursion + Custom Traversal HW • Programming Model • FPGA Prototype • Performance and Scalability

Ray Tracing on CPUs • Characteristics • Commodity, well understood HW • High FP performance, yet still too slow • Limited parallelism, bulky clusters • Poor silicon usage (e.g. cache) • Outlook • Multi-core designs are coming • Will still take too long

Ray Tracing on GPUs • Characteristics • Very high raw FP performance • High degree of parallelism • Fast development cycle • Stream programming model • Still too limited for efficient ray tracing • No support for recursion • Limited memory access

Ray Tracing Characteristics: kd-Tree Traversal • One-dimensional computation along ray • Compute location of d relative to t_min / t_max • Iterate or recurse with updated t_max / t_max t_max d t_max d t_max t_min t_min t_min split split d Near: t_min< t_max < d Both: t_min < d < t_max Far: d < t_min < t_max

Ray Tracing Characteristics: kd-Tree Traversal t_max • Inner traversal loop tmp = node.split – ray.origin d = tmp * 1/ray.direction near = d > t_min far = d < t_max if (near & far) push(node.far, d, t_max) if (near) iterate(node.near, t_min, d) else iterate(node.far, d, t_max) • Advantages of using kd-trees • Simple and fast traversal & building algorithm • Robust & very good handling of large scenes d t_min split

Ray Tracing Characteristics: kd-Tree Traversal • Traversal Processing • 50-80 k-D steps per ray @ 10 instructions/step many instructions  many clock cycles • Serial dependency  low pipeline efficiency, stalls, latency • Limited but flexible control flow and memory access  Custom HW unit • One clock tick per traversal step (fully pipelined) • Up to 100:1 improvement

Ray Tracing Characteristics: Intersection • Intersection computation • Triggered by traversal at every leaf node • Called with: ray and address of geometry • Option 1: Custom hardware [SaarCOR’05] • Option 2: Software on programmable processor • Can be implemented efficiently • Enables arbitrary programmable primitives  Do not use costly dedicated hardware

Ray Tracing Characteristics: Shading • Shading computation • Triggered by finished ray traversal • Called with: ray, hit point, shader-id, address of parameters • Characteristics: • General-purpose computation, many 3-/4-vectors • Needs support for efficient texture and memory access • Needs support for arbitrary recursive tracing rays • E.g. support dependent ray tracing  Main feature of ray tracing: Do not put limits on it

Ray Tracing Characteristics: Coherence • Ray coherence • Neighboring primary rays • Traverse highly similar kd-node in same order • Often hit same geometric primitives • Often execute the same shader, access same textures, … • Similar for shadow rays to one light source • Often (but not always) applies for secondary rays  HW should take advantage of this coherence

Previous Work • SaarCOR I • Fixed function ray tracing chip [GH’05]

RPU Approach • Take GPUs as basis and core component • Highly parallel, highly efficient • Improve programming model • Add efficient recursion, conditionals • Add memory access options • Add custom traversal unit • Slave to RPU • Performs indirect, data dependent functions calls

RPU Design • Shader Processing Units (SPU) • General purpose computation • For shading, geometry, lighting computations • Operates on 4-component vectors • Integer and float • Dual issue, split vector • GPU-like instruction set • Arbitrary read/write • Texture addressing mode • No texture filtering  SW

RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Efficient traversal of k-D trees • Communicates with SPU over dedicated registers

RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Increases usage of HW resources • Hides latency due to • Memory access • Instruction dependencies • Long traversal operations • Separate thread pool for SPU & TPU • Software scheduling (compiler) • No overhead for switching threads • Increases resources (mainly register file)

RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • SIMD execution (SPUs & TPUs) • Takes advantage of coherence • Reduces hardware complexity • Can combine of memory requests • Reduces external bandwidth • Must allow for incoherence • Chunks may split at conditionals • Inactive sub-chunk put on stack • Masked execution • Worst case: serial computation

RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing (MPU) • Per thread caching mechanism • Avoids multiple processing of same kd-tree entry (e.g. triangle) • 10x performance for some scenes

RPU Architecture

SPU Vector Registers • All registers have 4- component (float or integer) • R0 to R15: General registers • Index into a HW managed register stack • Allows for single-cycle function call • P0 to P15: shader parameters • I0 to I3: data read from memory • A = (A0,A1,A2,A3) • Memory addressing • ORG, DIR, ... • TPU communication registers

Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return

Ray Triangle IntersectionUnit-Triangle Test ; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy (<0 or >=1) return ; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return ; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return ; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return ; load triangle transformation load4x A.y,0 ; transform ray dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2 ; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return Input Arithmetic (dot products) Multi-issue (arith. & cond.)

Shader Processing UnitPipelining Read Instruction mov R0,R1 * mov R2,R3 * mov R0,R2 Read 3 Source Registers Swizzeling Memory Access * * * * + + + + Thread Control Clamp Branching RCP, RSQ Masking StackControl Writeback I0 – I3 Writeback Masking Writeback

RPU Programming Model Light Source Shader Light Source Shader • ↨: Direct function calls • ↔: Indirect function calls via TPU TPU/ MPU Lighting Shader shadow rays ... TPU/ MPU secondaryrays Surface/ BRDF Shader ... SPU Processing TPU / MPU Processing TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader • Threads are started for each pixel • Registers initialized from an input stream • 2D Hilbert curve generator sampling the screen • Memory stream for multi-pass • Shader computes ray TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader • Threads are started • Registers initialized from an input stream • 2D Hilbert curve generator sampling the screen • Memory stream for multi-pass TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader • Shooting Primary Rays • Ray traversal performed onthe TPU • Started in top-level kd-tree • Intersector transforms ray into local coordinate system TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader top-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader top-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader • Shooting Primary Rays (II) • Transformed ray traversed through object kd-tree on TPU • Geometry intersection performed on programmable SPU • Programmable geometry: triangles, spheres, bicubic splines, quadrics, … TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader object-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader object-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader • Surface shading performed on programmable SPU • Surface shader is called directly from primary shader • Arguments passed on HW stack • May trace secondary rays at any time: reflection, refraction, … • Writing shaders is easy due to global access to the scene and physically-based computation TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

RPU Programming Model Light Source Shader Light Source Shader • Light properties and illumination can be abstracted using function calls • Illumination shader iterates over all light sources • For each light source a Light source shader is called TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

Prototype Implementation

PrototypePerformance • FPGA prototype • Xilinx Virtex II 6000 • 128 MB DDR-RAM at 350 MB/s • PCI bus for up-/download (no VGA) • Single RPU at only 66 MHz • Up to 4 million rays per second • Up to 20 fps @ 512x384 • Same ray tracing performance as Intel P4 @ 2.66 GHz

Scalability • Larger Chunk Size • Less ray coherence • More data is accessed • Increased cache bandwidth • Larger caches

Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Limited by • VLSI technology • Memory bandwidth • FPGA prototype versus current GPUs • Floating point units 50x • Memory bandwidth 100x • Clock rate 7x

Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Multiple chips on a board • Fast interconnect for data exchange • Cache sizes accumulate • Managed through virtual memory [Schmittler’2003] • Limited through external bandwidth due to scene changes

Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Multiple chips on a board • Multiple boards in a PC • Similar to today’s PC clusters in a much smaller form factor

Video

Future Work • Support for fully dynamic scenes • Vertex shader + building kd-trees • Efficient photon mapping • kd-tree construction + kNN filtering • OpenRT-API [Dietrich’03] • ASIC prototype

Questions? http://graphics.cs.uni-sb.de http://www.OpenRT.de http://www.SaarCOR.de

Introduction to Realtime Ray Tracing Course 41