320 likes | 444 Vues
This work presents Kd-Jump, a novel approach for faster isosurface ray tracing on GPUs by implementing a path-preserving, stackless traversal method for Kd-trees. The method enhances ray coherence, reduces memory usage, and improves traversal speed by representing nodes with indices rather than pointers, facilitating efficient parallel processing. By minimizing the reliance on memory transactions and leveraging on-chip storage, Kd-Jump demonstrates significant performance gains over traditional stack-based techniques, making it an ideal solution for high-speed ray tracing applications.
E N D
Kd-Jump A Path-Preserving Stackless Traversal for FasterIsosurface Ray tracing on GPUs. David MeirionHughes. IkSoo Lim. Bangor University, UK.
Problem Setting and Previous work Problem Setting
Problem Setting:Ray Tracing • Tracing rays from camera • Find the intersections • Avoid uninteresting areas • Acceleration structure • Division of space • Requires ray traversal
Problem Setting: Traversal of Kd-Trees tnear tfar node* • Downward Traversal • Two branch choices • Remember furthest • Traverse nearest • Test for intersections • If branch had no hit? • Traversal Restore • Go back to other branch Ray Stack 0.5 0.75 0x1...
Problem Setting: GPUs • Several MPU’s • Parallel execution • Kernels • Thousands of threads • light-weight code • On-chip memory very fast • On-board memory slow
Problem Setting: Ray tracing on GPUs One Memory Transaction Two Memory Transactions Three Memory Transactions Ray 1: tnear tnear tnear tfar tfar tfar node* node* node* • Stack – still a problem? • Memory Size • Coalesced Access • One Stack element: • Ray Segment • Node address/look-up • times depth-of-tree • times ray-count • One kernel call Ray 2: Ray 3: ...
Previous Work:Stackless Traversal Ray 1: tnear tnear tnear tfar tfar tfar node* node* node* • Avoid using stack • Current thinking • Less memory • No global memory use • Faster Ray 2: Ray 3: ...
Previous Work:Stackless Traversal Tested Twice • Avoid using stack • Kd-Restart • Restart From Root • + Very little memory • - Revisits previous nodes • - Longer thread life • - Exacerbates incoherence
Previous Work:Stackless Traversal Tested Twice • Avoid using stack • Kd-Restart • Kd-Backtrack • Backtrack up tree • + Very little memory • + Better than Kd-Restart • - Revisits previous nodes • - Longer thread life • - Exacerbates incoherence
Previous Work:Stackless Traversal (per node) Additional Pointers • Avoid using stack • Kd-Restart • Kd-Backtrack • Ropes • Nodes have neighbour links • + Shorter ray life • - Lots of extra memory
Motivation and Description Kd-Jump
Kd-Jump:Motivation • Goal: • Same path as Stack method • Least-amount memory • How: • Indices rather than pointers • Down traverse with equation • Return using inverse • Binary bits for return markers
Kd-Jump:Index Reference • Each node reference by index • x, y, z, etc... • depth Level Memory Blocks [x,y,z] memory map
Kd-Jump:Method Description • Traversal into children • Update an index element • Determined by the split dimension • Multiply by 2 • Add child offset f C = 2x + f [x,y,z] x-dimension split [2x+f,y,z] f=0 f=1
Kd-Jump:Method Description • Traversal back to parent • Apply inverse of downward step • Can replace f with floor function • Do not need to consider what f was • f = 0 or 1, only (C-f)/2 = x floor(C/2) = x [floor(x/2),y,z] x-dimension split [x,y,z]
Kd-Jump:Method Description • Traversal to common parent • Apply inverse on all indices • Divide elements by power of 2 • Number of splits • Matrix of Split information • Store in constant memory • (cached) • (alternative) Store on the fly 0 0 1 0 1 1 floor(x/21), floor(y/22), ... 2 2 2 1 1 0 1 2 2 2 [x,y,z]
Kd-Jump:Method Description • Determine jump amount • Mark common parents • 1 bit • Store in MSB order • On return • Count right-trail zero bits • This is the return depth • Subtract from current depth • Jump amount 32-bit Register 0 1 0 0 0
Kd-Jump:Method Description • Re-clip Ray • Bounds stored or computed Bound X Bound Y
Kd-Jump:Scope • Nodes referenced with indices • Traversal equations invertible • Forget route choices in inverse • Index-to-memory map • Limit wasted memory • Balanced kd-tree • implicit kd-tree • Requires node bounds • Re-computed with implicit kd-tree
Kd-Jump:Isosurfacing with implicit Kd-tree • Wald’s implicit Kd-tree • min/max of node branch • left-balanced • No-waste memory map • Bounds/splits computed
Implementation:Isosurfacing with implicit Kd-tree • Minor differences • Node test • Test prior to traversing • Reduces number of returns • Stack, kd-jump, kd-restart
Results:Isosurfacing with implicit Kd-tree • Kd-Jump faster • Ray time-active important (kd-restart) • Stack only slightly slower • High occupancy (75%) • High ray coherence = automatic coalesced access Frames Per Second. Average across multiple iso/view
Results:Isosurfacing with implicit Kd-tree • All use one 32-bit register • stack_size, tfar_max, depth_flags • Stack memory allocated for all rays • Single kernel • Constant memory as fast as registers • Once data cached. Memory Use
Analysis:Kd-Jump • Theoretical performance. • Memory access not hidden • However, perfectly coalesced.
Analysis:Kd-Jump • Bottlenecks • Stack memory bound • Kd-Jump computation bound
Hybrid Kd-tree:Exploiting Texture caching • Build implicit tree • Depth threshold • Volume stepping • Texture cache • Very fast • Threshold depth? • intersection method • Iso-surface • View direction
Hybrid Kd-tree:Results Frames Per Second. Average across multiple view
Conclusions • Kd-Jump • Stackless • Index based • Immediate backtrack to common parent • No dependency on ray coherency • At least if bounds can be computed • Hybrid Kd-Jump • Texture cache over Acceleration Structure • Variable depth threshold of branches • View, intersection method, iso-value.
Conclusions • Future prediction • Memory access and speed improving • Current trend • Usefulness of Stackless • Reduced memory cost • Reduce dependency on coherency • Less iterations (Ropes) than stack • Big question: One kernel, verses many? • Stack favours one kernel. • i.e., no reorganising (can break coalesced access) • Can organise into groups of same depth though? • Many kernels = better device occupancy • Memory access better hidden
Future Work • Kd-Jump with General Kd-Tree’s? • Real-time explicit from implicit • CUDA 3.0 • Dynamic Warps? • Ideal for ray tracing? • Inter-device communication?
Addendum:Indices for general case kd-trees? • Nodes need bounds for re-clip • Accept the cost? • Compute them somehow? • BVH stores them anyway • Memory Map • Very difficult to remove wasted space • Feasible to minimise waste?