A Hardware Processing Unit For Point Sets
This paper presents a novel hardware architecture designed for efficient processing of point sets in graphics applications. Recognizing the limitations of traditional GPUs in managing neighborhood queries, the work introduces advanced techniques for neighbor searching, caching, and multi-threading in an FPGA environment. By overcoming issues like dynamic data handling and incoherent branching, our architecture demonstrates significant performance improvements in spatial searching tasks, achieving efficient real-time rendering and manipulation of complex point datasets.
A Hardware Processing Unit For Point Sets
E N D
Presentation Transcript
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud,M. Botsch, M. Gross Graphics Hardware 2008
Motivation • Point-based graphics established • Powerful algorithms • Representation • Processing • Manipulation • Rendering • Decomposition • Get neighborhood • Operate on neighbors Graphics Hardware 2008
Motivation • GPUs not suited for getting neighborhood • SIMD • Incoherent branching • Dynamic data structures slow • Recursive calls not supported • CPUs • Small number of FPUs • Inflexible memory caches Courtesy of NVIDIA Courtesy of Intel Graphics Hardware 2008
Contributions • Hardware architecture for point sets • Neighbor search module • Novel advanced caching mechanism • Reconfigurable processing module • Programmability using FPGA compiler • FPGA prototype and measurements • Small & Lean Integration into multi-core CPU/GPU possible Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture and Prototype • Results • Conclusion Graphics Hardware 2008
Related Work Kd-Tree [Bentley 75] kNN on GPUs[Ma and McCool 02] Kd-Tree on GPUs [Popov et al. 07] Kd-Tree Hardware [Woop et al. 05] [Woop et al. 06] Graphics Hardware 2008
Related Work Adaptive SPH Fluid Simulation [Adams et al. ‘07] Algebraic Moving Least Squares, [Guennebaud and Gross ‘07] Linear Moving Least Squares, [Adamson and Alexa ’04] Graphics Hardware 2008
Linear Moving Least Squares • Implicit surface definition defined by set of points Graphics Hardware 2008
Linear Moving Least Squares • Implicit surface definition defined by set of points x Graphics Hardware 2008
10 Linear Moving Least Squares ni pi x Graphics Hardware 2008
Linear Moving Least Squares • Iterative projections onto plane x Graphics Hardware 2008
Linear Moving Least Squares • Iterative projections onto plane x’ x ’ Graphics Hardware 2008
Linear Moving Least Squares • Iterative projections onto plane x’’ x ’ ’ Graphics Hardware 2008
Linear Moving Least Squares • Iterative projections onto plane x’’’ x ’ ’ ’ Graphics Hardware 2008
Linear Moving Least Squares • Surface defined by points projecting onto themselves x Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture & Prototype • Results • Conclusion Graphics Hardware 2008
Spatial Search • Spatial search: kNN and eNN • Common in most point operations • Based on kd-tree • Example eNN: Graphics Hardware 2008
Spatial Search • kNN search similar to eNN search: • Start with infinite radius • Sort leaf points into priority queue • Shrink radius with every point sorted Graphics Hardware 2008
Coherent Neighbor Cache(eNN) • Find neighbors in slightly bigger radius • Re-use result for spatially close query Re-use if Graphics Hardware 2008
Coherent Neighbor Cache(kNN, exact) • Find (k+1) neighbors • Re-use result for spatially close query Re-use if Graphics Hardware 2008
Coherent Neighbor Cache(kNN, approximation) • Approximation error e • Enlarge radius Re-use if Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture & Prototype • Results • Conclusion Graphics Hardware 2008
The Architecture Host Graphics Hardware 2008
Coherent Neighbor Cache 0 0 0 1 1 1 n n n • Eight cached neighborhoods • Problem: parallel queries in kd-tree module • Interleave spatially similar queries Graphics Hardware 2008
Kd-Tree Traversal Graphics Hardware 2008
NodeRecurse • Kd-tree structure on chip • 16 threads • Pipelining and multi-threading Graphics Hardware 2008
Stacks • 16 stacks • Parallel read/write • Bounded in depth • 6 bytes per thread per recursion Graphics Hardware 2008
Leaf • 16 parallel priority queues (1-cycle ops) • Queues store pointers and distances • Bandwidth bottleneck Graphics Hardware 2008
Processing Module • Multithreaded quad-port bank of 16 registers • 128 threads • Programmability using FPGA-technology Graphics Hardware 2008
Further Data • Implemented on two FPGAs • 64 bit DDR DRAM • Interconnection: no overhead • Resource usage regs and LUTs • Virtex 2 Pro 100 (kNN): 26% registers, 38% LUTs • Virtex 2 Pro 70 (MLS):47% registers, 52% LUTs • Clock frequency: 75 MHz Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture & Prototype • Results • Conclusion Graphics Hardware 2008
Applications • Tested on various applications • PCI interface of prototype slow • [Weyrich et al. 04] • [Adams et al. 07] Graphics Hardware 2008
Results kNN 75 MHz 2200 MHz 1200 MHz CUDA: x4 ASIC estimate, 500 MHz x6.6 Number of queries CUDA w/o sort: x4.0 CPU: x1.5 CUDA: x2.4 CUDA w/o sort: x3.1 CPU: x1.4 CUDA: x1.6 FPGA: x1 CPU: x1.1 FPGA: x1 FPGA: x1 Number of Neighbors Graphics Hardware 2008
Results kNN • Small hardware footprint • FPGA slightly slower • Realistic clock frequency Prototype faster than CPU/GPU 75 MHz 2200 MHz 1200 MHz CUDA: x4 ASIC estimate, 500 MHz x6.6 Number of queries CUDA w/o sort: x4.0 CPU: x1.5 CUDA: x2.4 CUDA w/o sort: x3.1 CPU: x1.4 CUDA: x1.6 FPGA: x1 CPU: x1.1 FPGA: x1 FPGA: x1 Number of Neighbors Graphics Hardware 2008
Results MLS FPGA faster than CPU 75 MHz 2200 MHz 1200 MHz Number of queries MLS CUDA x3.8 • kNN bottleneck • FPGA • GPU FPGA: x1 MLS CPU: x0.4 Number of Neighbors Graphics Hardware 2008
Coherent Neighbor Cache CPU, e=0.1 Number of queries FPGA, e=0.1 FPGA, exact Level of coherence Graphics Hardware 2008
Results Approximation Error (MLS projection) MLS Error e approximation no approx. Graphics Hardware 2008
Results Approximation Error (MLS projection) Cache hits Cache Hits e approximation Graphics Hardware 2008
Approximation Error (visual) Graphics Hardware 2008
Approximation Error (visual) • Coherent Neighbor Cache: • Not optimal for exact queries • Approximate queries • Can be tolerated in most cases • Greatly increases performance • Even for small approximations Graphics Hardware 2008
Outline • Related Work • Spatial Searching and Caching • Architecture & Prototype • Results • Conclusion Graphics Hardware 2008
Conclusion • Novel hardware architecture for • Nearest-neighbor searches • Generic meshless processing operators • Cache exploiting spatial coherence • Good performance considering resources • Possible GPU integration Graphics Hardware 2008
Future Work • Programmable data structure • Support different data structures • Programmability in data structure • Construction on-chip • ‘Real’ programmability in point processing module Graphics Hardware 2008
A Hardware Processing Unit For Point Sets S. Heinzle, G. Guennebaud,M. Botsch, M. Gross Graphics Hardware 2008