Glift: An Abstraction for Generic, Efficient GPU Data Structures

Glift: An Abstraction for Generic, Efficient GPU Data Structures Aaron Lefohn University of California, Davis

Problem Statement • Goal • Simplify creation and use of random-access GPU data structures for graphics and GPGPU programming • Challenges • GPU memory model different than CPU • Memory model spans multiple processors and languages • Efficiency • Solution • Abstraction for GPU data structures • Glift template library

Collaborators • Joe KnissUniversity of Utah • Robert StrzodkaCAESAR Research Institute • Shubhabrata SenguptaUniversity of California, Davis • John OwensUniversity of California, Davis

Abstraction Spoiler… • GPU data structures are not as complex as they appear • A large number of GPU data structures can be easily understood if the following three items are specified • Virtual address domain • Physical address domain • Address translator

Overview • Motivation and Previous Work • Abstraction • Glift template library • Case study • Adaptive shadow maps and octree 3D paint • Conclusions

Motivation CPU Data Structures • Why? • Algorithms expressed in natural data domain • Decouple algorithms and data structures • Reuse Virtual representation of memory: N-D array, stack, hash table, queue, … Abstractions provided by library (STL, Boost, …) Physical representation of memory: 1D array

Motivation GPU State of the Art • GPU code accesses physical memory directly • GPU code is tangled mess of data structure access and algorithms • No reuse • Greatly complicates GPU algorithm design

Abstraction We Want To Transform This… float3 getAddr3D( float2 winPos, float2 winSize, float3 sizeConst3D ) { float3 curAddr3D;float2 winPosInt = floor(winPos);float addr1D = winPosInt.y * winSize.x + winPosInt.x; addr3D.z = floor( addr1D / sizeConst3D.z ); addr1D -= addr3D.z * sizeConst3D.z; addr3D.y = floor( addr1D / sizeConst3D.y ); addr3D.x = addr1D - addr3D.y * sizeConst3D.y;return addr3D; } float2 getAddr2D( float3 addr3D, float2 winSize, float3 sizeConst3D ) { float addr1D = dot( addr3D, sizeConst3D );float normAddr1D = addr1D / winSize.x;returnfloat2(frac(normAddr1D) * winSize.x, normAddr1D); } float3 main( uniform samplerRECT data, uniform float2 winSize, uniformfloat3 sizeConst3D, float2 winPos : WPOS ) : COLOR { float3hereAddr3D = getAddr3D( winPos, winSize, sizeConst3D ); float3 neighborAddr = hereAddr3D - float3(1, 1, 1); returntexRECT(data, getAddr2D(neighborAddr3D, winSize, sizeConst3D) ); }

Abstraction Into This. void main( uniformVMem3D data, Iterator3Diter, out float result ) { float3 va = iter.addr(); return srcData.vTex3D( va – float3(1,1,1) ); }

Previous Work Previous Work • GPU computation • GPU data structures

Previous Work Shader/Kernel Languages • Vertex and fragment assembly 2001- 2002 • Real-Time Shading Language 2001 • Cg, HLSL, GLSL, Ashli 2002-2004

Previous Work Higher-Level Languages • Abstract kernels, streams, and “glue” • Brook 2003 • Sh 2004 • Scout 2004

Previous Work Previous Work • GPU computation • GPU data structures

Previous Work Virtualized GPU Data Structures • Brook • Virtualizes CPU/GPU interface for 1D – 4D arrays • Sh • Virtualizes 1D arrays and CPU/GPU data access

Previous Work Example Brook Data Structure • kernelvoid myKernel( float srcData[], • out float dstData<> ) • { • result = input[ i – 1 ]; } float srcData<5000>, dstData<5000>; streamRead(srcData, dataPtr ); myKernel( srcData, dstData ); streamWrite(dstData, dataPtr );

Previous Work Limitations of Brook Data Virtualization • Difficult to add new, equivalently virtualized structures • Requires adopting entire system • Only exposes GL_NEAREST, GL_TEXTURE_2D

Previous Work What about other Data Structures? • Photon map Purcell • Sparse matrix Boltz, Krueger • Sparse simulation grid Lefohn • Polycube (3D grid, cubeMap, …) Tarini • N-tree Lefebvre • …

Brook Sh STL ??? C++ Cg OpenGL Motivation GPU Data Structures • What’s Missing? • Standalone abstraction for GPU data structures for graphics or GPGPU programming

Overview • Motivation and Previous Work • Abstraction • Glift template library • Case study • Adaptive shadow maps and octree 3D paint • Conclusions

Why a Data Structure Abstraction? • Separate data structures and algorithms • Enable more complex structures • Enable more complex algorithms • Provide perspective on class of GPU-compatible structures • Is random read required? • What is required for stream read/write?

Abstraction Abstraction Design Goals • GPU data structure abstraction that • Enables easy creation of new structures • Virtualizes CPU and GPU memory interfaces • Separates containers from algorithms • Encourages efficiency

Abstraction Abstraction Design Approach • Minimal efficient abstraction of GPU memory model • GPU is different than CPU (1D/2D/3D/Cube/Mip) • Identify common patterns in GPU papers and code • Other inspiration • STL, Boost, STAPL, Stepanov • Brook

Abstraction What is the GPU Memory Model? • Natively multi-dimensional • CPU interface • glTexImage malloc • glDeleteTextures free • glTexSubImage memcpy GPU -> CPU • glGetTexSubImage* memcpy CPU -> GPU • glCopyTexSubImage memcpy GPU -> GPU • glBindTexture read-only parameter bind • glFramebufferTexture write-only parameter bind * Does not exist. Emulate withglReadPixels

Abstraction What is the GPU Memory Model? • GPU Interface (shown in Cg) • uniform samplerND parameter declaration • texND(tex, addr) random-access read • streamND(tex)* stream read * Does not exist, but is a useful construct for efficiency reasons

Abstraction GPU Data Structure Abstraction • Factor GPU data structures into • Physical memory • Virtual memory • Address translator • Iterators

Motivation Virtualized GPU Memory Model • Natively multi-dimensional • CPU interface • glTexImageC++ constructor • glDeleteTexturesC++ destructor • glTexSubImagewrite(orig, size, dataPtr) • glGetTexSubImage* read( orig, size, dataPtr) • glCopyTexSubImagecopy( orig, size, dst ) • glBindTexturebind_for_read( cgParam ) • glFramebufferTexturebind_for_write( attach ) * Does not exist. Emulate withglReadPixels

Motivation Virtualized GPU Memory Model • GPU Interface (shown in Cg) • uniform samplerNDuniform VMemND • texND(tex, addr)vTexND(addr) • streamND(tex)*streamRead(iter) * Does not exist, but is a useful construct

Abstraction Physical Memory • Native GPU textures • Choose based on algorithm efficiency requirements • 1D • Read-write, linear, 4096 max size • 2D • Read-write, bilinear, 40962 max size • 3D • Read-only, trilinear, 5123 max size • Cube • read-write, bilinear, square, array of six 2D textures • Mipmaps • Additional (multiresolution) dimension to address

Translation Translation Translation 3D native mem 2D slices Flat 3D texture Abstraction Virtual Memory • Virtual N-D address space • Choose based on problem space of algorithm • Defined by physical memory and address translator Virtual representation of memory: 3D grid

Abstraction Address Translator • Mapping between physical and virtual addrs • Core of data structure • Small amount of code defines all required C++ and Cg memory interfaces • Select based on virtual and physical domains and memory/compute efficiency requirements of algorithm

Abstraction Address Translator Examples • Examples • ND-to-2D • 3D-to-2D tiled “flat 3D textures” • Page table • Grid of lists • Hash table • Silmap

Abstraction Address Translator Classifications • Representation • Analytic / Discrete • Memory Complexity • O(1), O(log N), O(N), … • Compute Complexity • O(1), O(log N), O(N), … • Compute Consistency • Uniform vs. non-uniform • Total / Partial • Complete vs. sparse • One-to-one / Many-to-one • Uniform vs. adaptive

Abstraction Iterators • Abstraction for virtual and physical addrs • CPU and GPU iterators • Boundary between containers and algorithms • Allows separate definition!

Abstraction Which Iterators? • Forward iterator • Required for GPU execution • Enables stream read and stream write • Random access iterator • Required for indexing (i.e., texture-like structures) • Graphics needs random access. GPGPU often does not. • Others Coming… • Use iterators to explicitly declare access patterns

Abstraction Simple Example • 3D Array with 2D physical memory CPU (C++) typedef boost::multi_array<float, 3> array_type; array_type srcData( boost::extents[10][10][10] ); array_type dstData( boost::extents[10][10][10] ); … initialize data … for (size_t z = 1; z < 10; ++z) { for (size_t y = 1; z < 10; ++y) { for (size_t x = 1; z < 10; ++x) { dstData[z][y][x] = srcData[z–1][y–1][x–1]; } } }

Abstraction Example : GPU Shader w/out Abstraction float3 getAddr3D( float2 winPos, float2 winSize, float3 sizeConst3D ) { float3 curAddr3D;float addr1D = winPosInt.y * winSize.x + winPosInt.x; addr3D.z = floor( addr1D / sizeConst3D.z ); addr1D -= addr3D.z * sizeConst3D.z; addr3D.y = floor( addr1D / sizeConst3D.y ); addr3D.x = addr1D - addr3D.y * sizeConst3D.y;return addr3D; } float2 getAddr2D( float3 addr3D, float2 winSize, float3 sizeConst3D ) { float addr1D = dot( addr3D, sizeConst3D );float normAddr1D = addr1D / winSize.x;float2 neighborAddr2D = float2(frac(normAddr1D) * winSize.x, normAddr1D); } float3 main( uniform samplerRECT data, uniform float2 winSize, uniformfloat3 sizeConst3D, float2 winPos : WPOS ) : COLOR { float3hereAddr3D = getAddr3D( floor(winPos), winSize, sizeConst3D ); float3 neighborAddr = hereAddr3D - float3(1, 1, 1); returntexRECT(data, getAddr2D(neighborAddr3D, winSize, sizeConst3D) ); }

Physical-to-Virtual Address Translation Virtual-to-PhysicalAddress Translation Physical Memory Read Abstraction Example : Rename Variables float3 physToVirt( float2 pa, float2 physSize, float3 virtSizes ) { float3 va;float addr1D = pa.y * physSize.x + pa.x; va.z = floor( addr1D / virtSizes.z ); addr1D -= va.z * sizeConst3D.z; va.y = floor( addr1D / virtSizes.y ); va.x = addr1D - va.y * virtSizes.y;return va; } float2 virtToPhys( float3 va, float2 physSize, float3 virtSizes ) { float addr1D = dot( va, virtSizes );float normAddr1D = addr1D / physSize.x;float2 pa = float2(frac(normAddr1D) * physSize.x, normAddr1D); } float3 main( uniform samplerRECT physMem, uniform float2 physSize, uniformfloat3 virtSizes, float2 pa : WPOS ) : COLOR { float3va = physToVirt( floor(pa), physSize, virtSizes ); float3 neighborAddr = va - float3(1, 1, 1); returntexRECT(data, virtToPhys(neighborAddr3D, physSize, virtSizes) ); }

Iterator3D VMem3D VMem3D Abstraction Example : Glift Components float3 physToVirt( float2 pa, float2 physSize, float3 virtSizes ) { float3 va;float addr1D = pa.y * physSize.x + pa.x; va.z = floor( addr1D / virtSizes.z ); addr1D -= va.z * sizeConst3D.z; va.y = floor( addr1D / virtSizes.y ); va.x = addr1D - va.y * virtSizes.y;return va; } float2 virtToPhys( float3 va, float2 physSize, float3 virtSizes ) { float addr1D = dot( va, virtSizes );float normAddr1D = addr1D / physSize.x;float2 pa = float2(frac(normAddr1D) * physSize.x, normAddr1D); } float3 main( uniform samplerRECT physMem, uniform float2 physSize, uniformfloat3 virtSizes, float2 pa : WPOS ) : COLOR { float3va = physToVirt( floor(pa), physSize, virtSizes ); float3 neighborAddr = va - float3(1, 1, 1); returntexRECT(data, virtToPhys(neighborAddr3D, physSize, virtSizes) ); }

Abstraction Example : GPU Shader with Glift Cg Usage float3 main( uniformVMem3D srcData, Iterator3Diter ) : COLOR { float3 va = iter.addr(); return srcData.vTex3D( va – float3(1,1,1) ); }

Abstraction Example : Glift Data Structures C++ Usage vec3i origin(0,0,0); vec3i size(10,10,10); ArrayGpuND<vec3i,vec1f> srcData( size ); ArrayGpuND<vec3i,vec1f> dstData( size ); … initialize dataPtr … srcData.write( origin, size, dataPtr ); gpu_range_iterator it = dstData.gpu_range(origin, size); it.bind_for_read( iterCgParam ); srcData.bind_for_read( srcCgParam ); dstData.bind_for_write( COLOR0, myFrameBufferObject ); gpuForEach(it);

Abstraction Other Benefits of Abstraction • Multiple PhysMem with same AddrTrans • “Unlimited” amount of data in structures • Multiple AddrTrans with one PhysMem • “reinterpret_cast” physical memory • Continuguous memory layout • Efficient stream processing of PhysMem or AddrTrans

Overview • Motivation and previous work • Abstraction • Glift template library implementation • Case study • Adaptive shadow maps and octree 3D paint • Conclusions

Implementation Glift Design Goals • Generic implementation of abstraction • As efficient as hand-coding • Unified C++ and Cg code base • Easily extensible • Incrementally adoptable • Easy integration with Cg/OpenGL

Implementation Glift Components Application PhysMem VirtMem Container Adaptors AddrTrans C++/Cg/OpenGL

4D Array Declaration Example • Build 4D array of vec3f values typedef PhysMemGPU<vec2i, vec3f> PMem2D;typedef NdTo2DAddrTrans<vec4i,vec2i> Addr4to2;typedef VirtMemGPU<Addr4to2, PMem2D> VMem4D; vec4i virtSize( 10, 10, 10, 10);vec2i physSize( 100, 100 ); PMem2D pMem2D( physSize );Addr4to2 addrTrans( virtSize, physSizse );VMem4D array4D( addrTrans, pMem2D );

Implementation PhysMem • Templated texture class • Defines all C++ and Cg GPU memory interfaces • GPU, CPU, and CPU-GPU • Template parameters • Address type dimension value type • Value type dimension value type • Example typedef PhysMemGPU<vec2f, vec1f> PMem2D; PMem2D pMem2D( vec2i(100, 100) );

Implementation AddrTrans • Template parameters • Virtual address type • Physical address type • Boundary condition • …Specific parameters… • Example typedef NdTo2DAddrTrans<vec4i,vec2f> Addr4to2; Addr4to2 addrTrans( vec4i(10, 10, 10, 10), vec2i(100, 100) );

Implementation AddrTrans • Core of data structure • Extension point for creating new structures • Must define translate(…)translate_range(…)cpu_range(…)gpu_range(…)

Implementation VirtMem • Composition of an AddrTrans and PhysMem • Defines all C++ and Cg GPU memory interfaces • Parameters • Address translator type • Physical memory type • Example typedef VirtMemGPU<Addr4to2, PMem2D> VMem4D; VMem4D vMem4D( addrTrans, pMem2D );

Implementation Container Adaptors • High-Level Containers • Apply behavior to underlying container • STL stack is adaptor atop deque, vector, … • Wrap up typedefs • Examples ArrayGpuND<vec1i, vec4ub> myArray( 20000 ); StackGPU<vec3f> myGpuStack( 1000 );

Glift: An Abstraction for Generic, Efficient GPU Data Structures