Advancing GPGPU: Harnessing GPUs for General-Purpose Computing
This paper explores the evolution of Graphics Processing Units (GPUs) as powerful, flexible tools for general-purpose computation (GPGPU). Modern GPUs exhibit remarkable programmability and precision, supporting high-level languages and advanced features like 64-bit floating point computations. The paper introduces stream programming abstraction, enabling efficient parallel processing through data streams and kernels. By analyzing practical applications, including a simulation of cloth dynamics, it highlights the efficiency of GPGPU in solving complex computational problems. The findings point towards new potential applications of these techniques in broader contexts.
Advancing GPGPU: Harnessing GPUs for General-Purpose Computing
E N D
Presentation Transcript
David Angulo Rubio FAMU CIS GradStudent GPGPUGeneral Purpose Programmability on Graphic Processors Units
Introduction • GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become extremely powerful and flexible • Programmability • Precision • Power • GPGPU computing is an emerging field which objective is to harness GPUs for general-purpose computation
Motivation: Flexible and Precise • Modern GPUs are deeply programmable • Programmable pixel, vertex, video engines. • Solidifying high level language support • Modern GPUs support high precision • 32 bit floating point throughout the pipeline • High enough for many (not all) applications • Newest GPUs have 64bit support
Stream Programming Abstraction • Streams • Collection of data records • All data is expressed in streams • Kernels • Inputs/outputs are streams • Perform computation on streams • Can be chained together stream KERNEL stream
Stream Programming Abstraction Dolphin Triangle Mesh
Stream Programming Abstraction • Benchmark Funnel: In this simulation, a cloth falls into a funnel and pass through it under the pressure of a ball. This model has 47K vertices, 92K triangles, and a lot of self-collisions. Our novel GPU-based CCD algorithm takes 4.4ms and 10ms per frame to compute all the collisions on a NVIDIA GeForce GTX 480 and a NVIDIA GeForce GTX 285, respectively.
Why Streams • Ample computation by exposing parallelism • Streams expose data parallelism • Multiple streams elements can be processed in parallel • Pipeline (task) parallelism • Multiple tasks can be processed in parallel • Kernels yield high arithmetic intensity • Efficient communication • Producer consumer locality • Predictable memory access pattern • Optimize for throughput of all elements, not latency of one • Processing elements at once allows latency hiding
CPU GPU ANALOGIES Stream/Data array = Texture Memory read= Texture Sample
Structuring a GPU Program • Cpu assembles input data • Cpu transfers data to GPU(GPU “main memory” or “device memory”) • Cpu calls GPU program (or set of kernels). GPU runs out of GPU main memory. • When GPU finishes, CPU copies back results into CPU memory. • Recent interfaces allow overlap • What lessons can we draw from this sequence of operations
Kernels ADVECT CPU GPU KERNEL / LOOP BODY / ALGORITHM STEP = Fragment Program You write one program. It runs on every vertex/fragment.
Conclusion • Can we apply these techniques to more general problems? • GPUs should excel at tasks that : • Require ample computation • Regular computation • Efficient communication