Designing physics Algorithms for gpu architecture

DesigningphysicsAlgorithms for gpu architecture Takahiro HARADA AMD

Narrow phase on GPU • Narrow phase is parallel • How to solve each pair? • Design it for a specific architecture

GPU Architecture • Radeon HD 5870 • 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec • Many cores • 20SIMDs x 64 wide SIMD •  CPU SSE 4 wide SIMD • Program of a work item is packed in VLIW, then executed SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD Radeon HD 5870 Phenom II X4 SIMD Core Core Core Core Core 20(SIMDs)x16(Thread processors) x 5(Stream cores) = 1600

Memory • Register • Global memory • “Main memory” • Large • High latency • Local Data Store（LDS) • Low latency • High bandwidth • Like a user managed cache • Key to get high performance Global Memory > 1GB SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD 156.3GB/s SIMD GPU Local Data Share 32KB

Narrow phaseon CPU 0 1 2 3 4 5 6 7 Void Kernel() { executeX(); switch() { case A: { executeA(); break; } case B: { executeB(); break; } case C: { executeC(); break; } } finish(); } • Methods on CPUs(GJK) • Any convex shapes • Possible to implement on the GPU • Complicated for GPU • Divergence => Low use of ALUs • GPU prefer simpler algorithm with less logic • Why GPU is not good at complicated logic? • Wide SIMD architecture 25% 25% 50%

Narrow phase on GPU 0 1 2 3 4 5 6 7 Void Kernel() { prepare(); collide(p0); collide(p1); collide(p2); collide(p3); collide(p4); collide(p5); collide(p6); collide(p7); } • Particles • Search for neighboring particle • Collide to all • Accurate shape representation needs • Increase resolution • Acceleration structure in each shape • Increase complexity • Explode number of contacts • Etc.. • Can we make it better but keep it simple?

a Good approach for GPUs, from architecture • Have to know what GPUs likes • Less branch • Less divergence • Use LDS on SIMD • Latency hiding • Why latency?

Work Group0 Particle[0-63] Work group(WG), work item(WI) SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD Work Group1 Particle[64-127] SIMD Radeon HD 5870 Work Group2 Particle[128-191] SIMD lane(64lanes) Work item(64items)

How GPU hides Latency? Work Group0 Work Group1 Work Group2 Work Group3 Void Kernel() { readGlobalMem(); compute(); } • Memory access latency • Not rely on cache • SIMD hides latency by switching WGs • The more WGs/SIMD is the better • 1WG/SIMD cannot hide latency • Overlap work to memory request • What determines # of WGs/SIMD? • Local resource usage

Why reduce Resource usage? • Regs are limited resource • # of WGs/SIMD • SIMD regs/(kernel regs use) • LDS/(kernel LDS use) • Less # of regs • More WGs • Hide latency • Register overflow -> global memory SIMD Engine (8 regs) 1 KernelA Regs:8 1 2 KernelB Regs:4 1 2 3 4 KernelC Regs:2

Preview of Current Approach 0 1 2 3 4 5 6 7 Global Mem Void Kernel() { fetchToLDS(); BARRIER; compute(); BARRIER; workTogether(); BARRIER; Writeback(); } • 1 WG processes 1 pair • Reduce resource usage • Less branch • Compute is branch free • No dependency • Use of LDS • No global mem access on compute() • Random access to LDS • Latency hiding • Pair data for a WG not per WI • WIs work together • Unified method for all the shapes LDS Global Mem

Solver

Fusion

Choosing a processor • CPU can do everything • Not good for highly parallel computations as GPU • GPU is very powerful processor • Only for parallel computation • Real problem has both • GPU is far from CPU

Fusion • GPU and CPU are close • Faster communication between GPU and CPU • Use both GPU and CPU • Parallel workload -> GPU • Serial workload -> CPU

Collision between large and small particles 0 1 2 3 4 5 6 7 • Granularity of computation • Large particle collide more • Inefficient use of the GPU

Q & A

Designing physics Algorithms for gpu architecture