1 / 20

Designing physics Algorithms for gpu architecture

Designing physics Algorithms for gpu architecture. Takahiro HARADA AMD. Narrow phase on GPU. Narrow phase is parallel How to solve each pair? Design it for a specific architecture. GPU Architecture. Radeon HD 5870 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec Many cores

domani
Télécharger la présentation

Designing physics Algorithms for gpu architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DesigningphysicsAlgorithms for gpu architecture Takahiro HARADA AMD

  2. Narrow phase on GPU • Narrow phase is parallel • How to solve each pair? • Design it for a specific architecture

  3. GPU Architecture • Radeon HD 5870 • 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec • Many cores • 20SIMDs x 64 wide SIMD •  CPU SSE 4 wide SIMD • Program of a work item is packed in VLIW, then executed SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD Radeon HD 5870 Phenom II X4 SIMD Core Core Core Core Core 20(SIMDs)x16(Thread processors) x 5(Stream cores) = 1600

  4. Memory • Register • Global memory • “Main memory” • Large • High latency • Local Data Store(LDS) • Low latency • High bandwidth • Like a user managed cache • Key to get high performance Global Memory > 1GB SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD 156.3GB/s SIMD GPU Local Data Share 32KB

  5. Narrow phaseon CPU 0 1 2 3 4 5 6 7 Void Kernel() { executeX(); switch() { case A: { executeA(); break; } case B: { executeB(); break; } case C: { executeC(); break; } } finish(); } • Methods on CPUs(GJK) • Any convex shapes • Possible to implement on the GPU • Complicated for GPU • Divergence => Low use of ALUs • GPU prefer simpler algorithm with less logic • Why GPU is not good at complicated logic? • Wide SIMD architecture 25% 25% 50%

  6. Narrow phase on GPU 0 1 2 3 4 5 6 7 Void Kernel() { prepare(); collide(p0); collide(p1); collide(p2); collide(p3); collide(p4); collide(p5); collide(p6); collide(p7); } • Particles • Search for neighboring particle • Collide to all • Accurate shape representation needs • Increase resolution • Acceleration structure in each shape • Increase complexity • Explode number of contacts • Etc.. • Can we make it better but keep it simple?

  7. a Good approach for GPUs, from architecture • Have to know what GPUs likes • Less branch • Less divergence • Use LDS on SIMD • Latency hiding • Why latency?

  8. Work Group0 Particle[0-63] Work group(WG), work item(WI) SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD Work Group1 Particle[64-127] SIMD Radeon HD 5870 Work Group2 Particle[128-191] SIMD lane(64lanes) Work item(64items)

  9. How GPU hides Latency? Work Group0 Work Group1 Work Group2 Work Group3 Void Kernel() { readGlobalMem(); compute(); } • Memory access latency • Not rely on cache • SIMD hides latency by switching WGs • The more WGs/SIMD is the better • 1WG/SIMD cannot hide latency • Overlap work to memory request • What determines # of WGs/SIMD? • Local resource usage

  10. Why reduce Resource usage? • Regs are limited resource • # of WGs/SIMD • SIMD regs/(kernel regs use) • LDS/(kernel LDS use) • Less # of regs • More WGs • Hide latency • Register overflow -> global memory SIMD Engine (8 regs) 1 KernelA Regs:8 1 2 KernelB Regs:4 1 2 3 4 KernelC Regs:2

  11. Preview of Current Approach 0 1 2 3 4 5 6 7 Global Mem Void Kernel() { fetchToLDS(); BARRIER; compute(); BARRIER; workTogether(); BARRIER; Writeback(); } • 1 WG processes 1 pair • Reduce resource usage • Less branch • Compute is branch free • No dependency • Use of LDS • No global mem access on compute() • Random access to LDS • Latency hiding • Pair data for a WG not per WI • WIs work together • Unified method for all the shapes LDS Global Mem

  12. Solver

  13. Fusion

  14. Choosing a processor • CPU can do everything • Not good for highly parallel computations as GPU • GPU is very powerful processor • Only for parallel computation • Real problem has both • GPU is far from CPU

  15. Fusion • GPU and CPU are close • Faster communication between GPU and CPU • Use both GPU and CPU • Parallel workload -> GPU • Serial workload -> CPU

  16. Collision between large and small particles 0 1 2 3 4 5 6 7 • Granularity of computation • Large particle collide more • Inefficient use of the GPU

  17. Q & A

More Related