1 / 66

GPU Programming: eScience or Engineering ?

Commit /. GPU Programming: eScience or Engineering ?. msterdam. Vrije Universiteit. Henri Bal. Graphics Processing Units. GPUs and other accelerators take top-500 by storm Many application success stories But GPUs are very difficult to program and optimize.

florence
Télécharger la présentation

GPU Programming: eScience or Engineering ?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Commit/ GPU Programming:eScienceor Engineering? msterdam VrijeUniversiteit Henri Bal

  2. Graphics Processing Units • GPUs and other accelerators take top-500 by storm • Many application success stories • But GPUs are very difficult to program and optimize http://www.nvidia.com/object/tesla-case-studies.html

  3. Example: convolution Fully optimized • About half a Ph.D. thesis Naive

  4. Parallel Programming Lab course • Lab course for MSc students (next to lectures) • CUDA: • Simple image processing application on 1 node • MPI: • Parallel all pairs shortest path algorithms • CUDA: 11 out of 21 passed (52 %) • MPI: 17 out of 21 passed (80 %)

  5. Questions • Why are accelerators so difficult to program? • What are the challenges for Computer Science? • What role do applications play?

  6. Background • Netherlands eScience Center • Bridge between ICT and applications • Climate modeling, astronomy,water management, digital forensics, … • COMMIT: (100 M€) public-private ICT program • http://www.commit-nl.nl/ • Distributed ASCI Supercomputer (DAS) • Testbed for Computer Science (Euro-Par 2014 keynote) Commit/

  7. My background • Cluster computing • Zoo (1994), Orca • Wide-area computing • DAS-1 (1997), Albatross • Grid computing • DAS-2 (2002), Manta, Satin • eScience & optical grids • DAS-3 (2006), Ibis • Hybrid computing • DAS-4 (2010), Glasswing, MCL

  8. Background (team) Staff • Rob van Nieuwpoort (NLeSC) • Ana Varbanescu(UvA) Scientific programmers • Rutger Hofman • Ceriel Jacobs Ph.D. students • Ben van Werkhoven • AlessioSclocco • Ismail El Hewl • Pieter Hijma

  9. Agenda • Application case studies • Multimedia kernel (convolution) • Astronomy kernel (dedispersion) • Climate modelling: optimizing multiple kernels • Lessons learned: why is GPU programming hard? • Programming methodologies • ‘’Stepwise refinement for performance’’ methodology • Glasswing: MapReduce on accelerators

  10. Application case study 1: Convolution operations Image I of size Iwby Ih Filter F of size Fwby Fh Thread block of size Bwby Bh Naïve CUDA kernel: 1 thread per output pixel Does 2 arithmetic operations and 2 loads (8 bytes) Arithmetic Intensity (AI) = 0.25

  11. Hierarchy of concurrent threads Grid Thread Block 0, 0 Thread Block 0, 1 Thread Block 0, 2 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 Thread Block 1, 0 Thread Block 1, 1 Thread Block 1, 2 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 1,0 1,1 1,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3 2,0 2,1 2,2 2,3

  12. Memory optimizations for tiled convolution Grid Block (0, 0) Block (1, 0) Threads within a block cooperatively load entire area they need into a small (e.g. 96KB) shared memory Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Global Memory Filter (small) goes into constant memory Constant Memory

  13. Tiled convolution • Arithmetic Intensity: 16x16 thread block processing an 11x 7 filter

  14. Analysis • If filter size increases: • Arithmetic Intensity increases: • Kernel shifts from memory-bandwidth bound to compute-bound • Amount of shared memory needed increases →fewer thread blocks can run concurrently on each SM

  15. Tiling • Each thread block computes 1xN tiles in horizontal direction • Increases amount of work per thread • Saves loading overlapping borders • Saves redundant instructions • More shared memory, fewer concurrent thread blocks No shared memory bank conflicts

  16. Adaptive tiling • Tiling factor is selected at runtime depending on the input data and the resource limitations of the device • Highest possible tiling factor that fits within the shared memory available (depending on filter size) • Plus loop unrolling, memory banks, search optimal configuration Ph.D. thesis Ben van Werkhoven,27 Oct. 2014 + FGCS journal, 2014

  17. Lessons learned • Everything must be in balance to obtain high performance • Subtle interactions between resource limits • Runtime decision system (adaptive tiling), in combination with standard optimizations • Loop unrolling, memory bank conflicts

  18. Application case study 2:Auto-tuning Dedispersion • Used for searching pulsars in radio astronomy data • Pulsar signals get dispersed: lower radio frequencies arrive progressively later • Can be reversed by shifting in time the signal’s lower frequencies (dedispersion) Alessio Sclocco et al.: Auto-Tuning Dedispersion for Many-Core Accelerators, IPDPS 2014

  19. Auto-tuning • Using auto-tuning to find optimal configuration for: • Different many-core platforms • NVIDIA & AMD GPUs, Intel Xeon Phi • Different observational scenarios • LOFAR, Apertif • Different number of Dispersion Measures (DMs) • Represents number of free electrons between source & receiver • Measure of distance between emitting object & receiver • Parameters: • Number of threads per sample or DM, thread block size, number of registers per thread, ….

  20. Auto-tuning: number of threads per thread block LOFAR Apertif

  21. Histogram of achieved GFLOP/s • 396 configurations, the winner is an outlier

  22. Lessons learned • Auto-tuning allows algorithms to adapt to different platforms and scenarios • Auto-tuning has large impact on dedispersion • Guessing a good configuration without auto-tuning is difficult

  23. Application case study 3:Global Climate Modeling • Understand future local sea level changes • Needs high-resolution simulations • Combine two approaches: • Distributed computing (multiple resources) • GPUs Commit/

  24. Distributed Computing • Use Ibisto couple different simulation models • Land, ice, ocean, atmosphere • Wide-area optimizations similar to Albatross project(16 years ago), like hierarchical load balancing

  25. Enlighten Your Research Global award • 10G • 10G STAMPEDE (USA) SUPERMUC (GER) CARTESIUS (NLD) • #7 KRAKEN (USA) EMERALD (UK) • 10G • #10

  26. GPU Computing host memory CPU Host • Offload expensive kernels for Parallel Ocean Program (POP) from CPU to GPU • Many different kernels, fairly easy to port to GPUs • Execution time becomes virtually 0 • New bottleneck: moving data between CPU & GPU PCI Express link device memory GPU Device

  27. Different methods for CPU-GPU communication • Memory copies (explicit) • No overlap with GPU computation • Device-mapped host memory (implicit) • Allows fine-grained overlap between computation and communication in either direction • CUDA Streams or OpenCL command-queues • Allows overlap between computation and communication in different streams • Any combination of the above

  28. Problem • Problem: • Which method will be most efficient for a given GPU kernel? Implementing all can be a large effort • Solution: • Create a performance model that identifies the best implementation: • What implementation strategy for overlapping computation and communication is best for my program? Ben van Werkhoven, Jason Maassen, Frank Seinstra & Henri Bal: Performance models for CPU-GPU data transfers, CCGrid2014(nominated for best-paper-award)

  29. MOVIE

  30. Example result Measured Model

  31. Different GPUs (state kernel)

  32. Different GPUs (buoydiff)

  33. Comes with spreadsheet

  34. Lessons learned • PCIe transfers can have a large performanceimpact for applications with many small kernels • Several methods for transferring data and overlapping computation & communication exist • Performance modelling helps to select the best mechanism

  35. Why is GPU programming hard? • Mapping algorithm to architecture is difficult, especially as thearchitecture is difficult: • Many levels of parallelism • Limited resources (registers, shared memory) • Less of everything than CPU (except parallelism), especially per thread, makes problem-partitioning difficult • Everything must be in balance to obtain performance

  36. Why is GPU programming hard? • Many crucial high-impact optimizations needed: • Data reuse • Use shared memory efficiently • Limited by #registers per thread, shared memory per thread block • Memory access patterns • Shared memory bank conflicts, global memory coalescing • Instruction stream optimization • Control flow divergence, loop unrolling • Moving data to/from the GPU • PCIe transfers

  37. Why is GPU programming hard? • Portability • Optimizations are architecture-dependent, and the architectures change frequently • Optimizations are often input dependent • Finding the right parameters settings is difficult • Need better performance models • Like Rooflineand our I/O model

  38. Why is GPU programming hard? • Bottom line: tensionbetween • control over hardware to achieve performance • higher abstraction level to ease programming • Programmers need understandable performance • Old problem in Computer Science,but now in extreme form (1989)

  39. Agenda • Application case studies • Multimedia kernel (convolution) • Astronomy kernel (dedispersion) • Climate modelling: optimizing multiple kernels • Lessons learned: why is GPU programming hard? • Programming methodologies • ‘’Stepwise refinement for performance’’ methodology • Glasswing: MapReduce on accelerators

  40. Programming methodology: stepwise refinement for performance • Methodology: • Programmers can work on multiple levels of abstraction • Integrate hardware descriptions into programming model • Performance feedback from compiler, based on hardware description and kernel • Cooperation between compiler and programmer P. Hijmaet al.,Stepwise-refinement for Performance: a methodology for many-core programming,” Concurrency and Computation: Practice and Experience (accepted)

  41. MCL: Many-Core Levels • MCL program is an algorithm mapped to hardware • Start at a suitable abstraction level • E.g. idealized accelerator, NVIDIA Kepler GPU, Xeon Phi • MCL compiler guides programmer which optimizations to apply on given abstraction level or to move to deeper levels

  42. MCL ecosystem

  43. Convolution example

  44. Compiler feedback

  45. Performance(GTX480, 9×9 filters) 380 GFLOPS MCL: 302 GFLOPS Compiler +

  46. Performance evaluation Compared to known, fully optimized versions(*measured on a C2050, ** using a different input).

  47. Current work on MCL:Heterogeneous many-core clusters • New GPUs become available frequently, but older-generation GPUs often still are fast enough • Clusters become heterogeneous and contain different types of accelerators • VU DAS-4 cluster: • NVIDIA GTX480 GPUs (22) • NVIDIA K20 GPUs (8) • Intel Xeon Phi (2) • NVIDIA C2050 (2), Titan, GTX680 GPU • AMD HD7970 GPU

  48. Cashmere • Integration MCL + Satin divide-and-conquer system • Satin [ACM TOPLAS 2010] does: • Load-balancing (cluster-aware random work-stealing) • Latency hiding • MCL allows kernels to be written and optimized for each type of hardware • Cashmere does integration, application logic, mapping, and load balancing for multiple GPUs/node

  49. Cashmere skeleton

  50. Kernel performance (GFLOP/s)

More Related