1 / 84

Parallel Concept and Hardware Architecture CUDA Programming Model Overview

Parallel Concept and Hardware Architecture CUDA Programming Model Overview. Yukai Hung Department of Mathematics National Taiwan University. Parallel Concept Overview. making your program faster. Parallel Computing Goals. Solve the problems in less time

marek
Télécharger la présentation

Parallel Concept and Hardware Architecture CUDA Programming Model Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Concept and Hardware ArchitectureCUDA Programming Model Overview Yukai HungDepartment of MathematicsNational Taiwan University

  2. Parallel Concept Overview making your program faster

  3. Parallel Computing Goals • Solve the problems in less time • - divide one problem into smaller pieces • - solve smaller problems concurrently • - allow to solve more bigger problems • Prepare to parallelize one problem • - represent algorithms as Directed Acyclic Graphs • - identify dependencies in the problems • - identify critical paths in the algorithms • - modify dependencies to shorten the critical paths 3

  4. Parallel Computing Goals • What is parallel computing? 4

  5. Amdahl’s Law • Speedup of a parallel program is limited by amount of serial works 5

  6. Amdahl’s Law • Speedup of a parallel program is limited by amount of serial works 6

  7. Race Condition • Consider the following parallel program • - threads are almost impossibly executed at the same time 7

  8. Race Condition • Scenario 1 • - the result value R is 2 if the initial value R is 1 8

  9. Race Condition • Scenario 2 • - the result value R is 2 if the initial value R is 1 9

  10. Race Condition • Scenario 3 • - the result value R is 3 if the initial value R is 1 10

  11. Race Condition with Lock • Solve the race condition by Locking • - manage the shared resource between threads • - avoid the deadlock or unbalanced problems 11

  12. Race Condition with Lock • Guarantee the executed instruction order is correct • - the problem is back to the sequential procedure • - lock and release procedure have high overhead 12

  13. Race Condition with Semaphore • Solve the race condition by Semaphore • - multi-value locking method (binary locking extension) • - instructions in procedure P and V are atomic operations 13

  14. Instruction Level Parallelism • Multiple instructions are executed simultaneously • - reorder the instructions carefully to get efficiency • - compilerreorders the assemble instructions automatically step 1 step 2 step 3 14

  15. Data Level Parallelism • Multiple data operations are executed simultaneously • - computational data is separable and independent • - single operation repeated over different input data sequential procedure parallel procedure 15

  16. Flynn’s Taxonomy • Classification for parallel computers and programs 16

  17. Flynn’s Taxonomy • Classification for parallel computers and programs SISD SIMD 17

  18. Flynn’s Taxonomy • Classification for parallel computers and programs MISD MIMD 18

  19. CPU and GPU Hardware Comparison

  20. CPU versus GPU 20

  21. CPU versus GPU Intel Penryn quad-core 255mm2 in 0.82B transistors NVIDIA GTX280 >500mm2 in 1.4B transistors 21

  22. CPU versus GPU computing GFLOPS memory bandwidth 22

  23. CPU versus GPU GPU CPU comparison control/cache size and number of ALUs comparison the clock rates/core numbers/executing latency 23

  24. General Purpose GPU Computation 24

  25. General Purpose GPU Computation algorithm conversion requires the knowledge of graphic APIs (OpenGL and DirectX) 25

  26. General Purpose GPU Computation convert from pixel data to required data restrict the general usage of algorithms 26

  27. Start from Traditional GPU Architecture to Nowadays

  28. Simplified Graphic Pipeline sorting stage z-buffer collection 28

  29. Simplified Graphic Pipeline maximum depth z-cull feedback 29

  30. SimplifiedGraphic Pipeline scale up some units 30

  31. Simplified Graphic Pipeline add framebuffer access bottleneck is the FBI unit for memory management 31

  32. SimplifiedGraphic Pipeline add programmability by ALU units add programmable geometry and pixel shaders 32

  33. Simplified Graphic Pipeline consider two similar units special case on the pipeline (1) one trigonometry but lots of pixels: pixel shader is busy (2) lots of trigonometry but one pixel: geometry shader is busy 33

  34. Iterative Graphic Pipeline combine two units to unified shader scalable between geometries and pixels memory resource management becomes important 34

  35. Graphic Pipeline Comparison software pipeline hardware pipeline 35

  36. SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 L1 Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Unified Graphic Architecture • Switch on two modes: Graphic and CUDA mode 36

  37. Texture Texture Texture Texture Texture Texture Texture Texture Texture Host Input Assembler Thread Execution Manager Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Parallel DataCache Load/store Global Memory Unified Graphic Architecture • Switch on two modes: Graphic and CUDA mode Load/store Load/store Load/store Load/store Load/store 37

  38. Thread Streaming Processing no data communication issue which is suitable for traditional graphical issue 38

  39. Shader Register File/Cache • Separate register files • - strict streaming processing mode • - no data sharing at instruction-level • - dynamical allocation and renaming • - not exist memory addressing order • - registers overflow into local memory 39

  40. Thread Streaming Processing communication issues between different threads (in the same shader) 40

  41. Shader Register File/Cache • Shared register files • - extra memory hierarchy between • shader registers and global memory • - share data in the same shader • or threads in the same block • - synchronize all threads in the shader 41

  42. Unified Graphic Architecture 42

  43. CPU and GPU Chip Architecture 43

  44. CPU and GPU Chip Architecture 44

  45. Nowadays GPU Applications • Traditional game rendering • - clothing demo and star tales benchmark • - real time physical phenomenon rendering • - mixed mode for physical phenomenon simulation • Scientific computational usage • - molecular dynamic simulation • - protein folding and nbody simulation • - medical images and computational fluid dynamics • CUDA community showcase 45

  46. Nowadays GPU Features • GPUs are becoming more programmable than before • - only standard C extension on unified scalable shaders • GPUs now support 32-bit and 64-bit floating point operations • - almost IEEE floating point compliant except for some specials • - lack mantissa denormalizationfor small floating point number • GPUs have much higher memory bandwidth than CPUs • - multiple memory bank driven by need of high-performance graphics • Massive data-level parallel architecture • - hundreds of thread processors on the chip • - thousands of concurrent threads on the shader • - lightweight thread switch and long latency memory 46

  47. General Purpose GPU Environment • CUDA: Compute Unified Device Architecture • - realistic hardware and software GPGPU solution • - minimal set of standard C language extensions • - tool set includes compiler and software development kits • OpenCL: Open Computing Language • - similar to CUDA from GPGPU points of view • - support both CPU and GPU hardware architecture • - execute across heterogeneous platform resources 47

  48. CUDA Programming Overview

  49. . . . . . . CUDA Programming Model • Integrated host and device application C program • - serial or modestly parallel parts in host C code • - highly parallel parts in device C extension code 49

  50. CUDA Programming Model • What is the computed device? • - coprocessor to the host part • - have its own device memory space • - run many active threads in parallel • What is the difference between CPU and GPU threads? • - GPU threads are extremely lightweight • - GPU threads have almost no creating overhead • - GPU needs more than 1000 threads for full occupancy • - multiple core CPU can execute or create only a little threads 50

More Related