1 / 67

GPGPU

GPGPU. Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.it a.marongiu@unibo.it. Old and New Wisdom in Computer Architecture. Old: Power is free, Transistors are expensive New: “Power wall”, Power expensive, Transistors free

Télécharger la présentation

GPGPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.ita.marongiu@unibo.it

  2. Old and New Wisdom in Computer Architecture • Old: Power is free, Transistors are expensive • New: “Power wall”, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) • Old: Multiplies are slow, Memory access is fast • New: “Memory wall”, Multiplies fast, Memory slow (200 clocks to DRAM memory, 4 clocks for FP multiply) • Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) • New: “ILP wall”, diminishing returns on more ILP HW (Explicit thread and data parallelism must be exploited) • New: Power Wall + Memory Wall + ILP Wall = Brick Wall

  3. Uniprocessor Performance (SPECint)

  4. SW Performance: 1993-2008

  5. Instruction-Stream Based Processing

  6. Data-Stream-Based Processing

  7. Instruction- and Data-Streams

  8. Architectures: Data–Processor Locality • Field Programmable Gate Array (FPGA) • Compute by configuring Boolean functions and local memory • Processor Array / Multi-core Processor • Assemble many (simple) processors and memories on one chip • Processor-in-Memory (PIM) • Insert processing elements directly into RAM chips • Stream Processor • Create data locality through a hierarchy of memories • Graphics Processor Unit (GPU) • Hide data access latencies by keeping 1000s of threads in-flight GPUs often excel in the performance/price ratio

  9. Graphics Processing Unit (GPU) • Development driven by the multi-billion dollar game industry • Bigger than Hollywood • Need for physics, AI and complex lighting models • Impressive Flops / dollar performance • Hardware has to be affordable • Evolution speed surpasses Moore’s law • Performance doubling approximately 6 months

  10. What is GPGPU? • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor • Programmability • Precision • Power • GPGPU: an emerging field seeking to harness GPUs for general-purpose computation other than 3D graphics • GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes • Large data arrays, streaming throughput • Fine-grain SIMD parallelism • Low-latency floating point (FP) computation • Applications – see //GPGPU.org • Game effects (FX) physics, image processing • Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

  11. Motivation 1: • Computational Power • GPUs are fast… • GPUs are getting faster, faster

  12. Motivation 2: • Flexible, Precise and Cheap: • Modern GPUs are deeply programmable • Solidifying high-level language support • Modern GPUs support high precision • 32 bit floating point throughout the pipeline • High enough for many (not all) applications

  13. Parallel Computing on a GPU • NVIDIA GPU Computing Architecture • Via a separate HW interface • In laptops, desktops, workstations, servers • 8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications • GPU parallelism is doubling every year • Programming model scales transparently • Programmable in C with CUDA tools • Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla D870 Tesla S870

  14. Towards GPGPU • The previous 3D GPU • A fixed function graphics pipeline • The modern 3D GPU • A Programmable parallel processor • NVIDIA’s Tesla and Fermi architectures • Unifies the vertex and pixel processors

  15. The evolution of the pipeline • Elements of the graphics pipeline: • A scene description: vertices, triangles, colors, lighting • Transformations that map the scene to a camera viewpoint • “Effects”: texturing, shadow mapping, lighting calculations • Rasterizing: converting geometry into pixels • Pixel processing: depth tests, stencil tests, and other per-pixel operations. • Parameters controlling design of the pipeline: • Where is the boundary between CPU and GPU ? • What transfer method is used ? • What resources are provided at each step ? • What units can access which GPU memory elements ?

  16. Rasterization and Interpolation Raster Operations Generation I: 3dfx Voodoo (1996) • One of the first true 3D game cards • Worked by supplementing standard 2D video card. • Did not do vertex transformations: these were done in the CPU • Did do texture mapping, z-buffering. Vertex Transforms Primitive Assembly Frame Buffer CPU GPU PCI

  17. Rasterization and Interpolation Raster Operations Generation II: GeForce/Radeon 7500 (1998) • Main innovation: shifting the transformation and lighting calculations to the GPU • Allowed multi-texturing: giving bump maps, light maps, and others.. • Faster AGP bus instead of PCI Vertex Transforms Primitive Assembly Frame Buffer GPU AGP

  18. Rasterization and Interpolation Raster Operations Generation III: GeForce3/Radeon 8500(2001) • For the first time, allowed limited amount of programmability in the vertex pipeline • Also allowed volume texturing and multi-sampling (for antialiasing) Vertex Transforms Primitive Assembly Frame Buffer GPU AGP Small vertex shaders

  19. Rasterization and Interpolation Raster Operations Generation IV: Radeon 9700/GeForce FX (2002) • This generation is the first generation of fully-programmable graphics cards • Different versions have different resource limits on fragment/vertex programs Vertex Transforms Primitive Assembly Frame Buffer AGP Programmable Vertex shader Programmable Fragment Processor

  20. Rasterization and Interpolation Raster Operations 3D API Commands 3D API: OpenGL or Direct3D 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) GPU Command & Data Stream Vertex Index Stream Pixel Location Stream Assembled Primitives Pixel Updates GPU Front End Primitive Assembly Frame Buffer Transformed Vertices Transformed Fragments Pre-transformed Vertices Pre-transformed Fragments Programmable Fragment Processor Programmable Vertex Processor • Vertex processors • Operation on the vertices of primitives • Points, lines, and triangles • Typical Operations • Transforming coordinates • Setting up lighting and texture parameters • Pixel processors • Operation on rasterizer output • Typical Operations • Filling the interior of primitives

  21. The road to unification • Vertex and pixel processors have evolved at different rates • Because GPUs typically must process more pixels than vertices, pixel-fragment processors traditionally outnumber vertex processors by about three to one. • However, typical workloads are not well balanced, leading to inefficiency. • For example, with large triangles, the vertex processors are mostly idle, while the pixel processors are fully busy. With small triangles, the opposite is true. • The addition of more-complex primitive processing makes it much harder to select a fixed processor ratio. • Increased generality  Increased the design complexity, area and cost of developing two separate processors • All these factors influenced the decision to design a unified architecture: • to execute vertex and pixel-fragment shader programs on the same unified processor architecture.

  22. Previous GPGPU Constraints

  23. What’s wrong with GPGPU?

  24. From pixel/fragment to thread program…

  25. CPU style cores CPU-“style”

  26. Slimming down

  27. Two cores

  28. Four cores

  29. Sixteen cores

  30. Add ALUs

  31. 128 elements in parallel

  32. But what about branches?

  33. But what about branches?

  34. But what about branches?

  35. But what about branches?

  36. Clarification SIMD processing does not imply SIMD instructions • Option 1: Explicit vector instructions–Intel/AMD x86 SSE, Intel Larrabee • Option 2: Scalar instructions, implicit HW vectorization • HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) • NVIDIA GeForce (“SIMT”warps), ATI Radeon architectures

  37. Stalls! • Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. • Memory access latency = 100’s to 1000’s of cycles • We’ve removed the fancy caches and logic that helps avoid stalls. • But we have LOTS of independent work items. • Idea #3: Interleave processing of many elements on a single core to avoid stalls caused by high latency operations.

  38. Hiding stalls

  39. Hiding stalls

More Related