Evolution of GPGPU: From Graphics to General-Purpose Computation

GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.ita.marongiu@unibo.it

Old and New Wisdom in Computer Architecture • Old: Power is free, Transistors are expensive • New: “Power wall”, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) • Old: Multiplies are slow, Memory access is fast • New: “Memory wall”, Multiplies fast, Memory slow (200 clocks to DRAM memory, 4 clocks for FP multiply) • Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) • New: “ILP wall”, diminishing returns on more ILP HW (Explicit thread and data parallelism must be exploited) • New: Power Wall + Memory Wall + ILP Wall = Brick Wall

Uniprocessor Performance (SPECint)

SW Performance: 1993-2008

Instruction-Stream Based Processing

Data-Stream-Based Processing

Instruction- and Data-Streams

Architectures: Data–Processor Locality • Field Programmable Gate Array (FPGA) • Compute by configuring Boolean functions and local memory • Processor Array / Multi-core Processor • Assemble many (simple) processors and memories on one chip • Processor-in-Memory (PIM) • Insert processing elements directly into RAM chips • Stream Processor • Create data locality through a hierarchy of memories • Graphics Processor Unit (GPU) • Hide data access latencies by keeping 1000s of threads in-flight GPUs often excel in the performance/price ratio

Graphics Processing Unit (GPU) • Development driven by the multi-billion dollar game industry • Bigger than Hollywood • Need for physics, AI and complex lighting models • Impressive Flops / dollar performance • Hardware has to be affordable • Evolution speed surpasses Moore’s law • Performance doubling approximately 6 months

What is GPGPU? • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor • Programmability • Precision • Power • GPGPU: an emerging field seeking to harness GPUs for general-purpose computation other than 3D graphics • GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes • Large data arrays, streaming throughput • Fine-grain SIMD parallelism • Low-latency floating point (FP) computation • Applications – see //GPGPU.org • Game effects (FX) physics, image processing • Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

Motivation 1: • Computational Power • GPUs are fast… • GPUs are getting faster, faster

Motivation 2: • Flexible, Precise and Cheap: • Modern GPUs are deeply programmable • Solidifying high-level language support • Modern GPUs support high precision • 32 bit floating point throughout the pipeline • High enough for many (not all) applications

Parallel Computing on a GPU • NVIDIA GPU Computing Architecture • Via a separate HW interface • In laptops, desktops, workstations, servers • 8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications • GPU parallelism is doubling every year • Programming model scales transparently • Programmable in C with CUDA tools • Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla D870 Tesla S870

Towards GPGPU • The previous 3D GPU • A fixed function graphics pipeline • The modern 3D GPU • A Programmable parallel processor • NVIDIA’s Tesla and Fermi architectures • Unifies the vertex and pixel processors

The evolution of the pipeline • Elements of the graphics pipeline: • A scene description: vertices, triangles, colors, lighting • Transformations that map the scene to a camera viewpoint • “Effects”: texturing, shadow mapping, lighting calculations • Rasterizing: converting geometry into pixels • Pixel processing: depth tests, stencil tests, and other per-pixel operations. • Parameters controlling design of the pipeline: • Where is the boundary between CPU and GPU ? • What transfer method is used ? • What resources are provided at each step ? • What units can access which GPU memory elements ?

Rasterization and Interpolation Raster Operations Generation I: 3dfx Voodoo (1996) • One of the first true 3D game cards • Worked by supplementing standard 2D video card. • Did not do vertex transformations: these were done in the CPU • Did do texture mapping, z-buffering. Vertex Transforms Primitive Assembly Frame Buffer CPU GPU PCI

Rasterization and Interpolation Raster Operations Generation II: GeForce/Radeon 7500 (1998) • Main innovation: shifting the transformation and lighting calculations to the GPU • Allowed multi-texturing: giving bump maps, light maps, and others.. • Faster AGP bus instead of PCI Vertex Transforms Primitive Assembly Frame Buffer GPU AGP

Rasterization and Interpolation Raster Operations Generation III: GeForce3/Radeon 8500(2001) • For the first time, allowed limited amount of programmability in the vertex pipeline • Also allowed volume texturing and multi-sampling (for antialiasing) Vertex Transforms Primitive Assembly Frame Buffer GPU AGP Small vertex shaders

Rasterization and Interpolation Raster Operations Generation IV: Radeon 9700/GeForce FX (2002) • This generation is the first generation of fully-programmable graphics cards • Different versions have different resource limits on fragment/vertex programs Vertex Transforms Primitive Assembly Frame Buffer AGP Programmable Vertex shader Programmable Fragment Processor

Rasterization and Interpolation Raster Operations 3D API Commands 3D API: OpenGL or Direct3D 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) GPU Command & Data Stream Vertex Index Stream Pixel Location Stream Assembled Primitives Pixel Updates GPU Front End Primitive Assembly Frame Buffer Transformed Vertices Transformed Fragments Pre-transformed Vertices Pre-transformed Fragments Programmable Fragment Processor Programmable Vertex Processor • Vertex processors • Operation on the vertices of primitives • Points, lines, and triangles • Typical Operations • Transforming coordinates • Setting up lighting and texture parameters • Pixel processors • Operation on rasterizer output • Typical Operations • Filling the interior of primitives

The road to unification • Vertex and pixel processors have evolved at different rates • Because GPUs typically must process more pixels than vertices, pixel-fragment processors traditionally outnumber vertex processors by about three to one. • However, typical workloads are not well balanced, leading to inefficiency. • For example, with large triangles, the vertex processors are mostly idle, while the pixel processors are fully busy. With small triangles, the opposite is true. • The addition of more-complex primitive processing makes it much harder to select a fixed processor ratio. • Increased generality  Increased the design complexity, area and cost of developing two separate processors • All these factors influenced the decision to design a unified architecture: • to execute vertex and pixel-fragment shader programs on the same unified processor architecture.

Previous GPGPU Constraints

What’s wrong with GPGPU?

From pixel/fragment to thread program…

CPU style cores CPU-“style”

Slimming down

Two cores

Four cores

Sixteen cores

Add ALUs

128 elements in parallel

But what about branches?

Clarification SIMD processing does not imply SIMD instructions • Option 1: Explicit vector instructions–Intel/AMD x86 SSE, Intel Larrabee • Option 2: Scalar instructions, implicit HW vectorization • HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) • NVIDIA GeForce (“SIMT”warps), ATI Radeon architectures

Stalls! • Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. • Memory access latency = 100’s to 1000’s of cycles • We’ve removed the fancy caches and logic that helps avoid stalls. • But we have LOTS of independent work items. • Idea #3: Interleave processing of many elements on a single core to avoid stalls caused by high latency operations.

Hiding stalls

Evolution of GPGPU: From Graphics to General-Purpose Computation

Evolution of GPGPU: From Graphics to General-Purpose Computation

Presentation Transcript

GPGPU Programming

GPGPU labor X .

GPGPU Labor 15.

GPGPU overview

Algorithm Engineering „ GPGPU“

GPGPU introduction

GPGPU Accounting

GPGPU labor XIII.

GPGPU and CUDA

GPGPU Lessons Learned

GPGPU labor VI.

GPGPU labor III.

GPGPU labor VIII .

GPGPU Programming

What is GPGPU?

GPGPU Introduction

GPGPU Programming