690 likes | 884 Vues
The landscape of computer architecture has undergone significant changes, transitioning from power and transistor economics to embracing data locality and parallel computing. This paper explores the evolution of General-Purpose Graphics Processing Units (GPGPU), which have transformed into versatile processors suitable for various applications beyond just 3D graphics. We discuss the challenges posed by the "Power Wall," "Memory Wall," and "ILP Wall," alongside innovative structures like Processor-in-Memory and the rise of programmable GPUs, which leverage fine-grain SIMD parallelism for enhanced performance and affordability in various computational tasks.
E N D
GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu martino.ruggiero@unibo.ita.marongiu@unibo.it
Old and New Wisdom in Computer Architecture • Old: Power is free, Transistors are expensive • New: “Power wall”, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) • Old: Multiplies are slow, Memory access is fast • New: “Memory wall”, Multiplies fast, Memory slow (200 clocks to DRAM memory, 4 clocks for FP multiply) • Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) • New: “ILP wall”, diminishing returns on more ILP HW (Explicit thread and data parallelism must be exploited) • New: Power Wall + Memory Wall + ILP Wall = Brick Wall
Architectures: Data–Processor Locality • Field Programmable Gate Array (FPGA) • Compute by configuring Boolean functions and local memory • Processor Array / Multi-core Processor • Assemble many (simple) processors and memories on one chip • Processor-in-Memory (PIM) • Insert processing elements directly into RAM chips • Stream Processor • Create data locality through a hierarchy of memories • Graphics Processor Unit (GPU) • Hide data access latencies by keeping 1000s of threads in-flight GPUs often excel in the performance/price ratio
Graphics Processing Unit (GPU) • Development driven by the multi-billion dollar game industry • Bigger than Hollywood • Need for physics, AI and complex lighting models • Impressive Flops / dollar performance • Hardware has to be affordable • Evolution speed surpasses Moore’s law • Performance doubling approximately 6 months
What is GPGPU? • The graphics processing unit (GPU) on commodity video cards has evolved into an extremely flexible and powerful processor • Programmability • Precision • Power • GPGPU: an emerging field seeking to harness GPUs for general-purpose computation other than 3D graphics • GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes • Large data arrays, streaming throughput • Fine-grain SIMD parallelism • Low-latency floating point (FP) computation • Applications – see //GPGPU.org • Game effects (FX) physics, image processing • Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting
Motivation 1: • Computational Power • GPUs are fast… • GPUs are getting faster, faster
Motivation 2: • Flexible, Precise and Cheap: • Modern GPUs are deeply programmable • Solidifying high-level language support • Modern GPUs support high precision • 32 bit floating point throughout the pipeline • High enough for many (not all) applications
Parallel Computing on a GPU • NVIDIA GPU Computing Architecture • Via a separate HW interface • In laptops, desktops, workstations, servers • 8-series GPUs deliver 50 to 200 GFLOPSon compiled parallel C applications • GPU parallelism is doubling every year • Programming model scales transparently • Programmable in C with CUDA tools • Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla D870 Tesla S870
Towards GPGPU • The previous 3D GPU • A fixed function graphics pipeline • The modern 3D GPU • A Programmable parallel processor • NVIDIA’s Tesla and Fermi architectures • Unifies the vertex and pixel processors
The evolution of the pipeline • Elements of the graphics pipeline: • A scene description: vertices, triangles, colors, lighting • Transformations that map the scene to a camera viewpoint • “Effects”: texturing, shadow mapping, lighting calculations • Rasterizing: converting geometry into pixels • Pixel processing: depth tests, stencil tests, and other per-pixel operations. • Parameters controlling design of the pipeline: • Where is the boundary between CPU and GPU ? • What transfer method is used ? • What resources are provided at each step ? • What units can access which GPU memory elements ?
Rasterization and Interpolation Raster Operations Generation I: 3dfx Voodoo (1996) • One of the first true 3D game cards • Worked by supplementing standard 2D video card. • Did not do vertex transformations: these were done in the CPU • Did do texture mapping, z-buffering. Vertex Transforms Primitive Assembly Frame Buffer CPU GPU PCI
Rasterization and Interpolation Raster Operations Generation II: GeForce/Radeon 7500 (1998) • Main innovation: shifting the transformation and lighting calculations to the GPU • Allowed multi-texturing: giving bump maps, light maps, and others.. • Faster AGP bus instead of PCI Vertex Transforms Primitive Assembly Frame Buffer GPU AGP
Rasterization and Interpolation Raster Operations Generation III: GeForce3/Radeon 8500(2001) • For the first time, allowed limited amount of programmability in the vertex pipeline • Also allowed volume texturing and multi-sampling (for antialiasing) Vertex Transforms Primitive Assembly Frame Buffer GPU AGP Small vertex shaders
Rasterization and Interpolation Raster Operations Generation IV: Radeon 9700/GeForce FX (2002) • This generation is the first generation of fully-programmable graphics cards • Different versions have different resource limits on fragment/vertex programs Vertex Transforms Primitive Assembly Frame Buffer AGP Programmable Vertex shader Programmable Fragment Processor
Rasterization and Interpolation Raster Operations 3D API Commands 3D API: OpenGL or Direct3D 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) GPU Command & Data Stream Vertex Index Stream Pixel Location Stream Assembled Primitives Pixel Updates GPU Front End Primitive Assembly Frame Buffer Transformed Vertices Transformed Fragments Pre-transformed Vertices Pre-transformed Fragments Programmable Fragment Processor Programmable Vertex Processor • Vertex processors • Operation on the vertices of primitives • Points, lines, and triangles • Typical Operations • Transforming coordinates • Setting up lighting and texture parameters • Pixel processors • Operation on rasterizer output • Typical Operations • Filling the interior of primitives
The road to unification • Vertex and pixel processors have evolved at different rates • Because GPUs typically must process more pixels than vertices, pixel-fragment processors traditionally outnumber vertex processors by about three to one. • However, typical workloads are not well balanced, leading to inefficiency. • For example, with large triangles, the vertex processors are mostly idle, while the pixel processors are fully busy. With small triangles, the opposite is true. • The addition of more-complex primitive processing makes it much harder to select a fixed processor ratio. • Increased generality Increased the design complexity, area and cost of developing two separate processors • All these factors influenced the decision to design a unified architecture: • to execute vertex and pixel-fragment shader programs on the same unified processor architecture.
Clarification SIMD processing does not imply SIMD instructions • Option 1: Explicit vector instructions–Intel/AMD x86 SSE, Intel Larrabee • Option 2: Scalar instructions, implicit HW vectorization • HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) • NVIDIA GeForce (“SIMT”warps), ATI Radeon architectures
Stalls! • Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. • Memory access latency = 100’s to 1000’s of cycles • We’ve removed the fancy caches and logic that helps avoid stalls. • But we have LOTS of independent work items. • Idea #3: Interleave processing of many elements on a single core to avoid stalls caused by high latency operations.