Why GPUs?

Why GPUs? Robert Strzodka

Overview • Computation / Bandwidth / Power • CPU – GPU Comparison • GPU Characteristics

Data Processing in General lack of parallelism memory wall IN OUT memory memory OUT IN Processor

Old and New Wisdom in Computer Architecture • Old: Power is free, Transistors are expensive • New: “Power wall”, Power expensive, Transistors free (Can put more transistors on chip than can afford to turn on) • Old: Multiplies are slow, Memory access is fast • New: “Memory wall”, Multiplies fast, Memory slow(200 clocks to DRAM memory, 4 clocks for FP multiply) • Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) • New: “ILP wall”, diminishing returns on more ILP HW(Explicit thread and data parallelism must be exploited) • New: Power Wall + Memory Wall + ILP Wall = Brick Wall slide courtesy of Christos Kozyrakis

Uniprocessor Performance (SPECint) 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006  Sea change in chip design: multiple “cores” or processors per chip slide courtesy of Christos Kozyrakis

Instruction-Stream-Based Processing instructions memory memory cache data data data data data data data Processor

Instruction- and Data-Streams for(y=0; y<HEIGHT; y++) for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x]; } instuction stream processing data data streams undergoing a kernel operation inputStreams(A,B); outputStream(C); kernelProgram(OP_ADD); processStreams(); Addition of 2D arrays: C= A + B

Data-Stream-Based Processing data data memory memory pipeline pipeline configuration Processor pipeline

Architectures: Data – Processor Locality • Field Programmable Gate Array (FPGA) • Compute by configuring Boolean functions and local memory • Processor Array / Multi-core Processor • Assemble many (simple) processors and memories on one chip • Processor-in-Memory (PIM) • Insert processing elements directly into RAM chips • Stream Processor • Create data locality through a hierarchy of memories

The GPU is a Fast, Parallel Array Processor Input Arrays: 1D, 3D,2D (typical) Output Arrays: 1D, 3D (slice),2D (typical) Vertex Processor (VP) Kernel changes indexregions of output arrays Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays Rasterizer Creates data streams from index regions Stream of array elements,order unknown

Index Regions in Output Arrays • Quads and Triangles • Fastest option • Line segments • Slower, try to pair lines to 2xh, wx2 quads Output region Output region Output region • Point Clouds • Slowest, try to gather points into larger forms

High Level Graphics Language for the Kernels • Float data types: • half 16-bit (s10e5), float 32-bit (s23e8) • Vectors, structs and arrays: • float4, float vec[6], float3x4, float arr[5][3], struct {} • Arithmetic and logic operators: • +, -, *, /; &&, ||, ! • Trignonometric, exponential functions: • sin, asin, exp, log, pow, … • User defined functions • max3(float a, float b, float c) { return max(a,max(b,c)); } • Conditional statements, loops: • if, for, while, dynamic branching in PS3 • Streaming and random data access

CPU Input and output arrays may overlap GPU Input and output arrays must not overlap Input and Output Arrays Input Input Output Output

CPU 1D input 1D output Higher dimensions with offsets GPU 1D, 2D, 3D input 2D output Other dimensions with offsets Native Memory Layout – Data Locality Output Input Color coded localityred (near), blue (far) Input Output

CPU Arbitrary gather Arbitrary scatter GPU Arbitrary gather Restricted scatter Data-Flow: Gather and Scatter Input Input Output Output Input Output Input Output

1) Computational Performance ATI R520 GFLOPS Note: Sustained performance is usually much lower and depends heavily on the memory system ! chart courtesy of John Owens

CPU Large cache Few processing elements Optimized for spatial and temporal data reuse 2) Memory Performance Memory access types: Cache, Sequential, Random • GPU • Small cache • Many processing elements • Optimized for sequential (streaming) data access GeForce 7800 GTX Pentium 4 chart courtesy of Ian Buck

3) Configuration Overhead Configu- ration limited Compu- tation limited chart courtesy of Ian Buck

Conclusions • Parallelism is now indispensable to further increase performance • Both memory and processing element dominated designs have pros and cons • Mapping algorithms to the appropriate architecture allows enormous speedups • Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)

Why GPUs?

Why GPUs?

Presentation Transcript

The Scout Compiler

Interactive k-D Tree GPU Raytracing

Geometric Computations on GPU: Proximity Queries

Leveraging GPUs for Application Acceleration Dan Ernst Cray, Inc.

Writing Efficient CUDA Programs

GPU Computing with OpenACC Directives

High-Performance Computing with NVIDIA Tesla GPUs

Visualization in Medical Education

CS361

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

yaSpMV: Yet Another SpMV Framework on GPUs

GPU Programming Overview

CUDA Programming

Fast BVH Construction on GPUs (Eurographics 2009)

Platform-based Design 5KK70 MPSoC

Scan Primitives for GPU Computing

High Performance Molecular Simulation, Visualization, and Analysis on GPUs