Presented by: Ahmad Hammad Course: CSE 661 - Fall 2011

NVIDIA’s Fermi:The First Complete GPU Computing ArchitectureA white paper by Peter N. Glaskowsky Presented by: Ahmad HammadCourse: CSE 661 - Fall 2011

Outline • Introduction • What is GPU Computing? • Fermi • The Programming Model • The Streaming Multiprocessor • The Cache and Memory Hierarchy • Conclusion

Introduction • Traditional microprocessor technology see diminishing returns. • Improvement in clock speeds and architectural sophistication is slowing • Focus has shifted to multicore designs. • These too are reaching practical limits for personal computing.

Introduction (2) • CPUs are optimized for applications where work done by limited number of threads • Threads exhibit high data locality • Mix of different operations • High percentage of conditional branches. • CPUs are inefficient for high-performance computing applications • Integer and floating-point execution units is small • Most of the CPU space, complexity and heat it generates, devoted to: • Caches, instruction decoders, branch predictors, and other features to enhance single-threaded performance.

Introduction (3) • GPU design aims applications with multiple threads dominated by long sequences of computational instructions. • GPUs are much better at thread handling, data caching, virtual memory management, flow control, and other CPU-like features. • CPUs will never go away, but • GPUs deliver more cost-effective and energy-efficient performance.

Introduction (4) • The key GPU design goal is to maximize floating-point throughput. • Most of the circuitry within each core is dedicated to computation, rather than speculative features • power consumed by GPUs goes into the application’s actual algorithmic work.

What is GPU Computing? • Use of a graphics processing unit to do general purpose scientific and engineering computing. • GPU computing not a replacement for CPU computing. • Each approach has advantages for certain kinds of software. • CPU and GPU work together in a heterogeneous co-processing computing model. • The sequential part of the application runs on the CPU • the computationally-intensive part is accelerated by the GPU.

What is GPU Computing? • From the user’s perspective, the application just runs faster because of using the GPU to boost performance.

Fermi • Code name for NVIDIA’s next-generation CUDA architecture • consists of • 16 streaming multiprocessors (SMs) • each consisting of 32 cores • each can execute one floating-point or integer instruction per clock. • The SMs are supported by a second-level cache • Host interface • GigaThread scheduler • Multiple DRAM interfaces.

Fermi • Code name for NVIDIA’s next-generation CUDA architecture

The Programming Model • Complexity of the Fermi architecture is managed by multi-level programming model • allows software developers to focus on algorithm design • No need to know details about mapping algorithm to the hardware • improve productivity

The Programming Model Kernels • In NVIDIA’s CUDA software platform the computational elements of algorithms called kernels • Kernels can be written in the C language • extended with additional keywords to express parallelism directly • Once compiled, kernels consist of many threads that execute the same program in parallel

The Programming Model Thread Blocks

The Programming Model Warps • Thread blocks are divided into warps of 32 threads. • The warp is the fundamental unit of dispatch within a single SM. • Two warps from different thread blocks can be issued and executed concurrently • increase hardware utilization and energy efficiency. • Thread blocks are grouped into grids • each executes a unique kernel

The Programming Model • At any one time, the entire Fermi device is dedicated to a single application. • an application may include multiple kernels. • Fermi supports simultaneous execution of multiple kernels from the same application • Each kernel distributed to one or more SMs • This capability increases the utilization of the device

The Programming Model GigaThread • Switching from one application to another needs 25µsec • Short enough to maintain high utilization even when running multiple applications • This switching is managed by GigaThread (hardware thread scheduler) • Manages 1,536 simultaneously active threads for each streaming multiprocessor across 16 kernels.

The Programming Model Languages • Fermi support • C-language • FORTRAN (with independent solutions from The Portland Group and NOAA) • Java, Matlab, and Python • Fermi brings new instruction level support for C++ • previously unsupported on GPUs • will make GPU computing more widely available than ever.

Supported software platforms • Supported software platforms • NVIDIA’s own CUDA development environment • The OpenCL standard managed by the Khronos Group • Microsoft’s Direct Compute API.

The Streaming Multiprocessor • Comprise 32 cores each: • can perform floating-point and integer operations • 16 load-store units for memory operations • four special-function units • 64K of local SRAM split between cache and local memory.

The Streaming Multiprocessor core • Floating-point operations follow the IEEE 754-2008 floating-point standard. • Each core can perform • one single-precision fused multiply-add operation in each clock period • one double-precision fused multiply-add FMA in two clock periods. • no rounding off in the intermediate result • Fermi performs more than 8× as many double-precision operations per clock than previous GPU generations

The Streaming Multiprocessor core • FMA support increases the accuracy and performance of other mathematical operations • division and square root • extended-precision arithmetic • interval arithmetic • Linear algebra. • The integer ALU supports the usual mathematical and logical operations • including multiplication, on both 32-bit and 64-bit values.

The Streaming Multiprocessor Memory operations • Handled by a set of 16 load-store units in each SM. • load/store instructions refer to memory in terms of two-dimensional arrays • providing addresses in terms of x and y values. • Data can be converted from one format to another as it passes between DRAM and the core registers at the full rate. • examples of optimizations unique to GPUs

The Streaming Multiprocessor four Special Function Units • handle special operations such as sin, cos and exp • Four of these operations can be issued per cycle in each SM.

The Streaming Multiprocessor execution blocks • Within Fermi SM has four execution blocks • Cores are divided into two execution blocks:16 cores each. • One block offer 16 load-store units • One block of the four SFUs, • In each cycle, 32 instructions can be dispatched from one or two warps to these blocks. • Two cycles to execute the 32 instructions on the cores or load/store units. • 32 special-function instructions can issued in single cycle • takes eight cycles to complete on the four SFUs. (32/4 = 8)

This figure shows how instructions are issued to the execution blocks.

The Cache and Memory Hierarchy L1 • Fermi architecture provides local memory in each SM, can be split • Shared memory • First-level (L1) cache for global memory references. • The local memory is 64K in size • Split 16K/48K or 48K/16K between L1 cache and shared memory. Depends on • How much shared memory is needed, • how predictable the kernel’s accesses to global memory are likely to be.

The Cache and Memory Hierarchy L2 • Fermi come with an L2 cache • 768KB in size for a 512-core chip. • Covers GPU local DRAM as well as system memory. • The L2 cache subsystem implements: • Set of memory read-modify-write atomic operations • Managing access to data shared across thread blocks or kernels. • atomic operations are 5× to 20× faster than on previous GPUs using conventional synchronization methods.

The Cache and Memory Hierarchy DRAM • The final stage of the local memory hierarchy. • Fermi provides six 64-bit DRAM channels that support SDDR3 and GDDR5 DRAMs. • Up to 6GB of GDDR5 DRAM can be connected to the chip.

Error Correcting Code ECC • Fermi is the first GPU to provide ECC protects • DRAM, register files, shared memories, L1 and L2 caches. • The level of protection is known as SECDED: • single (bit) error correction, double error detection. • Instead of each 64-bit memory channel carrying eight extra bits for ECC information • NVIDIA has a secrete solution for packing the ECC bits into reserved lines of memory.

Conclusion • CPUs is the best for dynamic workloads with short sequences of computational operations and unpredictable control flow. • workloads dominated by computational work performed within a simpler control flow need GPU

Presented by: Ahmad Hammad Course: CSE 661 - Fall 2011