The Fundamentals of GPU Technology and CUDA Programming

The Fundamentals of GPU Technology and CUDA Programming Nicholas Lykins Kentucky State University May 7, 2012

Outline • Introduction • Why pursue GPU accelerated computing? • Performance figures • Historical background • Graphics rendering pipeline • History of GPU technology • NVIDIA and GPU implementations • Alternative GPU processing frameworks • CUDA • Background and available libraries • Terminology • Architectural design • Syntax • Hands-on CUDA sample demonstration • Line by line illustration of code execution • Animated execution pipeline for sample application • Conclusion and future outlook

Thesis Guidelines • Initial goal: Demonstrate the potential for GPU technology to further enhance data processing needs of the scientific community. • Objectives • Deliver an account of the history of GPU technology • Provide an overview of NVIDIA’s CUDA framework • Demonstrate the motivation for scientists to pursue GPU acceleration and apply it to their own scientific disciplines

High-Performance Computing • Multi-Core Processing • GPU Acceleration • ….How are they different? • Hardware differences: CPU vs. GPU

Hardware Review • CPU (Single-Input, Single-Data) • Control unit, arithmetic and logic unit, internal registers, internal data bus • Speed limitations • One bit in, one bit out • GPU (Single-Input, Multiple-Data) • Many processing cores and onboard memory • Parallel execution of each core • One bit in, multiple bits out

Performance Trends • GPU processing time is measurably faster than comparable CPU processing time when working with large-scale input data.

Performance Trends, Continued

GPU Technology –Pipeline Overview • Graphics rendering pipeline • Entire process through which an image is generated by a graphics processing device • Vertex calculations • Color generation • Shadows and lighting • Shaders • Specialized program executed as a function of graphics processing hardware to produce a particular aspect of the resulting image

Traditional Pipelining Process • Traditional pipelining process • System collects data to be graphically represented • Modeling transformations within the world space • Vertices are “shaded” according to various properties • Lighting, materials, textures • Viewing transformation is performed – reorienting the graphical object with respect to the human eye • Clipping is performed, eliminating constructed content outside the frustum

Traditional Pipelining, Continued • The three-dimensional scene is then rendered onto a two-dimensional viewing plane, or screen space • Rasterization takes place, in which the continuous geometric representation of objects is translated into a set of discrete fragments for a particular display • Color, transparency, and depth • Stored within the frame buffer, where Z-buffering and alpha blending take place, pixels are determined with respect to their appearance on the screen.

Graphics Processing APIs (Application Programming Interfaces) • OpenGL • OpenGL 1.0 first developed by Silicon Graphics in 1992. • First middle layer developed for interpreting between operating system and underlying hardware. • Industry-wide standard was implemented for graphics development, with each vendor crafting their hardware architecture with those standards in mind. • Cross-platform compatibility • DirectX • Developed by Microsoft employees Craig Eisler, Alex St. John, and Eric Engstrum in 1995, for facilitating low level access by programmers of Window’s restricted memory space. • Set of related APIs (Direct3D, DirectDraw, DirectSound) that enable multimedia development. • Vendor provides device driver that enables compatibility for its own hardware across all Windows systems. • Restricted to Windows only.

GeForce 256 • Released in August of 1999, it was the world’s first official GPU device. • Integration of all graphics processing actions onto a single chip. • Implemented with a fixed function rendering pipeline

OpenGL 1.x Fixed Pipeline

Programmable Pipeline • OpenGL 2.0 • Programmable shaders • Programmers could write unique instructions for accessing hardware functionality • Programmability enabled by proprietary “shading languages” • ARB • Low-level assembly based language for directly interfacing with hardware elements • Unintuitive and difficult to use effectively • GLSL (OpenGL Shading Language) • High-level language derived from C • Translates high-level code into corresponding low-level instructions to be interpreted as ARB language • Cg • High-level shader language designed by NVIDIA • Compiles into assembly-based and GLSL code for interpretation by OpenGL and DirectX

Programmable Pipeline, Continued

GT80 Architecture • Released in November of 2006, first implemented within the GeForce 8800. • First architecture to implement the CUDA framework, and first instance of a unified graphics rendering pipeline • Vertex and fragment shaders integrated as one hardware component • Programmability given over individual processing elements on the device • Scalability based on targeted consumer market • Proportions of processing cores, memory, etc.

GT80 Architecture, Continued

GT80 Architecture, Continued • GeForce 8800 GTX • Each “tile” represents a separate multiprocessor • Eight streaming cores per multiprocessor, 16 multiprocessors per card • Shared L1 cache per pair of tiles • Texture handling units attached to each tile • Recursive method for handling graphics rendering • Output data for one core becomes input data for another • Six discrete memory partitions, each 64-bit, totalling to a 384-bit interface. • Bit interface and memory size varies based on specific GT80 device.

Discrete vs. Unified Architecture

Fermi Architecture • Second generational GPU architecture, released in June of 2008. • Most recently featured architecture until the Kepler architecture was published, in March of 2012. • Rebranding of streaming processor cores as CUDA cores. • Overall superior design in terms of performance and computational precision

Fermi Architecture, Continued • Core count of 240, increased to 512. • 32 cores per multiprocessor, totaling 16 streaming multiprocessors • Similar memory interface to the GT80, hosting six 64-bit memory partitions totalling a 384-bit memory interface. • 64 KB shared memory per streaming multiprocessor

Fermi Architecture, Continued • Unified memory address space: Thread, block, globally layered. • Enables a read and write mechanism compatible with C++ via pointer handling. • Configurable shared memory: 48 KB shared, 16 KB as L1 cache, vs. 48 KB L1 cache and 16 KB shared memory • L2 cache common across all streaming multiprocessors

Fermi Architecture, Continued • Added CUDA compatibility with the implementation of PTX (Parallel Thread Execution) 2.0 • Low level equivalent of assembly language • Low level virtual machine, responsible for translating system calls from the CPU, to hardware instructions interpretible by the GPU’s onboard hardware. • CUDA passes high level CUDA code to the compiler. • PTX translates it into corresponding low level code. • Hardware instructions are then interpreted based on that low level code and executed by the GPU itself.

AMD-ATI • Rival GPU manufacturer – develops its own proprietary line of graphics cards • Significant architectural differences with NVIDIA products • Evergreen chipset – ATI Radeon HD 5870 - Comparison • NVIDIA’s GTX 480 – 512 active cores, 3 billion transistors • Radeon HD 5870 – 20 parallel engines – 16 cores – 5 processing elements – totalling 1600 work units, 2.15 billion transistors

Parallel Computing Frameworks • OpenCL • Parallel computing framework similar to CUDA • Initially introduced by Apple, but development of its standards currently done by the Khronos Group • Emphasis on portability and cross-platform implementations • Flagship parallel computing API of AMD • CPU/GPU, Apple systems, GPUs, etc. • Adopted by Intel, AMD, NVIDIA, ARM Holdings • CTM (Close to Metal) • Released in 2006 by AMD as a low level API providing hardware access, similar to NVIDIA’s PTX instruction set. • Discontinued in 2008, replaced by OpenCL for principal usage

CUDA • Programming framework by NVIDIA for performing GPGPU (General-Purpose GPU) computing • Potential for applying parallel processing capabilities of GPU hardware to traditional software applications • NVIDIA Libraries • Ready-made libraries for implementing complex computational functions • cuFFT (NVIDIA CUDA Fast Fourier Transform), cuBLAS (NVIDIA CUDA Basic Linear Algebra Subroutines), and cuSPARSE (NVIDIA CUDA Sparse)

Terminology - What is CUDA? • Hardware or software? ……or both. • Development framework that correlates between hardware elements on the GPU and the algorithms responsible for accessing and manipulating those elements • Expands on its original definition as a C-compatible compiler which special extensions for recognizing CUDA code

Scalability Model • Resource allocation is dynamically handled by the framework. • Scalable to different hardware devices without the need to recode an application. o

Encapsulation and Abstraction • CUDA is designed as a high level API, hiding low level hardware details from the user. • Three major layers of abstraction between the architecture and the programmer: • Hierarchy of thread groups, shared memories, barrier synchronization • Computational features implemented as functions. Input data passed as parameters. • High level functionality allows for a low learning curve in terms of use. • Allows for applications to be run on any GPU card with a compatible architecture. • Backwards compatible for older versions.

Threading Framework • Resource allocation handled through threading • A thread represents a single work unit or operation • Lowest level of resource allocation for CUDA • Hierarchical structure • Threads, blocks, and grids; from lowest to highest • Paralleled to multiple layers of nested execution loops

Threading Framework, Continued • Visual representation of thread hierarchy • Multiple threads embedded in blocks, multiple blocks embedded in grids • Intuitive schemefor understanding allocation mechanism

Threading Framework, Continued • Threading syntax • Recognized by the framework for handling thread usage within an application. • Each variable provides for tracking and monitoring of individual thread activity. • Resource assignment for an application not covered by these syntax elements.

Threading Framework, Continued • Keywords • Threadidx .x/y/z– Represents the number of threads within a given block, three-dimensional. • blockIdx.x/y/z – Refers to a particular block within a grid, three-dimensional. • blockDim.x/y/z – Total number of threads allocated along a single dimension of a block, three-dimensional. • gridDim.x/y/z – Block count per dimension, three-dimensional • tid – Identifying marker for each individual thread; a unique value for each allocated thread

Threading Framework, Continued • Flexibility for managing threads within an application. • Example: inttid = threadIdx.x + blockIdx.x * blockDim.x • Current block number, multiplied by number of threads per block, added to the current thread count. • Thread IDs are managed in this equation by mapping each value on a per-thread basis. • Simultaneous implementation of all thread IDs. • Parallel mapping of the equation across all threads as opposed to one thread at a time.

Sample Thread Allocation • blockDim.x = 4 • blockIdx.x = {0, 1, 2, 3…} • threadIdx.x = {0, 1, 2, 3}…{0, 1, 2, 3}… • idx/tid = blockDim.x * blockIdx.x + threadIdx.x • Problem size of ten operations, so two threads go to waste.

Thread Incrementation • Current scheme handles thread execution, but not subsequent incrementation of thread IDs. • Right and wrong way to increment threads, to avoid overflow into other allocated IDs. • Increment based on grid dimensions, not on block and thread counts • Example: • tid += blockDim.x * gridDim.x • Thread ID incremented by a multiple of threads per block and blocks per grid.

Compute Capability • Indicates structural limitations of hardware architectures • Determines various technical thresholds such as block and thread ceilings, etc. • Revision 1.x – Pre-Fermi architectures • Revision 2.x – Fermi architecture

Serial vs. Parallel Distinction • Host memory vs. device memory • Each platform has a separate memory space • Host can read and write to host only, device can read and write to device only • Synchronization needed between CPU and GPU activity • GPU only handles computationally intensive calculations – CPU still executes serial code

Serial vs. Parallel Execution Model • Application pipeline • Represents CPU and GPU activity • Illustrates behavior of application, and invocation of GPU computations

Memory Architecture –Conceptual Overview • Three address spaces • Localized memory • Unique to each thread • Shared memory • Shared among threads within a particular block • Global memory • Accessible by threads and blocks across a given grid

Memory Architecture –Hardware Level • More accurate representation of hardware level interaction between address spaces • Two new spaces: constant memory and texture memory • Constant memory is read-only and globally accessible. • Texture memory is a subset of global memory, useful in graphics rendering • Two-dimensionality • Surface memory • Similar functionality to texture memory but different technical elements

Memory Allocation • Three basic steps of the allocation process • 1. Declare host and device memory allocations • 2. Copy input data from host memory to device memory • 3. Transfer processed data back to host upon completion • Bare memory requirements for successfully executing a GPU application • More sophisticated memory functions exist, but are geared towards more complex functionality and better performance

Memory Handling Syntax • CUDA-specific keywords for dynamically allocating memory • cudaMalloc– Allocates a dynamic reference to a location in GPU memory. Identical in function to mallocin C. • cudaMemCpy– Transfers data from CPU memory to GPU memory. Also responsible for reversing the transfer. • cudaFree – Deallocates reference to GPU memory location. Identical to free in C. • Basic syntax needed for handling memory allocation • Additional features available for more sophisticated applications

Kernels • Kernel – Executes processing instructions for data loaded onto the GPU • Executes an operation N times for N threads simultaneously • Structured similarly to a normal function, but with its own unique changes • Kernel syntax • __global__ void example1<<<M, N>>>(A, B, C) • __global__ - Declaration specifier identifying a line as a GPU kernel. • Void example1 – Return type and kernel name • <<<M, N>>> - M represents number of threads to be allocated per block. N indicates number of blocks to set aside for executing the kernel. • (A, B, C) – Argument list to be passed to the kernel

Warps • During kernel execution, threads organized into warps. • A warp is a grouping of 32 threads, all executed in parallel with one another. • Threads are executed at the same program address, but mapped onto its own instruction counter and register state. • Allows parallel execution, but independent pacing of each thread in terms of completion. • Handling of threads in a warp is managed by a warp scheduler. • Two warp schedulers available per streaming multiprocessor • Warp execution optimized if no data dependence between threads. • Otherwise, dependent threads remain disabled till required data is received from completed operations

Thread Synchronization • Separation of threads between warps can cause data to get “tangled”. • Completed data does not coalesce back in memory as it should due to out of order warp execution. • Problem avoided by using __syncthreads() • Forcibly halts continued execution of a thread batch until all threads in a warp have reached completion. • Minimizes idle time for threads that finish early and ensures fewer errors in sensitive computations

Sample Execution • Animated visualization, indicating the relation between CPU and GPU elements • Sample code obtained from:Sanders, Jason and Kandrot, Edward.CUDA By Example: An Introduction to General-Purpose GPU Programming. Boston : Pearson Education, Inc., 2011. • Highlights the activities needed to facilitate completion of a GPU-based data processing application. • Code Animation Link

Conclusion • Major topics covered: • Performance benefits of GPU accelerated applications. • Historical account of GPU technology and graphics processing. • Hands-on demonstration of CUDA, including syntax, architecture, and implementation.

Future Outlook • Promising future, with positive projected market demand for GPU technology • Growing market share for NVIDIA products • Gaming applications, scientific computing, and video editing and engineering purposes • Release of Kepler architecture – March 2012. • Indicates further increase in performance metrics and optimized resource consumption • Currently little documentation released in terms of technical specifications • Role of GPU technology is sure to continue saturating the professional market, as it’s capabilities continue to rise.

Bibliography • 1. Meyers, Michael.Mike Meyers' CompTIA A+ Guide to Managing and Troubleshooting PCs. s.l. : McGraw-Hill Osborne Media, 2010. • 2. MAC. Hardware Canucks. [Online] November 14, 2011. [Cited: February 21, 2012.] http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/48210-intel-sandy-bridge-e-core-i7-3960x-cpu-review-3.html. • 3. Intel Corporation. Intel AVX. [Online] [Cited: February 21, 2012.] http://software.intel.com/en-us/avx/. • 4. Performance Analysis of GPU compared to Single-core and Multi-core CPU for Natural Language Applications. Gupta, Shubham and Babu, M. Rajasekhara. 5, 2011, International Journal of Advanced Computer Science and Applications, Vol. 2, p. 4. • 5. IAP 2009 CUDA @ MIT / 6.963. [Online] January 2009. [Cited: February 7, 2012.] https://sites.google.com/site/cudaiap2009/. • 6. Palacios, Jonathan and Triska, Josh. A Comparison of Modern GPU and CPU Architectures: And the Common Convergence of Both. [Online] March 15, 2011. [Cited: February 21, 2012.] http://web.engr.oregonstate.edu/~palacijo/cs570/final.pdf. • 7. NVidia.NVidia's Next Generation CUDA Compute Architecture: Fermi. nvidia.com. [Online] 2009. [Cited: February 21, 2012.] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. • 8. —. NVidia CUDA C Programming Guide 4.1. NVidia. [Online] November 18, 2011. [Cited: February 17, 2012.] http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf.

The Fundamentals of GPU Technology and CUDA Programming