B. Wilkinson, April 15, 2014, GPUMemories

GPU Memories These notes will introduce: • The basic memory hierarchy in the NVIDIA GPU • global memory, shared memory, register file, constant memory • How to declare variables for each memory • Memory coalescing • Cache memory and making most effective in program B. Wilkinson, April 15, 2014, GPUMemories.ppt

Host-Device Connection Host (CPU) Device (GPU) Memory bus limited by memory and processor-memory connection bandwidth Hypertransport and Intel’s Quickpath currently 25.6 GB/s GPU bus C2050 1030.4 GB/s GTX 280 141.7 GB/s PCIe x16 4 GB/s PCIe x16 Gen2 8 GB/s peak Device Global Memory GDDR5 230 GB/s Host Memory DDR 400 3.2 GB/s Note transferring between host and GPU much slower that between device and global memory Hence need to minimize host-device transfers GPU on a laptop such as Mac pro may share the system memory.

GPU Memory Hierarchy Global memory is off-chip on the GPU card. Even though global memory an order of magnitude faster than CPU memory, still relatively slow and a bottleneck for performance GPU provided with faster on-chip memory although data has to be transferred explicitly into shared memory to global memory. Two principal levels on-chip: shared memory and registers

Scope of global memory, shared memory, and registers Host Grid Block Threads Registers Shared memory Local memory Host memory Global memory Constant memory For storing global constants see later. Also a read-only global memory called texture memory.

Currently can only transfer data from host to global (and constant memory) and not host directly to shared. Constant memory used for data that does not change (i.e. read-only by GPU) Shared memory is said to provide up to 15 x speed of global memory Register similar speed to shared memory if reading same address or no bank conflicts.

Declaring program variables for registers, shared memory and global memory Memory Declaration Scope Lifetime Registers Automatic variables* Thread Kernel other than arrays Local Automatic array variables Thread Kernel Shared __shared__ Block Kernel Global __device__ Grid Application Constant __constant__ Grid Application *Automatic variables allocated automatically when entering scope of variable and de-allocated when leaving scope. In C, all variables declared within a block are “automatic” by default, see http://en.wikipedia.org/wiki/Automatic_variable

Global Memory __device__ #include <stdio.h> #include <stdlib.h> #include <cuda.h> #define N 1000 … __device__ int A[N]; __global__ kernel() { int tid = blockIdx.x * blockDim.x + threadIdx.x; A[tid] = … … } main { … } For data available to all threads in device. Declared outside function bodies Scope of Grid and lifetime of application

Issues with using Global memory • Long delays, slow • Access congestion • Cannot synchronize accesses • Need to ensure no conflicts of accesses between threads

Shared Memory Shared memory is on the GPU chip and very fast Separate data available to all threads in one block. Declared inside function bodies Scope of block and lifetime of kernel call So each block would have its own array A[N] #include <stdio.h> #include <stdlib.h> #include <cuda.h> #define N 1000 … __global__ kernel() { __shared__ int A[N]; int tid = threadIdx.x; A[tid] = … … } main { … }

Transferring data to shared memory __global__ void myKernel (int *A_global) { __shared__ int A_sh[n][n]; // declare shared memory int row = … int col = … A_sh[i][j] = *A_global[row + col*N]; //copy from global to shared … } main () { … cudaMalloc((void**)dev_ A, size); // allocate global memory cudoMemcpy(dev_A, A, size, cudaMemcpyHostTo Device); //copy to global memory myKernel<<G,B>>(dev_A) … }

Issues with Shared Memory Shared memory is not immediately synchronized after access. Usually it is the writes that matter. Use __syncthreads()before you read data that has been altered. Shared memory is very limited (Fermi has up to 48KB per GPU core, NOT per block) Hence may have to divide your data into “chunks”

Example uses of shared data Where the data can be divided into independent parts: Image processing - Image can be divided into blocks and placed into shared memory for processing Block matrix multiplication • Sub-matrices can be stored in shared memory

Registers Compiler will place variables declared in kernel in registers when possible Limit to the number of registers Fermi has 32768 32-bit registers Registers divided across “warps” (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps __global__ kernel() { int x, y, z; … }

Arrays declared within kernel (Automatic array variables) Generally stored in global memory but private copy made for each thread. Can be as slow access as global memory, except cached, see later. If array indexed with a constant value, compiler may use registers __global__ kernel() { int A[10]; … }

Constant Memory __constant__ #include <stdio.h> #include <stdlib.h> #include <cuda.h> … __constant__ int n; __global__ kernel() { … } main { n = … … } For data not altered by device. Although stored in global memory, cached and has fast access Declared outside function bodies Scope of grid and lifetime of application Size currently limited to 65536 bytes

Local memory Resides in device memory space (global memory) and is slow except that organized such that consecutive 32-bit words accessed by consecutive threadIDs for best performance when possible. For compute capability 2.x, cached in L1 and L2 caches on-chip Used to hold arrays if not indexed with a constant value and for variables when there are no more register available for them

Cache memory More recent GPUs have L1 and L2 (data) cache memory, but apparently without cache coherence so up to the programmer to ensure that. Make sure each thread accesses different locations Ideally arrange accesses to be in same cache lines Compute capability 1.3 Tesla’s do not have cache memory Compute capability 2.0 Fermi’s+ have L1/L2 caches

Fermi Caches Streaming processors (SP) Streaming multiprocessors (SM’s) Register file L2 cache L1 cache/ shared memory

Taking Advantage of Cache Poor Performance from Poor Data Layout __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[1000*i] = … } Very Bad! Each thread accesses a location on a different cache line. Fermi line size is 32 integers or floats

Taking Advantage of Cache __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[i] = … } Good! Groups of 32 accesses by consecutive threads on same line. Threads will be in same warp Fermi line size is 32 integers or floats

Warp A “warp’ in CUDA is a group of 32 threads that will operate in the SIMT mode A “half warp” (16 threads) actually execute simultaneously (current GPUs) Using knowledge of warps and how the memory is laid out can improve code performance

Memory Banks Device (GPU) A[0] A[1] A[2] A[3] Memory 1 Memory 2 Memory 3 Memory 4 Consecutive locations on successive memory banks Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.

Shared Memory Banks Shared memory divided into 16 or 32 banks of 32-bit width. Banks can be accessed simultaneously Compute cap. 1.x has 16 banks accesses processed per half warp Compute cap. 2.x and 3.0/3.5 has 32 banks accesses processed per warp Banks can be accessed simultaneously To achieve maximum bandwidth, threads in a half warp should access different banks of shared memory Exception: all threads read the same location which results in a broadcast operation *coit-grid06 and coit-grid07 C2050 compute capability 2.0 has 32 banks)

Memory Coalescing Aligned memory accesses Threads can read 4, 8, or 16 bytes at a time from global memory but only if accesses are aligned. That is: A 4-byte read must start at address …xxxxx00 A 8 byte read must start at address …xxxx000 A 16 byte read must start at address …xxx0000 Then access is much faster (twice?)

Ideally try to arrange for threads to access different memory modules at the same time, and consecutive addresses A bad case would be: • Thread 0 to access A[0], A[2], ... A[15] • Thread 1 to access A[16], A[17], ... A[31] • Thread 2 to access A[32], A[33], ... A[63] … etc. Good case would be • Thread 0 to access A[0], A[16], ... A[31] • Thread 1 to access A[1], A[17], ... A[32] • Thread 2 to access A[2], A[18], ... A[33] … etc. if there are 16 banks. Need to know that detail! Time

Memory coalescing and cache memories Comp cap 2.x onwards have data caches Accessing locations in cache will still be in blocks of consecutive locations and proper data layout that allows memory coalescing will be advantageous Unfortunately need to know the detailed physical arrangements in global memory and cache to gain maximum benefit. Different comp capability devices have different constraints for memory coalescing, see NVIDIA documentation for more information, see http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-3-0

Effects of memory access in matrix multiplication

Effects of memory access in matrix multiplication One thread is responsible for computing one result Cij and needs access a row of A and a column of B: Thread Each thread access one row of A and one column of B N2 row/column combinations, N2 threads

Seen another way, in first time period, each thread accesses the first element in a row of A: Thread 0, … Thread I, … Thread N-1, … Consider those threads that access different rows Given the row-major order of how A is stored, those threads will locations are not in consecutive locations – Bad cannot do memory coalescing. Question: how many threads access the same location?

Next, each thread accesses the first element in a column of B: Thread 0, … Thread I, … Thread N-1, … Consider those threads that access different columns Given the row-major order of how A is stored, those threads will locations are in consecutive locations. – Good! Can do memory coalescing. Question: how many threads access the same location?

How can we get better memory accesses and memory coalcesing? • Transpose one array • Copy all rows of A to columns and all columns of A to rows before access A and modify program according. • (Not mentioned in course textbook or other NVIDIA book, although appears obvious way – see next about whether works!)

Sequential code for a transpose using same array: for (i=0; i < N; i++) for (j=0; j < i; j++) { temp = B[i][j]; B[i][j] = b[j][i]; B[j][i] = temp; } (In my code, I use separate arrays) Could be done on host prior to copying to device. How would the code look like if on device?

/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/ transposeArray(a, a_T, N); // transpose array cudaEventRecord(start, 0); // here time measured before // host-device copy, but not transpose // cudaEventSynchronize(start); // Needed? cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice); // cpy transp. A cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); // copy B gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N); cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost); cudaEventRecord(stop, 0); // measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms2, start, stop ); printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time

Some results 8 x 8 array 1 block of 8 x 8 threads Speedup = 1.62 over not transposing array

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 1.17 over not transposing array

Some results 256 x 256 array 8 blocks of 32 x 32 threads Speedup = 0.89!! over not transposing array

Some results 1024 x 1024 array 32 blocks of 32 x 32 threads Speedup = 0.93!! over not transposing array

Some notes on NVIDIA New Tesla K20 GPU card Released late 2012. Uses GK110 chip 7.1 billion transistors! “Big-die” GPU “Kepler” architecture 64KB shared memory/L1 cache 48 KB uniform (?) cache Up to 1.5 MB L2 cache K20 card has 5 GB global memory 320-bit GDDR5 225 watt Sources http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last/3

Kepler compute architecture (e.g. K20) • Includes: • Dynamic Parallelism capability -- Enables GPU threads to automatically spawn new threads. By adapting to the data without going back to the CPU, it greatly simplifies parallel programming and enables greater GPU acceleration. • Hyper-Q feature that enables multiple CPU cores to simultaneously utilize the CUDA cores on a single Kepler GPU, dramatically increasing GPU utilization. Useful for cluster applications that use MPI. Source: http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

Unified Virtual Addressing CUDA Version 4 (2012) Host and device(s) memories share single addressing space Pointers can point to either host or device memories cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice options done with a single routine cudaMemcpyDefault 0x0000 GPU memory GPU memory Host memory 0xFFFF CPU GPU GPU May not be necessary to explicitly copy data between memories in program – check?

Questions

B. Wilkinson, April 15, 2014, GPUMemories

B. Wilkinson, April 15, 2014, GPUMemories

Presentation Transcript

Board Study Session – April 15, 2014

April 15, 2014 Research Administrators Workgroup

Heads Up 15 th April 2014

PRSSA Tuesday, April 15, 2014

15 April 2014

Tuesday, April 15, 2014

April 15, 2014

Tuesday, April 15 th , 2014

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories

Grid Computing, B. Wilkinson, 2005

April 15, 2014 Research Administrators Workgroup

Grid Computing, B. Wilkinson, 2005

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories

Space News Update - April 15, 2014 -

Grid Computing, B. Wilkinson, 2005

Physics 140 – Winter 2014 April 15

Grid Computing, B. Wilkinson, 2005

Joburg’s 2014/15 IDP and Budget Process : Region B Community Engagement: 5 April 2014

ITCS 5/4145 Parallel computing, B. Wilkinson, April 17, 2014. CUDAMultiDimBlocks

Grid Computing, B. Wilkinson, 2005

Hanoi, 15 April 2014