ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011 MatrixMult

Matrix Multiplication Performance Improvements These notes will introduce: • Basic matrix multiplication on a 2-D grid/block review • Limitations • Block matrix multiplication • Memory access patterns in matrix multiplication • Coalescing memory accesses • Transpose array • Using shared memory tiles ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011 MatrixMult.ppt

Matrix Multiplication Matrix multiplication is an important application in HPC and appears in many applications C = A * B where A, B, and C are matrices (two-dimensional arrays. A restricted case is when B has only one column -- matrix-vector multiplication, which appears in representation of linear equations and partial differential equations

Matrix multiplication, C = A x B

Implementing Matrix Multiplication Sequential Code Assume matrices square (N x N matrices). for (i = 0; i < N; i++) for (j = 0; j < N; j++) { c[i][j] = 0; for (k = 0; k < N; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } Requires n3 multiplications and n3 additions Sequential time complexity of O(n3). Very easy to parallelize.

CUDA Kernel for multiplying two arrays __global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) { int k, sum = 0; int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; if (col < N && row < N) { for (k = 0; k < N; k++) sum += a[row * N + k] * b[k * N + col]; c[row * N + col] = sum; } } In this example, one thread computes one C element and the number of threads must equal or greater than the number of elements

Sequential version with flattened arrays for comparison void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) { int i, j, k, sum; for (row =0; row < N; row++) // row of a for (col =0; col < N; col++) { // column of b sum = 0; for(k = 0; k < N; k++) sum += cpu_a[row * N + k] * cpu_b[k * N + col]; cpu_c[row * N + col] = sum; } }

Matrix mapped on 2-D Grids and 2-D blocks A[][column] blockIdx.y * blockDim.y + threadIdx.y Grid Arrays mapped onto structure, one element per thread Block threadID.x threadID.y Array Thread A[row][] blockIdx.x * blockDim.x + threadIdx.x Basically array divided into “tiles” and one tile mapped onto one block

Complete Program (several slides) // Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010. #include <stdio.h> #include <cuda.h> #include <stdlib.h> __global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) { … } void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) { … } int main(int argc, char *argv[]) { int i, j; // loop counters int Grid_Dim_x=1, Grid_Dim_y=1; //Grid structure values int Block_Dim_x=1, Block_Dim_y=1; //Block structure values int noThreads_x, noThreads_y; // number of threads available in device, each dimension int noThreads_block; // number of threads in a block int N = 10; // size of array in each dimension int *a,*b,*c,*d; int *dev_a, *dev_b, *dev_c; int size; // number of bytes in arrays cudaEvent_t start, stop; // using cuda events to measure time float elapsed_time_ms; // which is applicable for asynchronous code also /* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/ … // keyboard input dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structure dim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific device size = N * N * sizeof(int); // number of bytes in total in arrays a = (int*) malloc(size); //dynamically allocated memory for arrays on host b = (int*) malloc(size); c = (int*) malloc(size); // results from GPU d = (int*) malloc(size); // results from CPU … // load arrays with some numbers

/* ------------- COMPUTATION DONE ON GPU ----------------------------*/ cudaMalloc((void**)&dev_a, size); // allocate memory on device cudaMalloc((void**)&dev_b, size); cudaMalloc((void**)&dev_c, size); cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); cudaEventRecord(start, 0); // here start time, after memcpy gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N); cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost); cudaEventRecord(stop, 0); // measuse end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms, start, stop ); printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms); Where you measure time will make a big difference

/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/ cudaEventRecord(start, 0); // use same timing* cpu_matrixmult(a,b,d,N); // do calculation on host cudaEventRecord(stop, 0); // measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms, start, stop ); printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time /* ------------------- check device creates correct results -----------------*/ … /* --------------------- repeat program ----------------------------------------*/ … // while loop to repeat calc with different parameters /* -------------- clean up ---------------------------------------*/ free(a); free(b); free(c); cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c); cudaEventDestroy(start); cudaEventDestroy(stop); return 0; } I have found problems if do CPU timing before GPU timing. Anyone else?

Some Preliminaries Effects of First Launch Program is written so that can repeat with different parameters without stopping program – to eliminate effect of first kernel launch Also might take advantage of caching – seems not significant as first launch

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 1.65, First time Random numbers 0- 9 Answer Check both CPU and GPU same answers

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 2.12 Second time

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 2.16 Third time Subsequently can vary 2.12 – 2.18

Some results 256 x 256 array 8 blocks of 32 x 32 threads Speedup = 151.86

Some results 1024 x 1024 array 32 blocks of 32 x 32 threads Speedup = 860.9

Some results 2048 x 2048 array 64 blocks of 32 x 32 threads Speedup = ?? GPU appears to freeze. Why? 211 x 211 threads = 222 threads Memory needed = 222 integers =224 bytes = 4 Mbytes Max number of threads on GPU appears to be 216 x 216 = 232 threads Server has 4 Gbytes of main memory

Some results 2048 x 2048 array 64 blocks of 32 x 32 threads Speedup = ??

Different Array Sizes • * These result include time of memory copy back from device but not memory copy to device. • ** These result include time of memory copy back from the device and memory copy to the device Block size 32 x 32. Number of blocks to suit array size

GPU Limitations • Previous program has limitations: • Number of threads => number of array elements • (code will not work if number of threads < number of array elements) • Number of threads/block and blocks/grid has GPU limitations, which will limit size of arrays that can be processed. • Keyboard input must check for invalid grid and block value • There are memory bandwidth issues.

Compute capability 1.x Maximum number of threads per block = 512 Maximum sizes of x- and y- dimension of thread block = 512 Maximum size of each dimension of grid = 65535 Maximum number of threads per block of 512 means a square 2-D block cannot be greater than 16 x 16 (256 threads) So maximum square array size is 16 x 65535 (24 x 216 = 220) i.e. an array of 220 x 220 Is this a problem?

Compute capability 2.x (coit-grid06.uncc.edu) Maximum number of threads per block = 1024 Maximum sizes of x- and y- dimension of thread block = 1024 Maximum size of each dimension of grid = 65535 Max number of threads per block of 1024 means a square 2-D block cannot be greater than 32 x 32. Now all 1024 threads allocated So maximum square array size is 32 x 65535 (25 x 216 = 221) i.e. an array of 221 x 221 Is this a problem?

Increasing size of arrays beyond thread limitation of GPU -- Using less threads than array elements* Actually this one is easy and can draw upon regular technique of using sub-matrix multiplication: Sub-matrix Sub-matrix Sub-matrix Not in textbooks (not their “tiling”). * Would you want to use less threads than elements?

To demonstrate that sum-matrix multiplication will produce the correct final answer, consider simple 2 x 2 sub-matrices:

Pseudo Code for Block Multiplication Suppose N x N matrices, divided into s2 submatrices. Each submatrix has n/s x n/s elements. for (p = 0; p < s; p++) for (q = 0; q < s; q++) { Cp,q = 0; // clear elements of submatrix for (r = 0; r < m; r++) // submatrix multiplication Cp,q = Cp,q + Ap,r * Br,q; // add to accum. submatrix } Cp,q means submatrix row p and column q of matrix C. Cp,q = Cp,q + Ap,r * Br,q; means multiply submatrix Ap,r and Br,q using matrix multiplication and add to submatrix Cp,q using matrix addition.

Code for Block Multiplication N = …; // no of elements in cols/rows of matrices s = …; // no of sub matrices ss = N/s; // no of elements in sub-matrix cols/rows for (i = 0; i < N; i+= ss) // go thro sub-matrices of A for (j = 0; j < N; j+= ss) { // and sub-matrices of B for (p = i; p < i + ss; p++) // clear elements of sub-matrix, Cp,q = 0; for (q = j; q < j + ss; q++) C[p][q] = 0; for (r = 0; r < s; r++) { // sub-matrices row, column for (p = i; p < i + ss; p++) // submatrix multiplication for (q = j; q < j + ss; q++) { c[p][q] = 0; for (k = r; k < r + ss; k++) c[p][q] += A[p][k] * B[k][q]; } C[i][j] += c[p][q]; // add to accum. submatrix } } N/s an integer Each thread does one value of i and j. S threads Not tested yet Mistakes?

Effects of memory access in matrix multiplication One thread is responsible for computing one result Cij and needs access a row of A and a column of B: Thread Each thread access one row of A and one column of B N2 row/column combinations, N2 threads

Seen another way, in first time period, each thread accesses the first element in a row of A: Thread 0, … Thread I, … Thread N-1, … Consider those threads that access different rows Given the row-major order of how A is stored, those threads will locations are not in consecutive locations – Bad cannot do memory coalescing. Question: how many threads access the same location?

Next, each thread accesses the first element in a column of B: Thread 0, … Thread I, … Thread N-1, … Consider those threads that access different columns Given the row-major order of how A is stored, those threads will locations are in consecutive locations. – Good! Can do memory coalescing. Question: how many threads access the same location?

How can we get better memory accesses and memory coalcesing? • Transpose one array • Copy all rows of A to columns and all columns of A to rows before access A and modify program according. • (Not mentioned in course textbook or other NVIDIA book, although appears obvious way – see next about whether works!)

Sequential code for a transpose using same array: for (i=0; i < N; i++) for (j=0; j < i; j++) { temp = B[i][j]; B[i][j] = b[j][i]; B[j][i] = temp; } (In my code, I use separate arrays) Could be done on host prior to copying to device. How would the code look like if on device?

/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/ transposeArray(a, a_T, N); // transpose array cudaEventRecord(start, 0); // here time measured before // host-device copy, but not transpose // cudaEventSynchronize(start); // Needed? cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice); // cpy transp. A cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice); // copy B gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N); cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost); cudaEventRecord(stop, 0); // measure end time cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_ms2, start, stop ); printf("Time to calculate results on GPU with transposed array: %f ms.\n", elapsed_time_ms2); // print out execution time

Some results 8 x 8 array 1 block of 8 x 8 threads Speedup = 1.62 over not transposing array

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup = 1.17 over not transposing array

Some results 256 x 256 array 8 blocks of 32 x 32 threads Speedup = 0.89!! over not transposing array

Some results 1024 x 1024 array 32 blocks of 32 x 32 threads Speedup = 0.93!! over not transposing array

2. Using shared memory with “Tiling” Copy of tiles made in shared memory Bs As Cs Note: this is not using block matrix multiplication. The fundamental algorithm is just divided into phases Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.

Developing Code • Declaring shared memory: __global__ void gpu_matrixmult (int* Md, int* Nd, int* Pd, int Width) { __shared__ intMds[TILE_WIDTH][TILE_WIDTH]; __shared__ intNds[TILE_WIDTH][TILE_WIDTH]; Needs to be using static memory allocation so will need to fix size of tile Convenient to choose same as kernel block size, say 32 x 32

2. Index into shared memory: For convenience declare tile(block) and thread indices, and indices into final C array: int bx = blockIdx.x; // tile (block) indices int by = blockIdx.y; int tx = threadIdx.x; //thread indices int ty = threadIdx.y; To access (later): Mds[ty][tx] = … // element associated with thread Nds[ty][tx] = …

3. Global address: For convenience declare row and column: int Row = by * TILE_WIDTH + ty; // global indices int Col = bx * TILE_WIDTH + tx; Note: Same as the usual global thread ID and result index in normal matrix multiplication: int row = threadIdx.y + blockDim.y * blockIdx.y; int col = threadIdx.x + blockDim.x * blockIdx.x;

4. Loading shared memory: Done using SIMT thread collaboration (very tricky): All elements on row transferred, one per thread for (int m = 0; m < N/TILE_WIDTH; m++) { Mds[ty][tx] = Md[Row*N + (m*TILE_WIDTH + tx)]; Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*N + Col]; __syncthreads(); // do matrix multiplication operations on pair of tiles … For each tile in row or column Thread ID Thread ID All elements on column transferred, one per thread Wait for all threads in block Books says achieves memory coalescing although it does not look to do that in both cases.

Example (3 x 3 tiles) Row = by * TILE_WIDTH + ty Col = bx * TILE_WIDTH + tx m = tile number 0 ,1, and 2 Global memory address Row*N + (m*TILE_WIDTH + tx) Global memory address ((m*TILE_WIDTH + ty)*N + Col by/bx 0 1 2 0 1 2 0 0 m k m 1 1 k 2 2 Global array A Global array B Based upon Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.

5. Matrix multiplication in shared memory: int Pvalue = 0; for (int m = 0; m < N/TILE_WIDTH; m++) { … // copy tiles to shared memory (step 4) for ( int k = 0; k < TILE_WIDTH; k++) Pvalue += Mds[ty][k] * Nds[k][tx]; } Pd[Row * N + Col] = Pvalue; } Multiply tiles, accumulating values Copy back to global memory

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // thread ID int Row = by * TILE_WIDTH + ty; //Identify row, column of Pd element to work on int Col = bx * TILE_WIDTH + tx; float Pvalue = 0; for (int m = 0; m < Width/TILE_WIDTH; m++) { // loop over Md, Nd to compute Pd element Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; // load Md, Nd tiles into sh. mem Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col]; __syncthreads(); for ( int k = 0; k < TILE_WIDTH; k++) Pvalue += Mds[ty][k] * Nds[k][tx]; } Pd[Row][Col] = Pvalue; } Code given in book Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.11, Page 110 Note copying to shared memory is a collaborative between threads, each thread doing one element of each array In wrong place on page 110, although does not effect final answer! A mistake in the book as Pd is a pointer and size of array unknown? Will not compile as is. Need to use flattened index

Some results 32 x 32 array 1 block of 32 x 32 threads Speedup GPU to CPU 2.12 GPU sh. mem. to CPU 2.73 GPU sh. mem to GPU without sh memory 1.28 Results of second run

Some results 256 x 256 array 8 block of 32 x 32 threads Speedup GPU to CPU 153 GPU sh. mem. to CPU 217 GPU sh. mem to GPU without sh memory 1.41

Some results 1024 x1024 array 32 block of 32 x 32 threads Speedup GPU to CPU 864 GPU sh. mem. to CPU 2214 !!! GPU sh. mem to GPU without sh memory 2.56

Some results 2048 x 2048 array 64 block of 32 x 32 threads Speedup GPU to CPU 989 GPU sh. mem. to CPU 2962 !! GPU sh. mem to GPU without sh memory 2.99

Different Array Sizes Block size 32 x 32. Number of blocks to suit array size

Bandwidth improvements by using shared memory Using a 32 x 32 tiles reduced the number of global memory access by a factor of 16 (two transfers instead of 2 x 32 transfers). According to PMPP book, page 90, using 16 x 16 tiles: “it allows the 86.4 GB/s global bandwidth to serve a much larger floating-point computation rate … Can now support 86.4/4 x 16 = 345.6 gigaflops, very close to the peak floating point performance of the G80… effectively removes global memory bandwidth as the major limiting factor of matrix multiplication performance.”

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011 MatrixMult