ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

Approach Objective: As with memory coalescing demo, to load numbers into a two-dimensional array Flattened global threadID of thread loaded into array element so one can tell which thread accesses which location array printed out. For comparison purposes, access done: Using global memory only On shared memory with local 2-D arrays and copying back to global memory As 2. but using separate pointer arithmetic for speed GPU structure -- one or more 2-D blocks in a 2-D grid. Each block is 2-D 32x32 threads fixed (max. for compute cap. 2.x)

1. Using global memory only __global__ void gpu_WithoutSharedMem (int *h, int N, int T) { // Array loaded with global thread ID that accesses that location // Coalescing should be possible int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; for (int t = 0; t < T; t++) // to reduce other time effects h[index] = threadID; // load array with global thread ID }

2. Using shared memory __global__ void gpu_SharedMem (int *h, int N, int T) { __shared__ int h_local[BlockSize][BlockSize]; // sh. mem. each block int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; // h_local[threadIdx.y][threadIdx.x] = h[index]; Not necessary here // but might be in other caculations for (int t = 0; t < T; t++) h_local[threadIdx.y][threadIdx.x] = threadID; // load array h[index] = h_local[threadIdx.y][threadIdx.x]; //copy back to global mem. }

3. Using shared memory with index calculation outside loop __global__ void gpu_SharedMem_ptr (int *h, int N, int T) { __shared__ int h_local[BlockSize][BlockSize]; int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; int *ptr; // index calc. once outside loop ptr = &h_local[0][0]; int index_local = threadIdx.x + threadIdx.y * N; for (int t = 0; t < T; t++) ptr[index_local] = threadID; h[index] = h_local[threadIdx.y][threadIdx.x]; } This code I am still checking out

Main program … /*------------------------- Allocate Memory-----------------------------------*/ int size = N * N * sizeof(int); // number of bytes in total in array int *h, *dev_h; // ptr to arrays holding numbers on host and device h = (int*) malloc(size); // Array on host cudaMalloc((void**)&dev_h, size); // allocate device memory /* ------------------------- GPU Computation without shared memory -----------------------------------*/ gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); // once outside timing cudaEventRecord( start, 0 ); gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Get results to check printf("\nComputation without shared memory\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 and 3 similar

Some results A grid of one block and one iteration Array 32x32 Shared memory Speedup = 1.18

A grid of one block and 1000000 iterations Array 32 x 32 Shared memory Speedup = 1.24

Repeat just to check results are consistent

A grid of 16 x 16 blocks and 10000 iterations Array 512x512 Speedup = 1.74 Different numbers of iterations produce similar results

Different Array Sizes 1000 iterations. Block size 32 x 32. Number of blocks to suit array size

Questions

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

Presentation Transcript

Intro to Banner HR

ITCS 6010

ITCS 6010 DATA INTEGRATION Presentation On Social Web

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streamsx

Types of Parallel Computers

Blandine Serve, Eric Gonnard ( OECD Statistics Directorate)

Grid Portals

ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Lecture 7 Classification

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing

The completion of OECD trade databases in OECD.STAT

ITCS 6265 Information Retrieval and Web Mining

ITCS 3153 Artificial Intelligence

Numerical Algorithms • Parallelizing matrix multiplication • Solving a system of linear equations

ITCS 6114 Dynamic programming Longest Common Subsequence

Anatomy of a disaster recovery

ITCS 6265/8265 Project

Globus Toolkit 4.0 Grid Resource Allocation Manager (GRAM) Job submission

Partitioning and Divide-and-Conquer Strategies

Book Drawing

Security II – Grid Computing Security and Globus Security

ITCS 6010