1 / 12

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

Using Shared memory. These notes will demonstrate the improvements achieved by using shared memory. ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt. Approach. Objective: As with memory coalescing demo, to load numbers into a two-dimensional array

tad
Télécharger la présentation

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Shared memory These notes will demonstrate the improvements achieved by using shared memory ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

  2. Approach Objective: As with memory coalescing demo, to load numbers into a two-dimensional array Flattened global threadID of thread loaded into array element so one can tell which thread accesses which location array printed out. For comparison purposes, access done: Using global memory only On shared memory with local 2-D arrays and copying back to global memory As 2. but using separate pointer arithmetic for speed GPU structure -- one or more 2-D blocks in a 2-D grid. Each block is 2-D 32x32 threads fixed (max. for compute cap. 2.x)

  3. 1. Using global memory only __global__ void gpu_WithoutSharedMem (int *h, int N, int T) { // Array loaded with global thread ID that accesses that location // Coalescing should be possible int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; for (int t = 0; t < T; t++) // to reduce other time effects h[index] = threadID; // load array with global thread ID }

  4. 2. Using shared memory __global__ void gpu_SharedMem (int *h, int N, int T) { __shared__ int h_local[BlockSize][BlockSize]; // sh. mem. each block int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; // h_local[threadIdx.y][threadIdx.x] = h[index]; Not necessary here // but might be in other caculations for (int t = 0; t < T; t++) h_local[threadIdx.y][threadIdx.x] = threadID; // load array h[index] = h_local[threadIdx.y][threadIdx.x]; //copy back to global mem. }

  5. 3. Using shared memory with index calculation outside loop __global__ void gpu_SharedMem_ptr (int *h, int N, int T) { __shared__ int h_local[BlockSize][BlockSize]; int col = threadIdx.x + blockDim.x * blockIdx.x; int row = threadIdx.y + blockDim.y * blockIdx.y; int threadID = col + row * N; int index = col + row * N; int *ptr; // index calc. once outside loop ptr = &h_local[0][0]; int index_local = threadIdx.x + threadIdx.y * N; for (int t = 0; t < T; t++) ptr[index_local] = threadID; h[index] = h_local[threadIdx.y][threadIdx.x]; } This code I am still checking out

  6. Main program … /*------------------------- Allocate Memory-----------------------------------*/ int size = N * N * sizeof(int); // number of bytes in total in array int *h, *dev_h; // ptr to arrays holding numbers on host and device h = (int*) malloc(size); // Array on host cudaMalloc((void**)&dev_h, size); // allocate device memory /* ------------------------- GPU Computation without shared memory -----------------------------------*/ gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); // once outside timing cudaEventRecord( start, 0 ); gpu_WithoutSharedMem <<< Grid, Block >>>(dev_h, N, T); cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &elapsed_time_ms1, start, stop ); cudaMemcpy(h,dev_h, size ,cudaMemcpyDeviceToHost); //Get results to check printf("\nComputation without shared memory\n"); printArray(h,N); printf("\nTime to calculate results on GPU: %f ms.\n", elapsed_time_ms1); Computation 2 and 3 similar

  7. Some results A grid of one block and one iteration Array 32x32 Shared memory Speedup = 1.18

  8. A grid of one block and 1000000 iterations Array 32 x 32 Shared memory Speedup = 1.24

  9. Repeat just to check results are consistent

  10. A grid of 16 x 16 blocks and 10000 iterations Array 512x512 Speedup = 1.74 Different numbers of iterations produce similar results

  11. Different Array Sizes 1000 iterations. Block size 32 x 32. Number of blocks to suit array size

  12. Questions

More Related