ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 ConstantMemTiming

Using Constant Memory These notes will introduce: How to declare and use constant memory Results of an experiment using constant memory ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 ConstantMemTiming.ppt

Global memory, shared memory, and registers Host Grid Block Threads Registers Shared memory Local memory Host memory Global memory Constant memory For storing global constants. Also a read-only global memory called texture memory exists.

Constant memory programming Constant memory part of global memory but much faster because cached, but limited to 64KB (all comp. cap. to 3.5.) Declared statically using __device__ and __constant__ qualifiers together with global scope (application) Lifetime of application, like global memory. Read-only from GPU kernel, i.e. cannot be altered by kernel (enables caching to work) Read/write from host using cudaMemcpyFromSymbol() and cudaMemcpyToSymbol().

Sample Code and Experimental Results The test program simply adds two vectors A and B together to produce a third vector, C One version uses constant memory for A and B Another version uses regular global memory for A and B Note maximum available for constant memory on the GPU (all compute capabilities so far) is 64 Kbytes total.

Code Array declarations #define N 8192 // max size allowed for two vectors in const. mem // Constants held in constant memory __device__ __constant__ int dev_a_Cont[N]; __device__ __constant__ int dev_b_Cont[N]; // regular global memory for comparison __device__ int dev_a[N]; __device__ int dev_b[N]; // result in device global memory __device__ int dev_c[N];

// kernel routines __global__ void add_Cont() { // using constant memory int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N){ dev_c[tid] = dev_a_Cont[tid] + dev_b_Cont[tid]; } } __global__ void add() { //not using constant memory int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N){ dev_c[tid] = dev_a[tid] + dev_b[tid]; } }

/*----------- GPU using constant memory ------------------------*/ printf("GPU using constant memory\n"); for(int i=0;i<N;i++) { // load arrays with some numbers a[i] = i; b[i] = i*2; } // copy vectors to constant memory cudaMemcpyToSymbol(dev_a_Cont,a,N*sizeof(int),0,cudaMemcpyHostToDevice); cudaMemcpyToSymbol(dev_b_Cont,b,N*sizeof(int),0,cudaMemcpyHostToDevice); cudaEventRecord(start, 0); // start time add_Cont<<<B,T>>>(); // does not need array ptrs cudaThreadSynchronize(); // wait for all threads to complete cudaEventRecord(stop, 0); // end time cudaMemcpyFromSymbol(a,"dev_a_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaMemcpyFromSymbol(b,"dev_b_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaMemcpyFromSymbol(c,"dev_c",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_Cont, start, stop); Watch for this zero. I missed it off and it took some time to spot Missed originally

/*----------- GPU not using constant memory ------------------------*/ printf("GPU using constant memory\n"); for(int i=0;i<N;i++) { // load arrays with some numbers a[i] = i; b[i] = i*2; } // copy vectors to constant memory cudaMemcpyToSymbol(dev_a_Cont,a,N*sizeof(int),0,cudaMemcpyHostToDevice); cudaMemcpyToSymbol(dev_b_Cont,b,N*sizeof(int),0,cudaMemcpyHostToDevice); cudaEventRecord(start, 0); // start time add<<<B,T>>>(); // does not need array ptrs cudaThreadSynchronize(); // wait for all threads to complete cudaEventRecord(stop, 0); // end time cudaMemcpyFromSymbol(a,"dev_a_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaMemcpyFromSymbol(b,"dev_b_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaMemcpyFromSymbol(c,"dev_c",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time, start, stop);

Speedup around 1.2 after first launch (20%) 1st launch, 1.6 Textbook says get 50% improvement with ray tracing app. 2nd run, 1.217 3rd run, 1.225 No explanation why first launch is faster.

Questions

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 ConstantMemTiming

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 ConstantMemTiming

Presentation Transcript

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 1, 2011 SharedMem.ppt

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt

RENCI@UNC Charlotte

UNC Charlotte

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013

UNC Charlotte Employees

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25 , 2011 DeviceRoutines.pptx

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomicsx

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streamsx

CUDA Lecture 4 CUDA Programming Basics

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 23, 2013 SharedMem

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011 MatrixMult

ITCS 4/5010 Parallel Programming, B. Wilkinson, Jan 21, 2013. CUDADeviceRoutines

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 16, 2013.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 5, 2011, 3-DBlocks

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories

CUDA Programming

ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 13A, 2013, OpenACC

ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011