300 likes | 465 Vues
GMAC Global Memory for Accelerators. Isaac Gelado PUMPS Summer School - Barcelona. Vector Addition CUDA code. Vector addition Really simple kernel code But, what about the CPU code? GMAC is a complement to the CUDA run-time Simplifies the CPU code Exploits advanced CUDA features for free.
E N D
GMACGlobal Memory for Accelerators Isaac Gelado PUMPS Summer School - Barcelona
Vector Addition CUDA code • Vector addition • Really simple kernel code • But, what about the CPU code? • GMAC is a complement to the CUDA run-time • Simplifies the CPU code • Exploits advanced CUDA features for free __global__ void vector(float *c, float *a, float *b, size_t size) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if(idx < size) c[idx] = a[idx] + b[idx]; } PUMPS Summer School
Some easy CUDA code (I) • Read from disk, transfer to GPU and compute intmain(intargc, char *argv[]) { float *h_a, *h_b, *h_c, *d_a, *d_b, *d_c; size_t size = LENGTH * sizeof(float); assert((h_a = malloc(size) != NULL); assert((h_b = malloc(size) != NULL); assert((h_c = malloc(size) != NULL); assert(cudaMalloc((void **)&d_a, size) == cudaSuccess)); assert(cudaMalloc((void **)&d_b, size) == cudaSuccess)); assert(cudaMalloc((void **)&d_c, size) == cudaSuccess)); read_file(argv[A], h_a); read_file(argv[B], h_b); assert(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice) == cudaSuccess); assert(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice) == cudaSuccess); PUMPS Summer School
Some easy CUDA code (and II) • Read from disk, transfer to GPU and compute Db(BLOCK_SIZE); Dg(LENGTH / BLOCK_SIZE); if(LENGTH % BLOCK_SIZE) Dg.x++; vector<<<Dg, Db>>>(d_c, d_a, d_b, LENGTH); assert(cudaThreadSynchronize() == cudaSuccess); assert(cudaMemcpy(d_c, h_c, LENGTH * sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess); save_file(argv[C], h_c); free(h_a); cudaFree(d_a); free(h_b); cudaFree(d_b); free(h_c); cudaFree(d_c); return 0; } PUMPS Summer School
Some really easy GMAC code intmain(intargc, char *argv[]) { float *a, *b, *c; size_t size = LENGTH * sizeof(float); assert(gmacMalloc((void **)&a, size) ==gmacSuccess)); assert(gmacMalloc((void **)&b, size) ==gmacSuccess)); assert(gmacMalloc((void **)&c, size) ==gmacSuccess)); read_file(argv[A], a); read_file(argv[B],b); Db(BLOCK_SIZE); Dg(LENGTH / BLOCK_SIZE); if(LENGTH % BLOCK_SIZE) Dg.x++; vector<<<Dg, Db>>>(c, a, b, LENGTH); assert(gmacThreadSynchronize() == gmacSuccess); save_file(argv[C], c); gmacFree(a); gmacFree(b); gmacFree(c); return 0; } There is no memory copy There is no memory copy PUMPS Summer School
Getting GMAC • GMAC is at http://adsm.googlecode.com/ • Debian / Ubuntu binary and development .deb files • UNIX (also MacOS X) source code package • Experimental versions from mercurial repository PUMPS Summer School
Outline • Introduction • GMAC Memory Model • Asymmetric Memory • Global Memory • GMAC Execution Model • Multi-threading • Inter-thread communication • Conclusions PUMPS Summer School
GMAC Memory Model • Unified CPU / GPU virtual address space • Asymmetric address space visibility Shared Data Memory CPU GPU CPU Data PUMPS Summer School
GMAC Consistency Model • Implicit acquire / release primitives at accelerator call / return boundaries CPU ACC CPU ACC PUMPS Summer School
GMAC Memory API • Memory allocation gmacError_t gmacMalloc(void **ptr, size_t size) • Allocated memory address (returned by reference) • Gets the size of the data to be allocated • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . } PUMPS Summer School
GMAC Memory API • Memory release gmacError_t gmacFree(void *ptr) • Memory address to be release • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . gmacFree(foo); } PUMPS Summer School
GMAC Memory API • Memory translation (workaround for Fermi) Void *gmacPtr(void *ptr) template<typename T> T *gmacPtr(T *ptr) • CPU memory address • GPU memory address • Example usage #include <gmac.h> int main(int argc, char *argv[]) { . . . kernel<<<Dg, Db>>>(gmacPtr(buffer), size); . . . } PUMPS Summer School
GMAC Execution Example • Get advanced CUDA features for free • Asynchronous data transfers • Pinned memory PUMPS Summer School
Outline • Introduction • GMAC Memory Model • Asymmetric Memory • Global Memory • GMAC Execution Model • Multi-threading • Inter-thread communication • Conclusions PUMPS Summer School
GMAC Global Memory • Data accessible by all accelerators, but owned by the CPU GPU Memory CPU GPU PUMPS Summer School
GMAC Global memory API • Memory allocation gmacError_t gmacGlobalMalloc(void **ptr, size_t size) • Allocated memory address (returned by reference) • Gets the size of the data to be allocated • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacGlobalMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . } PUMPS Summer School
Outline • Introduction • GMAC Memory Model • Asymmetric Memory • Global Memory • GMAC Execution Model • Multi-threading • Inter-thread communication • Conclusions PUMPS Summer School
GMAC and Multi-threading • In the past, one thread one CPU • In GMAC, one thread: • One CPU • One GPU • A thread is running in the GPU or the CPU, but not in both at the same time • Create threads using what you already know • pthread_create(...) PUMPS Summer School
GMAC and Multi-threading • Virtual memory accessibility: • Complete address space in CPU code • Partial address space in GPU code Memory CPU CPU GPU GPU PUMPS Summer School
Outline • Introduction • GMAC Memory Model • Asymmetric Memory • Global Memory • GMAC Execution Model • Multi-threading • Inter-thread communication • Conclusions PUMPS Summer School
GPU Passing and Copying • GPU passing: • Send the thread’s virtual GPU to another thread • Do not move data, move computation • API Calls • Virtual GPU sending gmacError_t gmacSend(thread_id dest) • Virtual GPU receiving gmacError_t gmacReceive() • Virtual GPU copying gmacError_t gmacCopy(thread_id dest) PUMPS Summer School
Outline • Introduction • GMAC Memory Model • Asymmetric Memory • Global Memory • GMAC Execution Model • Multi-threading • Inter-thread communication • Conclusions PUMPS Summer School
Conclusions • Single virtual address space for CPUs and GPUs • Use CUDA advanced features • Automatic overlap data communication and computation • Get access to any GPU from any CPU thread • Get more performance from your application more easily • Go: http://adsm.googlecode.com PUMPS Summer School
GMACGlobal Memory for Accelerators Isaac Gelado PUMPS Summer School - Barcelona
GMAC Unified Address Space • When allocating memory • Allocate accelerator memory • Allocate CPU memory at the same virtual address System Memory Accelerator Memory CPU Accelerator PUMPS Summer School
GMAC Unified Address Space • Use fixed-size segments to map accelerator memory • Implement and export Accelerator Virtual Memory Accelerator Memory Accelerator 0x200100000 0x00100000 Accelerator Memory Accelerator 0x100100000 0x00100000 CPU System Memory PUMPS Summer School
GMAC Data Transfers • Avoid unnecessary data copies • Lazy-update: • Call: transfer modified data • Return: transfer when needed System Memory Accelerator Memory CPU Accelerator PUMPS Summer School
GMAC Data Transfers • Overlap CPU execution and data transfers • Minimal transfer on-demand • Rolling-update: • Memory-block size granularity System Memory Accelerator Memory CPU Accelerator PUMPS Summer School
GMACGlobal Memory for Accelerators Isaac Gelado PUMPS Summer School - Barcelona