GMAC Global Memory for Accelerators

GMACGlobal Memory for Accelerators Isaac Gelado PUMPS Summer School - Barcelona

Vector Addition CUDA code • Vector addition • Really simple kernel code • But, what about the CPU code? • GMAC is a complement to the CUDA run-time • Simplifies the CPU code • Exploits advanced CUDA features for free __global__ void vector(float *c, float *a, float *b, size_t size) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if(idx < size) c[idx] = a[idx] + b[idx]; } PUMPS Summer School

Some easy CUDA code (I) • Read from disk, transfer to GPU and compute intmain(intargc, char *argv[]) { float *h_a, *h_b, *h_c, *d_a, *d_b, *d_c; size_t size = LENGTH * sizeof(float); assert((h_a = malloc(size) != NULL); assert((h_b = malloc(size) != NULL); assert((h_c = malloc(size) != NULL); assert(cudaMalloc((void **)&d_a, size) == cudaSuccess)); assert(cudaMalloc((void **)&d_b, size) == cudaSuccess)); assert(cudaMalloc((void **)&d_c, size) == cudaSuccess)); read_file(argv[A], h_a); read_file(argv[B], h_b); assert(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice) == cudaSuccess); assert(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice) == cudaSuccess); PUMPS Summer School

Some easy CUDA code (and II) • Read from disk, transfer to GPU and compute Db(BLOCK_SIZE); Dg(LENGTH / BLOCK_SIZE); if(LENGTH % BLOCK_SIZE) Dg.x++; vector<<<Dg, Db>>>(d_c, d_a, d_b, LENGTH); assert(cudaThreadSynchronize() == cudaSuccess); assert(cudaMemcpy(d_c, h_c, LENGTH * sizeof(float), cudaMemcpyDeviceToHost) == cudaSuccess); save_file(argv[C], h_c); free(h_a); cudaFree(d_a); free(h_b); cudaFree(d_b); free(h_c); cudaFree(d_c); return 0; } PUMPS Summer School

Some really easy GMAC code intmain(intargc, char *argv[]) { float *a, *b, *c; size_t size = LENGTH * sizeof(float); assert(gmacMalloc((void **)&a, size) ==gmacSuccess)); assert(gmacMalloc((void **)&b, size) ==gmacSuccess)); assert(gmacMalloc((void **)&c, size) ==gmacSuccess)); read_file(argv[A], a); read_file(argv[B],b); Db(BLOCK_SIZE); Dg(LENGTH / BLOCK_SIZE); if(LENGTH % BLOCK_SIZE) Dg.x++; vector<<<Dg, Db>>>(c, a, b, LENGTH); assert(gmacThreadSynchronize() == gmacSuccess); save_file(argv[C], c); gmacFree(a); gmacFree(b); gmacFree(c); return 0; } There is no memory copy There is no memory copy PUMPS Summer School

Getting GMAC • GMAC is at http://adsm.googlecode.com/ • Debian / Ubuntu binary and development .deb files • UNIX (also MacOS X) source code package • Experimental versions from mercurial repository PUMPS Summer School

Outline • Introduction • GMAC Memory Model • Asymmetric Memory • Global Memory • GMAC Execution Model • Multi-threading • Inter-thread communication • Conclusions PUMPS Summer School

GMAC Memory Model • Unified CPU / GPU virtual address space • Asymmetric address space visibility Shared Data Memory CPU GPU CPU Data PUMPS Summer School

GMAC Consistency Model • Implicit acquire / release primitives at accelerator call / return boundaries CPU ACC CPU ACC PUMPS Summer School

GMAC Memory API • Memory allocation gmacError_t gmacMalloc(void **ptr, size_t size) • Allocated memory address (returned by reference) • Gets the size of the data to be allocated • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . } PUMPS Summer School

GMAC Memory API • Memory release gmacError_t gmacFree(void *ptr) • Memory address to be release • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . gmacFree(foo); } PUMPS Summer School

GMAC Memory API • Memory translation (workaround for Fermi) Void *gmacPtr(void *ptr) template<typename T> T *gmacPtr(T *ptr) • CPU memory address • GPU memory address • Example usage #include <gmac.h> int main(int argc, char *argv[]) { . . . kernel<<<Dg, Db>>>(gmacPtr(buffer), size); . . . } PUMPS Summer School

GMAC Execution Example • Get advanced CUDA features for free • Asynchronous data transfers • Pinned memory PUMPS Summer School

GMAC Global Memory • Data accessible by all accelerators, but owned by the CPU GPU Memory CPU GPU PUMPS Summer School

GMAC Global memory API • Memory allocation gmacError_t gmacGlobalMalloc(void **ptr, size_t size) • Allocated memory address (returned by reference) • Gets the size of the data to be allocated • Error code, gmacSuccess if no error • Example usage #include <gmac.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacGlobalMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . } PUMPS Summer School

GMAC and Multi-threading • In the past, one thread one CPU • In GMAC, one thread: • One CPU • One GPU • A thread is running in the GPU or the CPU, but not in both at the same time • Create threads using what you already know • pthread_create(...) PUMPS Summer School

GMAC and Multi-threading • Virtual memory accessibility: • Complete address space in CPU code • Partial address space in GPU code Memory CPU CPU GPU GPU PUMPS Summer School

GPU Passing and Copying • GPU passing: • Send the thread’s virtual GPU to another thread • Do not move data, move computation • API Calls • Virtual GPU sending gmacError_t gmacSend(thread_id dest) • Virtual GPU receiving gmacError_t gmacReceive() • Virtual GPU copying gmacError_t gmacCopy(thread_id dest) PUMPS Summer School

Conclusions • Single virtual address space for CPUs and GPUs • Use CUDA advanced features • Automatic overlap data communication and computation • Get access to any GPU from any CPU thread • Get more performance from your application more easily • Go: http://adsm.googlecode.com PUMPS Summer School

Backup Slides

GMAC Unified Address Space • When allocating memory • Allocate accelerator memory • Allocate CPU memory at the same virtual address System Memory Accelerator Memory CPU Accelerator PUMPS Summer School

GMAC Unified Address Space • Use fixed-size segments to map accelerator memory • Implement and export Accelerator Virtual Memory Accelerator Memory Accelerator 0x200100000 0x00100000 Accelerator Memory Accelerator 0x100100000 0x00100000 CPU System Memory PUMPS Summer School

GMAC Data Transfers • Avoid unnecessary data copies • Lazy-update: • Call: transfer modified data • Return: transfer when needed System Memory Accelerator Memory CPU Accelerator PUMPS Summer School

GMAC Data Transfers • Overlap CPU execution and data transfers • Minimal transfer on-demand • Rolling-update: • Memory-block size granularity System Memory Accelerator Memory CPU Accelerator PUMPS Summer School

GMAC Global Memory for Accelerators

GMAC Global Memory for Accelerators

Presentation Transcript

Serial Code Accelerators for Heterogeneous Multi-core Processor with 3D memory

ACCELERATORS

ACCELERATORS

Gmac Car Insurance

Accelerators

In-memory Accelerators with Memristors

ACCELERATORS

GMAC Global Memory for Accelerators

Microfabrication Techniques for Accelerators

Accelerators

Accelerators

Superconducting magnets for Accelerators

Accelerators

Accelerators

ACCELERATORS

ACCELERATORS

GMAC Hosted Meets

Magnetrons for accelerators

Accelerators

Accelerators

Superconducting Magnets for Accelerators

ACCELERATORS