An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro and Wen-meiHwu ASPLOS 2010 -- Pittsburgh

1. Introduction: Heterogeneous Computing CPU ACC • Existent programming models are DMA based: • Explicit memory copy • Programmer-managed memory coherence CPU IN OUT ACC IN OUT ASPLOS 2010 -- Pittsburgh • Heterogeneous Parallel Systems: • CPU: sequential control-intensive code • Accelerators: massively data-parallel code

Outline ASPLOS 2010 -- Pittsburgh Introduction Motivation ADSM: Asymmetric Distributed Shared Memory GMAC: Global Memory for ACcelerators Experimental Results Conclusions

RAM Memory RAM Memory RAM Memory RAM Memory 2.1 Motivation: Reference System Device Memory • High bandwidth • Weak consistency • Large page size GPU-like Accelerator PCIe Bus • Low latency • Strong consistency • Small page size CPU (N - Cores) System Memory ASPLOS 2010 -- Pittsburgh

2.2 Motivation: Memory Requirements • Accelerator memoryalways growing (e.g. 6GB NVIDIA Fermi, 16GB PowerXCell 8i) ASPLOS 2010 -- Pittsburgh • High memory bandwidth requirements • Non fully-coherent systems: • Long-latency coherence traffic • Different coherence protocols

2.3 Motivation: DMA-Based Programming • CUDA Sample Code • Duplicated Pointers • Explicit Coherence Management foo foo foo foo CPU GPU ASPLOS 2010 -- Pittsburgh void compute(FILE *file, int size) { float *foo, *dev_foo; foo = malloc(size); fread(foo, size, 1, file); cudaMalloc(&dev_foo, size); cudaMemcpy(dev_foo, foo, size, cudaMemcpyHostToDevice); kernel<<<Dg, Db>>>(dev_foo, size); cudaMemcpy(foo, dev_foo, size, cudaMemcpyDeviceToHost); cpuComputation(foo); cudaFree(dev_foo); free(foo); }

3.1 ADSM: Unified Virtual Address Space bar foo foo baz ACC CPU Shared Data Object System Memory Device Memory ASPLOS 2010 -- Pittsburgh • Unified Virtual Shared Address Space • CPU: access both, system and accelerator memory • Accelerator: access to its own memory • Under ADSM, both will use the same virtual address when referencing the shared object

3.2 ADSM: Simplified Code foo CPU GPU Single Pointer void compute(FILE *file, int size) { float *foo; foo = adsmMalloc(size); fread(foo, size, 1, file); kernel<<<Dg, Db>>>(foo, size); cpuComputation(foo); adsmFree(foo); } Data Assignment Peer DMA Legacy Support ASPLOS 2010 -- Pittsburgh • Simpler CPU code than in DMA-based programming models • Hardware-independent code

3.3 ADSM: Memory Distribution ASPLOS 2010 -- Pittsburgh • Asymmetric Distributed Shared Memory principles: • CPU accesses objects in accelerator memory but not vice versa • All coherency actions are performed by the CPU • Trashing unlikely to happen: • Synchronization Variables: Interrupt-based and dedicated hardware • False-sharing: Data object sharing granularity

3.4 ADSM: Consistency and Coherence Accelerator Return Foo Accelerator Call CPU ACC Foo • Memory Coherence: • Data ownership information enables eager data transfers • CPU maintains coherency CPU ACC ASPLOS 2010 -- Pittsburgh • Release consistency: • Consistency only relevant from CPU perspective • Implicit release/acquire at accelerator call/return

4. Global Memory for Accelerators ASPLOS 2010 -- Pittsburgh • ADSM implementation • User-level shared library • GNU / Linux Systems • NVIDIA CUDA GPUs

4.1 GMAC: Overall Design CUDA-like Front-End Memory Manager (Different Policies) Kernel Scheduler (FIFO) Operating System Abstraction Layer Accelerator Abstraction Layer (CUDA) ASPLOS 2010 -- Pittsburgh • Layered Design: • Multiple Memory Consistency Protocols • Operating System and Accelerator Independent code

4.2 GMAC: Unified Address Space • GPU memory address range cannot be selected GPU Physical Address Space System Virtual Address Space • Allocate same virtual memory address range in both, GPU and CPU • Accelerator Virtual memory would ease this process ASPLOS 2010 -- Pittsburgh • Virtual Address Space formed by GPU and System physical memories

4.3 GMAC: Coherence Protocols ASPLOS 2010 -- Pittsburgh • Batch-Update: copy all shared objects • Lazy-Update: copy modified / needed shared objects • Data object granularity • Detect CPU read/write accesses to shared objects • Rolling-Update: copy only modified / needed memory • Memory block size granularity • Fixed maximum number of modified blocks in system memory  flush data when maximum is reached

5.1 Results: GMAC vs. CUDA • Batch-Update overheads: • Copy output data on call • Copy non-used data • Similar performance for CUDA, Lazy-Update and Rolling-Update ASPLOS 2010 -- Pittsburgh

5.2 Results: Lazy vs. Rolling on 3D Stencil • Extra data copy for small data objects • Trade-off between bandwidth and page fault overhead ASPLOS 2010 -- Pittsburgh

6. Conclusions ASPLOS 2010 -- Pittsburgh • Unified virtual shared address space simplifies programming of heterogeneous systems • Asymmetric Distributed Shared Memory • CPU access accelerator memory but not vice versa • Coherence actions only executed by CPU • Experimental results shows no performance degradation • Memory translation in accelerators is key to efficient implement ADSM

Thank you for your attention Eager to start using GMAC? http://code.google.com/p/adsm/ igelado@ac.upc.edu adsm-users@googlegroups.com ASPLOS 2010 -- Pittsburgh

Backup Slides ASPLOS 2010 -- Pittsburgh

4.4 GMAC: Memory Mapping • Allocation might fail if the range is in use GPU Physical Address Space System Virtual Address Space ASPLOS 2010 -- Pittsburgh • Software: allocate different address space and provide translation function (gmacSafePtr()) • Hardware: implement virtual memory in the GPU

4.5 GMAC: Protocol States • Batch-Update: • Call / Return • Lazy-Update: • Call / Return • Read / Write • Rolling-Update: • Call / Return • Read / Write • Flush Invalid Return Dirty Call Read Invalid Read Only Call Flush Write Call Dirty Write ASPLOS 2010 -- Pittsburgh Protocol States: Invalid, Read-only, Dirty

4.6 GMAC: Rolling vs. Lazy • Batch – Update: transfer on kernel call • Rolling – Update: transfer while CPU computes ASPLOS 2010 -- Pittsburgh

5.3 Results: Break-down of Execution ASPLOS 2010 -- Pittsburgh

5.4 Results: Rolling Size vs. Block Size • Small Rolling size leads to performance aberrations • Prefer relative large rolling sizes ASPLOS 2010 -- Pittsburgh No appreciable effect on most benchmarks

6.1 Conclusions: Wish—list ASPLOS 2010 -- Pittsburgh • GPU Anonymous Memory Mappings: • GPU to CPU mappings never fail • Dynamic memory re—allocations • GPU dynamic Pinned Memory: • No intermediate data copies on flush • Peer DMA: • Speed—up I/O operations • No intermediate copies on GPU-to-GPU copies

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

Presentation Transcript

Distributed Shared Memory

Shared Memory Systems

6. Distributed Shared Memory

An Efficient Asymmetric Distributed Lock for Embedded Multiprocessor Systems

Distributed Shared Memory

Introduction to Software Distributed Shared Memory Systems

Shared Memory Systems

Distributed Shared Memory

Distributed Shared Memory (DSM)

Distributed shared memory

Parallel Computing with OpenMP on distributed shared memory platforms

Distributed Shared Memory (DSM)

Distributed Shared Memory

Parallel Shared Memory

Distributed Shared Memory

Distributed Shared Memory

Shared-memory Parallel Programming

Distributed Resource Management: Distributed Shared Memory

Distributed Shared Memory

GAMMA: An Efficient Distributed Shared Memory Toolbox for MATLAB

Distributed Shared Memory