Hybrid MPI/CUDA

Hybrid MPI/CUDA Scaling accelerator code

Why Hybrid CUDA? • CUDA is fast! (for some problems) • CUDA on a single card is like OpenMP (doesn’t scale) • MPI can only scale so far • Excessive power • Communication overhead • Large amount of work remains for each node • What if you can harness the power of multiple accelerators on multiple MPI processes?

Hybrid Architectures Node • Tesla S1050 connected to nodes • 1 GPU, connected directly to a node • Al-Salam @ Earlham (as11 & as12) • Tesla S1070 • A server node with 4 GPUs, typically connected via PCI-E to 2 nodes • Sooner @ OU has some of these • Lincoln @ NCSA (192 nodes) • Accelerator Cluster (AC) @ NCSA (32 nodes) RAM GPU GPU GPU GPU Node RAM

MPI/CUDA Approach • CUDA will be: • Doing the computational heavy lifting • Dictating your algorithm & parallel layout (data parallel) • Therefore: • Design CUDA portions first • Use MPI to move work to each node

Implementation • Do as much work as possible on the GPU before bringing data back to the CPU and communicating it • Sometimes you won’t have a choice… • Debugging tips: • Develop/test/debug one-node version first • Then test it with multiple nodes to verify commun-ication move data to each node while not done: copy data to GPU do work <<< >>> get new state out of GPU communicate with others aggregate results from all nodes

Multi-GPU Programming • A CPU thread can only have a single active context to communicate with a GPU • cudaGetDeviceCount(int * count) • cudaSetDevice(int device) • Be careful using MPI rank alone, device count only counts the cards visible from each node • Use MPI_Get_processor_name()to determine which processes are running where

Compiling • CUDA needs nvcc, MPI needs mpicc • Dirty trick: wrap mpicc with nvcc • nvccprocesses .cu files, sends the rest to its wrapped compiler • Kernel, kernel invocation, cudaMalloc, are all best off in a .cu file somewhere • MPI calls should be in .c files • There are workarounds, but this is the simplest approach nvcc--compiler-bindirmpiccmain.ckernel.cu

Executing • Typically one MPI process per available GPU • On Sooner (OU), each node has 2 GPUs available, so ppn should be 2. • On AC, each node has 4 GPUs and correspond to the number of processors requested, so this requests a total of 8 GPUs on 2 nodes: #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2 #BSUB -l nodes=2:tesla:cuda3.2:ppn=4

Hybrid CUDA Lab • We already have Area Under a Curve code for MPI and CUDA independently. • You canwrite a hybrid code that has each GPU calculate a portion of the area, then use MPI to combine subtotals for the complete area. • Otherwise feel free to take any code we’ve used so far and experiment!

Hybrid MPI/CUDA

Hybrid MPI/CUDA

Presentation Transcript

0101010100100101101010010011011 Challenges in the evolution from classical over hybrid to digital libraries 011010100

Hybrid cars – the value of patent strategies in innovation

Hybrid Soft Computing: Where Are We Going?

WLTP Test Procedures for Hybrid Electric Vehicle Testing

FEASIBILITY STUDY OF HYBRID WOOD STEEL STRUCTURES

The Yeast Two-Hybrid System

Accelerating SQL Database Operations on a GPU with CUDA

JPEG Compression Algorithm In CUDA

Introduction to CUDA (2 of 2)

Chapter 8: Hybrid Technology and Multichip Modules

CHAPTER 19

CUDA Lecture 1 Introduction to Massively Parallel Computing

Introduction to PGAS (UPC and CAF) and Hybrid for Multicore Programming

GPU Optimization using CUDA Framework

CUDA Lecture 3 Parallel Architectures and Performance Analysis

A Hybrid Approach to Applying Mathematical Reasoning in Computer Science Courses

EECE 396-1 Hybrid and Embedded Systems: Computation

CUDA Lecture 10 Architectural Considerations

Yeastar Technology Co., Ltd.