CS 179: GPU Programming

CS 179: GPU Programming Lecture 14: Inter-process Communication

The Problem • What if we want to use GPUs across a distributed system? GPU cluster, CSIRO

Distributed System • A collection of computers • Each computer has its own local memory! • Communication over a network • Communication suddenly becomes harder! (and slower!) • GPUs can’t be trivially used between computers

We need some way to communicate between processes…

Message Passing Interface (MPI) • A standard for message-passing • Multiple implementations exist • Standard functions that allow easy communication of data between processes • Works on non-networked systems too! • (Equivalent to memory copying on local system)

MPI Functions • There are seven basic functions: • MPI_Initinitialize MPI environment • MPI_Finalizeterminate MPI environment • MPI_Comm_sizehow many processes we have running • MPI_Comm_rankthe ID of our process • MPI_Isendsend data (nonblocking) • MPI_Irecvreceive data (nonblocking) • MPI_Waitwait for request to complete

MPI Functions • Some additional functions: • MPI_Barrierwait for all processes to reach a certain point • MPI_Bcastsend data to all other processes • MPI_Reducereceive data from all processes and reduce to a value • MPI_Sendsend data (blocking) • MPI_Recvreceive data (blocking)

The workings of MPI • Big idea: • We have some process ID • We know how many processes there are • A send request/receive request pair are necessary to communicate data!

MPI Functions – Init • intMPI_Init( int *argc, char ***argv ) • Takes in pointers to command-line parameters (more on this later)

MPI Functions – Size, Rank • intMPI_Comm_size( MPI_Commcomm, int *size ) • intMPI_Comm_rank( MPI_Commcomm, int *rank ) • Take in a “communicator” • For us, this will be MPI_COMM_WORLD – essentially, our communication group contains all processes • Take in a pointer to where we will store our size (or rank)

MPI Functions – Send • intMPI_Isend(const void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm, MPI_Request *request) • Non-blocking send • Program can proceed without waiting for return! • (Can use MPI_Sendfor “blocking” send)

MPI Functions – Send • intMPI_Isend(const void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm, MPI_Request *request) • Takes in… • A pointer to what we send • Size of data • Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) • Destination process ID …

MPI Functions – Send • intMPI_Isend(const void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm, MPI_Request *request) • Takes in… (continued) • A “tag” – nonnegative integers 0-32767+ • Uniquely identifies our operation – used to identify messages when many are sent simultaneously • MPI_ANY_TAG allows all possible incoming messages • A “communicator” (MPI_COMM_WORLD) • A “request” object • Holds information about our operation

MPI Functions - Recv • intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Request *request) • Non-blocking receive (similar idea) • (Can use MPI_Irecvfor “blocking” receive)

MPI Functions - Recv • intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, intsource, int tag, MPI_Commcomm, MPI_Request *request) • Takes in… • A buffer (pointer to where received data is placed) • Size of incoming data • Type of data (special MPI designations MPI_INT, MPI_FLOAT, etc) • Sender’s process ID …

MPI Functions - Recv • intMPI_Irecv(void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Request *request) • Takes in… • A “tag” • Should be the same tag as the tag used to send message! (assuming MPI_ANY_TAG not in use) • A “communicator” (MPI_COMM_WORLD) • A “request” object

Blocking vs. Non-blocking • MPI_Isend and MPI_Irecv are non-blocking communications • Calling these functions returns immediately • Operation may not be finished! • Should use MPI_Wait to make sure operations are completed • Special “request” objects for tracking status • MPI_Send and MPI_Recv are blocking communications • Functions don’t return until operation is complete • Can cause deadlock! • (we won’t focus on these)

MPI Functions - Wait • intMPI_Wait(MPI_Request *request, MPI_Status *status) • Takes in… • A “request” object corresponding to a previous operation • Indicates what we’re waiting on • A “status” object • Basically, information about incoming data

MPI Functions - Reduce • intMPI_Reduce(constvoid *sendbuf, void *recvbuf, int count, MPI_Datatypedatatype, MPI_Opop, int root, MPI_Commcomm) • Takes in… • A “send buffer” (data obtained from every process) • A “receive buffer” (where our final result will be placed) • Number of elements in send buffer • Can reduce element-wise array -> array • Type of data (MPI label, as before) …

MPI Functions - Reduce • intMPI_Reduce(constvoid *sendbuf, void *recvbuf, int count, MPI_Datatypedatatype, MPI_Opop, int root, MPI_Commcomm) • Takes in… (continued) • Reducing operation (special MPI labels, e.g. MPI_SUM, MPI_MIN) • ID of process that obtains result • MPI communication object (as before)

MPI Example intmain(intargc, char **argv) { intrank, numprocs; MPI_Statusstatus; MPI_Requestrequest; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); inttag=1234; intsource=0; intdestination=1; intcount=1; intsend_buffer; intrecv_buffer; if(rank == source){ send_buffer=5678; MPI_Isend(&send_buffer,count,MPI_INT,destination,tag, MPI_COMM_WORLD,&request); } if(rank == destination){ MPI_Irecv(&recv_buffer,count,MPI_INT,source,tag, MPI_COMM_WORLD,&request); } MPI_Wait(&request,&status); if(rank == source){ printf("processor %d sent %d\n",rank,recv_buffer); } if(rank == destination){ printf("processor %d got %d\n",rank,recv_buffer); } MPI_Finalize(); return 0; } • Two processes • Sends a number from process 0 to process 1 • Note: Both processes are running this code!

Solving Problems • Goal: To use GPUs across multiple computers! • Two levels of division: • Divide work between processes (using MPI) • Divide work among a GPU (using CUDA, as usual) • Important note: • Single-program, multiple-data (SPMD) for our purposes i.e. running multiple copies of the same program!

Multiple Computers, Multiple GPUs • Can combine MPI, CUDA streams/device selection, and single-GPU CUDA!

MPI/CUDA Example • Suppose we have our “polynomial problem” again • Suppose we have n processes for n computers • Each with one GPU • (Suppose array of data is obtained through process 0) • Every process will run the following:

MPI/CUDA Example MPI_Init declare rank, numberOfProcesses set rank with MPI_Comm_rank set numberOfProcesses with MPI_Comm_size if process rank is 0: array <- (Obtain our raw data) for all processesidx= 0 through numberOfProcesses - 1: Send fragment idxof the array to process ID idxusing MPI_Isend This fragment will have the size (size of array) / numberOfProcesses) declare local_data Read into local_datafor our process using MPI_Irecv MPI_Wait for our process to finish receive request local_sum <- ... Process local_datawith CUDA and obtain a sum declare global_sum set global_sumon process 0 with MPI_reduce (use global_sum somehow) MPI_Finalize

MPI/CUDA – Wave Equation • Recall our calculation:

MPI/CUDA – Wave Equation • Big idea: Divide our data array between n processes!

MPI/CUDA – Wave Equation • In reality: We have three regions of data at a time (old, current, new)

MPI/CUDA – Wave Equation • Calculation for timestep t+1 uses the following data: t+1 t t-1

MPI/CUDA – Wave Equation • Problem if we’re at the boundary of a process! t+1 t t-1 x Where do we get ? (It’s outside our process!)

Wave Equation – Simple Solution • After every time-step, each process gives its leftmost and rightmost piece of “current” data to neighbor processes! Proc1 Proc4 Proc0 Proc2 Proc3

Wave Equation – Simple Solution • Pieces of data to communicate: Proc1 Proc4 Proc0 Proc2 Proc3

Wave Equation – Simple Solution • Can do this with MPI_Irecv, MPI_Isend, MPI_Wait: • Suppose process has rank r: • If we’re not the rightmost process: • Send data to process r+1 • Receive data from process r+1 • If we’re not the leftmost process: • Send data to process r-1 • Receive data from process r-1 • Wait on requests

Wave Equation – Simple Solution • Boundary conditions: • Use MPI_Comm_rank and MPI_Comm_size • Rank 0 process will set leftmost condition • Rank (size-1) process will set rightmost condition

Simple Solution – Problems • Communication can be expensive! • Expensive to communicate every timestep to send 1 value! • Better solution: Send some m values every mtimesteps!

CS 179: GPU Programming

CS 179: GPU Programming

Presentation Transcript

Chapter 4: Dynamic Programming

Sales positioning

Concepts of Programming Languages

Programming with C# and .NET

LINEAR PROGRAMMING

Chapter 1

MIPS Assembly Language Programming

Advanced Programming Techniques

An introduction to Logic Programming

PROGRAMMING IN HASKELL

VEX/ROBOTC Session 1

Introduction to Object-Oriented Programming and the Java Programming Language

The JavaScript Programming Language

Prolog: Programming in Logic

Declarative Programming Techniques

Internet and Java Foundations, Programming and Practice

TM 331: Computer Programming Introduction to Class, Introduction to Programming

C++ Programming: From Problem Analysis to Program Design , Fifth Edition

Dynamic Programming (continued)

Introduction to Computer programming

Socket Programming(2/2)