Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108

Lesson 12 Collective Operations Dr. Stephen Tse stse@forbin.qc.edu 908-872-2108

Collective Communication • A collective communication is • A communication pattern that involves all the processes in a communicator • It involves more than two processes. • Different Collective Communication Operations: • Broadcast • Gather and Scatter • Allgather • Alltoall

Consider the following Arrangement ----> Data 0 | A0 A1 A2 A3 A4 . . . 1 | 2 | : | n | V Processes

Broadcast • A broadcast is a collective communication that a single process sends the same data to every process in the communicator. ----> Data | A0 =======> A0 | bcast A0 | A0 | A0 V Processes

Process 0 Process 1 Process 2 Process 3 Matrix-Vector Product • If A=(aij) is an mxn matrix and x=(x0, x1, …, xn-1 )T is an n-dimensional vector then the matrix-vector product is y=Ax A x y =

A Gather • A collective communication in which a root process receives data from every other process. • In order to form the dot product of each row of A with x: • We need to gather all of x onto each process A Process 0 x0 Process 1 x1 Process 2 x2 Process 3 x3

A Scatter • A collective communication in which a fixed root process sends a distinct collection of data to every other process. • Scatter each row of A across the process A A A Process 0 a00 a01 a02 a03 Process 1 Process 2 Process 3

Allgather ----> Data |A0 A0 B0 C0 D0 E0 |B0 A0 B0 C0 D0 E0 |C0 A0 B0 C0 D0 E0 |D0 A0 B0 C0 D0 E0 |E0 A0 B0 C0 D0 E0 V Processes • Simultaneously gather all of xonto each processes. • Gathering a distributed array to every process. • It gathers the contents of each process’s send_data into each process’s recv_data. • After the function returns, all the processes in the communicator will have the result stored in the memory referenced by result.

Alltoall (transpose) ----> Data |A0 A1 A2 A3 A4 A0 B0 C0 D0 E0 |B0 B1 B2 B3 B4 A1 B1 C1 D1 E1 |C0 C1 C2 C3 C4 A2 B2 C2 D2 E2 |D0 D1 D2 D3 D4 A3 B3 C3 D3 E3 |E0 E1 E2 E3 E4 A4 B4 C4 D4 E4 V Processes • The heart of the redistribution of the keys is each process’s sending of its original local keys to the appropriate process • This is a collective communication operation in which each process sends a distinct collection of data to every other process.

Tree-Structure Communication • To improve the coding, we should focus on the distribution of the input data. • How can we divide the work more evenly among processes? • We think of that we have a tree of processes, with process 0 as the root; • During the 1st stage of the data distribution: 0 sends data to 1. • During 2nd stage: 0 sends the data to 2 while 1 sends data to 3. • During 3rd stage: 0 sends to 4, while 1 sends to 5, 2 sends to 6, and 3 sends to 7. • So we reduce the input distribution loop from 7 stages to 3 stages. • In general, if we have p processes, this procedure allows us to distribute the input data in | log2(p) | stages; which is the smallest whole number greater than of equal to log2(p) , which is all called the ceiling of that number. (See the processes configuration tree )

0 0 1 0 2 1 3 7 0 4 2 6 1 5 3 Data Distribution Stages • This distribution reduced the original p-1 stages. • If p=7, it reduced the time required for the program to complete the data distribution from 6 to 3 and reduced by a factor of 50 times. • There is no canonical choice of ordering. • We have to know the topology of the system in order to have better choice of scheme.

Reduce the burden of final Sum • In the final summation phase, process 0 always gets a disproportionate amount of work; i.e. the global sum of results from all other processes. • To accelerate the final phase, we can use the tree concept in reverse to reduce the load of process 0. • Distribute the work as: • Stage 1: • 4 sends to 0; 5 sends to 1; 6 sends to 2; 7 sends to 3. • 0 adds its integral from 4; 1 adds its integral from 5; 2 adds its integral from 6; 3 adds its integral from 7. • Stage 2: • 2 sends to 0; 3 sends to 1. • 0 adds its integral from 2; 1 adds its integral from 3. • Stage 3: • 1 sends to 0. • 0 adds its integral from 1. (See the reverse tree processes configuration)

Reverse-tree Processes Configuration 0 4 2 6 1 5 3 7 0 2 1 3 0 1 0

Reduction Operations • The “global sum” calculation , is a general class of collective communication operations called reduction operations. • In a global reduction operation, all the processes in a communicator are contributing data. All those data will be combined by using a binary operation. • Typical operations are addition, max, min, logical and , etc.

Simple Reduce ----> Data | A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2 | B0 B1 B2 | C0 C1 C2 V Process

Allreduce • In the simple reduce function only process 0 will return the global sum result. All the other processes will return 0. • If we want to use the result for subsequent calculations, we would like each process to return the same correct result. • The obvious approach is to call MPI_Reduce with a call to MPI_Bcast.

Every Processes have same Results ----> Data Results | A0 A1 A2 A0+B0+C0 A1+B1+C1 A2+B2+C2 | B0 B1 B2 A0+B0+C0 A1+B1+C1 A2+B2+C2 | C0 C1 C2 A0+B0+C0 A1+B1+C1 A2+B2+C2 V Process

Implementation in MPI- MPI_Gather • MPI_Gather( sendbuffer sendcount sendtype recvbuffer recvcount recvtype root rank comm ) Remarks: 1. All processes in “comm..” Including root send “sendbuffer” to root’s recvbuffer 2. Root collects these “sendbuffer” contents and put them in rank order in “recvbuffer” 3. “recvbuffer” is ignored in all processes except the “root”. 4. Its inverse operation is MPI_Scatter()

Implementation in MPI- MPI_Scatter 2. MPI_Scatter( sendbuffer sendcount sendtype recvbuffer recvcount recvtype root rank comm. ) Remarks: 1. Root sends “sendbuffer” to all processes including “root” 2. Root puts them in rank order in “recvbuffer” 3. Root cuts its msg into “n” equal parts and then sends them to “n” processes

Implementation in MPI- MPI_GatherV 3. MPI_GatherV( sendbuffer sendcount sendtype recvbuffer recvcount displacement /* integer array for displacement */ recvtype root rank comm. ) Remarks: 1. This is a more general and more flexible function 2. Allowing varying count of data from each process 3. The variation is marked in "displacement" which is an "n-" dimensional array.

Implementation in MPI- MPI_Allgather 4. MPI_Allgather( sendbuffer sendcount sendtype recvbuffer recvcount recvtype comm ) Remarks: 1. This operation is similar to all-to-all operation 2. Instead of specifying the "root", every process sends a its data too all other processes 3. The "j-th" block of data from each process is received by every process and is placed in the "j-th" block of the buffer "recvbuf"

Implementation in MPI- MPI_Allgather 5. MPI_AllgatherV( sendbuffer sendcount sendtype recvbuffer recvcount displacement recvtype comm ) Remarks: (1) This is an operation similar to all-to-all operation. (2) Instead of specifying the "root", every process sends its data too all other processes. (3) The "j-th" block of data from each process is received by every process and is placed in the "j-th" block of the buffer "recvbuf". (4) But the blocks from different processes need not to be uniform in sizes.

Implementation in MPI- MPI_Alltoall 6. MPI_Alltoall( sendbuffer sendcount sendtype recvbuffer recvcount recvtype comm ) Remarks: (1) This is an all-to-all operation (2) "j-th" block sent from process "i" is placed in process "j"'s "i-th" location of the "recv" buffer

Implementation in MPI- MPI_AlltoallV 7. MPI_AlltoallV( sendbuffer sendcount s-displacement sendtype recvbuffer recvcount r-displacement recvtype comm ) Remarks: (1) This is an all-to-all process (2) "j-th" block sent from process "i" is placed process "j"'s "i-th" location of the "recv" buffer

See You in the next Class !!

Collective Operations Dr. Stephen Tse stse@forbin.qc 908-872-2108