ACOE401

Parallel Processing MPI Tutorial ACOE401 Parallel Processing - MPI 1

Example 1: A 1024X1024 gray-scaled bitmap image is stored in the file “INPUT.BMP”. Write a program that displays the number of pixels that have a value greater than the average value of all pixels. (a) Write the program for a message passing system, using MPI without any of the collective communication functions. (b) Write the program for a message passing system, using MPI with the most appropriate collective communication functions. 3 3 1 2 2 6 17 ACOE401 Parallel Processing - MPI 2

Example 1a: Methodology Init.MPI Yes No Id=0? Read File Send data segments to others Rcve my data segment from root Calculate my local Sum Calculate my local Sum Receive local Sum from others and find Global Sum and Aver. Send local Sum to Root. Send Average to others Rcve Average from Root Count Pixels >Average Count Pixels >Average Receive local Counts from Others Find and display Global Count. Send my Count to Root. Exit ACOE401 Parallel Processing - MPI 3

Example 1a: void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize,aver, num=0, mysum=0, temp; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int lvals[myrows][1024]; if(myid==0) /************ Code executed only by process with id = 0 ***********/ { int dvals[1024] [1024]; get_data(filename,dvals); /*Copy data from data file to array dvals[] */ for(i=1; i<nprocs;i++) MPI_Send(dvals+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) mysum +=dvals[i][k]; /*Calculate partial sum */ for(i=1; i<nprocs;i++){ MPI_Rcve(&temp,1,MPI_INT,i,1,MPI_COMM_WORLD,&stat); mysum +=temp; } /*Calculate global */ aver = mysum/(1024*1024); for(i=1; i<nprocs;i++) MPI_Send(&aver,1,MPI_INT,i,2,MPI_COMM_WORLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) if (dvals[i][k]>aver) num++; /*Count pixels less that average */ for(i=1; i<nprocs;i++){ MPI_Rcve(&temp,1,MPI_INT,i,3,MPI_COMM_WORLD,&stat); num+=temp; } /*Calculate global number of pixels less than the average*/ printf(“Result = %d”,num); } /*****End of Code executed only by process 0 *****/ ACOE401 Parallel Processing - MPI 4

Example 1a: Continued if(myid<>0) /***Code executed by all processes except process 0 *****/ { MPI_Rcve(lvals,mysize,MPI_INT,1,0,MPI_COMM_WORLD,&stat); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) mysum +=lvals[i][k]; /*Calculate partial sum */ MPI_Send(&mysum,1,MPI_INT,0,1,MPI_COMM_WORLD); MPI_Rcve(&aver,1,MPI_INT,0,2,MPI_COMM_WORLD,&stat); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) if (lvals[i][k]>aver) num++; /*Count pixels less than the average*/ MPI_Send(&num,1,MPI_INT, 0,3,MPI_COMM_WORLD); } MPI_Finalize(); } ACOE401 Parallel Processing - MPI 5

Example 1b: Methodology Init.MPI Yes Id=0? Read File Scatter data segments (Root = Process 0) Calculate my local Sum Reduce local Sum to Global Sum (Root = Process 0) Yes Id=0? Find Average Broadcast Average (Root = Process 0) Count local Pixels >Average Reduce local Count to Global Count (Root = Process 0) Yes Id=0? Display Global Count Exit ACOE401 Parallel Processing - MPI 6

Example 1b: void main(int argc, char *argv) { int myid, nprocs, i, myrows, aver, res, num=0; int dvals[1024] [1024]; int mysum=0; int gsum = 0; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); if(myid==0) get_data(filename,dvals); /*Copy data from data file to array dvals[] */ myrows=1024/nprocs; int lvals[myrows][1024]; MPI_Scatter(dvals,1024*1024,MPI_INT,lvals, myrows*1024,MPI_INT,0,MPI_COMM_WORLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) mysum +=lvals[i][k]; /*Calculate partial sum */ MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM,0,MPI_COMM_WORLD); if(myid = = 0) aver=sum/(1024*1024); MPI_Bcast(&aver, 1, MPI_INT, 0,MPI_COMM_WORLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) if (lvals[i][k]>aver) num++; /*Count number of pixels less than average */ MPI_Reduce(&res, &num, 1, MPI_INT, MPI_SUM,0,MPI_COMM_WORLD); if(myid = = 0) printf("The result is %d.\n",res); MPI_Finalize(); } ACOE401 Parallel Processing - MPI 7

Example 2: An image enhancement technique on gray-scaled images, requires that all pixels that have a value less than the average value of all pixels of the image are set to zero, while the rest remain the same. You are required to write a program to implement the above technique. Use as input the file ‘input.dat’ that contains a 1024X1024 gray-scaled bitmap image. Store the new image in the file ‘output.dat’. Write the program for a message passing system, using MPI with the most appropriate collective communication functions. ACOE401 Parallel Processing - MPI 8

Example 2: void main(int argc, char *argv) { int myid, nprocs, i, myrows, aver; int dvals[1024] [1024]; int mysum=0; int sum = 0; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); if(myid==0) get_data(filename,dvals); /*Copy data from data file to array dvals[] */ myrows=1024/nprocs; int lvals[myrows][1024]; MPI_Scatter(dvals,1024*1024,MPI_INT,lvals, myrows*1024,MPI_INT,0,MPI_COMM_WOLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) mysum +=lvals[i][k]; /*Calculate partial sum */ MPI_Reduce(&mysum, &sum, 1, MPI_INT, MPI_SUM,0,MPI_COMM_WORLD); if(myrank = = 0) aver=sum/(1024*1024); MPI_Bcast(&aver, 1, MPI_INT, 0,MPI_COMM_WORLD); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) if (lvals[i][k]<aver) lvals[i][k]=0; /*If value less than average then set to 0 */ MPI_Gather(dvals,1024*1024,MPI_INT,lvals,myrows*1024,MPI_INT,0,MPI_COMM_WOLD); if(myrank = = 0) write_data(filename,dvals); MPI_Finalize(); } ACOE401 Parallel Processing - MPI 9

Example 3: The following MPI program uses only the MPI_Send and MPI_Rcve functions to transfer data from between processes. Rewrite the program for a message passing system, using MPI with the most appropriate collective communication functions. void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize, aver, mkey = 0, num=0, mysum=0, temp; int lvals2[1024] [1024]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int lvals1[myrows][1024]; if(myid==0) { int dvals1[1024] [1024]; get_data(filename1,dvals1); int dvals2[1024] [1024]; get_data(filename2,dvals2); } …… ………see next slide ……… MPI_Finalize(); } ACOE401 Parallel Processing - MPI 10

Example 3: (cont.) if(myid==0) { for(i=1; i<nprocs;i++) { MPI_Send(dvals1+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD); MPI_Send(dvals2,1024*1024,MPI_INT,i,1,MPI_COMM_WORLD); } for(i=0, i<myrows, i++) for(k=0, k<1024, k++) { dvals1[i][k]=dvals1[i][k] * dvals2[k][i]; mkey+= dvals1[i][k]; for(i=1; i<nprocs;i++){ MPI_Rcve(dvals1+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD,&stat); MPI_Rcve(&temp,1,MPI_INT,i,1,MPI_COMM_WORLD,&stat); mkey +=temp; } aver = mkey/(1024*1024); printf(“Result = %d”,mkey; } if(myid<>0) { MPI_Rcve(lvals1,mysize,MPI_INT,1,0,MPI_COMM_WORLD,&stat); MPI_Rcve(lvals2,1024*1024,MPI_INT,1,1,MPI_COMM_WORLD,&stat); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) { lvals1[i][k]=lvals1[i][k] * lvals1[k][i]; mkey+= lvals1[i][k]; } MPI_Send(lvals1,mysize,MPI_INT,0,0,MPI_COMM_WORLD); MPI_Send(&mkey,1,MPI_INT,0,1,MPI_COMM_WORLD); } ACOE401 Parallel Processing - MPI 11

Example 3: (Solution 1/2) if(myid==0) { for(i=1; i<nprocs;i++) { MPI_Send(dvals1+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD); MPI_Send(dvals2,1024*1024,MPI_INT,i,1,MPI_COMM_WORLD); } for(i=0, i<myrows, i++) for(k=0, k<1024, k++) { dvals1[i][k]=dvals1[i][k] * dvals2[k][i]; mkey+= dvals1[i][k]; for(i=1; i<nprocs;i++){ MPI_Rcve(dvals1+i*mysize,mysize,MPI_INT,i,0,MPI_COMM_WORLD,&stat); MPI_Rcve(&temp,1,MPI_INT,i,1,MPI_COMM_WORLD,&stat); mkey +=temp; } aver = mkey/(1024*1024); printf(“Result = %d”,mkey; } if(myid<>0) { MPI_Rcve(lvals1,mysize,MPI_INT,1,0,MPI_COMM_WORLD,&stat); MPI_Rcve(lvals2,1024*1024,MPI_INT,1,1,MPI_COMM_WORLD,&stat); for(i=0, i<myrows, i++) for(k=0, k<1024, k++) { lvals1[i][k]=lvals1[i][k] * lvals1[k][i]; mkey+= lvals1[i][k]; } MPI_Send(lvals1,mysize,MPI_INT,0,0,MPI_COMM_WORLD); MPI_Send(&mkey,1,MPI_INT,0,1,MPI_COMM_WORLD); } Scatter Broadcast Gather Reduce ACOE401 Parallel Processing - MPI 12

Example 3: (Solution 2) void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize, aver, mkey = 0, num=0, mysum=0, temp; int lvals2[1024] [1024]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int lvals1[myrows][1024]; if(myid==0) { int dvals1[1024] [1024]; get_data(filename1,dvals1); /* int dvals2[1024] [1024]; get_data(filename2,lvals2); } MPI_Scatter(dvals1,1024*1024,MPI_INT,lvals1,myrows*1024,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(lvals2,1024*1024,MPI_INT,0,MPI_COMM_WORLD); } for(i=0, i<myrows, i++) for(k=0, k<1024, k++) { lvals1[i][k]=lvals1[i][k] * lvals2[k][i]; mkey+= lvals1[i][k]; } MPI_Gather(dvals1,1024*1024,MPI_INT,lvals1,myrows*1024,0,MPI_COMM_WORLD,&stat); MPI_Reduce(&temp,&mkey,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD,&stat); printf(“Result = %d”,mkey; MPI_Finalize(); } ACOE401 Parallel Processing - MPI 13

Example 4: • Rewrite the program below for a message passing system, without using any of the collective communication functions. void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize, aver, num=0, mysum=0, temp; int arr1[1024][1024]; arr2[1024][1024]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int arr3[myrows][1024]; int arr4[myrows][1024]; if(myid==0) { get_data(filename1,arr1); get_data(filename2,arr2); } MPI_Scatter(arr1,1024*1024,MPI_INT,arr3,myrows*1024,MPI_INT,0,MPI_COMM_WORLD); MPI_Bcast(arr2,1024*1024,MPI_INT,0,MPI_COMM_WORLD); } for(i=0, i<myrows, i++) { mysum+=arr1[i][k]; for(k=0, k<1024, k++) { arr3[i][k]=arr1[i][k] + arr2[k][i]; } MPI_Allreduce(&temp,&mysum,1,MPI_INT,MPI_SUM,MPI_COMM_WORLD,&stat); aver = temp/1024; for(i=0, i<myrows, i++) for(k=0, k<1024, k++) if (arr3[i][k] < aver) arr4[i][k] = 0 else arr4[i][k] = arr3[i][k]; MPI_Gather(arr1,1024*1024,MPI_INT,arr2,myrows*1024,0,MPI_COMM_WORLD,&stat); ……. } ACOE401 Parallel Processing - MPI 14

Answer 4: (part 1) void main(int argc, char *argv) { int myid, nprocs, i, myrows, mysize, aver, num=0, mysum=0, temp; int arr1[1024][1024]; arr2[1024][1024]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_rank(MPI_COMM_WORLD, &myid); myrows=1024/nprocs; mysize = 1024*myrows; int arr3[myrows][1024]; int arr4[myrows][1024]; if(myid==0) {get_data(filename1,arr1); get_data(filename2,arr2); } // MPI_Scatter(arr1,1024*1024,MPI_INT,arr3,myrows*1024,MPI_INT,0,MPI_COMM_WORLD); if (myid == 0 { for (i = 0; i<nprocs; i++) /* process 0 sends also to itself*/ MPI_Send(arr1+i*myrows, mysize,MPI_INT,i,0,MPI_COMM_WORLD); else MPI_Rcve(arr3,mysize,MPI_INT,0,0,MPI_COMM_WORLD,&stat); } //MPI_Bcast(arr2,1024*1024,MPI_INT,0,MPI_COMM_WORLD); } if (myid == 0 { for (i = 1; i<nprocs; i++) /* no need for process 0 sends also to itself*/ MPI_Send(arr2, 1024*1024,MPI_INT,i,1,MPI_COMM_WORLD); else MPI_Rcve(arr2, 1024*1024,MPI_INT,0,1,MPI_COMM_WORLD,&stat); } for(i=0, i<myrows, i++) { mysum+=arr1[i][k]; for(k=0, k<1024, k++) { arr3[i][k]=arr1[i][k] + arr2[k][i]; } /……. ACOE401 Parallel Processing - MPI 15

Answer 4: (part 2) / ……… //MPI_Allreduce(&temp,&mysum,1,MPI_INT,MPI_SUM,MPI_COMM_WORLD,&stat); if (myid == 0 { for(i=1; i<nprocs;i++){ MPI_Rcve(&temp,1,MPI_INT,i,2,MPI_COMM_WORLD,&stat); mysum +=temp; } aver = mysum/1024; for(i=1; i<nprocs;i++) MPI_Send(&aver,1,MPI_INT,i,3,MPI_COMM_WORLD); else { MPI_Send(&mysum,1,MPI_INT,i,2,MPI_COMM_WORLD); MPI_Rcve(&aver,1,MPI_INT,i,2,MPI_COMM_WORLD,&stat); } for(i=0, i<myrows, i++) for(k=0, k<1024, k++) if (arr3[i][k] < aver) arr4[i][k] = 0 else arr4[i][k] = arr3[i][k]; //MPI_Gather(arr1,1024*1024,MPI_INT,arr2,myrows*1024,0,MPI_COMM_WORLD,&stat); if (myid == 0 { for(i=1; i<nprocs;i++){ MPI_Rcve(arr1+i*myrows, mysize,,MPI_INT,i,4,MPI_COMM_WORLD,&stat); } else MPI_Send(arr2, mysize,,MPI_INT,i,4,MPI_COMM_WORLD); } ……. } ACOE401 Parallel Processing - MPI 16

Question 3 (Message Passing System) Figure below shows the timeline diagram of the execution of an MPI program with four processes • Calculate the speedup achieved, the efficiency and the utilization factor for each process. • List three reasons for achieving such a low performance. For each reason propose a change that will improve the performance. Parallel Processing -

Answer 3a (Message Passing System) • Calculate the speedup achieved, the efficiency and the utilization factor for each process. • Speedup=Computation Time/Parallel Time=(22+4+15+20+14)/54= 1.39 • Efficiency = Speedup / Number of processors = 1.39 / 4 = 0.35 = 35% • Utilization of P0 = 26/54 = 0.48 = 48% • Utilization of P1 = 15/54 = 0.48 = 28% • Utilization of P2 = 20/54 = 0.48 = 37% • Utilization of P3 = 14/54 = 0.26 = 26% Parallel Processing

Answer 3b (Message Passing System) • List three reasons for achieving such a low performance. For each reason propose a change that will improve the performance. • Bad load balancing (P0 useful work = 26, P0 Useful work 14) • Improve speedup by improving load balancing. P0 must be assigned less computation than the rest, since it has to spend time to distribute the data and then collect the results. • Process P0 handles most of the communication using point to point communication. (see next slide) • Use collective communications to obtain a balanced communication load. • Processors are idle while waiting for a communication or synchronization event to be completed. (see next slide) • Hide communication or synchronization latency by allowing the processor doing useful work while waiting for a long latency event. Use techniques such as non-blocking communication (pre-communication), Parallel Processing

Hiding Communication Latency in MPI In message passing system, communication latency reduces significantly the performance. MPI offers a number of directives that aim in hiding the communication latency and thus allow the processor do useful work while the communication is in progress. Non-Blocking Communication: MPI_Send and MPI_Rcve are blocking communication functions – Both nodes must wait until the data is transferred correctly. MPI supports Non-blocking communication with the MPI_ISend() and MPI_IRcve() functions. With the MPI_ISend() the transmitting node sends the data and continues. After executing a number of instructions it can check the status and act accordingly. With the MPI_IRcve() the receiving node executes the receive function before the data is needed, expecting that the data will arrive at the time it is needed. MPI supports data packing with the MPI_Pack() function that packs different types of data into a single message. This reduces the communication overheads since the number of messages is reduced. MPI supports collective communications that can take advantage of the network facilities. For example the MPI_Bcast() takes advantage of the broadcast function on an Ethernet network and thus send data to many nodes with a single message. Parallel Processing 20

ACOE401

ACOE401

Presentation Transcript