Parallel Programming in MPI

Parallel Programming in MPIpart 2 January 18, 2011 1

Sample answer of the last week's report. (1) #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include "mpi.h" int main(int argc, char *argv[]) { int i,r, myid, procs,*result; struct timeval tv; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); if(myid==0){ result = (int *)malloc(sizeof(int)*procs); if( result == NULL){ printf("Not enough memory\n"); exit(EXIT_FAILURE); } } gettimeofday(&tv, NULL); srand(tv.tv_usec); r = rand(); MPI_Gather(&r,1,MPI_INT,result,1,MPI_INT,0,MPI_COMM_WORLD); if(myid == 0){ for(i=0;i < procs ;i++){ printf("%d: %d\n",i, result[i]); } free(result); }

Sample answer of the last week's report. (2) #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #include "mpi.h" int main(int argc, char *argv[]) { MPI_Status status; int val=0; int r,myid,procs; struct timeval tv; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myid); MPI_Comm_size(MPI_COMM_WORLD, &procs); if (myid != 0) MPI_Recv(&val,1,MPI_INT,myid-1,0,MPI_COMM_WORLD,&status); gettimeofday(&tv,NULL); srand(tv.tv_usec); r = rand(); printf("%d: %d\n",myid,r); if (myid !=procs-1) MPI_Send(&val,1,MPI_INT,myid+1,0,MPI_COMM_WORLD); MPI_Finalize(); }

Today's Topic 並列処理Parallel Computing: 複数のCPUコアに仕事を分担Distribute the work into processes Non Parallel Computing Parallel Computing task1 task1 task2 task3 task2 task3

プログラム例　（並列化前）Sample Program (Before Parallelization) ２個のベクトルの和　Add two vectors (C = A + B) 一つずつ順に処理　Sequential computation one after another ... 0 99 A = = = = = = = = = = = = = = = = = = = = B + + + + + + + + + + + + + + + + + + + + C Sequential program double A[100], B[100], C[100]; ... for (i = 0; i < 100; i++) A[i] = B[i] + C[i];

プログラム例　（並列化後） Sample (After Parallelization) 処理とデータを各プロセスに分割Distribute the computation and the data into processes Rank 0 Rank 1 Rank 2 Rank 3 ... ... ... ... 0 24 0 24 0 24 0 24 A A A A = = = = = = = = = = = = = = = = = = = = B B B B + + + + + + + + + + + + + + + + + + + + C C C C Parallel program double A[25], B[25], C[25]; ... MPI_Init(...); ... for (i = 0; i < 25; i++) A[i] = B[i] + C[i];

並列化の効果 Effect of Parallelization • 速度向上率＝(１プロセスによる実行時間) / (複数プロセスによる実行時間)Speed Up Ratio = (Execution time of sequential program) / (Execution time of parallel program) • 並列化によって何倍速くなったかHow many times the program got faster by the parallelization

並列処理に対する期待と現実Hope and Reality of Parallelization • プログラマ　Programmer:「ＣＰＵを 4台使うんだから，並列化で 4倍速くなって欲しい」"I want the program to get 4 times faster by parallelization, when I use 4 CPUs" • 計算機製作者　Maker of Computers:「ＣＰＵ 4台で 3倍くらい速くなれば十分だろう」"3 times faster should be enough with 4 CPUs" • Why? • アムダールの法則 Amdahl's Law • 負荷のバランス Load Balance • 通信のコスト Time for Communication

アムダールの法則 Amdarl's Law • 並列化にあてはめて考えるとApply the law on Parallelization: • 並列化による性能向上率の理論的な限界Limit of the Speed-Up Ratio by Parallelization　＝1/((1-P)+P/N) • P: プログラム中の並列化対象部分が全処理時間に占める割合Ratio of the execution time of the part that will be parallelized in the total execution time • N: プロセス数 the number of processes • Example) N=4 で 3.5倍以上高速化するためには 95%以上の部分を並列化To achieve 3.5 times faster on N=4, more than 95% part of the execution timemust be parallelized. Limit of Speed-Up Ratio (N=4) 9 9

負荷のバランス Load Balance • 並列プログラムの処理時間は　「最も遅いプロセスの処理時間」である The execution time of a parallel program is "the execution time of the slowest process". Execution time of this program Execution time of this program Rank 0 Rank 0 Rank 1 Rank 1 Rank 2 Rank 2 Rank 3 Rank 3

通信時間 Time for communication • 並列化前は不要だった時間　＝　並列化によるオーバーヘッドThe time that is not necessary before parallelization = The overhead caused by parallelization Rank 0 Rank 1 Rank 2 Rank 3

By the way 並列化だけが高速化の手段ではない Parallelization is not the only way for speeding up. For example... Add optimization options at compiling programs. Modify programs to fit the sizes of "cache memory systems". 並列化は簡単な作業ではないParallelization is not a simple task. Careful analysis about the cost and the effect of parallelizationis important.

では，プログラムを並列化してみましょうOK. Let's Parallelize the Program. • プログラムのどこを並列化するか？On which part of the program to parallelize? • そこは並列化できるか？Can it be parallelized? • どうやって並列化するか？How to parallelize it?

どこを並列化するか？On which part to parallelize? • アムダールの法則を考慮すると，　「可能であれば、最も時間のかかっている場所を並列化した方がよい」Considering Amdahl's Law, "Choose the part that takes the longest computing time, if possible." • Usually, try 最大の多重ループ　"Try the largest loop nest, first."　さらに、その最外ループ"Try the outermost loop of the nest, first." • 最大の多重ループが、最も所要時間が長い可能性が高いThe largest loop nest tends to consume the longest computing time. • 内側のループを並列化すると，並列化のオーバーヘッドが大きくなる可能性が高いParallelizing the inner loop tends to cause larger overhead for parallelization.

並列化可能か？　Can it be parallelized? • 並列処理＝複数の処理を同時進行Parallel computing = Proceed multiple computation at a same time. • 仕事の順番が不確定 The order of the computation is not deterministic. • 並列化できるプログラム＝　実行の順番が変わっても結果が変わらない。 Parallelizable program = Change of the order of computations does not change the results for (i = 0; i < 12; i++) a[i] = b[i] + c[i]; 1 b 1 1 1 1 1 1 1 1 1 1 1 + c 1 2 3 4 5 6 7 8 9 10 11 12 a 13 2 3 4 5 6 7 8 9 10 11 12 15

並列化が難しいプログラムPrograms that are hard to parallelize • 実行の順序によって結果が変わるThe change of the order of computationscan change the results • Example:前に計算した値を参照して計算 Computation that uses the results of previous computations. for (i = 1; i < 12; i++) a[i] = a[i] + a[i-1]; Rank 0 Rank 1 Rank 2 Rank 3 a 1 2 3 4 5 6 7 8 9 10 11 12 a 1 3 6 ? ? ? ? ? ? ? ? ? 16 16

ループの並列化Parallelization of loops • どのようにプロセスに割り当てるか？How to distribute loops into processes • どのようにプロセスにデータを配置するか？How to allocate data on processes • 計算結果のとりまとめHow to achieve the final results of distributed loops

ループの分割How to distribute loops for (i = (100/procs)*myid; i < (100/procs)*(myid+1); i++) a[i] = work(i); Rank 0: i = 0, 1, 2, ..., 24 Rank 1: i = 25, 26, 27, ..., 49 Rank 2: i = 50, 51, 52, ..., 74 Rank 3: i = 75, 76, 77, ..., 99 procs: Number of processes myid: Rank • Block：　連続した繰り返し毎に分割 Distribute bycontiguous region of iterations • 通常、このBlock分割で十分な高速化が得られる場合が多い。Usually, this distribution shows sufficient speed-up. • 繰り返し数がプロセス数で割り切れない場合があるので注意。Additional consideration must be paid for the cases in which the number of iterations is not divisible by the number of processes. 18 18 18

ループの分割（つづき）How to distribute loops (cont.) for (i = myid; i < 100; i += procs) a[i] = work(i); Rank 0: i = 0, 4, 8, 12, ..., 96 Rank 1: i = 1, 5, 9, 13, ..., 97 Rank 2: i = 2, 6, 10, 14, ..., 98 Rank 3: i = 3, 7, 11, 15, ..., 99 Rank 0: i = 0, 1, 2, 3,16,17,18,19, ... Rank 1: i = 4, 5, 6, 7,20,21,22,23, ... Rank 2: i = 8, 9,10,11,24,25,26,27,... Rank 3: i = 12,13,14,15,28,29,30,31, ... for (i = myid*4; i < 100; i += procs*4) for (ii = i; ii < i+4; i++) a[i] = work(i); 繰り返し毎に計算の量が異なる場合に利用Use these when the amount of computation differs from one iteration to another. • Cyclic：　とびとびに分割 Distribute with some stride • Block-Cyclic：　二つの組み合わせ Mixture of Block & Cyclic 19 19 19

データの配置　　How to allocate data • 通常，それぞれのプロセスで使用するデータのみを宣言Usually, declare the data that is used by a process • Good: Efficient usage of memory. • Bad: Entire program must be rewritten according to the sizes of arrays. double *a, *b, *c; int local_size; ... MPI_Init(...); ... local_size = N / procs; ... a = (double *)malloc( local_size*sizeof(double)); ... for (i = 0; i < local_size; i++) c[i] = a[i] * b[i]; Rank 0 Rank 1 Rank 2 Rank 3 c = a + b c = a + b c = a + b c = a + b 20

データの配置（つづき） How to allocate data (cont.) • 全プロセスで全配列を宣言 Redundant allocation: • Good: Less modification for parallelization. • Bad: Problem size cannot be increased by increasing processes. • Allocate regions that are never used in the execution. double a[N], b[N], c[N]; int local_size ... MPI_Init(...); ... for (i = N/procs*myid; i < N/procs*(myid+1); i++) c[i] = a[i] * b[i]; Rank 0 Rank 1 Rank 2 Rank 3 c = a + b c = a + b c = a + b c = a + b 小さいデータの配置に利用Use this allocation for small data. 21

結果のとりまとめHow to achieve the final results(1) Rank0 Rank1 Rank2 Rank3 c_total c c c c MPI_Gather(c, local_size, MPI_DOUBLE, c_total, local_size,MPI_DOUBLE, 0, MPI_COMM_WORLD); Ex) MPI_Gather(&(c[myid*local_size]),local_size, MPI_DOUBLE, c_total, local_size, MPI_DOUBLE, 0, MPI_COMM_WORLD); • Ｇａｔｈｅｒ　distributed data on each process • 通常，MPI_Gather を使用　MPI_Gather can be used, generally. • ループの分割やデータの配置によって，調整が必要Need some adjustment according to the loop distribution or the data allocation. 22 22

結果のとりまとめHow to achieve the final results(2) Rank0 Rank1 Rank2 Rank3 c_total c_total c_total c_total c c c c MPI_AllGather(c, local_size, MPI_DOUBLE, c_total, local_size,MPI_DOUBLE, MPI_COMM_WORLD); とりまとめ結果を全プロセスにコピーCopy the gathered data to all process 23 23

結果のとりまとめHow to achieve the final results(3) • 全プロセスの結果の総計Total calculation of the results among processes Example) Total Sum • Possible Calculations: MPI_SUM, MPI_PROD, etc. • MPI_Allreduce for copying the result to all process. Rank0 Rank1 Rank2 Rank3 c c c c_total c SUM MPI_Reduce(c, c_total, N, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); 24

MPIプログラムの時間計測 Measure the time of MPI programs MPI_Wtime 現在時間（秒）を実数で返す関数 Returns the current time in seconds. Example) ... double t1, t2; ... t1 = MPI_Wtime(); 　　処理 t2 = MPI_Wtime(); printf("Elapsed time: %e sec.\n", t2 – t1); Measuretime here

並列プログラムにおける時間計測の問題Problem on measuring time in parallel programs Rank 0 Rank 1 t1 = MPI_Wtime(); Rank 2 Read t1 = MPI_Wtime(); Measuretime here Read t1 = MPI_Wtime(); Receive Send Receive Read t1 = MPI_Wtime(); Send t1 = MPI_Wtime(); t1 = MPI_Wtime(); プロセス毎に違う時間を測定：　どの時間が本当の所要時間か?Each process measures different time. Which time is the time we want? 26

集団通信 MPI_Barrierを使った解決策 Use MPI_Barrier Rank 0 MPI_Barrier Rank 1 MPI_Barrier Rank 2 MPI_Barrier t1 = MPI_Wtime(); Receive Receive Read Measuretime here Read Send MPI_Barrier Read Send MPI_Barrier MPI_Barrier t1 = MPI_Wtime(); • 時間計測前にMPI_Barrierで同期 Synchronize processes before each measurement • For measuring total execution time. 27

より細かい解析 Detailed analysis • Average • MPI_Reduce can be used to achieve the average: • MAX and MIN • Use MPI_Gather to gather all of the results to Rank 0. • Let Rank 0 to find MAX and MIN double t1, t2, t, total; t1 = MPI_Wtime(); ... t2 = MPI_Wtime(); t = t2 – t1; MPI_Reduce(&t, &total, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myrank == 0) printf("Ave. elapsed: %e sec.\n", total/procs);

最大(Max)、平均(Ave)、最小(Min)の関係 Relationships among Max, Ave and Min • プロセス毎の負荷（仕事量）のばらつき検証に利用 Can be used for checking the load-balance. Time includes Computation Time and Communication Time

通信時間の計測 Measuring time for communications double t1, t2, t3, t4 comm=0; t3 = MPI_Wtime(); for (i = 0; i < N; i++){computation t1 = MPI_Wtime();communication t2 = MPI_Wtime(); comm += t2 – t1; computation t1 = MPI_Wtime();communication t2 = MPI_Wtime(); comm += t2 – t1; } t4 = MPI_Wtime();

Analyze computation time • Computation time = Total time - Communication time • Or, just measure the computation time • 計算時間のばらつき　＝　負荷の不均衡の度合いBalance of computation time shows balance of the amount of computation • 注意:　通信時間には、負荷の不均衡によって生じた待ち時間が含まれるので、単純な評価は難しい Communication time is difficult to analyze since it consists waiting time caused by load-imbalance. ==> Balance computation first.

Today's MPI functions • MPI_Allgather • MPI_Reduce • MPI_Allreduce • MPI_Barrier • MPI_Wtime

MPI_Allgather Usage: int MPI_Allgather(void *sb, int sc MPI_Datatype st,void *rb,int rc, MPI_Datatype rt, MPI_Comm comm); Rank0 Rank1 Rank2 Rank3 a a a a b b b b • MPI_Gatherの結果を全プロセスにコピーcopy the result of MPI_Gather to all process • Parameters: • send data: start address, number of elements, data type, receive data: start address, number of elements, data type, (means only on the root rank)communicator • Example: MPI_Allgather(a, 3, MPI_DOUBLE, b, 3, MPI_DOUBLE, MPI_COMM_WORLD); 33 33 33

MPI_Reduce Usage: int MPI_Reduce(void *sb, void *rb, int c, MPI_Datatype t, MPI_Op op, int root, MPI_Comm comm); Rank0 Rank1 Rank2 Rank3 a a a a SUM b • 全プロセスからのデータを集めて計算（総和等）をするGather data from all process and apply a reduction operation • Parameters:start address of original data, start address to store the result,number of elements, data type, operation, root rank, communicator • Possible operations:MPI_SUM, MPI_PROD, MPI_MAX(only for integer), MPI_MIN(only for integer),etc. • Example: MPI_Reduce(a, b, 3, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); 34 34 34

MPI_Allreduce Usage: int MPI_Allreduce(void *sb, void *rb, int c, MPI_Datatype t, MPI_Op op, MPI_Comm comm); Rank 0 Rank 1 Rank 2 Rank 3 a a a a SUM b b b b • MPI_Reduceの結果を全プロセスにコピーCopy the result of MPI_Reduce to all process • Parameters:start address of the original data, start address to store the result,number of elements, data type, operation, communicator • Example: MPI_Allreduce(a, b, 3, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); 35 35 35

MPI_Barrier Usage: int MPI_Barrier(MPI_Comm comm); • 同期。全プロセスがMPI_Barrierを実行するまで待つ。Synchronization. Wait here until all process execute MPI_Barrier. • Parameter: • communicator 36 36

MPI_Wtime Usage: double MPI_Wtime(); • 現在の時刻を秒で取得。秒は倍精度実数で返す。Get current time in seconds.Parameters:Returns double precision floating point number. 37 37

レポート課題：　ベクトルの内積の並列化Report: Parallelization of Dot-Product 以下のプログラムの計算部分を並列化しなさい #include <stdio.h> #include <stdlib.h> #include <sys/time.h> #define N 1000 int main(int argc, char *argv[]) { double *a, *b, c, r; int i; struct timeval tv; a = (double *)malloc(N * sizeof(double)); b = (double *)malloc(N * sizeof(double)); continue to the next page...

gettimeofday(&tv, NULL); srand(tv.tv_usec); for (i = 0; i < N; i++){ r = rand(); a[i]=r/RAND_MAX; } for (i = 0; i < N; i++){ r = rand(); b[i]=r/RAND_MAX; } c = 0.0; for (i = 0; i < N; i++) c = c + a[i]*b[i]; printf("Result: %e\n", c); free(a); free(b); }

並列化にあたって Parallelization • ベクトル a, b の初期値はランク0で生成し，他のランクへ適切にコピーするLet Rank 0 to create initial values of the elements of a and b. Copy the data to other processes appropriately. • 結果はランク0で表示するLet Rank 0 to display the result. • 時間と興味があったら，N やプロセス数を変えながらMPI_Wtimeで所要時間や計算部分の時間を計測してみるIf you have timeand interest, measure the time for entire execution, or the time for computation, with different N and the number of processes.

Parallel Programming in MPI - Sample Answers and Effects

Parallel Programming in MPI - Sample Answers and Effects

Presentation Transcript

Parallel Programming in C with MPI and OpenMP

Advanced Parallel Programming with MPI

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in MPI part 1

Parallel Programming in MPI part 1

Parallel Programming using MPI

Parallel Programming with MPI: Day 1

Parallel Programming with MPI- Day 3

Parallel Programming with MPI

Parallel Programming in MPI Answer

Parallel Computing/Programming using MPI

Parallel Programming with MPI and OpenMP

MPI Parallel Programming

Parallel Programming with MPI- Day 4

Parallel Programming with MPI- Day 2

Parallel Programming in MPI part 2

Introduction to Parallel Programming with MPI

Parallel Programming with MPI and OpenMP