1 / 50

Open Multiprocessing

Open Multiprocessing. Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. OpenMP. An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran. Supports multiple platforms (processor architectures and operating systems).

meara
Télécharger la présentation

Open Multiprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open Multiprocessing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

  2. OpenMP • An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran. • Supports multiple platforms (processor architectures and operating systems). • Higher level implementation (a block of code that should be executed in parallel). • A method of parallelizing whereby a master thread forks a number of slave threads and a task is divided among them. • Based on preprocessor directives (Pragma) • Requires compiler support. • omp.h • References • http://openmp.org/ • https://computing.llnl.gov/tutorials/openMP/ • http://supercomputingblog.com/openmp/

  3. Hello, World! #include <stdio.h> #include <stdlib.h> #include <omp.h> void Hello(void) int main(intargc, char* argv[]) { /* Get number of threads from command line */ intthread_count=strtol(argv[1], NULL, 10); # pragmaomp parallel num_threads(thread_count) Hello(); return 0; } void Hello(void) { intmy_rank=omp_get_thread_num(); intthread_count=omp_get_num_threads(); printf(“Hello from thread %d of %d\n”, my_rank, thread_count); }

  4. Definitions text to modify the directive # pragmaomp parallel [clauses] { code_block } implicit barrier Error Checking #ifdef _OPENMP # include <omp.h> #endif #ifdef _OPENMP intmy_rank=omp_get_thread_num(); intthread_count=omp_get_num_threads(); #else intmy_rank=0; intthread_count=1; #endif Thread Team = Master + Slaves

  5. The Trapezoidal Rule /* Input: a, b, n */ h=(b-a)/n; approx=(f(a)+f(b))/2.0; for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i); } approx=h*approx; Thread 0 Thread 2 Shared Memory  Shared Variables  Race Condition # pragmaomp critical global_result+=my_result;

  6. The critical Directive # pragma omp critical y=f(x); ... double f(double x) { # pragma omp critical z=g(x); ... } Cannot be executed simultaneously! # pragma omp critical(one) y=f(x); ... double f(double x) { # pragma omp critical(two) z=g(x); ... } Deadlock

  7. The atomic Directive • # pragma omp atomic • x <op>=<expression>; • <op> can be one of the binary operators: • +, *, -, /, &, ^, |, <<, >> • Higher performance than the critical directive. • Only single C assignment statement is protected. • Only the load and store of x is protected. • <expression> must not reference x. x++ ++x x-- --x # pragma omp atomic# pragma omp critical x+=f(y);x=g(x); Can be executed simultaneously!

  8. Locks /* Executed by one thread */ Initialize the lock data structure; ... /* Executed by multiple threads */ Attempt to lock or set the lock data structure; Critical section; Unlock or unset the lock data structure; ... /* Executed by one thread */ Destroy the lock data structure; void omp_init_lock(omp_lock_t* lock_p); void omp_set_lock(omp_lock_t* lock_p); void omp_unset_lock(omp_lock_t* lock_p); void omp_destroy_lock(omp_lock_t* lock_p);

  9. Trapezoidal Rule in OpenMP #include <stdio.h> #include <stdlib.h> #include <omp.h> void Trap(double a, double b, int n, double* global_result_p); int main(intargc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count; thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n); # pragmaomp parallel num_threads(thread_count) Trap(a, b, n, &global_result); printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0; }

  10. Trapezoidal Rule in OpenMP void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; inti, local_n; intmy_rank=omp_get_thread_num(); intthread_count=omp_get_num_threads(); h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h; # pragmaomp critical *global_result_p+=my_result; }

  11. Scope of Variables • In serial programming: • Function-wide scope • File-wide scope • a, b, n • global_result • thread_count • my_rank • my_result • global_result_p • *global_result_p • Shared Scope • Accessible by all threads in a team • Declared before a parallel directive • Private Scope • Only accessible by a single thread • Declared in the code block

  12. Another Trap Function double Local_trap(double a, double b, int n); global_result=0.0; # pragmaomp parallel num_threads(thread_count) { # pragmaomp critical global_result+=Local_trap(a, b, n); } global_result=0.0; # pragmaomp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragmaomp critical global_result+=my_result; }

  13. The Reduction Clause • Reduction: A computation (binary operation) that repeatedly applies the same reduction operator (e.g., addition or multiplication) to a sequence of operands in order to get a single result. • Note: • The reduction variable itself is shared. • A private variable is created for each thread in the team. • The private variables are initialized to 0 for addition operator. reduction(<operator>: <variable list>) global_result=0.0; # pragmaomp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n);

  14. The parallel for Directive h=(b-a)/n; approx=(f(a)+f(b))/2.0; # pragmaomp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx; h=(b-a)/n; approx=(f(a)+f(b))/2.0; for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx; • The code block must be a for loop. • Iterations of the for loop are divided among threads. • approx is a reduction variable. • i is a private variable.

  15. The parallel for Directive • Sounds like a truly wonderful approach to parallelizing serial programs. • Does not work with while or do-while loops. • How about converting them into for loops? • The number of iterations must be determined in advance. for (; ;) { ... } for (i=0; i<n; i++) { if (...) break; ... } int x, y; # pragma omp parallel for num_threads(thread_count) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } } private(y)

  16. Estimating π double factor=1.0; double sum=0.0; for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum; double factor=1.0; double sum=0.0; # pragmaomp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum; ? Loop-carried dependence

  17. Estimating π if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); factor=(k%2 == 0)?1.0: -1.0; sum+=factor/(2*k+1); double factor=1.0; double sum=0.0; # pragmaomp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

  18. Scope Matters • With the default (none) clause, we need to specify the scope of each variable that we use in the block that has been declared outside the block. • The value of a variable with private scope is unspecified at the beginning (and after completion) of a parallel or parallel for block. double factor=1.0; double sum=0.0; # pragmaomp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum; The private factor is not specified.

  19. Bubble Sort for (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; } • Can we make it faster? • Can we parallelize the outer loop? • Can we parallelize the inner loop?

  20. Odd-Even Sort Any opportunities for parallelism?

  21. Odd-Even Sort void Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

  22. Odd-Even Sort in OpenMP for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */ # pragmaomp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ # pragmaomp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

  23. Odd-Even Sort in OpenMP # pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */ # pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ # pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

  24. Data Partitioning Iterations Block Threads 0 1 2 Cyclic Iterations Threads 0 1 2

  25. Scheduling Loops sum=0.0; for (i=0; i<=n; i++) sum+=f(i); double f(inti) { int j, start=i*(i+1)/2, finish=start+i; double return_val=0.0; for (j=start; j<=finish; j++) { return_val+=sin(j); } return return_val; } Load Balancing

  26. The schedule clause sum=0.0; # pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i); chunksize schedule(static, total_iterations/thread_count)

  27. The dynamic and guided Types • In a dynamic schedule: • Iterations are broken into chunks of chunksize consecutive iterations. • Default chunksize value: 1 • Each thread executes a chunk. • When a thread finishes a chunk, it requests another one. • In a guided schedule: • Each thread executes a chunk. • When a thread finishes a chunk, it requests another one. • As chunks are completed, the size of the new chunks decreases. • Approximately equals to the number of iterations remaining divided by the number of threads. • The size of chunks decreases down to chunksize or 1 (default).

  28. Which schedule? • The optimal schedule depends on: • The type of problem • The number of iterations • The number of threads • Overhead • guided>dynamic>static • If you are getting satisfactory results (e.g., close to the theoretically maximum speedup) without a schedule clause, go no further. • The Cost of Iterations • If it is roughly the same, use the default schedule. • If it decreases or increases linearly as the loop executes, a static schedule with small chunksize values will be good. • If it cannot be determined in advance, try to explore different options.

  29. Performance Issue A x y = X # pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; }

  30. Performance Issue Cache Miss False Sharing

  31. Performance Issue • 8,000,000-by-8 • y has 8,000,000 elements  Potentially large number of write misses • 8-by-8,000,000 • x has 8,000,000 elements  Potentially large number of read misses • 8-by-8,000,000 • y has 8 elements (8 doubles)  Could be stored in the same cache line (64 bytes). • Potentially serious false sharing effect for multiple processors • 8000-by-8000 • y has 8,000 elements (8,000 doubles). • Thread 2: 4000 to 5999 Thread 3: 6000 to 7999 • {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] } • The effect of false sharing is highly unlikely.

  32. Thread Safety • How to generate random numbers in C? • First, call srand() with an integer seed. • Second, call rand() to create a sequence of random numbers. • Pseudorandom Number Generator (PRNG) • Is it thread safe? • Can it be simultaneously executed by multiple threads without causing problems? Shared State

  33. Foster’s Methodology • Partitioning • Divide the computation and the data into small tasks. • Identify tasks that can be executed in parallel. • Communication • Determine what communication needs to be carried out. • Local Communication vs. Global Communication • Agglomeration • Group tasks into larger tasks. • Reduce communication. • Task Dependence • Mapping • Assign the composite tasks to processes/threads.

  34. Foster’s Methodology

  35. The n-body Problem • To predict the motion of a group of objects that interact with each other gravitationally over a period of time. • Inputs: Mass, Position and Velocity • Astrophysicist • The positions and velocities of a collection of stars • Chemist • The positions and velocities of a collection of molecules

  36. Newton’s Law

  37. The Basic Algorithm Get input data; for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q; } for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; } }

  38. The Reduced Algorithm for each particle q forces[q][0]=forces[q][1]=0; for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }

  39. Euler Method

  40. Position and Velocity for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q]; }

  41. Communications sq(t) vq(t) sr(t) vr(t) Fq(t) Fr(t) sq(t + △t) vq(t + △t) sr(t + △t) vr(t + △t) Fq(t+ △t) Fr(t+ △t)

  42. Agglomeration sq sr sq, vq, Fq sr, vr, Fr t sq sr sq, vq, Fq sr, vr, Fr t + △t

  43. Parallelizing the Basic Solver # pragma omp parallel for each timestep { if (timestep output){ # pragma omp single nowait Print positions and velocities of particles; } # pragma omp for for each particle q Compute total force on q; # pragma omp for for each particle q Compute position and velocity of q; } Race Conditions?

  44. Parallelizing the Reduced Solver # pragma omp for for each particle q forces[q][0]=forces[q][1]=0; # pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }

  45. Does it work properly? • Consider 2 threads and 4 particles. • Thread 1 is assigned particle 0 and particle 1. • Thread 2 is assigned particle 2 and particle 3. • F3=-f03-f13-f23 • Who will calculate f03and f13? • Who will calculate f23? • Any race conditions?

  46. Thread Contributions 3 Threads, 6 Particles, Block Partition

  47. First Phase # pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; } }

  48. Second Phase # pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } } • In the first phase, each thread carries out the same calculations as before but the values are stored in its own array of forces (loc_forces). • In the second phase, the thread that has been assigned particle q will add the contributions that have been computed by different threads. Race Conditions?

  49. Evaluating the OpenMP Codes • In the reduced code: • Loop 1: Initialization of the loc_forces array • Loop 2: The first phase of the computation of forces • Loop 3: The second phase of the computation of forces • Loop 4: The updating of positions and velocities • Which schedule should be used?

  50. Review • What are the major differences between MPI and OpenMP? • What is the scope of a variable? • What is a reduction variable? • How to ensure mutual exclusion in a critical section? • What are the common loop scheduling options? • What factors may potentially affect the performance of an OpenMP program? • What is a thread safe function?

More Related