Open Multiprocessing

Open Multiprocessing Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

OpenMP • An API for shared memory multiprocessing (parallel) programming in C, C++ and Fortran. • Supports multiple platforms (processor architectures and operating systems). • Higher level implementation (a block of code that should be executed in parallel). • A method of parallelizing whereby a master thread forks a number of slave threads and a task is divided among them. • Based on preprocessor directives (Pragma) • Requires compiler support. • omp.h • References • http://openmp.org/ • https://computing.llnl.gov/tutorials/openMP/ • http://supercomputingblog.com/openmp/

Hello, World! #include <stdio.h> #include <stdlib.h> #include <omp.h> void Hello(void) int main(intargc, char* argv[]) { /* Get number of threads from command line */ intthread_count=strtol(argv[1], NULL, 10); # pragmaomp parallel num_threads(thread_count) Hello(); return 0; } void Hello(void) { intmy_rank=omp_get_thread_num(); intthread_count=omp_get_num_threads(); printf(“Hello from thread %d of %d\n”, my_rank, thread_count); }

Definitions text to modify the directive # pragmaomp parallel [clauses] { code_block } implicit barrier Error Checking #ifdef _OPENMP # include <omp.h> #endif #ifdef _OPENMP intmy_rank=omp_get_thread_num(); intthread_count=omp_get_num_threads(); #else intmy_rank=0; intthread_count=1; #endif Thread Team = Master + Slaves

The Trapezoidal Rule /* Input: a, b, n */ h=(b-a)/n; approx=(f(a)+f(b))/2.0; for (i=1; i<=n-1; i++) { x_i=a+i*h; approx+=f(x_i); } approx=h*approx; Thread 0 Thread 2 Shared Memory  Shared Variables  Race Condition # pragmaomp critical global_result+=my_result;

The critical Directive # pragma omp critical y=f(x); ... double f(double x) { # pragma omp critical z=g(x); ... } Cannot be executed simultaneously! # pragma omp critical(one) y=f(x); ... double f(double x) { # pragma omp critical(two) z=g(x); ... } Deadlock

The atomic Directive • # pragma omp atomic • x <op>=<expression>; • <op> can be one of the binary operators: • +, *, -, /, &, ^, |, <<, >> • Higher performance than the critical directive. • Only single C assignment statement is protected. • Only the load and store of x is protected. • <expression> must not reference x. x++ ++x x-- --x # pragma omp atomic# pragma omp critical x+=f(y);x=g(x); Can be executed simultaneously!

Locks /* Executed by one thread */ Initialize the lock data structure; ... /* Executed by multiple threads */ Attempt to lock or set the lock data structure; Critical section; Unlock or unset the lock data structure; ... /* Executed by one thread */ Destroy the lock data structure; void omp_init_lock(omp_lock_t* lock_p); void omp_set_lock(omp_lock_t* lock_p); void omp_unset_lock(omp_lock_t* lock_p); void omp_destroy_lock(omp_lock_t* lock_p);

Trapezoidal Rule in OpenMP #include <stdio.h> #include <stdlib.h> #include <omp.h> void Trap(double a, double b, int n, double* global_result_p); int main(intargc, char* argv[]) { double global_result=0.0; double a, b; int n, thread_count; thread_count=strtol(argv[1], NULL, 10); printf(“Enter a, b, and n\n”); scanf(“%lf %lf %d”, &a, &b, &n); # pragmaomp parallel num_threads(thread_count) Trap(a, b, n, &global_result); printf(“With n=%d trapezoids, our estimate\n”, n); printf(“of the integral from %f to %f = %.15e\n”, a, b, global_result); return 0; }

Trapezoidal Rule in OpenMP void Trap(double a, double b, int n, double* global_result_p) { double h, x, my_result; double local_a, local_b; inti, local_n; intmy_rank=omp_get_thread_num(); intthread_count=omp_get_num_threads(); h=(b-a)/n; local_n=n/thread_count; local_a=a+my_rank*local_n*h; local_b=local_a+local_n*h; my_result=(f(local_a)+f(local_b))/2.0; for(i=1; i<=local_n-1; i++) { x=local_a+i*h; my_result+=f(x); } my_result=my_result*h; # pragmaomp critical *global_result_p+=my_result; }

Scope of Variables • In serial programming: • Function-wide scope • File-wide scope • a, b, n • global_result • thread_count • my_rank • my_result • global_result_p • *global_result_p • Shared Scope • Accessible by all threads in a team • Declared before a parallel directive • Private Scope • Only accessible by a single thread • Declared in the code block

Another Trap Function double Local_trap(double a, double b, int n); global_result=0.0; # pragmaomp parallel num_threads(thread_count) { # pragmaomp critical global_result+=Local_trap(a, b, n); } global_result=0.0; # pragmaomp parallel num_threads(thread_count) { double my_result=0.0; /* Private */ my_result=Local_trap(a, b, n); # pragmaomp critical global_result+=my_result; }

The Reduction Clause • Reduction: A computation (binary operation) that repeatedly applies the same reduction operator (e.g., addition or multiplication) to a sequence of operands in order to get a single result. • Note: • The reduction variable itself is shared. • A private variable is created for each thread in the team. • The private variables are initialized to 0 for addition operator. reduction(<operator>: <variable list>) global_result=0.0; # pragmaomp parallel num_threads(thread_count)\ reduction(+: global_result) global_result=Local_trap(a, b, n);

The parallel for Directive h=(b-a)/n; approx=(f(a)+f(b))/2.0; # pragmaomp parallel for num_threads(thread_count)\ reduction(+: approx) for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx; h=(b-a)/n; approx=(f(a)+f(b))/2.0; for (i=1; i<=n-1; i++) { approx+=f(a+i*h); } approx=h*approx; • The code block must be a for loop. • Iterations of the for loop are divided among threads. • approx is a reduction variable. • i is a private variable.

The parallel for Directive • Sounds like a truly wonderful approach to parallelizing serial programs. • Does not work with while or do-while loops. • How about converting them into for loops? • The number of iterations must be determined in advance. for (; ;) { ... } for (i=0; i<n; i++) { if (...) break; ... } int x, y; # pragma omp parallel for num_threads(thread_count) for(x=0; x < width; x++) { for(y=0; y < height; y++) { finalImage[x][y] = f(x, y); } } private(y)

Estimating π double factor=1.0; double sum=0.0; for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum; double factor=1.0; double sum=0.0; # pragmaomp parallel for\ num_threads(thread_count)\ reduction(+: sum) for(k=0; k<n; k++) { sum+=factor/(2*k+1); factor=-factor; } pi_approx=4.0*sum; ? Loop-carried dependence

Estimating π if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); factor=(k%2 == 0)?1.0: -1.0; sum+=factor/(2*k+1); double factor=1.0; double sum=0.0; # pragmaomp parallel for num_threads(thread_count)\ reduction(+: sum) private(factor) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum;

Scope Matters • With the default (none) clause, we need to specify the scope of each variable that we use in the block that has been declared outside the block. • The value of a variable with private scope is unspecified at the beginning (and after completion) of a parallel or parallel for block. double factor=1.0; double sum=0.0; # pragmaomp parallel for num_threads(thread_count)\ default(none) reduction(+: sum) private(k, factor) shared(n) for(k=0; k<n; k++) { if(k%2 == 0) factor=1.0; else factor=-1.0; sum+=factor/(2*k+1); } pi_approx=4.0*sum; The private factor is not specified.

Bubble Sort for (len=n; len>=2; len--) for (i=0; i<len-1; i++) if (a[i]>a[i+1]) { tmp=a[i]; a[i]=a[i+1]; a[i+1]=tmp; } • Can we make it faster? • Can we parallelize the outer loop? • Can we parallelize the inner loop?

Odd-Even Sort Any opportunities for parallelism?

Odd-Even Sort void Odd_even_sort (int a[], int n) { int phase, i, temp; for (phase=0; phase<n; phase++) if (phase%2 == 0) { /* Even phase */ for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

Odd-Even Sort in OpenMP for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */ # pragmaomp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ # pragmaomp parallel for num_threads(thread_count)\ default(none) shared(a, n) private(i, temp) for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

Odd-Even Sort in OpenMP # pragma omp parallel num_thread(thread_count) \ default(none) shared(a, n) private(i, tmp, phase) for (phase=0; phase<n; phase++) { if (phase%2 == 0) { /* Even phase */ # pragma omp for for (i=1; i<n; i+=2) if (a[i-1]>a[i]) { temp=a[i]; a[i]=a[i-1]; a[i-1]=temp; } } else { /* Odd phase */ # pragma omp for for (i=1; i<n-1; i+=2) if (a[i]>a[i+1]) { temp=a[i]; a[i]=a[i+1]; a[i+1]=temp; } } }

Data Partitioning Iterations Block Threads 0 1 2 Cyclic Iterations Threads 0 1 2

Scheduling Loops sum=0.0; for (i=0; i<=n; i++) sum+=f(i); double f(inti) { int j, start=i*(i+1)/2, finish=start+i; double return_val=0.0; for (j=start; j<=finish; j++) { return_val+=sin(j); } return return_val; } Load Balancing

The schedule clause sum=0.0; # pragma omp parallel for num_threads(thread_count) \ reduction(+:sum) schedule(static, 1) for (i=0; i<n; i++) sum+=f(i); chunksize schedule(static, total_iterations/thread_count)

The dynamic and guided Types • In a dynamic schedule: • Iterations are broken into chunks of chunksize consecutive iterations. • Default chunksize value: 1 • Each thread executes a chunk. • When a thread finishes a chunk, it requests another one. • In a guided schedule: • Each thread executes a chunk. • When a thread finishes a chunk, it requests another one. • As chunks are completed, the size of the new chunks decreases. • Approximately equals to the number of iterations remaining divided by the number of threads. • The size of chunks decreases down to chunksize or 1 (default).

Which schedule? • The optimal schedule depends on: • The type of problem • The number of iterations • The number of threads • Overhead • guided>dynamic>static • If you are getting satisfactory results (e.g., close to the theoretically maximum speedup) without a schedule clause, go no further. • The Cost of Iterations • If it is roughly the same, use the default schedule. • If it decreases or increases linearly as the loop executes, a static schedule with small chunksize values will be good. • If it cannot be determined in advance, try to explore different options.

Performance Issue A x y = X # pragma omp parallel for num_threads(thread_count) \ default(none) private(i,j) shared(A, x, y, m, n) for(i=1; i<m; i++) { y[i]=0.0; for(j=0; j<n; j++) y[i]+=A[i][j]*x[j]; }

Performance Issue Cache Miss False Sharing

Performance Issue • 8,000,000-by-8 • y has 8,000,000 elements  Potentially large number of write misses • 8-by-8,000,000 • x has 8,000,000 elements  Potentially large number of read misses • 8-by-8,000,000 • y has 8 elements (8 doubles)  Could be stored in the same cache line (64 bytes). • Potentially serious false sharing effect for multiple processors • 8000-by-8000 • y has 8,000 elements (8,000 doubles). • Thread 2: 4000 to 5999 Thread 3: 6000 to 7999 • {y[5996], y[5997], y[5998], y[5999], y[6000], y[6001], y[6002], y[6003] } • The effect of false sharing is highly unlikely.

Thread Safety • How to generate random numbers in C? • First, call srand() with an integer seed. • Second, call rand() to create a sequence of random numbers. • Pseudorandom Number Generator (PRNG) • Is it thread safe? • Can it be simultaneously executed by multiple threads without causing problems? Shared State

Foster’s Methodology • Partitioning • Divide the computation and the data into small tasks. • Identify tasks that can be executed in parallel. • Communication • Determine what communication needs to be carried out. • Local Communication vs. Global Communication • Agglomeration • Group tasks into larger tasks. • Reduce communication. • Task Dependence • Mapping • Assign the composite tasks to processes/threads.

Foster’s Methodology

The n-body Problem • To predict the motion of a group of objects that interact with each other gravitationally over a period of time. • Inputs: Mass, Position and Velocity • Astrophysicist • The positions and velocities of a collection of stars • Chemist • The positions and velocities of a collection of molecules

Newton’s Law

The Basic Algorithm Get input data; for each timestep { if (timestep output) Print positions and velocities of particles; for each particle q Compute total force on q; for each particle q Compute position and velocity of q; } for each particle q { forces[q][0]=forces[q][1]=0; for each particle k!=q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; forces[q][0]-=G*masses[q]*masses[k]/dist_cubed*x_diff; forces[q][1]-=G*masses[q]*masses[k]/dist_cubed*y_diff; } }

The Reduced Algorithm for each particle q forces[q][0]=forces[q][1]=0; for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }

Euler Method

Position and Velocity for each particle q { pos[q][0]+=delta_t*vel[q][0]; pos[q][1]+=delta_t*vel[q][1]; vel[q][0]+=delta_t*forces[q][0]/masses[q]; vel[q][1]+=delta_t*forces[q][1]/masses[q]; }

Communications sq(t) vq(t) sr(t) vr(t) Fq(t) Fr(t) sq(t + △t) vq(t + △t) sr(t + △t) vr(t + △t) Fq(t+ △t) Fr(t+ △t)

Agglomeration sq sr sq, vq, Fq sr, vr, Fr t sq sr sq, vq, Fq sr, vr, Fr t + △t

Parallelizing the Basic Solver # pragma omp parallel for each timestep { if (timestep output){ # pragma omp single nowait Print positions and velocities of particles; } # pragma omp for for each particle q Compute total force on q; # pragma omp for for each particle q Compute position and velocity of q; } Race Conditions?

Parallelizing the Reduced Solver # pragma omp for for each particle q forces[q][0]=forces[q][1]=0; # pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; forces[q][0]+=force_qk[0]; forces[q][1]+=force_qk[1]; forces[k][0]-=force_qk[0]; forces[k][1]-=force_qk[1]; } }

Does it work properly? • Consider 2 threads and 4 particles. • Thread 1 is assigned particle 0 and particle 1. • Thread 2 is assigned particle 2 and particle 3. • F3=-f03-f13-f23 • Who will calculate f03and f13? • Who will calculate f23? • Any race conditions?

Thread Contributions 3 Threads, 6 Particles, Block Partition

First Phase # pragma omp for for each particle q { for each particle k>q { x_diff=pos[q][0]-pos[k][0]; y_diff=pos[q][1]-pos[k][1]; dist=sqrt(x_diff*x_diff+y_diff*y_diff); dist_cubed=dist*dist*dist; force_qk[0]=-G*masses[q]*masses[k]/dist_cubed*x_diff; force_qk[1]=-G*masses[q]*masses[k]/dist_cubed*y_diff; loc_forces[my_rank][q][0]+=force_qk[0]; loc_forces[my_rank][q][1]+=force_qk[1]; loc_forces[my_rank][k][0]-=force_qk[0]; loc_forces[my_rank][k][1]-=force_qk[1]; } }

Second Phase # pragma omp for for (q=0; q<n; q++) { forces[q][0]=forces[q][1]=0; for(thread=0; thread<thread_count; thread++) { forces[q][0]+=loc_forces[thread][q][0]; forces[q][1]+=loc_forces[thread][q][1]; } } • In the first phase, each thread carries out the same calculations as before but the values are stored in its own array of forces (loc_forces). • In the second phase, the thread that has been assigned particle q will add the contributions that have been computed by different threads. Race Conditions?

Evaluating the OpenMP Codes • In the reduced code: • Loop 1: Initialization of the loc_forces array • Loop 2: The first phase of the computation of forces • Loop 3: The second phase of the computation of forces • Loop 4: The updating of positions and velocities • Which schedule should be used?

Review • What are the major differences between MPI and OpenMP? • What is the scope of a variable? • What is a reduction variable? • How to ensure mutual exclusion in a critical section? • What are the common loop scheduling options? • What factors may potentially affect the performance of an OpenMP program? • What is a thread safe function?

Open Multiprocessing

Open Multiprocessing

Presentation Transcript

Multiprocessing and Parallel Processing Chapter 9

Helios: Heterogeneous Multiprocessing with Satellite Kernels

Research Accelerator for MultiProcessing

Deterministic Multiprocessing

Chapter 4 Symmetric MultiProcessing

Prelude to Multiprocessing

Multiprocessing and NUMA

Helios: Heterogeneous Multiprocessing with Satellite Kernels

Research Accelerator for MultiProcessing

Supercomputing in Plain English Distributed Multiprocessing

Symmetric Multiprocessing (SMP)

SSD Multiprocessing / Multithreading

Adaptive Single-Chip Multiprocessing

PAMA Power Aware Multiprocessing Architecture

Shared Memory Multiprocessing

Multiprocessing Memory Management

DMP: Deterministic Shared Memory Multiprocessing

Outline for Intro to Multiprocessing

Prelude to Multiprocessing