1 / 31

Lecture 5. OpenMP Intro.

COM503 Parallel Computer Architecture & Programming. Lecture 5. OpenMP Intro. Prof. Taeweon Suh Computer Science Education Korea University. References. Lots of the lecture slides are based on the following web materials with some modifications https://computing.llnl.gov/tutorials/openMP/

callum
Télécharger la présentation

Lecture 5. OpenMP Intro.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COM503 Parallel Computer Architecture & Programming Lecture 5. OpenMPIntro. Prof. Taeweon Suh Computer Science Education Korea University

  2. References • Lots of the lecture slides are based on the following web materials with some modifications • https://computing.llnl.gov/tutorials/openMP/ • Official OpenMP page • http://openmp.org/ • It contains up-to-date information on OpenMP • It also includes tutorials, book example codes etc.

  3. OpenMP • What does OpenMP stand for? • Short version: Open Multi-Processing • Long version: Open specifications for Multi-Processing via collaborative work between interested parties from the hardware and software industry, government and academia • Standardized: • Jointly defined and endorsed by a group of major computer hardware and software vendors • Comprised of three primary API components: • Compiler Directives (#pragma) • Runtime Library Routines • Environment Variables • Portable: • The API is specified for C/C++ and Fortran • Most major platforms have been implemented including Linux and Windows platforms

  4. History • In the early 90's, shared-memory machine vendors supplied similar, directive-based, Fortran programming extensions • The user would augment a serial Fortran program with directives specifying which loops were to be parallelized • The compiler would be responsible for automatically parallelizing such loops across the SMP processors • The OpenMP standard specification started in 1997 • Led by the OpenMP Architecture Review Board (ARB) • The ARB members included Compaq / Digital, HP, Intel, IBM, Kuck & Associates, Inc. (KAI), Silicon Graphics, Sun Microsystems , and U.S. Department of Energy

  5. Release History • Our textbook covers OpenMP 2.5 specification OpenMP 4.0 OpenMP Spec. released together for C and Fortran from OpenMP 2.5 July 2013

  6. Shared Memory Machines • OpenMP is designed for shared memory machines (UMA or NUMA)

  7. Fork-Join Model • Begin as a single process (master thread). • The master thread executes sequentially until the first parallel region construct is encountered. • FORK: the master thread creates a team of parallel threads • The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads • JOIN: When the team threads complete the statements in the parallel region, they synchronize and terminate, leaving only the master thread Source: https://computing.llnl.gov/tutorials/openMP/#Introduction

  8. Other Parallel Programming Models? • MPI: Message Passing Interface • Developed for distributed-memory architectures, where multiple processes execute independently and communicate data • Most widely used in the high-end technical computing community, where clusters are common • Most vendors of shared memory systems also provide MPI implementations • Most MPI implementations consist of a specific set of APIs callable from C, C++ ,Fortran or Java • MPI implementations • MPICH • Freely available, portable implementation of MPI • Free Softwareand is available for most flavors of Unix and Windows • OpenMPI

  9. Other Parallel Programming Models? • Pthreads: POSIX (Portable Operating System Interface) Threads • Shared-memory programming model • Defined as a set of C and C++ programming types and procedure calls • A collection of routines for creating, managing, and coordinating a collection of threads • Programming with Pthreads is much more complex than with OpenMP

  10. Serial Code Example • Serial version of dot product program • The dot product is an algebraic operation that takes two equal-length sequences of numbers (usually coordinate vectors) and returns a single number obtained by multiplying corresponding entries and adding up those products #include <stdio.h> int main(argc, argv) intargc; char * argv[]; { double sum; double a[256], b[256]; inti, n; n = 256; for (i=0; i< n; i++){ a[i] = i * 0.5; b[i] = i * 2.0; } sum = 0.0; for (i=0 ; i<n; i++) { sum = sum + a[i]*b[i]; } printf("sum = %9.2lf\n", sum); }

  11. MPI Example • MPI • To compile, ‘mpiccdot_product_mpi.c –o dot_product_mpi’ • To run, ‘mpirun–np 4 machine_filedot_product_mpi’ #include <stdio.h> #include <mpi.h> int main(argc, argv) intargc; char * argv[]; { double sum, sum_local; double a[256], b[256]; inti, n; intnumprocs, myid, my_first, my_last; n = 256; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid); printf("#procs = %d\n", numprocs); my_first = myid * n/numprocs; my_last = (myid + 1) * n/numprocs; for (i=0; i< n; i++){ a[i] = i * 0.5; b[i] = i * 2.0; } sum_local = 0.0; for (i=my_first ; i < my_last; i++) { sum_local = sum_local + a[i]*b[i]; } MPI_Allreduce(&sum_local, &sum, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); if (myid==0) printf("sum = %9.2lf\n", sum); MPI_Finalize(); return 0; }

  12. Pthreads Example • Pthreads • To compile, ‘gccdot_product_pthread.c –o dot_product_pthread-pthread’ for (i=0; i< n; i++){ a[i] = i * 0.5; b[i] = i * 2.0; } pthread_mutex_init(&mutexsum, NULL); pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); for (i=0; i<NUMTHRDS; i++) { pthread_create(&thds[i], &attr, dotprod, (void *)i); } pthread_attr_destroy(&attr); for (i=0; i<NUMTHRDS; i++) { pthread_join(thds[i], (void **) &status); } printf("sum = %9.2lf\n", sum); pthread_mutex_destroy(&mutexsum); pthread_exit(NULL); } #include <stdio.h> #include <pthread.h> #define NUMTHRDS 4 double sum = 0 ; double a[256], b[256]; int status; int n = 256; pthread_tthds[NUMTHRDS]; pthread_mutex_tmutexsum; void *dotprod(void *arg); int main(argc, argv) intargc; char * argv[]; { pthread_attr_tattr; inti;

  13. Pthreads Example • Pthreads • To compile, ‘gccdot_product_pthread.c –o dot_product_pthread-pthread’ void *dotprod(void *arg) { intmyid, i, my_first, my_last; double sum_local; myid = (int) arg; my_first = myid * n / NUMTHRDS; my_last = (myid+1) * n / NUMTHRDS; sum_local = 0.0; for (i=my_first; i< my_last; i++) { sum_local = sum_local + a[i]*b[i]; } pthread_mutex_lock(&mutexsum); sum = sum + sum_local; pthread_mutex_unlock(&mutexsum); pthread_exit((void *) 0); }

  14. Another Pthread Example • Pthreads • To compile, ‘gccpthread_creation.c –o pthread_creation-pthread’ #include <stdio.h> #include <stdlib.h> #include <pthread.h> void *print_message_function( void *ptr ); main() { pthread_t thread1, thread2; char *message1 = "Thread 1"; char *message2 = "Thread 2"; int iret1, iret2; /* Create independent threads each of which will execute function */ iret1 = pthread_create( &thread1, NULL, print_message_function, (void*) message1); iret2 = pthread_create( &thread2, NULL, print_message_function, (void*) message2); /* Wait till threads are complete before main continues. Unless we */ /* wait we run the risk of executing an exit which will terminate */ /* the process and all threads before the threads have completed. */ pthread_join( thread1, NULL); pthread_join( thread2, NULL); printf("Thread 1 returns: %d\n",iret1); printf("Thread 2 returns: %d\n",iret2); exit(0); } void *print_message_function( void *ptr ) { char *message; message = (char *) ptr; printf("%s \n", message); }

  15. OpenMP Example • OpenMP • To compile, ‘gccdot_product_omp.c –o dot_product_omp-fopenmp’ #include <stdio.h> #include <omp.h> int main(argc, argv) intargc; char * argv[]; { double sum; double a[256], b[256]; inti, n; n = 256; for (i=0; i< n; i++){ a[i] = i * 0.5; b[i] = i * 2.0; } sum = 0.0; #pragmaomp parallel for reduction(+:sum) for (i=0 ; i<n; i++) { sum = sum + a[i]*b[i]; } printf("sum = %9.2lf\n", sum); }

  16. OpenMP • Compiler Directive Based • Parallelism is specified through the use of compiler directives in source code • Nested Parallelism Support • Parallel regions inside of other parallel regions. • Dynamic Threads • Dynamically alter the number of threads to execute parallel regions. • Memory Model • Provide a "relaxed-consistency”. In other words, threads can cache their data and are not required to maintain exact consistency with real memory all the time. • When it is critical that all threads view a shared variable identically, the programmer is responsible for ensuring that the variable is flushed by all threads as needed.

  17. OpenMP Parallel Construct • A parallel region is a block of code executed by multiple threads. This is the fundamental OpenMP parallel construct. #pragmaomp parallel [clause[[,] clause]…] structured block • This construct is used to specify the block that should be executed in parallel • A team of threads is created to execute the associated parallel region • Each thread in the team is assigned a unique thread number (0 to #thread-1) • The master is a member of that team and has thread number 0 • Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code. • It does not distribute the work of the region among the threads in a team if the programmer does not use the appropriate syntax to specify this action • There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point. • The code not enclosed by a parallel construct will be executed serially

  18. Example #include <stdio.h> #include <stdlib.h> #include <omp.h> int main(argc, argv) intargc; char * argv[]; { #pragmaomp parallel { printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num()); if (omp_get_thread_num() == 2) { printf(" Thread %d does things differently\n", omp_get_thread_num()); } } }

  19. How Many Threads? • The number of threads in a parallel region is determined by the following factors, in order of precedence • Evaluation of the IF clause • The IF clause is supported on the parallel construct only • #pragmaomp parallel if (n > 5) • num_threads clause with a parallel construct • The num_threads clause is supported on the parallel construct only • #pragmaomp parallel num_threads(8) • omp_set_num_threads() library function • omp_set_num_threads(8) • OMP_NUM_THREADS environment variable • In bash, use ‘export OMP_NUM_THREADS=4’ • Implementation default - usually the number of CPUs on a node, even though it could be dynamic

  20. Example #include <stdio.h> #include <stdlib.h> #include <omp.h> #define NUM_THREADS 8 int main(argc, argv) intargc; char * argv[]; { int n = 6; omp_set_num_threads(NUM_THREADS); //#pragmaomp parallel #pragmaomp parallel if (n > 5) num_threads(n) { printf("Parallel region is executed by thread ID %d\n", omp_get_thread_num()); if (omp_get_thread_num() == 2) { printf(" Thread %d does things differently\n", omp_get_thread_num()); } } }

  21. Work-Sharing Constructs • Work-sharing constructs are used to distribute computation among the threads in a team • #pragmaomp for • #pragmaomp sections • #pragmaomp single • By default, threads wait at a barrier at the end of a work-sharing region until the last thread has completed its share of the work • However, the programmer can suppress this by using the nowait clause

  22. Loop Construct • The loop construct causes the immediately following loop iterations to be executed in parallel #pragmaomp for [clause[[,] clause]…] for loop #include <stdio.h> #include <omp.h> #define NUM_THREADS 4 int main(argc, argv) intargc; char * argv[]; { int n = 8; inti; omp_set_num_threads(NUM_THREADS); #pragmaomp parallel shared(n) private(i) { #pragmaomp for for (i=0 ; i<n; i++) printf(" Thread %d executes loop iteration %d\n", omp_get_thread_num(), i); } }

  23. Section Construct • The section construct is the easiest way to have different threads execute different kinds of work #include <stdio.h> #include <stdlib.h> #include <omp.h> #define NUM_THREADS 4 void funcA(); void funcB(); int main(argc, argv) intargc; char * argv[]; { int n = 8; inti; omp_set_num_threads(NUM_THREADS); #pragmaomp parallel sections { #pragmaomp section funcA(); #pragmaomp section funcB(); } } #pragmaomp sections[clause[[,] clause]…] { [#pragmaomp section] structured block [#pragmaomp section] structured block } void funcA() { printf("In funcA: this section is executed by thread %d\n", omp_get_thread_num()); } void funcB() { printf("In funcB: this section is executed by thread %d\n", omp_get_thread_num()); }

  24. Section Construct • At run time, the specified code blocks are executed by the threads in the team • Each thread executes one code block at a time • Each code block will be executed exactly once • If there are fewer threads than code blocks, some threads execute multiple code blocks • If there are fewer code blocks than threads, the remaining threads will be idle • Assignment of code blocks to threads is implementation-dependent • Depending on the type of work performed in the various code blocks and the number of threads used, this construct might lead to a load-balancing problem

  25. Single Construct • The single construct specifies that the block should be executed by one thread only #include <stdio.h> #include <omp.h> #define NUM_THREADS 4 #define N 8 int main(argc, argv) intargc; char * argv[]; { inti, a, b[N]; omp_set_num_threads(NUM_THREADS); #pragmaomp parallel shared(a, b) private(i) { #pragmaomp single { a = 10; printf("Single construct executed by thread %d\n", omp_get_thread_num()); } #pragmaomp for for (i=0; i<N; i++) b[i] = a; } printf("After the parallel region\n"); for (i=0; i<N; i++) printf("b[%d] = %d\n", i, b[i]); } #pragmaomp single [clause[[,] clause]…] structured block

  26. Single Construct • Only one thread executes the block with the single construct • The other threads wait at a implicit barrier until the thread executing the single code block has completed • What if the single construct is omitted in the previous example? • Memory consistency issue? • Performance issue? • A barrier is required then before the #pragmaompfor?

  27. Misc • A useful Linux command • top (Display Linux tasks) provides a dynamic real-time view of a running system • Try 1, z, H after running the top command

  28. Misc • Useful Linux commands • ps –eLf • Display thread IDs for OpenMP and Pthreads • top • Display process IDs, which can be used to monitor the processes created by MPI

  29. Misc • top does not show threads • ps –eLf • Display thread IDs for OpenMP and Pthreads

  30. Backup Slides

  31. Goal of OpenMP • Standardization: • Provide a standard among a variety of shared memory architectures/platforms • Lean and Mean: • Establish a simple and limited set of directives for programming shared memory machines. • Significant parallelism can be implemented by using just 3 or 4 directives. • Ease of Use: • Provide capability to incrementally parallelize a serial program, unlike message-passing libraries which typically require an all or nothing approach • Provide the capability to implement both coarse-grain and fine-grain parallelism • Portability: • Supports Fortran (77, 90, and 95), C, and C++ • Public forum for API and membership

More Related