1 / 12

OpenMP EXERCISE part 1 – OpenMP v2.5

OpenMP EXERCISE part 1 – OpenMP v2.5. Ing. Andrea Marongiu a.marongiu@unibo.it. Download, compile and run. Download file OpenMP_Exercise.tgz from website Extract it to local folder tar xvf OpenMP_Exercise.tgz What’s on the package All tests are in file test.c

rafer
Télécharger la présentation

OpenMP EXERCISE part 1 – OpenMP v2.5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenMP EXERCISE part 1 – OpenMP v2.5 Ing. Andrea Marongiu a.marongiu@unibo.it

  2. Download, compile and run • Download file OpenMP_Exercise.tgz from website • Extractit to local folder tar xvf OpenMP_Exercise.tgz • What’s on the package • Alltests are in file test.c • Compile and run with • makecleanallrun • Take a look attest.c. Differentexercises are #ifdef-ed • To compile and execute the desiredonemakecleanallrun –e MYOPTS="-DEX1 –DEX2 …"

  3. EX 1 – Hello world! Parallelism creation #pragmaompparallelnum_threads (?) printf "Hello world, I’m thread ??" • Use parallel directive to create multiple threads • Each thread executes the code enclosed within the scope of the directive • Use runtime library functions to determine thread ID • All SPMD parallelization is based on this approach

  4. EX 2 - Loop partitioning – Static scheduling T0 T1 T2 T3 #pragma omp parallel for \ num_threads (4) schedule (static) for (uint i=0; i<16; i++) { /* BALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 • Iterations are statically assigned to threads • 16 iter / 4 threads = 4 iter/thread • Small overhead: loop indexes are computed according to thread ID • Optimal scheduling if iworkload is balanced 2 6 10 14 3 7 11 15

  5. EX 2 - Loop partitioning – Dynamic scheduling T0 T1 T2 T3 #pragmaompparallel for \ num_threads (4) schedule (dynamic, 4) for (uint i=0; i<16; i++) { /* BALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 • Iterations are dynamically assigned to threads • 16 iter, 4 by 4 • Same allocation of iteration as static (prev. slide) • Coarse granularity • Overhead only at beginning and end 2 6 10 14 3 7 11 15 CHUNK (size = 4 iter) OVERHEAD

  6. EX 2 - Loop partitioning – Dynamic scheduling T0 T1 T2 T3 #pragma omp parallel for \ num_threads (4) schedule (dynamic, 1) for (uint i=0; i<16; i++) { /* BALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 • Iterations are dynamically assigned to threads • 16 iter, 1 by 1 • Finest granularity possible • Overhead at every iteration • Worst performance under balanced workloads 2 6 10 14 3 7 11 15 CHUNK (size = 1 iter) OVERHEAD

  7. EX 3 – Unbalanced Loop partitioning T0 T1 T2 T3 #pragma omp parallel for \ num_threads (4) schedule (dynamic, 4) for (uint i=0; i<16; i++) { /* UNBALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 2 13 6 10 3 • Iterations are dynamically assigned to threads • 16 iter, 4 by 4 • Coarse granularity (same as static scheduling) • Due to barrier at the end of parreg, all threads have to wait for the slowest one 7 11 14 15 SYNCH POINT

  8. EX 3 – Unbalanced Loop partitioning T0 T1 T2 T3 #pragma omp parallel for \ num_threads (4) schedule (dynamic, 1) for (uint i=0; i<16; i++) { /* UNBALANCED LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 2 6 10 3 • Iterations are dynamically assigned to threads • 16 iter, 1 by 1 • Finest granularity balances the workload among threads • In this case, it is worth to pay 14 7 11 15 SPEEDUP

  9. EX 4 – Chunking overhead T0 T1 T2 T3 #pragmaompparallel for \ num_threads (4) schedule (dynamic, 1) for (uint i=0; i<16; i++) { /* SMALL LOOP CODE */ } /* (implicit) SYNCH POINT */ 0 4 8 12 1 5 9 13 2 6 10 14 • Iterations are dynamically assigned to threads • 16 iter, 1 by 1 • Finest granularity possible • Overhead at every iteration • Serious performance loss for very small loop bodies 3 7 11 15 OVERHEAD

  10. EX 5 – Task parallelism with sections void sections() { work(1000000); printf("%hu: Done with first elaboration!\n”, …); work(2000000); printf("%hu: Done with second elaboration!\n”, …); work(3000000); printf("%hu: Done with third elaboration!\n", …); work(4000000); printf("%hu: Done with fourth elaboration!\n", …); } • Distribute workload among 4 threads using SPMD parallelization • Get thread id • Use if/else or switch/case to differentiate workload • Implement the same workload partitioning with sections directive

  11. EX 6 – Task parallelism with task void tasks() { unsigned int i; for(i=0; i<4; i++) { work((i+1)*1000000); printf("%hu: Done with elaboration\n", …); } } • Distribute workload among 4 threads using taskdirective • Same program as before • But we had to manually unroll the loop to use sections • Performance?

  12. EX 6 – Task parallelism with task void tasks() { unsigned int i; for(i=0; i<1024; i++) { work((1000000); printf("%hu: Done with elaboration\n", …); } } Modify the EX6 exercise code as indicated on this slide • Parallelize the loop with task directive • Use single directive to force a single processor to create tasks • Parallelize the loop with single directive • Use nowait clause to allow for parallel execution • Performance?

More Related