1 / 54

Programming the Origin2000 with OpenMP: Part II

Programming the Origin2000 with OpenMP: Part II. William Magro Kuck & Associates, Inc. Outline. A Simple OpenMP Example Analysis and Adaptation Debugging Performance Tuning Advanced Topics. x. y. 1. n. A Simple Example. dotprod.f. real*8 function ddot(n,x,y) integer n

yair
Télécharger la présentation

Programming the Origin2000 with OpenMP: Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc.

  2. Outline • A Simple OpenMP Example • Analysis and Adaptation • Debugging • Performance Tuning • Advanced Topics

  3. x y 1 n A Simple Example dotprod.f real*8 function ddot(n,x,y) integer n real*8 x(n), y(n) ddot = 0.0 !$omp parallel do private(i) !$omp& reduction(+:ddot) do i=1,n ddot = ddot + x(i)*y(i) enddo return end

  4. x y A Less Simple Example dotprod2.f real*8 function ddot(n,x,y) integer n real*8 x(n), y(n), ddot1 !$omp parallel private(ddot1) ddot1 = 0.0 !$omp do private(i) do i=1,n ddot1 = ddot1 + x(i)*y(i) enddo !$omp end do nowait !$omp atomic ddot = ddot + ddot1 !$omp end parallel ddot1 ddot1 ddot1 n 1 ddot1 ddot1 ddot1 ddot

  5. Analysis and Adaptation • Thread-safety • Automatic Parallelization • Finding Parallel Opportunities • Classifying Data • A Different Approach

  6. Thread-safety • Confirm code works with -automatic in serial f77 -automatic -DEBUG:trap_uninitialized=ON <source files> a.out • Synchronize access to static data logical function overflows integer count save count data /count/0/ overflows = .false. !$omp critical count = count + 1 if (count .gt. 10) overflows = .true. !$omp end critical

  7. Automatic Parallelization • Power Fortran Accelerator • Detects parallelism • Implements parallelism • Using PFA module swap MIPSpro MIPSpro.beta721 f77 -pfa <source files> • PFA options to try • -IPA enables interprocedural analysis • -OPT:roundoff=3 enables reductions

  8. Basic Compiler Transformations Work variable privatization: !$omp parallel do !$omp& private(x) DO I=1,N x = ... . . . y(I) = x ENDDO DO I=1,N x = ... . . . y(I) = x ENDDO

  9. Basic Compiler Transformations Parallel reduction : DO I=1,N . . x = ... sum = sum + x . . ENDDO !$omp parallel !$omp private(x, sum1) sum1 = 0.0 !$omp do DO I=1,N . x = ... sum1 = sum1 + x . ENDDO !$omp atomic sum = sum + sum1 !$omp parallel do !$omp& private(x) !$omp& reduction(+:sum) DO I=1,N . x = ... sum = sum + x . ENDDO

  10. Basic Compiler Transformations Induction variable substitution: i1 = 0 i2 = 0 DO I=1,N i1 = i1 + 1 B(i1) = ... i2 = i2 + I A(i2) = … ENDDO !$omp parallel do !$omp& private(I) DO I=1,N B(I) = ... A((I**2 + I)/2) = … ENDDO

  11. Automatic Limitations • IPA is slow for large codes • Without IPA, only small loops go parallel • Analysis must be repeated with each compile • Can’t parallelize data dependent algorithms • Results usually don’t scale

  12. Compiler Listing • Generate listing with ‘-pfa keep’ f77 -pfa keep <source files> • The listing gives many useful clues: • Loop optimization tables • Data dependencies • Explanations about applied transformations • Optimization summary • Transformed OpenMP source code • Use listing to help write OpenMP version • Workshop MPF presents listing graphically

  13. Picking Parallel Loops • Avoid inherently serial loops • Time stepping loops • Iterative convergence loops • Parallelize at highest level possible • Choose loops with large trip count • Always parallelize in same dimension, if possible • Workshop MPF’s static analysis can help

  14. Profiling • Use SpeedShop to profile your program • Compile normally in serial • Select typical data set • Profile with ‘ssrun’: ssrun -ideal <program> <arguments> ssrun -pcsamp <program> <arguments> • Examine profile with ‘prof’: prof -gprof <program>.ideal.<pid> • Look for routines with: • Large combined ‘self’ and ‘child’ time • Small invocation count

  15. Example Profile apsi.profile self kids called/total parents index cycles(%) self(%) kids(%) called+self name index self kids called/total children [...] 20511398 453309309775 1/1 PSET [4] [5] 453329821173(100.00%) 20511398( 0.00%) 453309309775(100.00%) 1 RUN [5] 18305495901 149319136904 267589/268116 DCTDX [6] 19503577587 22818946546 527/527 DKZMH [13] 13835415346 24761094596 526/526 DUDTZ [14] 12919215922 24761094596 526/526 DVDTZ [15] 11953815047 25150873141 527/527 DTDTZ [16] 4541238123 24964028293 66920/66920 DPDX [18] 3883200260 24920009235 66802/66803 DFTDX [19] 5749986857 17489462744 527/527 DCDTZ [21] 8874949202 11380650840 526/526 WCONT [24] 10830140377 0 527/527 HYD [30] 3873808360 1583161052 527/527 ADVU [36] 3592836688 1580156951 526/526 ADVV [37] 1852017128 1583161052 527/527 ADVC [39] 1680678888 1583161052 527/527 ADVT [40] [...]

  16. Multiple Parallel Loops • Nested parallel loops • Prefer outermost loop • Preserve locality -- chose same index as in other parallel loops • If relative sizes of trip counts are not known • Use NEST() clause • Use IF clause to select best based on dataset • Non nested parallel loops • Consider fusing loops • Execute code between loops in parallel • Privatize data in redundant calculations

  17. Nested Parallel Loops copy.f subroutine copy (imx,jmx,kmx,imp2,jmp2,kmp2,w,ws) do nv=1,5 !$omp do do k = 1,kmx do j = 1,jmx do i = 1,imx ws(i,j,k,nv) = w(i,j,k,nv) end do end do end do !$omp end do nowait end do !$omp barrier return end

  18. In OpenMP, data is shared by default OpenMP provides several privatization mechanisms A correct OpenMP program must have its variables properly classified !$omp parallel!$omp& PRIVATE(x,y,z)!$omp& FIRSTPRIVATE (q)!$omp& LASTPRIVATE(I) common /blk/ l,m,n!$omp THREADPRIVATE(/blk/) Variable Classification

  19. Shared Variables • Shared is OpenMP default • Most things are shared • The major arrays • Variables whose indices match loop index !$omp parallel do do I = 1,N do J = 1, M x(I) = x(I) + y(J) • Variables only read in parallel region • Variables read, then written, requiring synchronization • maxval = max(maxval, currval)

  20. Private Variables program main !$omp parallel call compute !$omp end parallel end subroutine compute integer i,j,k [...] return end • Local variables in called routines are automatically private • Common access patterns • Work variables written then read (PRIVATE) • Variables read on first iteration, then written (FIRSTPRIVATE) • Variables read after last iteration (LASTPRIVATE)

  21. Variable Typing wcont.f wcont_omp.f dwdz.f DIMENSION HELP(NZ),HELPA(NZ),AN(NZ),BN(NZ),CN(NZ) ... [...] !$omp parallel !$omp& default(shared) !$omp& private(help,helpa,i,j,k,dv,topow,nztop,an,bn,cn) !$omp& reduction(+: wwind, wsq) HELP(1)=0.0 HELP(NZ)=0.0 NZTOP=NZ-1 !$omp pdo DO 40 I=1,NX DO 30 J=1,NY DO 10 K=2,NZTOP [...] 40 CONTINUE !$omp end pdo !$omp end parallel

  22. Synchronization maxpy.f • Reductions • Max, min values • Global sums, products, etc. • Use REDUCTION() clause for scalars !$omp do reduction(max: ymax) do i=1,n y(i) = a*x(i) + y(i) ymax = max(ymax,y(i)) enddo • Code array reductions by hand

  23. Array Reductions histogram.f histogram.omp.f !$omp parallel private(hist1,i,j,ibin) do i=1,nbins hist1(i) = 0 enddo !$omp do do i=1,m do j=1,m ibin = 1 + data(j,i)*rscale*nbins hist1(ibin) = hist1(ibin) + 1 enddo enddo !$omp critical do i=1,nbins hist(i) = hist(i) + hist1(i) enddo !$omp end critical !$omp end parallel

  24. Building the Parallel Program • Analyze, Insert Directives, and Compile: module swap MIPSpro MIPSpro.beta721 f77 -mp -n32 <optimization flags> <source files> - or - source /usr/local/apps/KAI/setup.csh guidef77 -n32 <optimization flags> <source files> • Run multiple times; compare output to serial setenv OMP_NUM_THREADS 3 setenv OMP_DYNAMIC false a.out • Debug

  25. Correctness and Debugging • OpenMP is easier than MPI, but bugs are still possible • Common Parallel Bugs • Debugging Approaches

  26. Debugging Tips • Check parallel P=1 results setenv OMP_NUM_THREADS 1 setenv OMP_DYNAMIC false a.out • If results differ from serial, check: • Uninitialized private data • Missing lastprivate clause • If results are same as serial, check for: • Unsynchronized access to shared variables • Shared variables that should be private • Variable size THREADPRIVATE common declarations

  27. Parallel Debuggingis Hard parbugs.f • What can go wrong? • Incorrectly classified variables • Unsynchronized writes • Data read before written • Uninitialized private data • Failure to update global data • Other race conditions • Timing-dependent bugs

  28. Parallel DebuggingIs Hard • What else can go wrong? • Unsynchronized I/O • Thread stack collisions • Increase with mp_set_slave_stacksize() function or KMP_STACKSIZE variable • Privatization of improperly declared arrays • Inconsistently declared private common blocks

  29. Debugging Options • Print statements • Multithreaded debuggers • Automatic parallel debugger

  30. Print Statements • Advantages • WYSIWYG • Can be useful • Can monitor scheduling of iterations on threads • Disadvantages • Slow, human-time intensive bug hunting • Tips • Include thread ID • Checksum shared memory regions • Protect I/O with a CRITICAL section

  31. Multithreaded Debugger • Advantages • Can find causes of deadlock, such as threads waiting at different barriers • Disadvantages • Locates symptom, not cause • Hard to reproduce errors, especially those which are timing-dependent • Difficult to relate parallel (MP) library calls back to original source • Human intensive

  32. WorkShop Debugger • Graphical User Interface • Using the debugger • Add debug symbols with ‘-g’ on compile and link: f77 -g -mp <source files> - or - guidef77 -g <source files> • Run the debugger setenv OMP_NUM_THREADS 3 setenv OMP_DYNAMIC false cvd a.out • Follow threads and try to reproduce the bug

  33. Automatic OpenMP Debugger • Advantages • Systematically finds parallel bugs • Deadlocks and race conditions • Uninitialized data • Reuse of PRIVATE data outside parallel regions • Measures thread stack usage • Uses computer time rather than human time • Disadvantages • Data set dependent • Requires sequentially consistent program • Increased memory usage and CPU time

  34. KAI’s Assure • Looks like an OpenMP compiler • Generates an ideal parallel computer simulation • Itemizes parallel bugs • Locates exact location of bug in source • Includes GUI to browse error reports

  35. Serial Consistency • Parallel program must have a serial counterpart • Algorithm can’t depend on number of threads • Code can’t manually assign domains to threads • Can’t call omp_get_thread_num() • Can’t use OpenMP lock API. • Serial code defines correct behavior • Serial code should be well debugged • Assure sometimes finds serial bugs as well

  36. Using Assure • Pick a project database file name: e.g., “buggy.prj” • Compile all source files with “assuref77”: source /usr/local/apps/KAI/setup.csh assuref77 -WA,-pname=./buggy.prj -c buggy.f assuref77 -WA,-pname=./buggy.prj buggy.o • Source files in multiple directories must specify same project file • Run with a small, but representative workload a.out setenv DISPLAY your_machine:0 assureview buggy.prj

  37. Assure Tips • Select small, but representative data sets • Increase test coverage with multiple data sets • No need to run job to completion (control-c) • Get intermediate reports (e.g., every 2 minutes) setenv KDD_INTERVAL 2m a.out & assureview buggy.prj [ wait a few minutes ] assureview buggy.prj • Quickly learn about stack usage and call graph setenv KDD_DELAY 48h

  38. A Different Approach to Parallelization md.f md.omp.f • Locate candidate parallel loop(s) • Identify obvious shared and private variables • Insert OpenMP directives • Compile with Assure parallel debugger • Run program • View parallel errors with AssureView • Update directives

  39. Parallel Performance • Limiters of Parallel Performance • Detecting Performance Problems • Fixing Performance Problems

  40. Parallel Performance • Limiters of performance • Amdahl’s law • Load imbalance • Synchronization • Overheads • False sharing Easy Obvious Hard Subtle

  41. Amdahl’s Law • Maximum Efficiency • Fraction parallel limits scalability • Key: Parallelize everything significant

  42. !$omp parallel do time !$omp end parallel do Load Imbalance • Unequal work loads lead to idle threads and wasted time

  43. Synchronization • Lost time waiting for locks !$omp parallel !$omp critical time !$omp end critical !$omp end parallel

  44. serial loop execution Max loop speedup = serial loop execution parallel loop startup + number of processors Parallel Loop Size • Successful loop parallelization requires large loops. • !$OMP PARALLEL DO SCHEDULE(STATIC) startup time • ~3500 cycles or 20 micro-seconds on 4 processors • ~200,000 cycles or 1 milli-second on 128 processors • Loop time should be large compared to parallel overheads • Data size must grow faster than number of threads to maintain parallel efficiency

  45. False Sharing false.f • False sharing occurs when multiple threads repeated write to the same cache line • Use perfex to detect if cache invalidation is a problem perfex -a -y -mp <program> <arguments> • Use SpeedShop to find the location of the problem ssrun -dc_hwc <program> <arguments> ssrun -dsc_hwc <program> <arguments>

  46. Measuring Parallel Performance • Measure wall clock time with ‘timex’ setenv OMP_DYNAMIC false setenv OMP_NUM_THREADS 1 timex a.out setenv OMP_NUM_THREADS 16 timex a.out • Profilers (speedshop, perfex) • Find remaining serial time • Identify false sharing • Guide’s instrumented parallel library

  47. Using GuideView • Compile with Guide OpenMP compiler and normal compile options source /usr/local/apps/KAI/setup.csh guidef77 -c -Ofast=IP27 -n32 -mips4 source.f ... • Link with instrumented library guidef77 -WGstats source.o … • Run with real parallel workload setenv KMP_STACKSIZE 32M a.out • View performance report guideview guide_stats

  48. GuideView Compare achieved to ideal Performance Identify parallel bottlenecks such as Barriers, Locks, and Sequential time Compare multiple runs

  49. Analyze each thread’s performance See how performance bottlenecks change as processors are added

  50. Performance Data By Region Analyze each Parallel region Find serial regions that are hurt by parallelism Sort or filter regions to navigate to hotspots

More Related