Scheduling and Performance Issues in OpenMP Programming

Scheduling and Performance Issues for Programming using OpenMP Alan Real ISS, University of Leeds

Overview • Scheduling for worksharing constructs • STATIC. • GUIDED. • RUNTIME. • Performance.

SCHEDULE clause • Gives control over which loop iterations are executed by which thread. • Syntax: • Fortran: SCHEDULE(kind[,chunksize]) • C/C++: schedule(kind[,chunksize]) • kind is one of STATIC, DYNAMIC, GUIDED or RUNTIME • chunksize is an integer expression with positive value; e.g.:!$OMP DO SCHEDULE(DYNAMIC,4) • Default is implementation dependent.

STATIC schedule • With no chunksize specified • Iteration space is divided into (approximately) equal chunks, and one chunk is assigned to each thread (block schedule) • If chunksize is specified • Iteration space is divided into chunks, each of chunksize iterations. The chunks are assigned cyclically to each thread (block cyclic schedule)

T0 T1 T2 T3 STATIC schematic 46 1 SCHEDULE(STATIC) • Block-cyclic schedule is good for triangular matrices to avoid load imbalance; e.g:do i=1,n do j=i,nblock end do end do T1 T1 T1 T1 T0 T2 T3 T0 T2 T3 T0 T2 T3 T0 T2 T3 46 1 SCHEDULE(STATIC,4)

DYNAMIC schedule • Divides the iteration space up into chunks of size chunksize and assigns them to threads on a first-come first-served basis. • i.e. as a thread finishes a chunk, it is assigned the next chunk in the list. • When no chunksize is specified it defaults to 1.

GUIDED schedule • Similar to DYNAMIC, but the chunks start off large and get smaller exponentially. • The size of the next chunk is (roughly) the number of remaining iterations divided by the number of threads. • The chunksize specifies the minimum size of the chunks • When no chunksize is specifies it defaults to 1.

T1 T1 T1 T0 T0 T2 DYNAMIC and GUIDED T1 T0 T2 T3 T2 T3 T0 T2 T3 46 1 SCHEDULE(DYNAMIC,3) T3 T0 T1 T2 T3 T3 T2 T3 T1 T2 T0 46 1 SCHEDULE(GUIDED,3)

RUNTIME schedule • Allows the choice of schedule to be deferred until runtime • Set by environment variable OMP_SCHEDULE:% setenv OMP_SCHEDULE “guided,4” • There is no chunksize associated with the RUNTIME schedule

Choosing a schedule • STATIC is best for balanced loops –least overhead. • STATIC,n good for loops with mild or smooth load imbalance – but can introduce “false sharing” (see later). • DYNAMIC useful if iterations have widely varying loads, but ruins data locality (can get cache hit). • GUIDED often less expensive than DYNAMIC, but beware of loops where first iterations are the most expensive! • Use RUNTIME for convenient experimentation

Performance • Easy to write OpenMP but hard to write an efficient program! • 5 main causes of poor performance: • Sequential code • Communication • Load imbalance • Synchronisation • Compiler (non-)optimisation.

Sequential code • Amdahl’s law: Parallel code will never run faster than the parts which can only be executed in serial. • Limits performance. • Need to find ways of parallelising it! • In OpenMP, all code outside of parallel regions and inside MASTER, SINGLE and CRITICAL directives is sequential. • This code should be as small as possible.

Communication • On Shared memory machines, communication is “disguised” as increased memory access costs. • It takes longer to access data in main memory or another processor’s cache than it does from local cache. • Memory accesses are expensive! • ~100 cycles for a main memory access compared to 1-3 cycles for a flop. • Unlike message passing, communication is spread throughout the program. • Much harder to analyse and monitor.

Caches and coherency • Shared memory programming assumes that a shared variable has a unique value at a given time. • Caching means that multiple copies of a memory location may exist in the hardware. • To avoid two processors caching different values of the same memory location, caches must be kept coherent. • Coherency operations are usually performed on the cache lines in the level of cache closest to memory • (level 2 cache on most systems) • Maxima has: • 64kb Level 1 cache, 32byte cache lines (4 d.p. words). • 8Mb Level 2 cache, 512byte cache lines (64 d.p. words).

P P P P Memory hierarchy L1C L1C L1C L1C L2C L2C L2C L2C Interconnect Memory

Coherency • Highly simplified view: • Each cache line can exist in one of 3 states: • Exclusive: the only valid copy in any cache. • Read-only: a valid copy but other caches may contain it. • Invalid: out of date and cannot be used. • A READ on an invalid or absent cache line will be cached as read-only. • A READ on an exclusive cache line will be changed to read only. • A WRITE on a line not in an exclusive state will cause all other copies to be marked invalid and the written line to be marked exclusive.

P P P P Exclusive Read-Only Invalid W R W Coherency example R L1C L1C L1C L1C L2C L2C L2C L2C Interconnect Memory

False sharing • Cache lines consist of several words of data. • What happens when two processors are both writing to different words on the same cache line? • Each write will invalidate the other processors copy. • Lots of remote memory accesses. • Symptoms: • Poor speedup • High, non-deterministic numbers of cache misses. • Mild, non-deterministic, unexpected load imbalance.

False sharing example 1 integer count(maxthreads) !$OMP PARALLEL : count(myid)=count(myid)+1 • Each thread writes a different element of the same array • Words with consecutive addresses are more than likely in the same cache line. • Solution: • Pad array by dimension of cache line: parameter (linesize=16) integer count(linesize,maxthreads) !$OMP PARALLEL : count(1,myid) = count(1,myid) + 1

False sharing example 2 • Small chunk sizes: !OMP DO SCHEDULE(STATIC,1) do j = 1,n do i=1,j b(j)=b(j)+a(i,j) end do end do • Will induce false sharing on b.

Data affinity • Data is cached on the processors which access it. • Must reuse cached data as much as possible. • Write code with good data affinity: • Ensure the same thread accesses the same subset of program data as much as possible. • Try to make these subsets large, contiguous chunks of data. • Will avoid false sharing and other problems.

1 2 3 4 i j 16 i Data affinity example i !$OMP DO PRIVATE(i) do j=1,n do i=1,n a(i,j)= i+j end do end do !$OMP DO SCHEDULE(STATIC,16) !$OMP& PRIVATE(i) do j=1,n do i=1,j b(j)=b(j)+a(i,j) end do end do • 1st 16 j’s OK – rest are cache misses! j i j j

Distributed shared memory systems • Location of data in main memory is important. • OpenMP has no support for controlling this. • Normally data is placed on the first processor which accesses it. • Data placement can be controlled indirectly by parallelisation of data initialisation. • Not an issue on Maxima as CPUs have reasonably uniform memory access within each machine.

Load imbalance • Load imbalance can arise from both communication and computation. • Worth experimenting with different scheduling options • Can use SCHEDULE(RUNTIME). • If none are appropriate, may be best to do your own scheduling!

Synchronisation • Barriers can be very expensive • Typically 1000s cycles to synchronise 32 processors). • Avoid barriers via: • Careful use of the NOWAIT clause • Parallelise at the outermost level possible. • May require re-ordering of loops +/ indicies. • Consider using point-to-point synchronisation for nearest-neighbour type problems. • Choice of CRITICAL / ATOMIC / lock routines may impact performance.

Compiler (non-)optimisation • Sometimes the addition of parallel directives can inhibit the compiler from performing sequential optimisations. • Symptoms: • 1-thread parallel code has longer execution and higher instruction count than sequential code. • Can sometimes be cured by making shared data private, or local to a routine.

Performance tuning • My code is giving me poor speedup. I don’t know why. What do I do now? • A: • Say “this machine/language is a heap of junk” • Give up and go back to your workstation/PC • B: • Try to classify and localise the sources of overhead. • What type of problem is it and where in the code does it occur • Use any available tools that can help you: • Add timers, use profilers – Sun Performance analyser. • Fix problems that are responsible for large overheads first. • Iterate • Good Luck!

Scheduling and Performance Issues in OpenMP Programming