High-Performance Computing: Shared Memory and Optimization Techniques

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@astro

LINPACK numbers • Without optimization a reasonable figure for the 10001000 problem is 200-300 Mflops • CPU = Athlon 64, 3200+ • Core 2 CPUs should do even better • Turning on –O2 should close to double this number • Athlon 64, 3200+ gives around 510 Mflops • Optimizations beyond O2 don’t add that much more improvement • Unrolling loops helps (5% improvement) on some platforms but not others

Assignment 1 solutions available Variable size LINPACK versus HINT Main Memory L2 L1 HINT’s performance profile can be seen in other programs.

Top 500 • Doubling time for the number of CPUs in a system really is ~1.5 years • Example: 2000 SHARCNET places 310 on top 500 with 128 processors • 2004(.5) SHARCNET wants to have a higher position, RFP asks for 1,500 processors (1.5 year doubling time suggests 1024 CPUs would get them back to 310) • They actually placed at 116 with 1536 processors • ‘e-waste’ anyone?

Quick note on Assignment Q2 Weight 1 Weight 4 Weight 2 Weight 3 (In 2d) nearest grid points Grid overlays particles, weights from all particles accumulate

Today’s Lecture Shared Memory Parallelism I • Part 1: Shared memory programming concepts • Part 2: Shared memory architectures • Part 3: OpenMP I

Part 1: Shared Memory Programming Concepts • Comparison of shared memory versus distributed memory programming paradigms • Administration of threads • Data dependencies • Race conditions

Shared Address Model • Each processor can access every physical memory location in the machine • Each process is aware of all data it shares with other processes • Data communication between processes is implicit: memory locations are updated • Processes are allowed to have local variables that are not visible by other processes

Comparison MPI vs OpenMP Feature OpenMP MPI Apply parallelism in steps YES NO Scale to large number of processors MAYBE YES Code Complexity small increasemajor increase Code length increase 2-80% 30-500% Runtime environment* $$ compilers FREE Cost of hardware $$$$$$$! CHEAP *gcc (& gfortran) now supports OpenMP

Distinctions: Process vs thread • A process as an OS-level task • Operates independently of other processes • Has its own process id, memory area, program counter, registers…. • OS has to create a process control block containing information about the process • Expensive to create • Heavyweight process = task • Sometimes called a heavyweight thread

Thread • A thread (lightweight process) is a subunit of a process • Shares data, code and OS resources with other threads • Controlled by one process, but one process may control more than one thread • Has its own program counter, register set, and stack space

Diagramatically CODE HEAP PC FILES STACK A PROCESS

Threads CODE HEAP STACK PC FILES STACK PC THREADS

Kernel versus User level threads • Threads can be implemented via two main mechanisms (or combinations of both) • Kernel level threads: natively supported by the OS • Changing between different threads still requires action from the OS, adds significant amount of time • Still not as bad as changing between tasks though • “Middleweight threads” • User-level threads implemented via a library (e.g. POSIX threads) • All issues of control are handled by the task rather than the OS • “Lightweight threads” (can be switched very quickly)

Thread Hierarchy Time to change between Process(Task) 1 ms Kernel-level thread 100 us User-level thread 10 us

Interprocess communication (IPC) • Despite tasks having their own data space, there are methods for communicating between them • “System V shared memory segments” is most well known • Shared regions are created & attached to by issuing a specific commands (shmget,shmat) • Still need to use a mechanism to create processes that share this region • Used in SMP version of GAUSSIAN See section 8.4 of Wilkinson & Allen, Parallel Programming

Serial Section Serial Section Serial Section Master Thread FORK JOIN FORK JOIN Parallel Section Parallel Section Threads based execution • Serial execution, interspersed with parallel In practice many compilers block execution of the extra threads during serial sections, this saves the overhead of the `fork-join’ operation

Used to create processes for SysV shared memory segments Segue: UNIX fork() • The UNIX system call fork() creates a child process that is an exact copy of the calling process with a unique process ID (wait() joins) • All of the parent variables are copied into the childs data space Success returns 0 pid = fork(); . code to be run by pair . if (pid==0) exit(0); else wait(0);

Not limited to work replication: pid = fork(); if (pid==0) { . code to be run by servant . } else { . code to be run by master . } if (pid==0) exit(0); else wait(0); MPMD Remember: all the variables in the original process are duplicated.

OpenMP Programs • OpenMP programs use a simple API to create threads based code • At their simplest they simply divide up iterations of data parallel loops among threads

Thread 2: i=101,200, Updates Y(101:200), Reads X(101:200) Thread 3: i=201,300, Updates Y(201:300), Reads X(201:300) Thread 4: i=301,400, Updates Y(301:400), Reads X(301:400) Thread 1: i=1,100, Updates Y(1:100), Reads X(1:100) Y(301:400) Y(1:100) Y(101:200) Y(201:300) X(301:400) X(1:100) X(101:200) X(201:300) What is actually happening in the threads? 4 threads, n=400 Threads Memory

Race Conditions • Common operation is to resolve a spatial position into an array index: consider following loop • Looks innocent enough – but suppose two particles have the same positions… C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i,j) C$OMP& SHARED(r,A) do i=1,n j=int(r(i)) A(j)=A(j)+1. end do r(): array of particle positions A(): array that is modified using information from r()

Thread 1: Gets A(j)=1. Adds 1. A(j)=2. Thread 1: Puts A(j)=2. End State A(j)=2. INCORRECT Thread 2: Gets A(j)=1. Adds 1. A(j)=2. Thread 2: Puts A(j)=2. Race Conditions: A concurrency problem • Two different threads of execution can concurrently attempt to update the same memory location Start A(j)=1. time

Dealing with Race Conditions • Need mechanism to ensure updates to single variables occur within a critical section • Any thread entering a critical section blocks all others • Critical sections can be established by using: • Lock variables (single bit variables) • Semaphores (Dijkstra 1968) • Monitor procedures (Hoare 1974, used in Java)

do while (lock.eq.1) spin end do lock=1 . Critical section . lock=0 do while (lock.eq.1) spin end do lock=1 . Critical section . lock=0 Simple (spin) lock usage Access is serialized

Deadlocks: The pitfall of locking • Must ensure a situation is not created where requests in possession create a deadlock: • Nested locks are a classic example of this • Can also create problem with multiple processes - `deadly embrace’ Resource 1 Resource 2 Holds Process 1 Process 2 Requests

Data Dependencies • Suppose you try to parallelize the following loop • Won’t work as it is written since iteration i, depends upon iteration i-1and thus we can’t start anything in parallel c=0.0 do i=1,n c=c+1.0 Y(i)=c end do

Simple solution • This loop can easily be re-written in a way that can be parallelized: • There is no longer any dependence on the previous operation • Private variables: i, Shared variables: Y(),c,n c=0.0 do i=1,n Y(i)=c+i end do c=c+n

Types of Data Dependencies • Suppose we have operations O1,O2 • True Dependence: • O2 has a true dependence on O1 if O2 reads a value written by O1 • Anti Dependence: • O2 has an anti-dependence on O1 if O2 writes a value read by O1 • Output Dependence: • O2 has an output dependence on O1 if O2 writes a variable written by O1

Examples A1=A2+A3 B1=A1+B2 • True dependence: • Anti-dependence: • Output dependence: B1=A1+B2 A1=C2 B1=5 B1=2

Bernstein, 1966, IEEE. Trans. Elec. Comp. Vol E-15, pp746 Bernstein’s Conditions • Set of conditions that are sufficient to determine whether two threads can be executed simultaneously • Ii: set of memory locations read by thread Pi • Oj: set of memory locations altered by thread Pj • For threads 1 & 2 to be concurrent: • I1∩O2=Ø (input of 1 cannot intersect output of 2) • I2∩O1=Ø (input of 2 cannot intersect output of 1) • O1∩O2=Ø (outputs cannot intersect)

Example • Consider two threads, a=x+y, b=x+z • Inputs: I1=(x,y), I2(x,z) • Outputs: O1=a, O2=b • All conditions are satifisfied: • I1∩O2=Ø • I2∩O1=Ø • O1∩O2=Ø • Forms the basis for auto-parallelizing compilers: difficult part is determining access at compile time

Dealing with Data Dependencies • Any loop where iterations depend upon the previous one has a potential problem • Any result which depends upon the order of the iterations will be a problem • Good first test of whether something can be parallelized: reverse the loop iteration order • Not all data dependencies can be eliminated • Accumulations of variables (e.g. sum of elements in an array) can be dealt with easily

Summary Part 1 • The concurrent nature of shared memory programming entails dealing with two keys issues • Race conditions • Data dependencies • Race conditions can be (partially) solved by locking • Dealing with data dependencies frequently involves algorithmic changes

Part 2: Shared Memory Architectures • Consistency models • Cache coherence • Snoopy versus directory based coherence

Shared Memory design issues • Race conditions cannot be avoided on shared memory architectures • must be dealt with at the programming level • However, updates to shared variables must be propagated to other threads • Variable values must be propagated through the machine via hardware • For multiple CPUs with caches this is non-trivial – let’s look at single CPU first

Caching Architectures • In cache-based architectures we will often have two copies of a variable: one in main memory and one in cache • How do we deal with memory updates if there are two copies? • Two options: either try and keep the main memory in sync with the cache (hard) or wait to update (i.e. send bursts/packets of updates)

Write-through Cache • When cache update occurs result is immediately written back up to main memory • Advantage: consistent picture of memory • Disadvantage: Uses up a lot of memory bandwidth • Disadvantage: Writes to main memory even slower than reads!

Write-back Cache • Wait until a certain number of updates have been made and then write to main memory • Mark a new cache line as being `clean’ • When modified an entry becomes `dirty’ • Clean values can be thrown away, dirty values must be written back to memory • Disadvantage: Memory is not consistent • Advantage: Overcomes problem of waiting for main memory

Inclusive Caching • Modern systems often have multiple layers of cache • However, each cache has only one parent (but may have more than one child) • Only the parent and children communicate directly • Inclusion: CPU CPU CPU CPU L1 L1 L1 L1 Level 2 Level 2 Level 3 Cache Note not all systems are built this way – exclusive caching (Opteron) does not have this relationship

Common Architecture • Very common to have the L1 cache as write-through to the L2 cache • L2 cache is then write-back to main memory • For this inclusive architecture increasing the size of the L2 can have significant performance gains

Caches do not agree: Coherency is lost Cache changed Cache now dirty but not written to main memory Multiprocessor design issue: Cache (line) Coherency CPU 2 CPU 1 2 Processors, write-back cache Individual caches Processor 2 requests from main memory Main memory

End state differences • Write-back cache: both memory and second cache hold stale values (until dirty cache entries are written) • Write-through cache: Only the second cache is stale • Cannot avoid issue of memory being read before update occurs • Need additional mechanism to preserve some kind of cache coherency

Segue: Consistency models • Strictly Consistent model • Intuitive idea of what memory should be • Any read to a memory location X returns the value stored by the most recent write operation to X P1: W(x)1 ----------------------- P2: R(x)1 R(x)1 Gharachorloo, Lenoski, Ludon, Gibbons, Gupta, Hennessy, 1990, in Proceedings 17th International Symposium on Computer Architecture

Couple more examples P1: W(x)1 ----------------------- P2: R(x)0 R(x)1 Allowed Not Allowed P1: W(x)1 ----------------------- P2: R(x)0 R(x)1

Sequential Consistency • Lamport 79: • A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. • Note the order of execution is not specified under this model • Best explained via example

Results under SC Start: X=Y=0 Two threads R1=X Y =1 R2=Y X =1 Is it possible to have R1=R2=1 under sequential consistency?

Instruction order allowed under SC R1=X(=0) Y=1 R2=Y(=1) X=1 -------------- R1=0 R2=1 R1=X(=0) R2=Y(=0) Y=1 X=1 -------------- R1=0 R2=0 R1=X(=0) R2=Y(=0) X=1 Y=1 -------------- R1=0 R2=0 R2=Y(=0) X=1 R1=X(=1) Y=1 -------------- R1=1 R2=0 R2=Y(=0) R1=X(=0) Y=1 X=1 -------------- R1=0 R2=0 R2=Y(=0) R1=X(=0) X=1 Y=1 -------------- R1=0 R2=0 NO! – all SC ordered models cannot produce R1=R2=1

SC doesn’t save you from data races • This is an important point! • Even though you have SC, the operations from the threads can still be ordered differently in the global execution • However, SC does help with some situations: Start: X=Y=0 Race free R1=X If(R1>0) Y =1 R2=Y If (R2>0) X=1 Other memory models exist with weaker requirements

Casuality is lost! P1: W(x)1 ------------------------------------- P2: W(x)2 ------------------------------------- P3: R(x)2 R(x)1 ------------------------------------- P4: R(x)2 R(x)1 Allowed result under SC Because we can reorder to P1: W(x)1 ------------------------------------- P2: W(x)2 ------------------------------------- P3: R(x)2 R(x)1 ------------------------------------- P4: R(x)2 R(x)1

High-Performance Computing: Shared Memory and Optimization Techniques