Shared Memory Multiprocessors

Shared Memory Multiprocessors A. Jantsch / Z. Lu / I. Sander

Outline • Shared memory architectures • Centralized memory • Distributed memory • Caches • Write through / write-back caches • The cache coherency problem • Shared memory programming • Critical section • Mutex and semaphore SoC Architecture

Shared Memory Architectures

Shared Memory Architectures • Shared Memory Multiprocessor are often used • Symmetric Multiprocessors (SMP) • Symmetric access to all of main memory from any processor • also called UMA (uniform memory access) • Distributed Shared Memory (DSM) • Access time depends on the location of data word in memory • also called NUMA (non-uniform memory access) SoC Architecture

Shared Memory Architectures • A shared memory programming model has a direct representation in hardware • Caches • Increase performance • Demand cache coherence and memory consistency protocols SoC Architecture

Several processors are connected via a switch with a shared memory Has been used for a very small number of processors Is difficult to use for a large number of processors, since the shared cache must deliver an extremely large bandwidth Shared Cache Architecture P1 Pm Switch Cache Main Memory SoC Architecture

The interconnect is a shared bus between the processors local caches and the memory Has been used up to 20 to 30 processors Scaling is limited due to the bandwidth limitations of the shared bus Bus-shared Shared Memory P1 Pm Cache Cache Bus Main Memory SoC Architecture

Scalable Point-to-Point Network placed between caches and memory modules that together form the main memory Due to the size of the interconnection network, the memory can be very far from the processors Dancehall Architecture P1 Pm Cache Cache Interconnection Network Memory Memory SoC Architecture

No symmetric approach. The local memory is much closer than the rest of the global memory. Structure works very well with scaling Important in the design to use the local memory efficiently. Distributed Memory P1 Pm Cache Cache Memory Memory Interconnection Network SoC Architecture

Shared Memory Programming

Process and history • A process executes a sequence of statements. • Each statement consists of one or more atomic (indivisible) actions which transform one state into another (state transition). • Process state is formed of values of variables at a point in time. • The process history is a trace of one execution: a sequence of atomic operations. • Example P1: A1 A2 Am S0 S1 S2 Sm SoC Architecture

Atomic Operations • Indivisible sequence of state transitions • Fine-grained atomic operations • Machine instructions (read, write, test-and-set, read-modify-write, swap etc.) • Atomicity is guaranteed by hardware • Coarse-grained atomic actions • A sequence of fine-grained atomic actions indivisibly • Should not be interrupted • Internal state transitions are not visible ”outside”. SoC Architecture

Concurrent execution • The concurrent execution of multiple processes can be viewed as the interleaving of their sequences of atomic actions. • A history is a trace of ONE execution, i.e., an intereleaving of atomic actions of processes. • Example • Individual histories P1: s0 → s1 P2: p0 → p1 • Interleaved execution history Trace 1: s0→p0→s1→p1 Trace 2: s0→s1→p0→p1 SoC Architecture

How many traces? • A concurrent program of n processes each with m atomic actions can produce N = (nm!)/(m!)n different histories! • Example • 3 processes, each with 2 actions, i.e., n=3, m=2, N=90 • Implication • This makes it impossible to show the corretness of a program by tesing (run the program and see what happes). • Design a ”correct” program in the first place. For shared variable programming, problems are concered with accessing shared variables. Therefore a key issue is process synchronization. SoC Architecture s0

Concurrent Execution Example Possible Results: 0, 1, 3 What about: Undefined, 2? Task B x:=x+1; y:=y+2; Task A x:=0; y:=0; Print (x+y); SoC Architecture

Synchronization • Synchronization constrains possible histories to desirable (good) histories. • Synchronization methods • Mutual exclusion (mutex) • Exclusive access to shared variables within a critical section • A mechanism that guarantees serialization of critical sections (atomicity of critical sections with respect to each other) • Condition synchronization • Delaying a process until the state satisfies a boolean condition. • More general than mutex • Lessons learnt: synchronization is required whenever processes read and write shared variables to preserve data dependencies. SoC Architecture

Critical section • CS: a piece of code that can only be executed by one process at a time • To provide mutual exclusive access to shared resources (sequence of statements accessing shared variables) • Two sections can be critical wrt each other if they cannot be executed simultaneously, i.e., mutual exclusive sections. • Some synchronization mechanism is required at the entry and exit of the CS to ensure exclusive use. SoC Architecture

The Critical section problem • Design entry and exit protocols that satisfy the following properties: • Mutual exlcusion • At most one process at a time is entering, executing and exiting the critical section. • Absence of deadlocks (livelocks) • One of the competing processes succeed to enter • Termination: CS should terminate in finite time • Absence of unnecessary delay • A process is not prevented from entering if others do not compete. • Fairness (enventual entry, liveness) • A process should eventually enter CS. SoC Architecture

Solutions • Locking mechanisms • Lock on enter; unlock on exit • Variants of locks: spin lock (busy-waiting), queueing locks, etc. • Semphores • A general solution to the synchronization problem for both mutual exclusion and condition synchronization. SoC Architecture

Lock • Enter CS: set the lock when it is cleared. < await (!lock) lock = true >; • Exit CS: clear/release the lock lock = false; • Synonyms: enter-exit, lock-unlock, acquire-release • Example bool lock=0; process CS2 { while (true) { <await ((!lock) lock= true>; //entry CS; lock =false; //exit non-critical section;} } bool lock=0; process CS1 { while (true) { <await ((!lock) lock= true>; //entry CS; lock =false; //exit non-critical section;} } SoC Architecture

Lock implementation • Lock/unlock in terms of instructions: • Locking consists of several instructions • Unlock is an ordinary store instruction. • To support the atomicity of locking, locks need hardware support, i.e., special atomic memory instructions. • General semantics: <read location, test the value read, compute a new value and store the new value to the location> • Many variants: read-modify-write, test&set, fetch&increment; swap, etc. lock: load register, location //copy location to register cmp register, #0 //compare with 0 bnz lock //if not 0, try again store location, #1 //store 1, marking locked ret unlock: store location, #0 ret SoC Architecture

Semaphore • A semaphore is a special kind of shared variable manipulated by two atomic operations, P and V. • Semaphores provide a low-level but efficient signaling mechanim for both mutual exclusion and condition synchronization • Inspired by a railroad semaphore: up/down signal flag • Semaphore operation in Dutch • P (decrement when nonnegative), stands for ”proberen”(test) or ”passeren” • V (increment) stands for ”verhogen” or ”vrijeven” SoC Architecture

Semaphore syntax and semantics • Declarion sem s = expr // single semaphore • Initialization Default to 1 • The value of a semaphore is non-negative integer • Operations P(s): <await (s>0) s=s-1;> //wait, down V(s): <s=s+1> //signal, up SoC Architecture

Semaphore types • Binary semaphore taking the value of 0 and 1 only. • Split binary semaphore is a set of binary semaphore where at most one semaphore is 1 at a time. • The sum of semaphore values [0,1] • General (counting) semaphore takes any nonnegative integer value and can be used for condition synchronization, for example, • Serves as a resource counter: counts the number of resource SoC Architecture

Mutex semaphore • A CS may be executed with mutual exclusion by enclosing it within P and V operations on a binary semaphore. • Example: initiates to 1 to indicated CS is free sem mutex=1; process CS[i=0 to n] { while (true) { P(mutex); //entry, down CS; V(mutex); //exit, up non-critical section;} } SoC Architecture

Caches and Cache Coherency

Caches and Cache Coherence • Caches play key role in all cases • Reduce average data access time • Reduce bandwidth demands placed on shared interconnect • But private processor caches create a problem • Copies of a variable can be present in multiple caches • A write by one processor may not become visible to others • They’ll keep accessing stale value in their caches • Cache coherence problem • Need to take actions to ensure visibility SoC Architecture

Cache Memories • A cache memory is used to reduce the access time to memory • Cache misses can occur since the cache is much smaller than the memory Main Memory Processor Cache SoC Architecture

Cache Memories • The decision which parts of the memory reside in the cache is taken by a replacement-algorithm • There are different protocols for a write operation: Write-Back and Write-Through Main Memory Processor Cache SoC Architecture

Cache MemoriesRead Operation • If the memory location is in the cache (cache hit), the data is read from the cache. • If the memory location is not in the cache (cache miss), the block containing the data (is read from memory) and the cache is updated. Main Memory Processor Cache SoC Architecture

Cache MemoriesWrite Operation (Write Hit) • Write-Through Protocol • A write operation updates the main memory location • depending on protocol cache may also be updated • in this course we assume cache to be updated during write hit Main Memory Processor Cache SoC Architecture

Cache MemoriesWrite Operation (Write Hit) • Write-Back Protocol • A write operation updates only the cache location and marks it as updated with an associated flag bit (dirty flag) • The main memory is updated later, when the block containing the marked address is removed from the cache. Main Memory Processor Cache SoC Architecture

Cache MemoriesWrite Operation (Write Miss) • Since data is not necessarily needed on a write there are two options • Write Allocate: The block is allocated on a write miss, followed by the corresponding write-hit actions • No-Write Allocate: Write misses do not affect the cache, instead only the lower-level memory is updated. Main Memory Processor Cache SoC Architecture

Cache MemoriesWrite Operation (Write Miss) Main Memory • Write-through and write-back can be combined with write-allocate or no-write-allocate • Typically • Write-back caches use write-allocate • Write-through uses no-write-allocate • To keep the following discussion simple, we consider • Write-back caches with write-allocate • Write-through caches with no-write-allocate Processor Cache SoC Architecture

States for Cache Blocks • Write-through • Invalid • Valid • Write-Back • Invalid • Valid • Dirty (not updated in memory) SoC Architecture

Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P1 reads location u (value 5) from main memory • P3 reads location u from main memory • P3 writes u, changing the value to 7 • P1 reads value u again • P2 reads location u from main memory Write-Back Cache Cache Bus Main Memory SoC Architecture

Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P1 reads location u (value 5) from main memory • Cache of P1 is updated • The block containing u=5 is loaded into the cache u=5 Cache Bus Read Main Memory u=5 SoC Architecture

Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P3 reads location u from main memory • Cache and Memory have still the value u=5 Read u=5 Cache Bus Main Memory u=5 SoC Architecture

Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P3 writes u, changing the value to 7 • Cache is updated (u=7) and u is marked. Memory is not changed! Write u=7 Cache Bus Main Memory u=5 SoC Architecture

Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P1 reads value u again • Since cache is common to all processes, there is no problem though the main memory is not updated! All processes have the same view of the cache! Read u=7 Cache Bus Main Memory u=5 SoC Architecture

Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P2 reads location u from main memory • Since cache is common to all processes, there is no problem though the main memory is not updated! All processes have the same view of the cache! Read u=7 Cache Bus Main Memory u=5 SoC Architecture

Cache Coherence Problem(Uniprocessor) • If only a uniprocessor is involved there is no cache coherence problem! • However, if another devices on the bus is involved that has direct memory access (like a DMA), the cache may not represent the contents of the memory and the cache coherence problem can occur! SoC Architecture

Cache Coherence Problem P1 P2 P3 • P1 reads location u (value 5) from main memory • P3 reads location u from main memory • P3 writes u, changing the value to 7 • P1 reads value u again • P2 reads location u from main memory Cache Cache Cache Bus Main Memory SoC Architecture

Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P1 reads location u (value 5) from main memory • P1’s cache is updated (u=5) u=5 Cache Cache Cache Bus Read Main Memory u=5 SoC Architecture

Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P3 reads location u from main memory • P3’s cache is updated (u=5) u=5 Cache Cache Cache u=5 Bus Read Main Memory u=5 SoC Architecture

Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P3 writes u, changing the value to 7 • main memory is updated (u=7) • P3’s cache is not up-dated (no-write-allocate), but block is invalidated u=5 Cache Cache Cache u=5 (Inv) Bus Write Through Main Memory u=7 SoC Architecture

Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P1 reads value u again • P1 reads the value from the cache (u=5), which is not the correct value! Read u=5 Cache Cache Cache u=5 (Inv) Bus Main Memory u=7 SoC Architecture

Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P2 reads location u from main memory • P2 reads the value from the main memory (u=7) Read u=5 Cache Cache Cache u=5 (Inv) Bus Main Memory u=7 SoC Architecture

Cache Coherence Problem(Write-Back Cache) P1 P2 P3 • P1 reads location u (value 5) from main memory • P1’s cache is updated (u=5) u=5 Cache Cache Cache Bus Main Memory u=5 SoC Architecture

Cache Coherence Problem(Write-Back Cache) P1 P2 P3 • P3 reads location u from main memory • P3’s cache is updated (u=5) u=5 Cache Cache Cache u=5 Bus Read Main Memory u=5 SoC Architecture

Shared Memory Multiprocessors