1 / 54

Shared Memory Multiprocessors

Shared Memory Multiprocessors. A. Jantsch / Z. Lu / I. Sander. Outline. Shared memory architectures Centralized memory Distributed memory Caches Write through / write-back caches The cache coherency problem Shared memory programming Critical section Mutex and semaphore.

isabel
Télécharger la présentation

Shared Memory Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shared Memory Multiprocessors A. Jantsch / Z. Lu / I. Sander

  2. Outline • Shared memory architectures • Centralized memory • Distributed memory • Caches • Write through / write-back caches • The cache coherency problem • Shared memory programming • Critical section • Mutex and semaphore SoC Architecture

  3. Shared Memory Architectures

  4. Shared Memory Architectures • Shared Memory Multiprocessor are often used • Symmetric Multiprocessors (SMP) • Symmetric access to all of main memory from any processor • also called UMA (uniform memory access) • Distributed Shared Memory (DSM) • Access time depends on the location of data word in memory • also called NUMA (non-uniform memory access) SoC Architecture

  5. Shared Memory Architectures • A shared memory programming model has a direct representation in hardware • Caches • Increase performance • Demand cache coherence and memory consistency protocols SoC Architecture

  6. Several processors are connected via a switch with a shared memory Has been used for a very small number of processors Is difficult to use for a large number of processors, since the shared cache must deliver an extremely large bandwidth Shared Cache Architecture P1 Pm Switch Cache Main Memory SoC Architecture

  7. The interconnect is a shared bus between the processors local caches and the memory Has been used up to 20 to 30 processors Scaling is limited due to the bandwidth limitations of the shared bus Bus-shared Shared Memory P1 Pm Cache Cache Bus Main Memory SoC Architecture

  8. Scalable Point-to-Point Network placed between caches and memory modules that together form the main memory Due to the size of the interconnection network, the memory can be very far from the processors Dancehall Architecture P1 Pm Cache Cache Interconnection Network Memory Memory SoC Architecture

  9. No symmetric approach. The local memory is much closer than the rest of the global memory. Structure works very well with scaling Important in the design to use the local memory efficiently. Distributed Memory P1 Pm Cache Cache Memory Memory Interconnection Network SoC Architecture

  10. Shared Memory Programming

  11. Process and history • A process executes a sequence of statements. • Each statement consists of one or more atomic (indivisible) actions which transform one state into another (state transition). • Process state is formed of values of variables at a point in time. • The process history is a trace of one execution: a sequence of atomic operations. • Example P1: A1 A2 Am S0 S1 S2 Sm SoC Architecture

  12. Atomic Operations • Indivisible sequence of state transitions • Fine-grained atomic operations • Machine instructions (read, write, test-and-set, read-modify-write, swap etc.) • Atomicity is guaranteed by hardware • Coarse-grained atomic actions • A sequence of fine-grained atomic actions indivisibly • Should not be interrupted • Internal state transitions are not visible ”outside”. SoC Architecture

  13. Concurrent execution • The concurrent execution of multiple processes can be viewed as the interleaving of their sequences of atomic actions. • A history is a trace of ONE execution, i.e., an intereleaving of atomic actions of processes. • Example • Individual histories P1: s0 → s1 P2: p0 → p1 • Interleaved execution history Trace 1: s0→p0→s1→p1 Trace 2: s0→s1→p0→p1 SoC Architecture

  14. How many traces? • A concurrent program of n processes each with m atomic actions can produce N = (nm!)/(m!)n different histories! • Example • 3 processes, each with 2 actions, i.e., n=3, m=2, N=90 • Implication • This makes it impossible to show the corretness of a program by tesing (run the program and see what happes). • Design a ”correct” program in the first place. For shared variable programming, problems are concered with accessing shared variables. Therefore a key issue is process synchronization. SoC Architecture s0

  15. Concurrent Execution Example Possible Results: 0, 1, 3 What about: Undefined, 2? Task B x:=x+1; y:=y+2; Task A x:=0; y:=0; Print (x+y); SoC Architecture

  16. Synchronization • Synchronization constrains possible histories to desirable (good) histories. • Synchronization methods • Mutual exclusion (mutex) • Exclusive access to shared variables within a critical section • A mechanism that guarantees serialization of critical sections (atomicity of critical sections with respect to each other) • Condition synchronization • Delaying a process until the state satisfies a boolean condition. • More general than mutex • Lessons learnt: synchronization is required whenever processes read and write shared variables to preserve data dependencies. SoC Architecture

  17. Critical section • CS: a piece of code that can only be executed by one process at a time • To provide mutual exclusive access to shared resources (sequence of statements accessing shared variables) • Two sections can be critical wrt each other if they cannot be executed simultaneously, i.e., mutual exclusive sections. • Some synchronization mechanism is required at the entry and exit of the CS to ensure exclusive use. SoC Architecture

  18. The Critical section problem • Design entry and exit protocols that satisfy the following properties: • Mutual exlcusion • At most one process at a time is entering, executing and exiting the critical section. • Absence of deadlocks (livelocks) • One of the competing processes succeed to enter • Termination: CS should terminate in finite time • Absence of unnecessary delay • A process is not prevented from entering if others do not compete. • Fairness (enventual entry, liveness) • A process should eventually enter CS. SoC Architecture

  19. Solutions • Locking mechanisms • Lock on enter; unlock on exit • Variants of locks: spin lock (busy-waiting), queueing locks, etc. • Semphores • A general solution to the synchronization problem for both mutual exclusion and condition synchronization. SoC Architecture

  20. Lock • Enter CS: set the lock when it is cleared. < await (!lock) lock = true >; • Exit CS: clear/release the lock lock = false; • Synonyms: enter-exit, lock-unlock, acquire-release • Example bool lock=0; process CS2 { while (true) { <await ((!lock) lock= true>; //entry CS; lock =false; //exit non-critical section;} } bool lock=0; process CS1 { while (true) { <await ((!lock) lock= true>; //entry CS; lock =false; //exit non-critical section;} } SoC Architecture

  21. Lock implementation • Lock/unlock in terms of instructions: • Locking consists of several instructions • Unlock is an ordinary store instruction. • To support the atomicity of locking, locks need hardware support, i.e., special atomic memory instructions. • General semantics: <read location, test the value read, compute a new value and store the new value to the location> • Many variants: read-modify-write, test&set, fetch&increment; swap, etc. lock: load register, location //copy location to register cmp register, #0 //compare with 0 bnz lock //if not 0, try again store location, #1 //store 1, marking locked ret unlock: store location, #0 ret SoC Architecture

  22. Semaphore • A semaphore is a special kind of shared variable manipulated by two atomic operations, P and V. • Semaphores provide a low-level but efficient signaling mechanim for both mutual exclusion and condition synchronization • Inspired by a railroad semaphore: up/down signal flag • Semaphore operation in Dutch • P (decrement when nonnegative), stands for ”proberen”(test) or ”passeren” • V (increment) stands for ”verhogen” or ”vrijeven” SoC Architecture

  23. Semaphore syntax and semantics • Declarion sem s = expr // single semaphore • Initialization Default to 1 • The value of a semaphore is non-negative integer • Operations P(s): <await (s>0) s=s-1;> //wait, down V(s): <s=s+1> //signal, up SoC Architecture

  24. Semaphore types • Binary semaphore taking the value of 0 and 1 only. • Split binary semaphore is a set of binary semaphore where at most one semaphore is 1 at a time. • The sum of semaphore values [0,1] • General (counting) semaphore takes any nonnegative integer value and can be used for condition synchronization, for example, • Serves as a resource counter: counts the number of resource SoC Architecture

  25. Mutex semaphore • A CS may be executed with mutual exclusion by enclosing it within P and V operations on a binary semaphore. • Example: initiates to 1 to indicated CS is free sem mutex=1; process CS[i=0 to n] { while (true) { P(mutex); //entry, down CS; V(mutex); //exit, up non-critical section;} } SoC Architecture

  26. Caches and Cache Coherency

  27. Caches and Cache Coherence • Caches play key role in all cases • Reduce average data access time • Reduce bandwidth demands placed on shared interconnect • But private processor caches create a problem • Copies of a variable can be present in multiple caches • A write by one processor may not become visible to others • They’ll keep accessing stale value in their caches • Cache coherence problem • Need to take actions to ensure visibility SoC Architecture

  28. Cache Memories • A cache memory is used to reduce the access time to memory • Cache misses can occur since the cache is much smaller than the memory Main Memory Processor Cache SoC Architecture

  29. Cache Memories • The decision which parts of the memory reside in the cache is taken by a replacement-algorithm • There are different protocols for a write operation: Write-Back and Write-Through Main Memory Processor Cache SoC Architecture

  30. Cache MemoriesRead Operation • If the memory location is in the cache (cache hit), the data is read from the cache. • If the memory location is not in the cache (cache miss), the block containing the data (is read from memory) and the cache is updated. Main Memory Processor Cache SoC Architecture

  31. Cache MemoriesWrite Operation (Write Hit) • Write-Through Protocol • A write operation updates the main memory location • depending on protocol cache may also be updated • in this course we assume cache to be updated during write hit Main Memory Processor Cache SoC Architecture

  32. Cache MemoriesWrite Operation (Write Hit) • Write-Back Protocol • A write operation updates only the cache location and marks it as updated with an associated flag bit (dirty flag) • The main memory is updated later, when the block containing the marked address is removed from the cache. Main Memory Processor Cache SoC Architecture

  33. Cache MemoriesWrite Operation (Write Miss) • Since data is not necessarily needed on a write there are two options • Write Allocate: The block is allocated on a write miss, followed by the corresponding write-hit actions • No-Write Allocate: Write misses do not affect the cache, instead only the lower-level memory is updated. Main Memory Processor Cache SoC Architecture

  34. Cache MemoriesWrite Operation (Write Miss) Main Memory • Write-through and write-back can be combined with write-allocate or no-write-allocate • Typically • Write-back caches use write-allocate • Write-through uses no-write-allocate • To keep the following discussion simple, we consider • Write-back caches with write-allocate • Write-through caches with no-write-allocate Processor Cache SoC Architecture

  35. States for Cache Blocks • Write-through • Invalid • Valid • Write-Back • Invalid • Valid • Dirty (not updated in memory) SoC Architecture

  36. Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P1 reads location u (value 5) from main memory • P3 reads location u from main memory • P3 writes u, changing the value to 7 • P1 reads value u again • P2 reads location u from main memory Write-Back Cache Cache Bus Main Memory SoC Architecture

  37. Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P1 reads location u (value 5) from main memory • Cache of P1 is updated • The block containing u=5 is loaded into the cache u=5 Cache Bus Read Main Memory u=5 SoC Architecture

  38. Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P3 reads location u from main memory • Cache and Memory have still the value u=5 Read u=5 Cache Bus Main Memory u=5 SoC Architecture

  39. Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P3 writes u, changing the value to 7 • Cache is updated (u=7) and u is marked. Memory is not changed! Write u=7 Cache Bus Main Memory u=5 SoC Architecture

  40. Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P1 reads value u again • Since cache is common to all processes, there is no problem though the main memory is not updated! All processes have the same view of the cache! Read u=7 Cache Bus Main Memory u=5 SoC Architecture

  41. Cache Coherence Problem(Uniprocessor) P1 P2 P3 Single Processor running 3 processes • P2 reads location u from main memory • Since cache is common to all processes, there is no problem though the main memory is not updated! All processes have the same view of the cache! Read u=7 Cache Bus Main Memory u=5 SoC Architecture

  42. Cache Coherence Problem(Uniprocessor) • If only a uniprocessor is involved there is no cache coherence problem! • However, if another devices on the bus is involved that has direct memory access (like a DMA), the cache may not represent the contents of the memory and the cache coherence problem can occur! SoC Architecture

  43. Cache Coherence Problem P1 P2 P3 • P1 reads location u (value 5) from main memory • P3 reads location u from main memory • P3 writes u, changing the value to 7 • P1 reads value u again • P2 reads location u from main memory Cache Cache Cache Bus Main Memory SoC Architecture

  44. Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P1 reads location u (value 5) from main memory • P1’s cache is updated (u=5) u=5 Cache Cache Cache Bus Read Main Memory u=5 SoC Architecture

  45. Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P3 reads location u from main memory • P3’s cache is updated (u=5) u=5 Cache Cache Cache u=5 Bus Read Main Memory u=5 SoC Architecture

  46. Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P3 writes u, changing the value to 7 • main memory is updated (u=7) • P3’s cache is not up-dated (no-write-allocate), but block is invalidated u=5 Cache Cache Cache u=5 (Inv) Bus Write Through Main Memory u=7 SoC Architecture

  47. Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P1 reads value u again • P1 reads the value from the cache (u=5), which is not the correct value! Read u=5 Cache Cache Cache u=5 (Inv) Bus Main Memory u=7 SoC Architecture

  48. Cache Coherence Problem(Write-Through Cache) P1 P2 P3 • P2 reads location u from main memory • P2 reads the value from the main memory (u=7) Read u=5 Cache Cache Cache u=5 (Inv) Bus Main Memory u=7 SoC Architecture

  49. Cache Coherence Problem(Write-Back Cache) P1 P2 P3 • P1 reads location u (value 5) from main memory • P1’s cache is updated (u=5) u=5 Cache Cache Cache Bus Main Memory u=5 SoC Architecture

  50. Cache Coherence Problem(Write-Back Cache) P1 P2 P3 • P3 reads location u from main memory • P3’s cache is updated (u=5) u=5 Cache Cache Cache u=5 Bus Read Main Memory u=5 SoC Architecture

More Related