1 / 53

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing. Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2007. Topics. Why Parallel Processors Communication models Challenge of parallel processing Coherence problem Consistency problem

nedt
Télécharger la présentation

Advanced Computer Architecture 5MD00 / 5Z032 Multi-Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Computer Architecture5MD00 / 5Z032Multi-Processing Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2007

  2. Topics • Why Parallel Processors • Communication models • Challenge of parallel processing • Coherence problem • Consistency problem • Synchronization • Fundamental design issues • Interconnection networks • Book: Chapter 4, appendix E, H ACA H.Corporaal

  3. Which parallelism are we talking about? Classification: Flynn Categories • SISD (Single Instruction Single Data) • Uniprocessors • MISD (Multiple Instruction Single Data) • Systolic arrays / stream based processing • SIMD (Single Instruction Multiple Data = DLP) • Examples: Illiac-IV, CM-2 (Thinking Machines), Xetal (Philips), Imagine (Stanford), Vector machines, Cell architecture (Sony) • Simple programming model • Low overhead • Now applied as sub-word parallelism !! • MIMD (Multiple Instruction Multiple Data) • Examples: Sun Enterprise 5000, Cray T3D, SGI Origin, Multi-core Pentiums, and many more…. • NoCs (Networks-on-Chip) • Flexible • Use off-the-shelf processor cores ACA H.Corporaal

  4. Why parallel processing • Performance drive • Diminishing returns for exploiting ILP and OLP • Multiple processors fit easily on a chip • Cost effective (just connect existing processors or processor cores) • Low power: parallelism may allow lowering Vdd However: • Parallel programming is hard ACA H.Corporaal

  5. CPU CPU1 CPU2 Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P1 = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P2 = f/2 2C V’2 =fCV’2 < P1 ACA H.Corporaal

  6. Parallel Architecture • Parallel Architecture extends traditional computer architecture with a communication network • abstractions (HW/SW interface) • organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node ACA H.Corporaal

  7. Communication models: Shared Memory Shared Memory (read, write) (read, write) Process P2 Process P1 • Coherence problem • Memory consistency issue • Synchronization problem ACA H.Corporaal

  8. Communication models: Shared memory • Shared address space • Communication primitives: • load, store, atomic swap Two varieties: • Physically shared => Symmetric Multi-Processors (SMP) • usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) ACA H.Corporaal

  9. Processor Processor Processor Processor One or more cache levels One or more cache levels One or more cache levels One or more cache levels SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and bus interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memory I/O System ACA H.Corporaal

  10. Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DSM: Distributed Shared Memory • Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Main memory I/O System ACA H.Corporaal

  11. Shared Address Model Summary • Each processor can name every physical location in the machine • Each process can name all data it shares with other processes • Data transfer via load and store • Data size: byte, word, ... or cache blocks • Memory hierarchy model applies: • communication moves data to local proc. cache ACA H.Corporaal

  12. receive send Process P2 Process P1 send receive FiFO Communication models: Message Passing • Communication primitives • e.g., send, receive library calls • Note that MP can be build on top of SM and vice versa ACA H.Corporaal

  13. Message Passing Model • Explicit message send and receive operations • Send specifies local buffer + receiving process on remote computer • Receive specifies sending process on remote computer + local buffer to place data • Typically blocking communication, but may use DMA Message structure Header Data Trailer ACA H.Corporaal

  14. Network interface Network interface Network interface Network interface DMA DMA DMA DMA Message passing communication Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory Interconnection Network ACA H.Corporaal

  15. Communication Models: Comparison • Shared-Memory • Compatibility with well-understood (language) mechanisms • Ease of programming for complex or dynamic communications patterns • Shared-memory applications; sharing of large data structures • Efficient for small items • Supports hardware caching • Messaging Passing • Simpler hardware • Explicit communication • Implicit synchronization (with any communication) ACA H.Corporaal

  16. Network: Performance metrics • Network Bandwidth • Need high bandwidth in communication • How does it scale with number of nodes? • Communication Latency • Affects performance, since processor may have to wait • Affects ease of programming, since it requires more thought to overlap communication and computation How can a mechanism help hide latency? • overlap message send with computation, • prefetch data, • switch to other task or thread ACA H.Corporaal

  17. Challenges of parallel processing Q1: can we get linear speedup Suppose we want speedup 80 with 100 processors. What fraction of the original computation can be sequential (i.e. non-parallel)? Q2: how important is communication latency Suppose 0.2 % of all accesses are remote, and require 100 cycles on a processor with base CPI = 0.5 What’s the communication impact? ACA H.Corporaal

  18. Three fundamental issues for shared memory multiprocessors • Coherence, about: Do I see the most recent data? • Consistency, about: When do I see a written value? • e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? • SynchronizationHow to synchronize processes? • how to protect access to shared data? ACA H.Corporaal

  19. CPU CPU cache cache a' 550 a' 100 b' 200 b' 200 memory memory a 100 a 100 b 200 b 440 I/O I/O Coherence problem, in single CPU system CPU not coherent cache a' 100 b' 200 memory not coherent a 100 b 200 I/O IO writes b CPU writes to a ACA H.Corporaal

  20. Coherence problem, in Multi-Proc system CPU-1 CPU-2 cache cache a' 550 a'' 100 b' 200 b'' 200 memory a 100 b 200 ACA H.Corporaal

  21. What Does Coherency Mean? • Informally: • “Any read must return the most recent write” • Too strict and too difficult to implement • Better: • “Any write must eventually be seen by a read” • All writes are seen in proper order (“serialization”) ACA H.Corporaal

  22. Two rules to ensure coherency • “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” • Writes to a single location are serialized: seen in one order • Latest write will be seen • Otherwise could see writes in illogical order (could see older value after a newer value) ACA H.Corporaal

  23. Potential HW Coherency Solutions • Snooping Solution (Snoopy Bus): • Send all requests for data to all processors (or local caches) • Processors snoop to see if they have a copy and respond accordingly • Requires broadcast, since caching information is at processors • Works well with bus (natural broadcast medium) • Dominates for small scale machines (most of the market) • Directory-Based Schemes • Keep track of what is being shared in one centralized place • Distributed memory => distributed directory for scalability(avoids bottlenecks) • Send point-to-point requests to processors via network • Scales better than Snooping • Actually existed BEFORE Snooping-based schemes ACA H.Corporaal

  24. Processor Processor Processor Processor Cache Cache Cache Cache Example Snooping protocol • 3 states for each cache line: • invalid, shared, modified (exclusive) • FSM per cache, receives requests from both processor and bus Main memory I/O System ACA H.Corporaal

  25. Cache coherence protocal • Write invalidate protocol for write-back cache • Showing state transitions for each block in the cache ACA H.Corporaal

  26. Synchronization problem • Computer system of bank has credit process (P_c) and debit process (P_d) /* Process P_c */ /* Process P_d */ shared int balance shared int balance private int amount private int amount balance += amount balance -= amount lw $t0,balance lw $t2,balance lw $t1,amount lw $t3,amount add $t0,$t0,t1 sub $t2,$t2,$t3 sw $t0,balance sw $t2,balance ACA H.Corporaal

  27. Critical Section Problem • n processes all competing to use some shared data • Each process has code segment, called critical section, in which shared data is accessed. • Problem – ensure that when one process is executing in its critical section, no other process is allowed to execute in its critical section • Structure of process while (TRUE){ entry_section (); critical_section (); exit_section (); remainder_section (); } ACA H.Corporaal

  28. Attempt 1 – Strict Alternation Two problems: • Satisfies mutual exclusion, but not progress(works only when both processes strictly alternate) • Busy waiting Process P0 Process P1 shared int turn; while (TRUE) { while (turn!=0); critical_section(); turn = 1; remainder_section(); } shared int turn; while (TRUE) { while (turn!=1); critical_section(); turn = 0; remainder_section(); } ACA H.Corporaal

  29. Attempt 2 – Warning Flags • Satisfies mutual exclusion • P0 in critical section: flag[0]!flag[1] • P1 in critical section: !flag[0]flag[1] • However, contains a deadlock(both flags may be set to TRUE !!) Process P0 Process P1 shared int flag[2]; while (TRUE) { flag[0] = TRUE; while (flag[1]); critical_section(); flag[0] = FALSE; remainder_section(); } shared int flag[2]; while (TRUE) { flag[1] = TRUE; while (flag[0]); critical_section(); flag[1] = FALSE; remainder_section(); } ACA H.Corporaal

  30. Software solution: Peterson’s Algorithm (combining warning flags and alternation) Process P0 Process P1 shared int flag[2]; shared int turn; while (TRUE) { flag[0] = TRUE; turn = 0; while (turn==0&&flag[1]); critical_section(); flag[0] = FALSE; remainder_section(); } shared int flag[2]; shared int turn; while (TRUE) { flag[1] = TRUE; turn = 1; while (turn==1&&flag[0]); critical_section(); flag[1] = FALSE; remainder_section(); } Software solution is slow ! ACA H.Corporaal

  31. Hardware solution for Synchronization • For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization • Hardware primitives needed • all solutions based on "atomically inspect and update a memory location" • Higher level synchronization solutions can be build in top ACA H.Corporaal

  32. Uninterruptable Instructions to Fetch and Update Memory • Atomic exchange: interchange a value in a register for a value in memory • 0 => synchronization variable is free • 1 => synchronization variable is locked and unavailable • Test-and-set: tests a value and sets it if the value passes the test (also Compare-and-swap) • Fetch-and-increment: it returns the value of a memory location and atomically increments it • 0 => synchronization variable is free ACA H.Corporaal

  33. Build a 'spin-lock' using exchange primitive • Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock LI R2,#1 ;load immediate lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked? • What about MP with cache coherency? • Want to spin on cache copy to avoid full memory latency • Likely to get cache hits for such variables • Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic • Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): • try: LI R2,#1 ;load immediate lockit: LW R3,0(R1) ;load var BNEZ R3,lockit ;not free=>spin EXCH R2,0(R1) ;atomic exchange BNEZ R2,try ;already locked? ACA H.Corporaal

  34. Alternative to Fetch and Update • Hard to have read & write in 1 instruction: use 2 instead • Load Linked (or load locked) + Store Conditional • Load linked returns the initial value • Store conditional returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise • Example doing atomic swap with LL & SC: try: OR R3,R4,R0 ; R4=R3LL R2,0(R1) ; load linkedSC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R3=0) • Example doing fetch & increment with LL & SC: try: LL R2,0(R1) ; load linked ADDUI R3,R2,#1 ; increment SC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R2=0) ACA H.Corporaal

  35. Another MP Issue: Memory Consistency • What is consistency? Whenmust a processor see a new memory value? • Example: P1: A = 0; P2: B = 0; ..... ..... A = 1; B = 1; L1: if (B == 0) ... L2: if (A == 0) ... • Seems impossible for both if-statements L1 & L2 to be true? • What if write invalidate is delayed & processor continues? • Memory consistency models: what are the rules for such cases? ACA H.Corporaal

  36. Sequential Consistency (SC) Sequential consistency: • result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved => finish assignments before if-statements above • SC: delay all memory accesses until all invalidates done ACA H.Corporaal

  37. Sequential consistency overkill? • Schemes for faster execution then sequential consistency • Most programs are synchronized • A program is synchronized if all accesses to shared data are ordered by synchronization operations • example: • P1 • write (x)...release (s) {unlock}... • P2 • acquire (s) {lock} ...read(x) ordered ACA H.Corporaal

  38. Relaxed Memory Consistency Models • Several Relaxed Models for Memory Consistency since most programs are synchronized; • Key: (partially) allow reads and writes to complete out-of-order • Models are characterized by their attitude towards: • W  R : total store ordering • W  W : partial store ordering • R  W and R  R : weak ordering, and others • to different addresses • Note, seq. consistency means: • W  R, W  W, R  W and R  R ACA H.Corporaal

  39. Fundamental MP design decision We have already discussed: • Shared memory versus Message passing • Coherence, Consistency and Synchronization issues Other extremely important decisions: • Processing units: • Homogeneous versus Heterogeneous? • Generic versus Application specific ? • Interconnect: • Bus versus Network ? • Type (topology) of network • What types of parallelism to support ? • Focus on Performance, Power or Cost ? • Memory organization ? ACA H.Corporaal

  40. Homogeneous or Heterogeneous • Homogenous: • replication effect • memory dominated any way • solve realization issuesonce and for all • less flexible ACA H.Corporaal

  41. Homogeneous or Heterogeneous • Heterogeneous • better fit to application domain • smaller increments ACA H.Corporaal

  42. Homogeneous or Heterogeneous • Middle of the road approach • Flexibile tiles • Fixed tile structure at top level ACA H.Corporaal

  43. Example: NoC with 2x4 mesh routing network node node node node R R R R node node node node R R R R Bus (shared) or Network (switched) • Network: • claimed to be more scalable • no bus arbitration • point-to-point connections • but router overhead ACA H.Corporaal

  44. Network design parameters Important network design space: • topology, degree • routing algorithm • path, path control, collision resolvement, network support, deadlock handling, livelock handling • virtual layer support • flow control, • buffering • QoS guarantees • error handling • etc, etc. ACA H.Corporaal

  45. Switch / Network Topology Topology determines: • Degree: number of links from a node • Diameter: max number of links crossed between nodes • Average distance: number of links to random destination • Bisection: minimum number of links that separate the network into two halves • Bisection bandwidth = link bandwidth x bisection ACA H.Corporaal

  46. Common Topologies Type Degree Diameter Ave Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 Hypercube Log2N n=Log2N n/2 N/2 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N-1 1 1 N2/2 N = number of nodes, n = dimension ACA H.Corporaal

  47. Topology examples Hypercube Grid/Mesh Torus Assume 64 nodes: ACA H.Corporaal

  48. How to make a bigger butterfly network? N/2 Butterfly ° ° ° N/2 Butterfly ° ° ° Butterfly or Omega Network • All paths equal length • Unique path from any input to any output • Try to avoid conflicts 8 x 8 butterfly switch ACA H.Corporaal

  49. Multistage Fat Tree • A multistage fat tree (CM-5) avoids congestion at the root node • Randomly assign packets to different paths on way up to spread the load • Increase degree near root, decrease congestion ACA H.Corporaal

  50. Old (off-chip) MP Networks Name Number Topology Bits Clock Link Bis. BW Year nCube/ten 1-1024 10-cube 1 10 MHz 1.2 640 1987 iPSC/2 16-128 7-cube 1 16 MHz 2 345 1988 MP-1216 32-512 2D grid 1 25 MHz 3 1,300 1989 Delta 540 2D grid 16 40 MHz 40 640 1991 CM-5 32-2048 fat tree 4 40 MHz 20 10,240 1991 CS-2 32-1024 fat tree 8 70 MHz 50 50,000 1992 Paragon 4-1024 2D grid 16 100 MHz 200 6,400 1992 T3D 16-1024 3D Torus 16 150 MHz 300 19,200 1993 MBytes/s No standard topology! However, for on-chip: mesh and torus are in favor ! ACA H.Corporaal

More Related