1 / 63

Software Distributed Shared Memory (SDSM): The MultiView Approach Agenda:

Software Distributed Shared Memory (SDSM): The MultiView Approach Agenda: SDSM, false sharing, previous solutions. The MultiView approach. Limitations. Dynamic granularity adaptation. Transparent SDSM. Integrated services by the memory management. Ayal Itzkovitz, Assaf Schuster.

wes
Télécharger la présentation

Software Distributed Shared Memory (SDSM): The MultiView Approach Agenda:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Distributed Shared Memory (SDSM): • The MultiView Approach • Agenda: • SDSM, false sharing, previous solutions. • The MultiView approach. • Limitations. • Dynamic granularity adaptation. • Transparent SDSM. • Integrated services by the memory management. • Ayal Itzkovitz, Assaf Schuster DSM Innovations

  2. Types of Parallel Systems Communication Efficiency • In-core multi-threading • Multi-core/SMP multi-threading • Tightly-coupled cluster, customized interconnect (SGI’s Altix) • Tightly-coupled cluster, of-the-shelf interconnect (InfiniBand) • WAN, Internet, Grid, peer-to-peer Traditionally: 1+2 are programmable using shared memory, 3+4 are programmable using message passing, in 5 peer processes communicate with central control only. HDSM: systems in 3 move towards presenting a shared memory interface to a physically distributed system. What about 4,5? Software Distributed Shared Memory = SDSM Scalability DSM Innovations

  3. Local memory core core core core A multi-core system - simplistic view • A parallel program may spawn processes (threads) to concurrently work together, in order to utilize all computing units • Processes communicate through write to/read from shared memory, physically located on the local machine DSM Innovations

  4. Virtual Shared Memory Local memory Local memory core core A distributed system • Emulation of the same programming paradigm • No changes to source Local memory core Network DSM Innovations

  5. A = malloc(MATSIZE);B = malloc(MATSIZE);C = malloc(MATSIZE); parfor(n) mult(A, B, C); mult(id): for (line=Nxid .. Nx(id+1)) for(col=0..N) C[line,col] = multline(A[line],B[col]); W Matrix Multiplication two threads Read/only matrices Write matrix R R DSM Innovations

  6. RW RO RO RW RO RO RO RO RO RO RO RO RW RW RO RO Matrix Multiplication A A Sent once x x Sent once B B = = C C Network DSM Innovations

  7. RO RO RW RW RO RO RO RO RO RO RO RO RW RO RO RW Matrix Multiplication R R W A A x x B B = = C C Network DSM Innovations

  8. RO RO RO RO RO RW RO RW RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO RO NA NA RW RW Matrix Multiplication - False Sharing Sent once A A x x B Sent once B = = C C Network DSM Innovations

  9. RO RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO NA NA RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations

  10. RO RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO RO NA NA RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations

  11. R R W RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RO RW RO RO RW RW RW RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations

  12. N-Body Simulation - False Sharing DSM Innovations

  13. N-Body Simulation - False Sharing Network DSM Innovations

  14. The False Sharing Problem Two variables that happen to reside in the same page can cause false sharing • Causes ping-pong between processes • Slows down the computation • Can even lead to livelocks • Possible solution: Delta mechanism [Munin, Rice] • A page is not allowed to leave the process for a time interval Delta • Setting this Delta value is difficult (fine tuning required) • Large Deltas cause delays, short Deltas may not solve the livelock. • The best Delta value is application specific and even page specific • SDSM systems do not work well for fine-grain applications! • Because they use page-based OS protection on pages  • But: hardware DSMs using cache lines are slow too  DSM Innovations

  15. The First SDSM System • The first software SDSM system, Ivy [Li & Hudak, Yale, ‘86] • Page-based SDSM • Provided strict semantics (Lamport’s sequential consistency) • The major performance limitation: Page size  False sharing • Two traditional approaches for designing SDSMs • Weak semantics (Relaxed consistencies) • Code instrumentation DSM Innovations

  16. Apply diff Apply diff RW RW First Approach: Weak Semantics • Example - Release Consistency: • Allow multiple writers to page (assume exclusive update for any portion of the page) • Each page has a twin copy • At synchronization time, all pages perform “diff” with their twins, and send diffs to managers • Managers hold master copies twin twin DSM Innovations

  17. First Approach: Weak Semantics • Allow memory to reside in an incosistent state for time intervals • Enforce consistency only at synchronization points • Reaching a consistent view of the memory requires computation • Reduces (but not always eliminate) false sharing • Reduces number of protocol messages • Weak memory semantics • Involves both memory and processing time overhead • Still: coarse-grain sharing (why diff at locations not touched? ) DSM Innovations

  18. Software DSM Evolution - Weak Semantics Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Brazos, ‘97Scope Cons.Rice DSM Innovations

  19. Software DSM Evolution - Multithreading Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Multithreading CVM, Millipede, ‘96multi-protocol Maryland Technion Brazos, ‘97Scope Cons.Rice Quarks, ‘98protocol latency hiding Utah DSM Innovations

  20. Second Approach:Code Instrumentation • Example - Binary Rewriting: • wrap each load and store with instructions that check whether the data is available locally push ptr[line]call __check_rload r1, ptr[line]push ptr[v]call __check_rload r2, ptr[v] add r1, 3hpush ptr[line]call __check_wstore r1, ptr[line]push ptr[line]call __donesub r2, r1push ptr[v]call __check_wstore r2, ptr[v]push ptr[v]call __done line += 3; v = v - line; push ptr[line]call __check_wload r1, ptr[line]push ptr[v]call __check_wload r2, ptr[v] add r1, 3hstore r1, ptr[line]push ptr[line]call __donesub r2, r1store r2, ptr[v]push ptr[v]call __done Compile CodeInstr. load r1, ptr[line]load r2, ptr[v] add r1, 3hstore r1, ptr[line]sub r2, r1store r2, ptr[v] Opt. DSM Innovations

  21. Second Approach:Code Instrumentation • Provides fine-grain access control, thus avoids false sharing • Bypasses the page protection mechanism • Usually, fixed granularity for all application data (Still, false sharing ) • Needs a special compiler or binary-level rewriting tools • Cost: • High overheads (even on single machine) • Inflated code • Not portable (among architectures) DSM Innovations

  22. Software DSM Evolution Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Fine-grain:Code Instrumentation Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Blizzard, ‘94binary instrumentationWisconsin Multithreading CVM, Millipede, ‘96multi-protocol Maryland Technion Shasta, ‘97transparent,works forcommercial appsDigital WRL Brazos, ‘97Scope Cons.Rice Quarks, ‘98protocol latency hiding Utah DSM Innovations

  23. The MultiView Approach • Attack the major limitation in SDSMs: the page size.Implement small-size pages through novel mechanisms [OSDI’99] More Goals: • W/O compromising the strict memory consistency [ICS’04,EuroPar’04] • Utilize Low-Latency Networks [DSM’00, IPDPS’04] • Transparency [EuroPar’03] • Adaptive Sharing Granularity [ICPP’00, IPDPS’01 best paper] • Maximize Locality through Migration and Load Sharing [DISK’01] • Additional “service layers”: Garbage Collection, Data-Race Detection. [ESA’99,JPDC’01,JPDC02] DSM Innovations

  24. x y z w v u The Traditional Memory Layout struct a { …};struct b; int x, y, z; main() { w = malloc(sizeof(struct a)); v = malloc(sizeof(struct a)); u = malloc(sizeof(struct b)); …} Traditional DSM Innovations

  25. x x y y z z w w v v u u The MultiView Technique MultiView Traditional DSM Innovations

  26. Protection is now set independently x x y y z z RW NA R w w v v u u The MultiView Technique Variables reside in the same page but are not shared MultiView Traditional DSM Innovations

  27. Memory Object x x View 1 y y z z View 2 View 3 w w v v u u The MultiView Technique MultiView Traditional DSM Innovations

  28. MemoryObject x View 1 y z View 2 View 3 w v u MultiView The MultiView Technique View 1 Memory Object View 2 View 3 Memory Layout DSM Innovations

  29. The MultiView Technique R R View 1 NA RW View 1 Memory Object Memory Object NA R View 2 View 2 R R R R RW NA View 3 View 3 RW NA Host A Host B DSM Innovations

  30. The MultiView Technique R R View 1 NA RW View 1 NA R View 2 View 2 R R R R RW NA View 3 View 3 RW NA Host A Host B DSM Innovations

  31. The Enabling Technology • memory mapped I/O are meant to be used as inter-process communication method SharedMemoryObject DSM Innovations

  32. The Enabling Technology • BUT, using multiple memory mapped I/O within a single process may provide the desired functionality SharedMemoryObject DSM Innovations

  33. MultiView – Implementation in Millipede • Implementation in Windows-NT 4.0 (now portable to all of Solaris, BSD, Linux, and NT) • Use CreateFileMapping() and MapViewOfFileEx() for allocating views • Views are constructed at initialization time • Minipage Table (MPT) provides translation from pointer address to minipage boundaries DSM Innovations

  34. MultiView - Overheads • Application:traverse an array of integers, all packed up in minipages • The number of minipages is derived from the value of max views in page • Limitations of the experiments: • 1.63GB contiguous address space available • Up to 1664 views •  Need 64 bits!!! DSM Innovations

  35. Num views MultiView - Overheads • As expected, committed (physical) memory is constant • Only a negligible overhead (< 4%): Due to TLB misses DSM Innovations

  36. 2MB 4MB 8MB 1MB MultiView - Taking it to the extreme • Beyond critical points overhead becomes substantial • Number of minipages at critical points is 128K • Slowdown due to L2 cache exhausted by PTEs DSM Innovations

  37. 2MB 4MB 8MB 1MB MultiView - Taking it to the extreme • Beyond critical points overhead becomes substantial • Number of minipages at critical points is 128K • Slowdown due to L2 cache exhausted by PTEs SDSM DSM Innovations

  38. SDSMs on Emerging Fast Networks • Fast networking is an emerging technology • MultiView provides only one aspect: reducing message sizes • The next magnitude of improvement shifts from the network layer to the system architectures and protocols that use those networks • Challenges: • Efficiently employ and integrate fast networks • Provide a “thin” protocol layer: reduce protocol complexity, eliminate buffer copying, use home-based management, etc. DSM Innovations

  39. x y z RW NA R RW Adding the Privileged View • Constant Read/Write permissions • Separate application threads from SDSM injected threads • Atomic updates • DSM threads can access (and update) memory while application threads are prohibited • Direct send/receive • Memory-to-memory • No buffer copying Memory Object Application Views The Privileged View DSM Innovations

  40. Basic Costs in Millipage (using a Myrinet, 1998) • Message sizes directly influence latency • The most compute demanding operation: Minipage translation - 7 usec • In relaxed consistency systems, protocol operations might take hundreds of usecs Access fault 26 usec get protection 7 usec set protection 12 usec messages (one way) header msg 12 usec a data msg (1/2 KB) 22 usec a data msg (1 KB) 34 usec a data msg (4 KB) 90 usec MPT translation 7 usec • example:Run-length diff for 4KB page: 250 usec DSM Innovations

  41. Minimal Application Modifications • Minipages are allocated at malloc time (via malloc-like API) • The memory layout is transparent to the programmer • Allocation routines should be slightly modified mat = malloc(lines*sizeof(int*));for(i=0;i<N;i++) mat[i] = malloc(cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; … mat = malloc(lines*cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; … • SOR and LU have not been modified at all • WATER- changed ~20 lines out of 783 lines • IS- changed 5 lines out of 93 lines • TSP- changed ~15 lines out of ~400 lines DSM Innovations

  42. Tradeoff: False Sharing vs. Aggregation (e.g. WATER) DSM Innovations

  43. Performance with Fixed GranularityNBodyW application on 8 nodes DSM Innovations

  44. Dynamic Sharing Granularity • Use the best sharing granularity, as determined by application requirements (not architecture constants!). • Dynamically adapt the sharing granularity to the changing behavior of the application. • Adaptation done transparently by the SDSM. • No special hardware support, compiler modifications, application code changes. DSM Innovations

  45. Dynamic Sharing Granularity Shared data elements Application run time DSM Innovations

  46. Reply (Data 2,4,5) 1 1 2 2 Request(1-6) Memory Access Request 3 3 Manager 4 4 Request 5 5 6 Reply (Data 1,3) 6 Coarse Granularity 1 Host 1 Host 2 2 3 4 5 6 Host 3 DSM Innovations

  47. Fine granularity Fine granularity 1 Coarse granularity Host A 1 1 2 Host A 2 2 Host A 3 3 Host B 3 4 4 4 Split Recompose 5 5 When different hosts update different minipages When same host accesses consecutive minipages 6 6 5 6 Automatic Adaptation of Granularity Coarse granularity DSM Innovations

  48. Millipede Memory Faults ReductionBarnes application DSM Innovations

  49. Water-nsq Performance DSM Innovations

  50. Water-nsq Performance (cont’d) DSM Innovations

More Related