Software Distributed Shared Memory (SDSM): The MultiView Approach Agenda:

Software Distributed Shared Memory (SDSM): • The MultiView Approach • Agenda: • SDSM, false sharing, previous solutions. • The MultiView approach. • Limitations. • Dynamic granularity adaptation. • Transparent SDSM. • Integrated services by the memory management. • Ayal Itzkovitz, Assaf Schuster DSM Innovations

Types of Parallel Systems Communication Efficiency • In-core multi-threading • Multi-core/SMP multi-threading • Tightly-coupled cluster, customized interconnect (SGI’s Altix) • Tightly-coupled cluster, of-the-shelf interconnect (InfiniBand) • WAN, Internet, Grid, peer-to-peer Traditionally: 1+2 are programmable using shared memory, 3+4 are programmable using message passing, in 5 peer processes communicate with central control only. HDSM: systems in 3 move towards presenting a shared memory interface to a physically distributed system. What about 4,5? Software Distributed Shared Memory = SDSM Scalability DSM Innovations

Local memory core core core core A multi-core system - simplistic view • A parallel program may spawn processes (threads) to concurrently work together, in order to utilize all computing units • Processes communicate through write to/read from shared memory, physically located on the local machine DSM Innovations

Virtual Shared Memory Local memory Local memory core core A distributed system • Emulation of the same programming paradigm • No changes to source Local memory core Network DSM Innovations

A = malloc(MATSIZE);B = malloc(MATSIZE);C = malloc(MATSIZE); parfor(n) mult(A, B, C); mult(id): for (line=Nxid .. Nx(id+1)) for(col=0..N) C[line,col] = multline(A[line],B[col]); W Matrix Multiplication two threads Read/only matrices Write matrix R R DSM Innovations

RW RO RO RW RO RO RO RO RO RO RO RO RW RW RO RO Matrix Multiplication A A Sent once x x Sent once B B = = C C Network DSM Innovations

RO RO RW RW RO RO RO RO RO RO RO RO RW RO RO RW Matrix Multiplication R R W A A x x B B = = C C Network DSM Innovations

RO RO RO RO RO RW RO RW RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO RO NA NA RW RW Matrix Multiplication - False Sharing Sent once A A x x B Sent once B = = C C Network DSM Innovations

RO RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO NA NA RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations

RO RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO RO NA NA RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations

R R W RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RO RW RO RO RW RW RW RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations

N-Body Simulation - False Sharing DSM Innovations

N-Body Simulation - False Sharing Network DSM Innovations

The False Sharing Problem Two variables that happen to reside in the same page can cause false sharing • Causes ping-pong between processes • Slows down the computation • Can even lead to livelocks • Possible solution: Delta mechanism [Munin, Rice] • A page is not allowed to leave the process for a time interval Delta • Setting this Delta value is difficult (fine tuning required) • Large Deltas cause delays, short Deltas may not solve the livelock. • The best Delta value is application specific and even page specific • SDSM systems do not work well for fine-grain applications! • Because they use page-based OS protection on pages  • But: hardware DSMs using cache lines are slow too  DSM Innovations

The First SDSM System • The first software SDSM system, Ivy [Li & Hudak, Yale, ‘86] • Page-based SDSM • Provided strict semantics (Lamport’s sequential consistency) • The major performance limitation: Page size  False sharing • Two traditional approaches for designing SDSMs • Weak semantics (Relaxed consistencies) • Code instrumentation DSM Innovations

Apply diff Apply diff RW RW First Approach: Weak Semantics • Example - Release Consistency: • Allow multiple writers to page (assume exclusive update for any portion of the page) • Each page has a twin copy • At synchronization time, all pages perform “diff” with their twins, and send diffs to managers • Managers hold master copies twin twin DSM Innovations

First Approach: Weak Semantics • Allow memory to reside in an incosistent state for time intervals • Enforce consistency only at synchronization points • Reaching a consistent view of the memory requires computation • Reduces (but not always eliminate) false sharing • Reduces number of protocol messages • Weak memory semantics • Involves both memory and processing time overhead • Still: coarse-grain sharing (why diff at locations not touched? ) DSM Innovations

Software DSM Evolution - Weak Semantics Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Brazos, ‘97Scope Cons.Rice DSM Innovations

Software DSM Evolution - Multithreading Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Multithreading CVM, Millipede, ‘96multi-protocol Maryland Technion Brazos, ‘97Scope Cons.Rice Quarks, ‘98protocol latency hiding Utah DSM Innovations

Second Approach:Code Instrumentation • Example - Binary Rewriting: • wrap each load and store with instructions that check whether the data is available locally push ptr[line]call __check_rload r1, ptr[line]push ptr[v]call __check_rload r2, ptr[v] add r1, 3hpush ptr[line]call __check_wstore r1, ptr[line]push ptr[line]call __donesub r2, r1push ptr[v]call __check_wstore r2, ptr[v]push ptr[v]call __done line += 3; v = v - line; push ptr[line]call __check_wload r1, ptr[line]push ptr[v]call __check_wload r2, ptr[v] add r1, 3hstore r1, ptr[line]push ptr[line]call __donesub r2, r1store r2, ptr[v]push ptr[v]call __done Compile CodeInstr. load r1, ptr[line]load r2, ptr[v] add r1, 3hstore r1, ptr[line]sub r2, r1store r2, ptr[v] Opt. DSM Innovations

Second Approach:Code Instrumentation • Provides fine-grain access control, thus avoids false sharing • Bypasses the page protection mechanism • Usually, fixed granularity for all application data (Still, false sharing ) • Needs a special compiler or binary-level rewriting tools • Cost: • High overheads (even on single machine) • Inflated code • Not portable (among architectures) DSM Innovations

Software DSM Evolution Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Fine-grain:Code Instrumentation Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Blizzard, ‘94binary instrumentationWisconsin Multithreading CVM, Millipede, ‘96multi-protocol Maryland Technion Shasta, ‘97transparent,works forcommercial appsDigital WRL Brazos, ‘97Scope Cons.Rice Quarks, ‘98protocol latency hiding Utah DSM Innovations

The MultiView Approach • Attack the major limitation in SDSMs: the page size.Implement small-size pages through novel mechanisms [OSDI’99] More Goals: • W/O compromising the strict memory consistency [ICS’04,EuroPar’04] • Utilize Low-Latency Networks [DSM’00, IPDPS’04] • Transparency [EuroPar’03] • Adaptive Sharing Granularity [ICPP’00, IPDPS’01 best paper] • Maximize Locality through Migration and Load Sharing [DISK’01] • Additional “service layers”: Garbage Collection, Data-Race Detection. [ESA’99,JPDC’01,JPDC02] DSM Innovations

x y z w v u The Traditional Memory Layout struct a { …};struct b; int x, y, z; main() { w = malloc(sizeof(struct a)); v = malloc(sizeof(struct a)); u = malloc(sizeof(struct b)); …} Traditional DSM Innovations

x x y y z z w w v v u u The MultiView Technique MultiView Traditional DSM Innovations

Protection is now set independently x x y y z z RW NA R w w v v u u The MultiView Technique Variables reside in the same page but are not shared MultiView Traditional DSM Innovations

Memory Object x x View 1 y y z z View 2 View 3 w w v v u u The MultiView Technique MultiView Traditional DSM Innovations

MemoryObject x View 1 y z View 2 View 3 w v u MultiView The MultiView Technique View 1 Memory Object View 2 View 3 Memory Layout DSM Innovations

The MultiView Technique R R View 1 NA RW View 1 Memory Object Memory Object NA R View 2 View 2 R R R R RW NA View 3 View 3 RW NA Host A Host B DSM Innovations

The MultiView Technique R R View 1 NA RW View 1 NA R View 2 View 2 R R R R RW NA View 3 View 3 RW NA Host A Host B DSM Innovations

The Enabling Technology • memory mapped I/O are meant to be used as inter-process communication method SharedMemoryObject DSM Innovations

The Enabling Technology • BUT, using multiple memory mapped I/O within a single process may provide the desired functionality SharedMemoryObject DSM Innovations

MultiView – Implementation in Millipede • Implementation in Windows-NT 4.0 (now portable to all of Solaris, BSD, Linux, and NT) • Use CreateFileMapping() and MapViewOfFileEx() for allocating views • Views are constructed at initialization time • Minipage Table (MPT) provides translation from pointer address to minipage boundaries DSM Innovations

MultiView - Overheads • Application:traverse an array of integers, all packed up in minipages • The number of minipages is derived from the value of max views in page • Limitations of the experiments: • 1.63GB contiguous address space available • Up to 1664 views •  Need 64 bits!!! DSM Innovations

Num views MultiView - Overheads • As expected, committed (physical) memory is constant • Only a negligible overhead (< 4%): Due to TLB misses DSM Innovations

2MB 4MB 8MB 1MB MultiView - Taking it to the extreme • Beyond critical points overhead becomes substantial • Number of minipages at critical points is 128K • Slowdown due to L2 cache exhausted by PTEs DSM Innovations

2MB 4MB 8MB 1MB MultiView - Taking it to the extreme • Beyond critical points overhead becomes substantial • Number of minipages at critical points is 128K • Slowdown due to L2 cache exhausted by PTEs SDSM DSM Innovations

SDSMs on Emerging Fast Networks • Fast networking is an emerging technology • MultiView provides only one aspect: reducing message sizes • The next magnitude of improvement shifts from the network layer to the system architectures and protocols that use those networks • Challenges: • Efficiently employ and integrate fast networks • Provide a “thin” protocol layer: reduce protocol complexity, eliminate buffer copying, use home-based management, etc. DSM Innovations

x y z RW NA R RW Adding the Privileged View • Constant Read/Write permissions • Separate application threads from SDSM injected threads • Atomic updates • DSM threads can access (and update) memory while application threads are prohibited • Direct send/receive • Memory-to-memory • No buffer copying Memory Object Application Views The Privileged View DSM Innovations

Basic Costs in Millipage (using a Myrinet, 1998) • Message sizes directly influence latency • The most compute demanding operation: Minipage translation - 7 usec • In relaxed consistency systems, protocol operations might take hundreds of usecs Access fault 26 usec get protection 7 usec set protection 12 usec messages (one way) header msg 12 usec a data msg (1/2 KB) 22 usec a data msg (1 KB) 34 usec a data msg (4 KB) 90 usec MPT translation 7 usec • example:Run-length diff for 4KB page: 250 usec DSM Innovations

Minimal Application Modifications • Minipages are allocated at malloc time (via malloc-like API) • The memory layout is transparent to the programmer • Allocation routines should be slightly modified mat = malloc(lines*sizeof(int*));for(i=0;i<N;i++) mat[i] = malloc(cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; … mat = malloc(lines*cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; … • SOR and LU have not been modified at all • WATER- changed ~20 lines out of 783 lines • IS- changed 5 lines out of 93 lines • TSP- changed ~15 lines out of ~400 lines DSM Innovations

Tradeoff: False Sharing vs. Aggregation (e.g. WATER) DSM Innovations

Performance with Fixed GranularityNBodyW application on 8 nodes DSM Innovations

Dynamic Sharing Granularity • Use the best sharing granularity, as determined by application requirements (not architecture constants!). • Dynamically adapt the sharing granularity to the changing behavior of the application. • Adaptation done transparently by the SDSM. • No special hardware support, compiler modifications, application code changes. DSM Innovations

Dynamic Sharing Granularity Shared data elements Application run time DSM Innovations

Reply (Data 2,4,5) 1 1 2 2 Request(1-6) Memory Access Request 3 3 Manager 4 4 Request 5 5 6 Reply (Data 1,3) 6 Coarse Granularity 1 Host 1 Host 2 2 3 4 5 6 Host 3 DSM Innovations

Fine granularity Fine granularity 1 Coarse granularity Host A 1 1 2 Host A 2 2 Host A 3 3 Host B 3 4 4 4 Split Recompose 5 5 When different hosts update different minipages When same host accesses consecutive minipages 6 6 5 6 Automatic Adaptation of Granularity Coarse granularity DSM Innovations

Millipede Memory Faults ReductionBarnes application DSM Innovations

Water-nsq Performance DSM Innovations

Water-nsq Performance (cont’d) DSM Innovations

Software Distributed Shared Memory (SDSM): The MultiView Approach Agenda: