1 / 25

LOTS: A Software DSM Supporting Large Object Space

LOTS: A Software DSM Supporting Large Object Space. Benny Wang-Leung Cheung, Cho-Li Wang , and Francis Chi-Moon Lau. Department of Computer Science The University of Hong Kong. September, 2004. Presentation Outline. Why LOTS? (Objectives) DSM Background and Related Work

lan
Télécharger la présentation

LOTS: A Software DSM Supporting Large Object Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University of Hong Kong September, 2004

  2. Presentation Outline • Why LOTS? (Objectives) • DSM Background and Related Work • Design of LOTS • Performance Testing and Results • Conclusion and Future Work

  3. The Problem in Current DSM • Lack of shared object (memory) space • Another major problem apart from performance • Fixed address mapping in virtual memory • Shared object space size < process space • TreadMarks: ~ min RAM size among all machines • JIAJIA V1.0: 128 MB • 32-bit machines  max 4 GB shared space • Unscalable: Fixed regardless of # machines • Large problems (with > 4GB shared memory need) can’t be run directly  The programmer needs to change the application code to reduce the memory utilization.

  4. Objectives of LOTS • Using 64-bit machines is not a total solution! • 32-bit machines are dominating the market (poor man’s clusters :<) • Problems keep increasing memory consumption (Rich man’s cluster) (Poor man’s cluster)

  5. Objectives of LOTS • Hence we introduce LOTS: • Large Shared Object Space > 4GB • Dynamic run-time memory mapping technique • Local disk as the backing store for temporarily unused objects • Shared space size now limited by disk space • Lazy disk read/write  reasonable performance

  6. Some DSM Background • Memory Consistency Issues: • Memory Consistency Models • Sequential Consistency (IVY) performs poorly • Relaxed models reduce redundant data traffic • Lazy Release Consistency (TreadMarks) • Scope Consistency (JIAJIA) P Q Y=5 Acq(L) X=3 Rel(L) Acq(L) X=? In Scope, Q sees X to be 3, but Y may not be 5 Y=? Rel(L)

  7. Some DSM Background • More Memory Consistency Issues: • Coherence Protocols • Home-based (JIAJIA) vs Homeless (TreadMarks) vs Migrating-Home (JUMP) • Write-update vs Write-invalidate • Adaptive Protocol (DOSA, ADSM) • Coherence Protocol has to match with memory model for higher efficiency • No DSM deals with Large Object Space!

  8. Related Work • Large object space support: • Pointer swizzling • Artificial, invalid addresses are translated to machine-addressable form during access • Used in persistent store (QuickStore, Thor-1) Unused objects free their virtual addresses and are swapped out (i.e., swizzled out) to hard disk Process Space Compiler-generated addresses cause page fault at runtime and are translated to valid ones

  9. (3) Internal structure points to object data for access (2) Bring in object from disk / network (1) Access invokes mapping mechanism Design of LOTS • Dynamic Memory Mapping (DMM) • Uses C++ Operator Overloading as the interface • Overloads [], +, -, *, /, ++, --, >=, <=, !=, etc. • Purely runtime; Network Data (DMM) Area Array A Remote Memory A->ctrl A[5]=7; Heap Area Program Local Hard Disk Virtual Memory Area

  10. 0xffffffff Kernel Reserved C++ Stack 0xc0000000 0xb0000000 DSM Control Area 0x90000000 Twin Area 0x70000000 DMM Area 0x50000000 Heap Area DAT Segment TXT Segment 0x00000000 LOTS Shared Objects Creation Process Space • Through the LOTS memory allocator • Exists as a C++ class • Memory allocation through alloc() function • Put data into specific part of process space • Object control info in heap area Array A A->Ctrl

  11. 8 16 24 32 40 48 … 1M 2M … ½G ½G … 2M 1M … 48 40 32 24 16 8 LOTS Memory Allocator • Bypass Doug Lea’s Memory Allocator used in original C/C++ • Uses mmap() to get physical memory, and map the shared object data to the process space. • Free queues and used queues • Small & large objects allocated separately Free queue Used block Free block Twin and Control Area Heap Area 0x50000000 DMM Area 0x70000000 Used queue

  12. Shared Memory Behavior • Goal: Reduce redundant data traffic • Memory Consistency Model: Scope • Memory Coherence Protocol: Mixed • Lock-synchronized objects : Homeless + write-update • Barrier-synchronized objects : Migrating-Home + write-invalidate • Principle: To eliminate as much all-to-all data communication as possible

  13. Acq(L2) Acq(L2) Acq(L1) Acq(L1) x1++ x1=? x1=1 x2++ x2=3 y1=5 y1++ X Y Y Y X X Rel(L2) Rel(L1) Rel(L2) Rel(L1) Mixed Coherence Protocol Updates Movement Home Token Movement P0 P1 P2 P3 • An Example: Home of X and Y x2 = 3 x1=1 y1=5 x2 = 4 Barrier New Home Inv X, Y Inv X, Y Inv X, Y x1 = 2, x2 = 4 When the processes arrive at the barrier, the process that holds the token of the object will become the new home of that object, and other processes will send the updates to the home.

  14. X1 X2 X3 X4 X5 X6 X7 X8 2 3 4 0 4 1 4 3 Time 1 1 X6 2 1 X1 3 2 X2 X8 4 3 X3 X5 X7 X1 X2 X3 X4 X5 X6 X7 X8 Making LOTS More Efficient • Eliminating Diff Accumulation Problem • Lock and timestamp info in DSM control area • Calculate diff on request, no redundancy T=1 (len=6) T=2 (len=4) T=3 (len=4) T=4 (len=3) X1 X2 X3 X6 X7 X8 Value Last Updated Time X1 X3 X5 X8 Traditional Method X2 X5 X7 X8 LOTS Method X3 X5 X7 Length Only send 7 units data + 8 units of control data All updates above need to be sent (17 units data + 8 units of control)

  15. Other Components in LOTS • C++ runtime library in Linux • Minimal set of functions as interface • Retains as much C++ syntax as possible to improve programmability • Synchronization: Locks and Barriers • Barriers: With/Without memory effect • Communication: Sockets with UDP/IP • SIGIO handler for incoming messages

  16. Performance Testing • Two Kinds of Testing • Without invoking large object space support • Compare performance with other DSM (JIAJIA V1.0, as both have similar communication protocol) • Report no. of messages and bytes sent • Calculate large object space support overhead • 16 Pentium IV 2GHz machines with 100Mbps Fast Ethernet connection, 128MB mem, Linux Fedora • With large object space support • Use an application with large memory demand • Run on different platforms for analysis • Expect disk read/write overhead dominates

  17. Test 1: Timing Performance LOTS<JiaJia LOTS<JiaJia LOTS>JiaJia LOTS<JiaJia LOTS: LOTS enabled LOTS-x : LOTS disabled x-axis : problem size, y-axis : execution time in seconds

  18. Performance Results Summary • LOTS beat JIAJIA V1.0 in most applications • Mixed protocol + “Diff accumulation elimination” reduce data traffic • Large object space support and access checking incur a considerable overhead • about 5-15% of total execution time (application dependent)

  19. Test 2: Large Object Space • Using 4-node PC and server clusters • Test program: simple matrix operations • With 120GB (SCSI) hard disk in each machine, able to claim 117.77GB Shared Object Space • Disk read and write time is closely related to the OS version.

  20. Conclusions • LOTS succeed in: • Providing a large shared object space larger than the local process space during runtime • Performing reasonably well by reducing data traffic through Scope Consistency, mixed coherence protocol and “diff accumulation elimination” technique • Similar programming interface with C++

  21. Future Work • A Number of Optimizations: • Further increase shared object space •  “the minimum hard disk space x number of processes / 2”. • Recent progress: 64GB (4GB x 16) of shared objects can be allocated in 16 machines, each having a 9GB hard disk. • Reduce disk overhead • Reduce over-loading overhead (access check) • Load-aware migrating-home protocol: coherence protocol adapting to network traffic and processor loading (e.g., avoid too many “homes” in a single machine)

  22. Questions ?

  23. Due to mixed protocol, LOTS send fewer messages through the network than JIAJIA Test 1: No. of Messages Sent The percentage is obtained by dividing the number of messages sent in LOTS over that in JIAJIA for the same application. % No. of procs (p)

  24. Test 1: No. of Bytes Sent The percentage is obtained by dividing the number of bytes sent in LOTS over that in JIAJIA for the same application. % No. of procs (p)

  25. for (j = 0; j < linec; j++) { pp = (dsmid + j) % linec; for (i = pp; i < X; i += 4) { acq(i); a[i][0] += rand(); rel(i); } } // array addition nm_barrier(); return 0; } int main(int argc, char **argv) { int i, j, pp, local[4]; // 2D int array Pointer <Pointer <int> > a; lots_init(); // init LOTS // shared memory allocation a.alloc(X); for (i=0; i<X; i++) a[i].alloc(size); nm_barrier(); // barrier Test 2: Large Object Space • Allocate shared objects with total size > 4GB, and another process accesses each of them once (array addition with p=4)

More Related