1 / 22

Exploiting Store Locality through Permission Caching in Software DSMs

Exploiting Store Locality through Permission Caching in Software DSMs. Uppsala University Dept. of Information Technology Div. of Computer Systems Uppsala Architecture Research Team [ UART ]. Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten zeffer@it.uu.se.

macy
Télécharger la présentation

Exploiting Store Locality through Permission Caching in Software DSMs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Store Locality through Permission Caching in Software DSMs Uppsala UniversityDept. of Information TechnologyDiv. of Computer SystemsUppsala Architecture Research Team [UART] Håkan Zeffer, Zoran Radovic, Oskar Grenholm and Erik Hagersten zeffer@it.uu.se

  2. Software Distributed Shared Memory

  3. Traditional Software DSMs • Page based coherence [e.g., Ivy, Munin, TreadMarks] • Virtual memory hardware for coherence checks • Expensive TLB traps • Large coherence unit size • Problem: False sharing • Solution: Weak memory consistency models DATA dir CPUs req. ST miss

  4. Fine-Grain Software DSMs • Fine-grain access-control checks [Shasta, Blizzard] • Relies on binary instrumentation • Avoids operating system trapping • Less false sharing • Extra instructions introduce overhead Checking code instrumented into the application DATA dir if (miss) goto st_protocol ST CPUs req.

  5. Fine-Grain Pros and Cons • Pros • Small coherence unit • Hardware-like memory consistency model • Cons • Extra check instructions to execute • Our proposal: Write Permission Cache (WPC) • Exploits store locality • Caches write permission • Effectively reduces the store instrumentation cost

  6. Outline • Motivation • Problem: Instrumentation Overhead • Solution: Write Permission Cache • Experimental Setup • Results on Real HW- and SW-DSM Systems • Conclusions

  7. Software Fine-Grain Coherence • Binary instrumentation of global loads and stores • Inserted code “snippet” maintains coherence Instrumented program Original program add R1, R2 -> R3 loop: load snippet for G_LD1 call coherence protocol if load miss load snippet for G_LD2 call coherence protocol if load miss sub R9, 1 -> R9 add R6, R7 -> R8 store snippet for G_ST1 call coherence protocol if store miss add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4]

  8. The Lock Problem (original DSZOOM) • Example store access pattern (array traversal) Operation CUID Original snippet handling ST 0xE22F0000 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0008 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0010 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0018 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0020 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0028 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0030 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0038 98 lock dir entry 98; store; unlock dir entry 98 ST 0xE22F0040 99 lock dir entry 99; store; unlock dir entry 99 ST 0xE22F0048 99 lock dir entry 99; store; unlock dir entry 99

  9. DSZOOM Fine-Grain Coherence • Magic value (load), atomic operations (store) Original program Instrumented program add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // original load if (R6 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss ld [R2 + R4] -> R7 // original load if (R7 == MAGIC) // test permission LD_PROTOCOL(); // protocol if miss sub R9, 1 -> R9 add R6, R7 -> R8 LOCK(LOCAL_DIR); // lock local dir if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL(); // protocol if miss st R8 -> [R3 + R4] // original store UNLOCK(LOCAL_DIR); // unlock local dir add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4]

  10. Sequential Instrumentation Overhead Average instrumentation overhead when run on a single processor (SPLASH2 –O3): • Integer load instrumentation overhead: 3% • Overhead when only integer loads are instrumented • Float load instrumentation overhead: 31% • Only floating-point loads instrumented • Store instrumentation overhead: 61% • Only stores instrumented

  11. Write Permission Caching in Action • Example store access pattern (array traversal) 98 99 Write Permission Cache Operation CUID WPC snippet handling ST 0xE22F0000 98 check WPC; miss; upd. WPC; lock dir entry 98; store ST 0xE22F0008 98 check WPC; hit; store ST 0xE22F0010 98 check WPC; hit; store ST 0xE22F0018 98 check WPC; hit; store ST 0xE22F0020 98 check WPC; hit; store ST 0xE22F0028 98 check WPC; hit; store ST 0xE22F0030 98 check WPC; hit; store ST 0xE22F0038 98 check WPC; hit; store ST 0xE22F0040 99 check WPC; miss; unlock 98; upd. WPC; lock 99; store ST 0xE22F0048 99 check WPC; hit; store

  12. The Write Permission Cache Idea • Keep the lock • Rely on store locality • SPARC application registers Write Permission Cache Snippet Original program add R1, R2 -> R3 loop: ld [R1 + R4] -> R6 // G_LD1 ld [R2 + R4] -> R7 // G_LD2 sub R9, 1 -> R9 add R6, R7 -> R8 st R8 -> [R3 + R4] // G_ST1 add R4, 4 -> R4 bnz R9, loop L134: st R3 -> [R7 + 4] WPC_FASTPATH: if (WPC != CU_ID(ADDR)) WPC_SLOWPATH() st R8 -> [R3 + R4]; // original store WPC_SLOWPATH: UNLOCK(WPC) WPC = CU_ID(ADDR) LOCK(WPC); if (LOCAL_DIR != WRITE_PERMISSION) ST_PROTOCOL();

  13. Experimental Setup: Software • Benchmarks: unmodified SPLASH2 • Compiler: GCC 3.3.3 (-O0 and –O3) • Instrumentation tool: custom made

  14. Experimental Setup: Hardware • SMP: Sun Enterprise E6000 Server • 16 UltraSPARC II (250 MHz) • Memory access time 330 ns [lmbench] • HW-DSM: Sun Wildfire (2 E6000 nodes) • Remote memory access time 1700 ns [lmbench] • Hardware coherent interconnect. BW 800 MB/s • DSZOOM: Runs in user space on the Wildfire system • put (get) = uncacheable block load (store) operation • atomic = ldstub (load store unsigned byte SPARC V9) • maintains coherence between private copies of G_MEM

  15. Write Permission Cache Hit Rate

  16. Sequential Instrumentation Overhead

  17. Execution Time, 16 processors (2x8)Performance bug in paper (popc).

  18. Conclusions • Write permission cache (WPC) • Effectively reduces store instrumentation overhead • 2 entries is sufficient • Store instrumentation overhead reduction: 42% • HW-, SW-DSM gap reduction: 28% • Parallel performance improvement: 9%

  19. Thanks and Questions http://www.it.uu.se/research/group/uart

  20. Memory Consistency • The base architecture implements sequential consistency by requiring all acknowledges from sharing nodes before a global store request is granted • Introducing the WPC in an invalidation-based environment will not weaken the memory model • WPC just extends the duration of the permission tenure before the write permission is given up • If the memory model of each node is weaker than SC, it will decide the memory model of the system

  21. Deadlock • WPC entries are flushed at: • Synchronization points • Failures to acquire directory locks • Thread termination • WPC + flag synchronization can lead to deadlock • Timers • Interrupt other CPUs • Lack of forward progress

  22. Directory Collisions • Directory collision: if a requesting processor fails to acquire a directory lock • The number of directory collisions doesn’t increase when less than 32 WPC entries are used • More information in the paper

More Related