1 / 29

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. John M. Mellor-Crummey. Michael L. Scott. Joseph Garvey & Joshua San Miguel. Dance Hall Machines?. Atomic Instructions.

dash
Télécharger la présentation

Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors • John M. Mellor-Crummey Michael L. Scott Joseph Garvey & Joshua San Miguel

  2. Dance Hall Machines?

  3. Atomic Instructions • Various insns known as fetch_and_ф insns: test_and_set, fetch_and_store, fetch_and_add, compare_and_swap • Some can be used to simulate others but often with overhead • Some lock types require a particular primitive to be implemented or to be implemented efficiently

  4. Test_and_set: Basic • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • while test_and_set (L) == locked ; • procedure release_lock (lock *L) • *L = unlocked

  5. Test_and_set: Basic P P P $ $ $ Memory

  6. Test_and_set: test_and_test_and_set • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • while 1 • if *L == unlocked • if test_and_set (L) == unlocked • return • procedure release_lock (lock *L) • *L = unlocked

  7. Test_and_set: test_and_test_and_set P P P $ $ $ Memory

  8. Test_and_set: test_and_set with backoff • type lock = (unlocked, locked) • procedure acquire_lock (lock *L) • delay = 1 • while test_and_set (L) == locked • pause (delay) • delay = delay * 2 • procedure release_lock (lock *L) • *L = unlocked

  9. Ticket Lock • type lock = record • next_ticket = 0 • now_serving = 0 • procedure acquire_lock (lock *L) • my_ticket = fetch_and_increment(L->next_ticket) • while 1 • if L->now_serving == my_ticket • return • procedure release_lock (lock *L) • L->now_serving = L->now_serving + 1

  10. Array-Based Queuing Locks • type lock = record • slots = array [0…numprocs – 1] of (has_lock, must_wait) • next_slot = 0 • procedure acquire_lock (lock *L) • my_place = fetch_and_increment (L->next_slot) • // Various modulo work to handle overflow • while L->slots[my_place] == must_wait ; • L->slots[my_place] = must_wait • procedure release_lock (lock *L) • L->slots[my_place + 1] = has_lock

  11. Array-Based Queuing Locks P P P my_place my_place my_place $ $ $ Memory next_slot slots

  12. MCS Locks procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false • type qnode = record • qnode *next • bool locked • type lock = qnode* • procedure acquire_lock (lock *L, qnode *I) • I->next = Null • qnode *predecessor = fetch_and_store (L, I) • if predecessor != Null • I->locked = true • predecessor->next = I • while I->locked ;

  13. MCS Locks 1-R L 2-B 2-R procedure release_lock (lock *L, qnode *I) if I->next == Null if compare_and_swap (L, I, Null) return while I->next == Null ; I->next->locked = false 3-B 3-R 3-E 4-R 4-B 5-B

  14. Results: Scalability – Distributed Memory Architecture

  15. Results: Scalability – Cache Coherent Architecture

  16. Results: Single Processor Lock/Release Time • Butterfly’s atomic insns are very expensive • Butterfly can’t handle 24-bit pointers

  17. Results: Network Congestion

  18. Which lock should I use? fetch_and_store supported? • Atomic insns >> normal insns && 1 processor latency is very important  don’t use MCS • If processes might be preempted  test_and_set with exponential backoff No Yes fetch_and_increment supported? MCS Yes No Ticket test_and_set w/ exp backoff

  19. Centralized Barrier P1 • P0 P2 P3   4 2 0 1 3

  20. Software Combining Tree Barrier P1 • P0 P2 P3   0 1 2 P0 P2     2 1 0 0 1 2 P1 P3

  21. Tournament Barrier P1 • P0 P2 P3     L C         L W L W P0 P1 P2 P3

  22. Dissemination Barrier P1 • P0 P2 P3                 P0 P1 P2 P3

  23. New Tree-Based Barrier P1 • P0 P2 P3    3 2 1 0       0 0    0

  24. Summary

  25. Results – Distributed Shared Memory

  26. Results – Broadcast-Based Cache-Coherent

  27. Results – Local vs. Remote Spinning

  28. Barrier Decision Tree Multiprocessor? Distributed Shared Memory Broadcast-Based Cache-Coherent New Tree-Based Barrier (tree wakeup) New Tree-Based Barrier (central wakeup) Dissemination Barrier Centralized Barrier

  29. Architectural Recommendations • No dance hall • No need for complicated hardware synch • Need a full set of fetch_and_ф

More Related