1 / 27

Synchronization without Contention

ECE 259 / CPS 221 Advanced Computer Architecture II. Synchronization without Contention. John M. Mellor-Crummey and Michael L. Scott+. Presenter : Tae Jun Ham 2012. 2. 16. Problem. Busy-waiting synchronization incurs high memory/network contention

heinz
Télécharger la présentation

Synchronization without Contention

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 259 / CPS 221 Advanced Computer Architecture II Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ Presenter : Tae Jun Ham 2012. 2. 16

  2. Problem • Busy-waiting synchronization incurs high memory/network contention • Creation of hot spot = degradation of performance • Causes cache-line invalidation (for every write on lock) • Possible Approach : Add special-purpose hardware for synchronization • Add synchronization variable to the switching nodes on interconnection • Implement lock queuing mechanisms on cache controller • Suggestion in this paper : Use scalable synchronization algorithm (MCS) instead of special-purpose hardware

  3. Review of Synchronization Algorithms • Test and Set • Require : Test and Set (Atomic operation) • Problem : • Large Contention – Cache / Memory • Lack of Fairness - Random Order LOCK while (test&set(x) == 1); UNLOCK x = 0;

  4. Review of Synchronization Algorithms • Test and Set with Backoff • Almost similar to Test and Set but has delay • Time : • Linear : Time = Time + Some Time • Exponential : Time = Time * Some constant • Performance : Reduced contention but still not fair LOCK while (test&set(x) == 1) { delay(time); } UNLOCK x = 0;

  5. Review of Synchronization Algorithms • Ticket Lock • Requires : fetch and increment (Atomic Operation) • Advantage : Fair (FIFO) • Disadvantage : Contention (Memory/Network) LOCK myticket = fetch & increment (&(L->next_ticket)); while(myticket!=L->now_serving) { delay(time * (myticket-L->now_serving)); } UNLOCK L->now_serving = L->now_serving+1;

  6. Review of Synchronization Algorithms • Anderson Lock (Array based queue lock) • Requires : fetch and increment (Atomic Operation) • Advantage : Fair (FIFO), No cache contention • Disadvantage : Requires coherent cache / Space LOCK myplace= fetch & increment (&(L->next_location)); while(L->location[myplace] == must_wait) ; L->location[myplace]=must_wait; } UNLOCK L->location[myplace+1]=has_lock;

  7. MCS Lock • MCS Lock – Based on Linked List • Acquire • Fetch & Store Last processor node (Get predecessor & set tail) • Set arriving processor node to locked • Set last processor node’s next node to arriving processor node • Spin till Locked=false tail 1 2 3 4 Locked : False (Run) Locked :True(Spin) Locked :True (Spin) tail 1 2 3 4 Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) Locked : False (Run)

  8. MCS Lock • MCS Lock – Based on Linked List • Release Check if next processor node is set (check if we completed acquisition) - If set, make next processor node unlocked tail 1 2 3 4 Locked :True (Spin) Locked :True (Spin) Locked :True (Spin) Locked : False (Run) tail 1 2 3 4 Locked :True (Spin) Locked : False (Run) Locked :True (Spin) Locked : False (Finished)

  9. MCS Lock • MCS Lock – Based on Linked List • Release Check if next processor node is set (check if we completed acquisition) • If not set, check if tail points itself (compare & swap with null) • If not, wait till next processor node is set • Then, unlock next processor node tail tail tail 1 2 1 2 1 2 Locked : True (Run) Locked : False (Run) Locked : False (Run) Locked : False (Finished) Locked : False (Run)

  10. MCS Lock – Concurrent Read Version • MCS Lock – Based on Linked List • MCS Lock – Concurrent Read Version

  11. MCS Lock – Concurrent Read Version • Start_Read :- Ifpredecessor is nill or active reader, reader_count++ (atomic) ; proceed;- Else, spin till (another Start_Read or End_Write) unblocks this=> Then, this unblocks its successor reader (if any) • End_Read : - Ifsuccessor is writer, set next_writer=successor- reader_count-- (atomic)- Iflast reader(reader_count==0), check next_writer and unblocks it • Start_Write : - If predecessor is nill and there’s no active reader(reader_count=0), proceed- Else, spin till (last End_Read ) unblocks this • End_Write : - If successoris reader, reader_count++ (atomic) and unblocks it

  12. Review of Barriers • Centralized counter barrier Keeps checking(test & set) centralized counter • Advantage : Simplicity • Disadvantage : Hot spot, Contention

  13. Review of Barriers • Combining Tree Barrier • Advantage : Simplicity, Less contention, Parallelized fetch&increment • Disadvantage : Still spins on non-local location

  14. Review of Barriers • Bidirectional Tournament Barrier • Winner is statically determined • Advantage : No need for fetch and op / Local Spin

  15. Review of Barriers • Dissemination Barrier • Can be understood as a variation of tournament (Statically determined) • Suitable for MPI system

  16. MCS Barriers • MCS Barrier (Arrival) • Similar to Combined Tree Barrier • Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path

  17. MCS Barriers • MCS Barrier (Wakeup) • Similar to Combined Tree Barrier • Local Spin / O(P) Space / 2(P-2) communication / O(log p) critical path 0 2 1 3 4 5

  18. Spin Lock Evaluation • Butterfly Machine result • Three scaled badly; Four scaled well. MCS was best • Backoff was effective

  19. Spin Lock Evaluation • Butterfly Machine result • Measured consecutive lock acquisitions on separate processors instead of acquire/release pair from start to finish

  20. Spin Lock Evaluation • Symmetry Machine Result • MCS and Anderson scales well • Ticket lock cannot be implemented in Symmetry due to lack of fetch and increment operation • Symmetry Result seems to be more reliable

  21. Spin Lock Evaluation • Network Latency • MCS has greatly reduced increases in network latency • Local Spin reduces contention

  22. Barrier Evaluation • Butterfly Machine • Dissemination was best • Bidirectional and MCS Tree was okay • Remote memory access degrades performance a lot

  23. Barrier Evaluation • Symmetry Machine • Counter method was best • Dissemination was worst • Bus-based architecture: Cheap broadcast • MCS arrival tree outperforms counter for more than 16 processors

  24. Local Memory Evaluation

  25. Local Memory Evaluation • Having a local memory is extremely important • It both affects performance and network contention • Dancehall system is not reallyscalable

  26. Summary • This paper proposed a scalable spin-lock synchronization algorithmwithout network contention • This paper proposed a scalable barrier algorithm • This paper proved that network contention due to busy-wait synchronization is not really a problem • This paper proved an idea that hardware for QOSB lock would not be cost-effective when compared with MCS lock • This paper suggests the use of distributed memory or coherentcaches rather than dance-hall memory without coherent caches

  27. Discussion • What would be the primary disadvantage of MCS lock? • In what case MCS lock would have worse performance than other locks? • How do you think about special-purpose hardware based locks? • Is space usage of lock important? • Can we benefit from dancehall style memory architecture? (disaggregated memory ?) • Is there a way to implement energy-efficient spin-lock?

More Related