1 / 18

Speculative Lock Elision

Speculative Lock Elision. Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and James R. Goodman University of Wisconsin-Madison. Motivation. Multithreaded Programming gains Importance SMPs, Multithreaded Architectures Used in non-traditional fields (Desktop)

espen
Télécharger la présentation

Speculative Lock Elision

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speculative Lock Elision Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and James R. Goodman University of Wisconsin-Madison

  2. Motivation • Multithreaded Programming gains Importance • SMPs, Multithreaded Architectures • Used in non-traditional fields (Desktop) • Synchronization Mechanisms required for exclusive access • Serialize access to shared data structures • Necessary to prevent conflicting updates

  3. LOCK(locks->error_lock) if (local_error>multi->err_multi) multi->err_multi=local_err; UNLOCK(locks->error_lock); Require no lock, if no write access different fields accessed Examples from ocean SHORE (similar) Thread 1 LOCK(hash_tbl.lock) var=hash_tbl.lookup(X) if(!var) hash_tbl.add(X); UNLOCK(hash_tbl.lock) Thread 2 LOCK(hash_tbl.lock) var=hash_tbl.lookup(Y) if(!var) hash_tbl.add(Y); UNLOCK(hash_tbl.lock) Dynamically Unnecessary Synchronization

  4. Multithreaded Programming • Conservative Locking • Easier to show correctness • Lock more often than necessary • Locking Granularity • Trade-Off between Performance and Complexity • Thread-unsafe legacy libraries • Require global locking

  5. False Dependencies • Locks introduce Control and Data Dependence • Removal • Key is Appearance of Instantaneous Change • Lock can be elided, if memory operation remain atomic • Data read is not modified by other threads • Data written is not access by other threads • Any instruction that violates these conditions must not be retired

  6. Silent store pairs • i16 and i6 are silent pair • i16 undoes i6 • Not silent individually • No write to _lock_ • Read _lock_ is ok • SLE does not depend on semantic information • Simply observe silent store pairs

  7. Two SLE Predictions • Silent pair prediction • On a store • Predict another store will undo changes • Don’t perform stores • Monitor memory location • If validated, elide two stores • Atomicity prediction • Predict all memory access within silent pair occur atomically • No partial updates visible to other threads

  8. Initiating Speculation • Detect candidate pairs • Filter indexed by program counter • Add confidence metric for each pair • Predict, if lock held • If yes, don’t speculate

  9. Buffering Speculative State • Register state • Reorder Buffer • Already used for branch prediction • Register Checkpoint • Memory state • Augment write buffers • Keep speculative writes in write buffer, until validated • Allows merging of speculative writes

  10. Misspeculation conditions • Atomicity violation • Detected by coherence protocol • Sufficient for ROB register state • For register checkpoint add speculative access bit to L1 cache • Resource constraints • Cache Size • Write Buffer Size

  11. Committing Speculative Memory State • Commits must appear instantaneously • Cache state and Cache data • Can speculate on state • Can’t speculate on data • Speculative store • Send GETX request • Block already exclusive when speculative writes commit

  12. Evaluation • 3 Systems • CMP, SMP and DSM • Total Store Order • Single Register Checkpoint • 32 entry lock predictor • Run on SimpleMP • Based on Simplescalar • Benchmarks • SPLASH • mp3d, barnes, cholesky • SPLASH-2 • Radiosity, water-nsq, ocean • Microbenchmark

  13. Microbenchmark Results

  14. Lock acquires/releases elided

  15. SPLASH Performance

  16. Reasons for Performance Gain • Concurrent critical section execution • Reduced observed memory latencies • Locks can remain shared • Reduced memory traffic • No transfer of locks over the bus

  17. Conclusions • Remove unnecessary serialization • Locks do not need to be acquired, but only observed • Control dependence converted to data dependence • No Programmer-based static analysis • But Hardware-based dynamic analysis • No coherence protocol changes • Independent of consistency model • Accesses are atomic in any model • Permit programmers to used conservative synchronization

  18. Questions • Are programmer really off the hook? • Are all critical sections short enough for SLE? • What about the legacy library example? • Will this work for OLTP? • Do all MP papers come from Wisconsin?

More Related