00:00

Efficient Debugging of Shared-Memory Programs Using Timetraveler

Shared-memory programs pose challenges for debugging due to non-deterministic memory races. This paper introduces Timetraveler, a novel approach that efficiently records and replays memory races in distributed systems. By exploiting the acyclicity of races, Timetraveler significantly reduces log size with minimal hardware overhead, outperforming existing schemes like Rerun. The mechanisms of post-dating and time-delay buffering help in detecting and ordering races, enhancing repeatability and scalability for debugging shared-memory programs.

lahmer
Télécharger la présentation

Efficient Debugging of Shared-Memory Programs Using Timetraveler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010

  2.  Shared-memory programs are hard to debug  Due to non-deterministic memory races  Memory races depend on thread interleaving ▪ Read/write by thread A + write by thread B to same location  Deterministic replay  Check-point initial program state at recording start  Record races in a log  Enforce same race ordering at replay Race recording provides repeatability Gwendolyn Voskuilen et al. 2

  3.  Record predecessor-successor ordering of threads involved in a memory race  Races always involve a write leverage coherence ▪ Global event (e.g., write invalidation) for memory races  Captures all races – synchronization and data  Two key overheads  Log size  Hardware to track race ordering Gwendolyn Voskuilen et al. 3

  4.  Centralized - Strata [ASPLOS06], DeLorean [ISCA08]  Logging/ordering at a central entity  DeLorean has shorter log but Strata uses less hardware  Both less scalable  Distributed - FDR [ISCA03], RTR [ASPLOS06], Rerun [ISCA08]  Use Lamport clocks with directory coherence  All exploit transitivity to reduce logs ▪ Avoid recording races made redundant by transitivity  Rerun significantly reduces hardware Our focus – distributed schemes Gwendolyn Voskuilen et al. 4

  5.  Goal: further reduce log size with minimal hardware  Rerun logs 38 GB/hour on 16 2-GHz cores  Our key novelty: Exploit acyclicity of races  Previous schemes record all non-transitive races  Timetraveler records only cyclic, non-transitive races Gwendolyn Voskuilen et al. 5

  6.  Two novel and elegant mechanisms  Post-dating : correctly orders acyclic races and detects cyclic races via L1 & L2 ▪ No messy cycle detection hardware (just a 32-bit timestamp/core)  Time-delay buffers: avoids false cycles through L2  Reduce log by 8x (commercial) & 123x (scientific) over Rerun  Minimal hardware: 2 32-bit timestamps/core + 696-byte time- delay  696 MB/hour on 16 2-GHz cores Timetraveler significantly reduces log with minimal, elegant hardware Gwendolyn Voskuilen et al. 6

  7.  Introduction  Timetraveler operations  Rerun background  Post-dating  Time-delay buffer  Results  Conclusion Gwendolyn Voskuilen et al. 7

  8.  Rerun eliminates per-block timestamps in L1 and L2  needs only one timestamp per core/L2 bank  Rerun divides thread into atomic sections (episodes)  Ends episode at a race; successor’s timestamp = predecessor timestamp+1 (piggybacked on coherence message)  Logs length and timestamp of episode  In replay, the serial order of episodes is known  Races fall in two categories [Strata]:  Current – block last accessed in another thread’s current episode  Past – block last accessed in a past episode  Distinguished by R/W bit per block (or Bloom filter) Past races are implied by transitivity, need not be logged Gwendolyn Voskuilen et al. 8

  9. Timestamp: 23 24 27 20 25 26 (A,B) 24 A? 26 Dynamic Execution A? B? Episodes: 2 2 log entries Gwendolyn Voskuilen et al. 9

  10.  Timetraveler logs only current, cyclic races  Rerun logs all current races  Post-dating  Upon current race, predecessor gives post-dated timestamp to successor, guarantees not to exceed it due to future races ▪ Without ending ▪ Breathing room for predecessor to avoid ending immediately ▪ Correctly orders acyclic successor ▪ Detects cycles causing post-dated timestamp to be exceeded  Minimal hardware over Rerun Postdating exploits acyclicity & detects cycles with minimal hardware Gwendolyn Voskuilen et al. 10

  11. Rerun Timetraveler 23 45 20 34 Current TS: Post-dated TS: Timestamp: 23 24 27 20 25 26 28 (A,B) --- 33 --- --- 44 (A,B) A? 33 Dynamic Execution A? 44 Dynamic Execution A? A? B? B? 1 chapter 2 episodes Gwendolyn Voskuilen et al. 11

  12.  Rerun conservatively ends episodes upon replacements/downgrades of current blocks to L2  Places timestamp at L2 for successors  Orders racing successor after predecessor  Timetraveler employs post-dating to avoid ending  Places post-dated timestamp at L2 Postdating extends chapters beyond replacements Gwendolyn Voskuilen et al. 12

  13.  Problem: Only one timestamp per L2 bank  All blocks look recent, even if only a single block recently accessed and others accessed long ago  Causes false cycles when accessing one of the others ▪ L2 timestamp > thread’s post-dated timestamp cycle  Solution: Buffer most-recently arrived timestamps at L2  Delays update of L2 timestamp so L2 bank retains old timestamp  L2 timestamp < thread’s post-dated timestamp no cycle  Requests get data from L2, timestamp from buffer or L2  8 entries per L2 bank suffice Time-delay buffer avoids false cycles through L2 Gwendolyn Voskuilen et al. 13

  14.  Introduction  Timetraveler operations  Rerun background  Post-dating  Time-delay buffer  Results  Conclusion Gwendolyn Voskuilen et al. 14

  15.  GEMS + Simics  8 in-order cores, MESI coherence  32 KB split I & D, 8 MB 8 bank L2  Workloads  Commercial: Apache, OLTP, SpecJBB 2005  Scientific: SPLASH Ocean, Raytrace, Water-nsquared  Timetraveler  R/W bits per L1 block, 8-entry time-delay buffer per L2 bank, 32-bit timestamps, 16-bit chapter length, postdating offest = 10  Rerun  R/W bloom filters, 32-bit timestamps, 16-bit episode length Gwendolyn Voskuilen et al. 15

  16. 5 Log growth (bytes / 1K instructions) Rerun Timetraveler Ideal 4.5 4 3.5 3 2.5 2 1.5 8x 123x 1 0.5 0 SpecJBB Apache OLTP Water nsquared Ocean Raytrace Mean - com Mean - sci Large reduction in log growth due to post-dating Post-dating & time-delay buffer effectively capture true cycles Gwendolyn Voskuilen et al. 16

  17. Current races Total current- races per chapter Current-block replacements Current-races Non-races Benchmarks Specjbb 0.6 1.1 21.0 1.7 Apache 1.5 8.0 26.1 9.5 OLTP 3.4 5.8 12.2 9.3 Water-n2 2.3 6.4 228.2 8.7 Ocean 1.8 2.4 5.1 4.1 Raytrace 2.4 3.9 197.8 6.3 Mean-com 1.8 4.9 19.8 6.8 Mean-sci 2.1 4.2 143.7 6.4 Multiple races per chapter Ending on current-block replacements would significantly shorten chapters Gwendolyn Voskuilen et al. 17

  18.  Timetraveler exploits acyclicity of races to reduce log size  8X (commercial) & 123X (scientific) reduction over Rerun  Two novel techniques elegantly exploit and detect cycles  Post-dating  Time-delay buffer  Introduces minimal hardware  Two timestamps per core  696 byte time-delay buffer CMPs on the rise + debugging important Timetraveler valuable Gwendolyn Voskuilen et al. 18

  19. Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010

  20.  Two requirements for replay  All original races must occur in replay  No new races (not seen originally) may occur  Replay need not be terribly fast but cannot be terribly slow  Thus simplest scheme is sequential replay of chapters  Can leverage speculation for faster replay Gwendolyn Voskuilen et al. 20

More Related