0 likes | 13 Vues
Shared-memory programs pose challenges for debugging due to non-deterministic memory races. This paper introduces Timetraveler, a novel approach that efficiently records and replays memory races in distributed systems. By exploiting the acyclicity of races, Timetraveler significantly reduces log size with minimal hardware overhead, outperforming existing schemes like Rerun. The mechanisms of post-dating and time-delay buffering help in detecting and ordering races, enhancing repeatability and scalability for debugging shared-memory programs.
E N D
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010
Shared-memory programs are hard to debug Due to non-deterministic memory races Memory races depend on thread interleaving ▪ Read/write by thread A + write by thread B to same location Deterministic replay Check-point initial program state at recording start Record races in a log Enforce same race ordering at replay Race recording provides repeatability Gwendolyn Voskuilen et al. 2
Record predecessor-successor ordering of threads involved in a memory race Races always involve a write leverage coherence ▪ Global event (e.g., write invalidation) for memory races Captures all races – synchronization and data Two key overheads Log size Hardware to track race ordering Gwendolyn Voskuilen et al. 3
Centralized - Strata [ASPLOS06], DeLorean [ISCA08] Logging/ordering at a central entity DeLorean has shorter log but Strata uses less hardware Both less scalable Distributed - FDR [ISCA03], RTR [ASPLOS06], Rerun [ISCA08] Use Lamport clocks with directory coherence All exploit transitivity to reduce logs ▪ Avoid recording races made redundant by transitivity Rerun significantly reduces hardware Our focus – distributed schemes Gwendolyn Voskuilen et al. 4
Goal: further reduce log size with minimal hardware Rerun logs 38 GB/hour on 16 2-GHz cores Our key novelty: Exploit acyclicity of races Previous schemes record all non-transitive races Timetraveler records only cyclic, non-transitive races Gwendolyn Voskuilen et al. 5
Two novel and elegant mechanisms Post-dating : correctly orders acyclic races and detects cyclic races via L1 & L2 ▪ No messy cycle detection hardware (just a 32-bit timestamp/core) Time-delay buffers: avoids false cycles through L2 Reduce log by 8x (commercial) & 123x (scientific) over Rerun Minimal hardware: 2 32-bit timestamps/core + 696-byte time- delay 696 MB/hour on 16 2-GHz cores Timetraveler significantly reduces log with minimal, elegant hardware Gwendolyn Voskuilen et al. 6
Introduction Timetraveler operations Rerun background Post-dating Time-delay buffer Results Conclusion Gwendolyn Voskuilen et al. 7
Rerun eliminates per-block timestamps in L1 and L2 needs only one timestamp per core/L2 bank Rerun divides thread into atomic sections (episodes) Ends episode at a race; successor’s timestamp = predecessor timestamp+1 (piggybacked on coherence message) Logs length and timestamp of episode In replay, the serial order of episodes is known Races fall in two categories [Strata]: Current – block last accessed in another thread’s current episode Past – block last accessed in a past episode Distinguished by R/W bit per block (or Bloom filter) Past races are implied by transitivity, need not be logged Gwendolyn Voskuilen et al. 8
Timestamp: 23 24 27 20 25 26 (A,B) 24 A? 26 Dynamic Execution A? B? Episodes: 2 2 log entries Gwendolyn Voskuilen et al. 9
Timetraveler logs only current, cyclic races Rerun logs all current races Post-dating Upon current race, predecessor gives post-dated timestamp to successor, guarantees not to exceed it due to future races ▪ Without ending ▪ Breathing room for predecessor to avoid ending immediately ▪ Correctly orders acyclic successor ▪ Detects cycles causing post-dated timestamp to be exceeded Minimal hardware over Rerun Postdating exploits acyclicity & detects cycles with minimal hardware Gwendolyn Voskuilen et al. 10
Rerun Timetraveler 23 45 20 34 Current TS: Post-dated TS: Timestamp: 23 24 27 20 25 26 28 (A,B) --- 33 --- --- 44 (A,B) A? 33 Dynamic Execution A? 44 Dynamic Execution A? A? B? B? 1 chapter 2 episodes Gwendolyn Voskuilen et al. 11
Rerun conservatively ends episodes upon replacements/downgrades of current blocks to L2 Places timestamp at L2 for successors Orders racing successor after predecessor Timetraveler employs post-dating to avoid ending Places post-dated timestamp at L2 Postdating extends chapters beyond replacements Gwendolyn Voskuilen et al. 12
Problem: Only one timestamp per L2 bank All blocks look recent, even if only a single block recently accessed and others accessed long ago Causes false cycles when accessing one of the others ▪ L2 timestamp > thread’s post-dated timestamp cycle Solution: Buffer most-recently arrived timestamps at L2 Delays update of L2 timestamp so L2 bank retains old timestamp L2 timestamp < thread’s post-dated timestamp no cycle Requests get data from L2, timestamp from buffer or L2 8 entries per L2 bank suffice Time-delay buffer avoids false cycles through L2 Gwendolyn Voskuilen et al. 13
Introduction Timetraveler operations Rerun background Post-dating Time-delay buffer Results Conclusion Gwendolyn Voskuilen et al. 14
GEMS + Simics 8 in-order cores, MESI coherence 32 KB split I & D, 8 MB 8 bank L2 Workloads Commercial: Apache, OLTP, SpecJBB 2005 Scientific: SPLASH Ocean, Raytrace, Water-nsquared Timetraveler R/W bits per L1 block, 8-entry time-delay buffer per L2 bank, 32-bit timestamps, 16-bit chapter length, postdating offest = 10 Rerun R/W bloom filters, 32-bit timestamps, 16-bit episode length Gwendolyn Voskuilen et al. 15
5 Log growth (bytes / 1K instructions) Rerun Timetraveler Ideal 4.5 4 3.5 3 2.5 2 1.5 8x 123x 1 0.5 0 SpecJBB Apache OLTP Water nsquared Ocean Raytrace Mean - com Mean - sci Large reduction in log growth due to post-dating Post-dating & time-delay buffer effectively capture true cycles Gwendolyn Voskuilen et al. 16
Current races Total current- races per chapter Current-block replacements Current-races Non-races Benchmarks Specjbb 0.6 1.1 21.0 1.7 Apache 1.5 8.0 26.1 9.5 OLTP 3.4 5.8 12.2 9.3 Water-n2 2.3 6.4 228.2 8.7 Ocean 1.8 2.4 5.1 4.1 Raytrace 2.4 3.9 197.8 6.3 Mean-com 1.8 4.9 19.8 6.8 Mean-sci 2.1 4.2 143.7 6.4 Multiple races per chapter Ending on current-block replacements would significantly shorten chapters Gwendolyn Voskuilen et al. 17
Timetraveler exploits acyclicity of races to reduce log size 8X (commercial) & 123X (scientific) reduction over Rerun Two novel techniques elegantly exploit and detect cycles Post-dating Time-delay buffer Introduces minimal hardware Two timestamps per core 696 byte time-delay buffer CMPs on the rise + debugging important Timetraveler valuable Gwendolyn Voskuilen et al. 18
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010
Two requirements for replay All original races must occur in replay No new races (not seen originally) may occur Replay need not be terribly fast but cannot be terribly slow Thus simplest scheme is sequential replay of chapters Can leverage speculation for faster replay Gwendolyn Voskuilen et al. 20