Efficient Debugging of Shared-Memory Programs Using Timetraveler

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010

 Shared-memory programs are hard to debug  Due to non-deterministic memory races  Memory races depend on thread interleaving ▪ Read/write by thread A + write by thread B to same location  Deterministic replay  Check-point initial program state at recording start  Record races in a log  Enforce same race ordering at replay Race recording provides repeatability Gwendolyn Voskuilen et al. 2

 Record predecessor-successor ordering of threads involved in a memory race  Races always involve a write leverage coherence ▪ Global event (e.g., write invalidation) for memory races  Captures all races – synchronization and data  Two key overheads  Log size  Hardware to track race ordering Gwendolyn Voskuilen et al. 3

 Centralized - Strata [ASPLOS06], DeLorean [ISCA08]  Logging/ordering at a central entity  DeLorean has shorter log but Strata uses less hardware  Both less scalable  Distributed - FDR [ISCA03], RTR [ASPLOS06], Rerun [ISCA08]  Use Lamport clocks with directory coherence  All exploit transitivity to reduce logs ▪ Avoid recording races made redundant by transitivity  Rerun significantly reduces hardware Our focus – distributed schemes Gwendolyn Voskuilen et al. 4

 Goal: further reduce log size with minimal hardware  Rerun logs 38 GB/hour on 16 2-GHz cores  Our key novelty: Exploit acyclicity of races  Previous schemes record all non-transitive races  Timetraveler records only cyclic, non-transitive races Gwendolyn Voskuilen et al. 5

 Two novel and elegant mechanisms  Post-dating : correctly orders acyclic races and detects cyclic races via L1 & L2 ▪ No messy cycle detection hardware (just a 32-bit timestamp/core)  Time-delay buffers: avoids false cycles through L2  Reduce log by 8x (commercial) & 123x (scientific) over Rerun  Minimal hardware: 2 32-bit timestamps/core + 696-byte time- delay  696 MB/hour on 16 2-GHz cores Timetraveler significantly reduces log with minimal, elegant hardware Gwendolyn Voskuilen et al. 6

 Introduction  Timetraveler operations  Rerun background  Post-dating  Time-delay buffer  Results  Conclusion Gwendolyn Voskuilen et al. 7

 Rerun eliminates per-block timestamps in L1 and L2  needs only one timestamp per core/L2 bank  Rerun divides thread into atomic sections (episodes)  Ends episode at a race; successor’s timestamp = predecessor timestamp+1 (piggybacked on coherence message)  Logs length and timestamp of episode  In replay, the serial order of episodes is known  Races fall in two categories [Strata]:  Current – block last accessed in another thread’s current episode  Past – block last accessed in a past episode  Distinguished by R/W bit per block (or Bloom filter) Past races are implied by transitivity, need not be logged Gwendolyn Voskuilen et al. 8

Timestamp: 23 24 27 20 25 26 (A,B) 24 A? 26 Dynamic Execution A? B? Episodes: 2 2 log entries Gwendolyn Voskuilen et al. 9

 Timetraveler logs only current, cyclic races  Rerun logs all current races  Post-dating  Upon current race, predecessor gives post-dated timestamp to successor, guarantees not to exceed it due to future races ▪ Without ending ▪ Breathing room for predecessor to avoid ending immediately ▪ Correctly orders acyclic successor ▪ Detects cycles causing post-dated timestamp to be exceeded  Minimal hardware over Rerun Postdating exploits acyclicity & detects cycles with minimal hardware Gwendolyn Voskuilen et al. 10

Rerun Timetraveler 23 45 20 34 Current TS: Post-dated TS: Timestamp: 23 24 27 20 25 26 28 (A,B) --- 33 --- --- 44 (A,B) A? 33 Dynamic Execution A? 44 Dynamic Execution A? A? B? B? 1 chapter 2 episodes Gwendolyn Voskuilen et al. 11

 Rerun conservatively ends episodes upon replacements/downgrades of current blocks to L2  Places timestamp at L2 for successors  Orders racing successor after predecessor  Timetraveler employs post-dating to avoid ending  Places post-dated timestamp at L2 Postdating extends chapters beyond replacements Gwendolyn Voskuilen et al. 12

 Problem: Only one timestamp per L2 bank  All blocks look recent, even if only a single block recently accessed and others accessed long ago  Causes false cycles when accessing one of the others ▪ L2 timestamp > thread’s post-dated timestamp cycle  Solution: Buffer most-recently arrived timestamps at L2  Delays update of L2 timestamp so L2 bank retains old timestamp  L2 timestamp < thread’s post-dated timestamp no cycle  Requests get data from L2, timestamp from buffer or L2  8 entries per L2 bank suffice Time-delay buffer avoids false cycles through L2 Gwendolyn Voskuilen et al. 13

 Introduction  Timetraveler operations  Rerun background  Post-dating  Time-delay buffer  Results  Conclusion Gwendolyn Voskuilen et al. 14

 GEMS + Simics  8 in-order cores, MESI coherence  32 KB split I & D, 8 MB 8 bank L2  Workloads  Commercial: Apache, OLTP, SpecJBB 2005  Scientific: SPLASH Ocean, Raytrace, Water-nsquared  Timetraveler  R/W bits per L1 block, 8-entry time-delay buffer per L2 bank, 32-bit timestamps, 16-bit chapter length, postdating offest = 10  Rerun  R/W bloom filters, 32-bit timestamps, 16-bit episode length Gwendolyn Voskuilen et al. 15

5 Log growth (bytes / 1K instructions) Rerun Timetraveler Ideal 4.5 4 3.5 3 2.5 2 1.5 8x 123x 1 0.5 0 SpecJBB Apache OLTP Water nsquared Ocean Raytrace Mean - com Mean - sci Large reduction in log growth due to post-dating Post-dating & time-delay buffer effectively capture true cycles Gwendolyn Voskuilen et al. 16

Current races Total current- races per chapter Current-block replacements Current-races Non-races Benchmarks Specjbb 0.6 1.1 21.0 1.7 Apache 1.5 8.0 26.1 9.5 OLTP 3.4 5.8 12.2 9.3 Water-n2 2.3 6.4 228.2 8.7 Ocean 1.8 2.4 5.1 4.1 Raytrace 2.4 3.9 197.8 6.3 Mean-com 1.8 4.9 19.8 6.8 Mean-sci 2.1 4.2 143.7 6.4 Multiple races per chapter Ending on current-block replacements would significantly shorten chapters Gwendolyn Voskuilen et al. 17

 Timetraveler exploits acyclicity of races to reduce log size  8X (commercial) & 123X (scientific) reduction over Rerun  Two novel techniques elegantly exploit and detect cycles  Post-dating  Time-delay buffer  Introduces minimal hardware  Two timestamps per core  696 byte time-delay buffer CMPs on the rise + debugging important Timetraveler valuable Gwendolyn Voskuilen et al. 18

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010

 Two requirements for replay  All original races must occur in replay  No new races (not seen originally) may occur  Replay need not be terribly fast but cannot be terribly slow  Thus simplest scheme is sequential replay of chapters  Can leverage speculation for faster replay Gwendolyn Voskuilen et al. 20

Efficient Debugging of Shared-Memory Programs Using Timetraveler