Dongyoon Lee † , Mahmoud Said, Satish Narayanasamy † , and Zijiang James Yang

Offline Symbolic Analysis toInfer Total Store Order Dongyoon Lee†, Mahmoud Said*, SatishNarayanasamy†, and Zijiang James Yang* University of Michigan, Ann Arbor † Western Michigan University *

Deterministic Replay What is deterministic replay? • Record and reproduce non-deterministic events • Program input (interrupt, I/O, DMA, etc.) • Shared-memory dependencies Deterministic replay uses 1) Debugging • Reproducing concurrency bugs • Time-travel debugging 2) Heavyweight dynamic analysis 3) Forensics

Traditional Deterministic Replay Systems Thread 1 Thread 2 Thread 3 Checkpoint Memory and Register State Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Log shared memory dependencies

Recording Shared-Memory Dependencies Problem • Need to monitor every memory operation Software-based replay systems • PinSEL(Intel), iDNA(Microsoft) • ODR (Berkeley), PRES (UCSD) • Respec/DoublePlay(Michigan) Hardware-based replay systems • RTR/ReRun(Wisconsin) • Strata (UCSD) • DeLorean(UIUC) • LReplay(CAS) • Timetraveler (Purdue) → 10-100x → Replay is not guaranteed → 2x throughput overhead : Total Store Order → Complex because precise recorder → Only few systems support relaxed consistency model

Hardware Complexity in Previous Solutions Support for precisely logging shared-memory dependencies • Monitor and log cache coherence messages Support for Total Store Order (TSO) model • Detect sequential consistency (SC) violation • Log SC-violatingloads Complexity • Require invasive changes to coherence sub-system • Complex to design and verify • 9 design bugs in coherence mechanism of AMD64 [Narayanasamy et al. ICCD’06] RTR [Xu et al. ASPLOS’06]

Our Approach Complexity-effective solution [MICRO’09] • Do NOT record shared-memory dependencies at all • Infer shared-memory dependencies offline • UsingSatisfiability Modulo Theory (SMT) solver Contribution • Find a causal order compliant to Total Store Order • Bound search space under TSO

System Overview Thread 1 Thread 2 Thread 3 Checkpoint Memory and Registers Checkpoint Registers BugNet[ISCA’05] Load-based Hardware Recorder Log non-deterministic program input Interrupts, I/O values, DMA, etc. Write Read Read Satisfiability-Modulo-Theory (SMT) solver reconstructs thread interleaving offline Log shared memory dependency

Roadmap • Motivation • Background: Load-based logging architecture [ISCA’05] T1 T2 T3 Checkpoint Registers BugNet[ISCA’05] Load-based Hardware Recorder Write Read Read Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline • Offline Symbolic Analysis • Bounding Search Space • Evaluation • Conclusion

Load-based Logging Architecture [Narayanasamy et al, ISCA’05] [Lee et al. MICRO’09] Insight • Recording initial register state and values of loads is sufficient for deterministic replay • Implicitly captures the program input from I/O, DMA, interrupts, etc. • Input and output of other instructions are reproduced during replay Optimization • Record a load only if it is the first access to a memory location Processor support • Recording data fetched on a cache miss captures first loads • Any first access to a location would result in a cache miss • May unnecessarily record data due to store misses, but that is OK

Recording Cache Miss Data (First Loads) • Checkpoint • Register Values • Program Counter Log file Checkpoint • Record cache misses • <Memory count , Data> • Implicitly capture first loads Load A = 0 <cnt1, 0> Load B = 5 <cnt2, 5> • Deterministic Replay • Input and output (including • address) of all instructions • are replayed Load A = 0 <cnt3, 0> Store C = 1 • On a store miss • Record old value – data before • store update • New value – data after store • update – can be reproduced • deterministically Execution Time First Load Cache Miss

Cache Miss Logging for Multithreaded Programs Insight • Load-based recorder (initial register state + loads) for each thread is sufficient for replaying that thread => Recording cache miss data is sufficient for multithreaded programs => No additional hardware support required for recording dependencies Reason • Load dependent on a remote write will cause a cache miss to ensure coherence => Implicitly records load values dependent on remote writes Effect • Canreplay each thread in isolation (independent of other threads) regardless of underlying consistency model

Replaying Each Thread Independently Proc 1 LOG Proc 2 LOG Proc 2 Proc 1 • Cache Coherence • Invalidates cache block • to gain exclusive • permission Load A=0 (1st, 0) Load A=0 Store A=1 (1st, 0) • Log cache miss data • Implicitly records loads • dependent on remote • writes • No change to • coherence mechanism Invalidation Cache Block Invalidated 1 (3rd, 1) Load A= Replay each thread independent of others Cache Miss

Shared Memory Dependency x Old value Address Thread 1 Thread 2 New value Address B A Load Old value A Store ? A New value B Store Load • Billions of instructions • Offline analysis • may not scale C C Store Store A Store Store Final State : A, B, C SMT Solver for finding shared memory dependencies Strata for bounding search space

Roadmap • Motivation • Load-based Logging Architecture • Offline Symbolic Analysis T1 T2 T3 Checkpoint Registers BugNet[ISCA’09] Load-based Hardware Recorder Write Read Read Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline • Bounding Search Space • Evaluation • Conclusion

Offline Symbolic Analysis Step 1) Encode Ordering Constraints • Coherence Constraints • Memory Model Constraints (SC, TSO) Coherence Constraints (X1 < X2) AND (X2 < X4 OR X5 < X2) AND … Memory Model Constraints (Y1 < X1 < X2 < Y2) AND (X3 < X4 < X5 < Y3) AND …. y y x x 3 3 1 1 x x x x 1 1 4 4 x x x x 2 2 5 5 y y y y 2 2 3 3 Final State Step 2) Find a valid casual order • Satisfiability Module Theory (SMT) Solver • Yices[Dutertre and Moura CAV’06]

Offline Symbolic Analysis ` Step 1) Encode Ordering Constraints • Coherence Constraints • Memory Model Constraints (SC, TSO) Coherence Constraints (X1 < X2) AND (X2 < X4 OR X5 < X2) AND … Memory Model Constraints (Y1 < X1 < X2 < Y2) AND (X3 < X4 < X5 < Y3) AND …. y y x x 3 3 1 1 x x x x 1 1 4 4 x x x x 2 2 5 5 y y y y 2 2 3 3 Final State Step 2) Find a valid casual order • Satisfiability Module Theory (SMT) Solver • Yices[Dutertre and Moura CAV’06]

Encoding Coherence Constraints Proc 1 Proc 2 x Coherence Constraints ( M→old == M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 x 1 3 x x 2 • OR X5 < X2) AND 4 x 5 x Final x Old value Address New value

Multiple Memory Locations Proc 1 Proc 2 Coherence Constraints ( M→old == M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y2 < Y3 AND : x x y 1 3 1 x x 2 4 x 5 y y 2 3 x y Final Final x Old value Address New value

Offline Symbolic Analysis ` Step 1) Encode Ordering Constraints • Coherence Constraints • Memory Model Constraints (SC, TSO) Coherence Constraints (X1 < X2) AND (X2 < X4 OR X5 < X2) AND … Memory Model Constraints (Y1 < X1 < X2 < Y2) AND (X3 < X4 < X5 < Y3) AND …. y y x x 3 3 1 1 x x x x 1 1 4 4 x x x x 2 2 5 5 y y y y 2 2 3 3 Final State Step 2) Find a valid casual order • Satisfiability Module Theory (SMT) Solver • Yices[Dutertre and Moura CAV’06]

Encoding Sequential Consistency Constraints Coherence Constraints ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y1 < Y2 < Y3 AND : Proc 1 Proc 2 x x 1 y 3 1 x x 2 4 x 5 Memory Model Constraints (under Sequential Consistency) P1 : Y1 < X1 < X2 < Y2 AND P2 : X3 < X4 < X5 < Y3 AND y y 2 3 x y Final Final

Total Store Order Relaxations Alice = Bob = 0 Stores Store Buffer Hit Store→Load P1 P2 SC Alice = 1 Bob = 1 TSO r1 = Bob r2 = Alice r1 = r2 = 0 Alice = Bob = Charlie = 0 Load →Load P2 P1 SC Alice = 1 Bob = 1 r3 = Charlie Charlie = 1 TSO r1 = Bob r2 = Alice r1 = r2 = 0, r3 = 1 Log memory counts of LoadSB-Hit Alice = Bob = 0 0 0 LoadSB-Hit→Load P1 P2 Alice = 1 • 1 Bob = 1 • 1 SC r3 = Alice r4= Bob TSO r1 = Bob r2 = Alice r1 = r2 = 0, r3 = r4 = 1

Encoding Total Store Order Constraints Stores Store Buffer Hit SMT Solver Encoded TSO constraints Derived TSO schedule y y 1 x x x x 1 1 1 3 3 x x x x 2 2 4 4 Store LoadSB-Hit x x y y y y 5 5 2 3 2 Load 3 Load x x y y Final Final Final Final AND X1 < Y2 P1 : Y1 < X1 < X2 < Y2 P2 : X3 < X4 < X5 < Y3 AND X3 < Y3

Roadmap • Motivation • BugNet • Offline Symbolic Analysis • Bounding Search Space • Evaluation • Conclusion

Bounding Search Space using Strata Proc 1 Strata • After every N broadcasted coherence messages, each processor records “Strata Hints” Strata regions are ordered SMT Solver • Analyzes one Strata region at a time • Starts from the last Strata region • Initial state of a Strata region = Final state of preceding region Proc 2 Coherence messages Strata Region 1 Nmsgs Final State hint 1 hint 2 Stratum Strata Region 2 Initial State Nmsgs Final State hint 3 hint 4 Strata Region 3 Stratum Initial State Final State Final State

Strata Hint = Mem. Count + I-Store Count Strata hint • Every processor records • memory count • number of in-flight stores in the store buffer (called I-Store) • Reconstruct Stratum offline based on the memory count • Move the pending stores and dependent loadsto the next Strata region MemCnt. MemCnt. 1 0 1 0 Alice = Bob = 0 P1 P2 I-Store Cnt. I-Store Cnt. P1 P2 1 0 0 1 Alice = 1 Bob = 1 … remains in store-buffer Alice = 1 Bob = 1 Log {1,1} • Log • {1,1} {1,1} {1,1} r1 = Bob r2 = Alice … executed, not committed yet r1 = Bob r2 = Alice r1 = r2 = 0 r1 = r2 = 0 < Recording Strata Hints > < Reconstructing Strata Offline>

Filtering Local, Read-only & Cache-hit Accesses Thread 1 Thread 2 Local accesses • No shared-memory dependencies Read-only accesses • Any order is valid Intermediate cache-hit accesses • No interleaved accesses on successive cache hits Effectiveness • < 1%of memory accesses remain Load A Load C Store A Load B Load C Store B Load C Load C Load C Load D Store D … Miss Store D … Hit Load D … Hit Store D … Hit Store D Strata Region

Roadmap • Motivation • Load-based Logging Architecture • Offline Symbolic Analysis • Evaluation • Conclusion

Evaluation Simulation Framework • Simics: simulate multi-processor execution (2, 4, 8,16 cores) • Modified FeS2 : simulate Total Store Order • Fast-forward up to known synchronization points • Simulated for 500 million instructions Benchmarks • SPLASH2: barnes, fmm, ocean • PARSEC2.0: blackscholes, bodytrack, x264 • SPEComp: wupwise, swim • Servers: apache, mysql Offline Symbolic Analysis • Yices SMT solver [Dutertre and Moura CAV’06]

Program Input and LoadSB-Hit Log Size Program Input (Data & Instr. Cache Misses) LoadSB-Hit 192MB/sec • On average, 192 MB/sec (8 threads, TSO) • Dominated by cache miss logging • 2.45% of load instructions read their values from the store buffer

Strata Hint Log Size SC TSO (with I-Stores) 1.2MB/sec • On average, 1.2MB/sec (8 threads, TSO, b-bound 10) • 15% increase over SC to log the number of pending stores (I-Store) • No hardware support for precise shared-memory dependency logging

Offline Analysis Overhead SC TSO 260secs • On average, 260 seconds/sec (8 threads, TSO, b-bound 10) • Overall, 30% more efficient than SC • One time cost before replay

Performance Overhead etc. Performance Overhead • Performance is mainly dominated by cache miss logging • Additional logging overhead for LoadSBH and I-Store is small • On average, <1% slowdown in IPC Paper includes more results on • Comparisons between different Strata bounding schemes • Effectiveness of cache hit filtering • Scalability results on different number of processors • …

Conclusion Reproducing concurrent executions is a huge challenge Our proposal: a complexity-effective hardware solution for TSO • Record cache miss data and store buffer hits • Record Stratum (memory count + store buffer count) to bound search space • Determine shared memory dependencies using offline symbolic analysis Result (8 threads, TSO) • Performance overhead: less than 1% • Total log size: 193 MB/sec (Program input + Strata hints) • Offline analysis: 260 seconds/sec(30% more efficient than SC)

Thank you

Dongyoon Lee † , Mahmoud Said, Satish Narayanasamy † , and Zijiang James Yang