Hardware Memory Race Recording for Deterministic Replay

Hardware Memory Race Recording for Deterministic Replay Mark D. Hill University of Wisconsin—Madison August 10, 2007 Based on joint work with Min Xu & Ras Bodik: ISCA 2003, ASPLOS 2006, IEEE Micro Top Picks 2007,& Xu UW Ph.D. 5/2006 (slides updated from defense talk).

Wisconsin Multifacet Project • Seek improved architectures for (mostly) servers thatare (mostly) chip multiprocessors (CMPs, multi-core) • Led by Mark Hill & David Wood • LogTM work w/ Ben Liblit & Mike Swift • Funding • Grants from U.S. National Science Foundation • Donations from Intel and Sun

Selected Multifacet Results (1 of 2) • Multiprocessor Flight Data Recorder • Records memory races for deterministic replay • Piggyback on coherence protocol & logs 0.001B/instrn • Supports SC & TSO • Adaptive L2 Cache & Memory Link Compression • Cache compression creates level 2½ cache (or 3½) • Adaptive so as “to do no harm” • Link compression husbands memory link bandwidth • Multifacet GEMS MP Simulation Infrastructure • Simics==Correctness; GEMS==Performance • GPL Distribution

Selected Multifacet Results (2 of 2) • Log-based Transactional Memory (LogTM) • Accelerates commit by writing new values in place(after saving old values in a per-thread log) • Gracefully handles cache eviction of TM data • LogTM Signature Edition (LogTM-SE) • Signatures summarize read/write sets • HW mechanisms: simple, policy-free, SW accessible • Forthcoming • Mechanisms to handle thread switching/migration & paging of transactions with OS or OS/VMM

Effective Inexpensive Long Recording More Applicable Low Overhead Low Cost Race Recorder Overview • Increasingly useful to replaymultithreaded code • Race recording: key to dealing with nondeterminism • A Case Study • Long recording: 1 byte/kilo-instr • Always-on recording: less than 2% overhead • Low cost: 24 KB RAM/core • Support both SC & TSO (x86-like)

Contributions Low Runtime Overhead Small Log Size Coherence Piggyback Transitive Reduction & Regulated TR Effective Inexpensive Order-Value Hybrid Set/LRU Approximation Low Cost Hardware SC & TSO Applicability

Outline 6 slides Motivation & Problem 21 An Effective and Inexpensive Race Recorder TR & RTR Algorithms Coherence Piggyback Set/LRU Approximation Order-Value Hybrid 6 Evaluation Method & Results 3 Conclusions, etc.

Motivation & Problem

Multithreaded Debugging • % gdb a.out • gdb> run • Program received SIGSEGV. • In get() at hash.c:45 • 45 a = bucket->d; • % gcc hash.c • % a.out • Segmentation fault • % • % gcc para-hash.c • % a.out • Segmentation fault • % • % gdb a.out • gdb> run • Program exited normally. • gdb> • % gcc para-hash.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d;

Applications of Deterministic Replay • Deterministic Replay is logically recreating a program execution • Cyclic Debugging ([Pancake & Netzer ‘93]) • Fault Tolerance (ExtraVirt [Lucchetti et al. ’05]) • Intrusion Analysis (ReVirt [Dunlap et al. ’02]) • Data Recovery (Continuous Checkpointing)? • See VMware Workstation 6 • Replay included for single-processor guest VM

Log - X = X*5 - - Recording X= 6 Race Recording Thread I Thread J Thread I Thread J X = 1 X++ print(X) - - - X = X*5 - - X = X*5 - - X = 1 X++ print(X) Original Replay X=6 X=10

Focus Recording for Multithreaded Replay • Race Recording • Not-an-issue for a single thread • Create the same general & data races • Checkpointing • Provide a snapshot of the program state • Many proposals (e.g., SafetyNet), not focus • Input Recording • Provide repeatable inputs • Some proposals (e.g., part of FDR), not focus

A Good Race Recorder Low runtime overhead Applicability Low cost • % gcc para-hash.c • % a.out • Segmentation fault • Race recorded in “log” • % • % gdb a.out log • gdb> run • Program received SIGSEGV. • In get() at para-hash.c:67 • 67 a = bucket->d; Long recording: small log

Our Recorder Desired & Existing Race Recorders Strata ASPLOS ’06 V V V X V V, but global

Small Log Size Coherence Piggyback Transitive Reduction & Regulated TR Order-Value Hybrid Set/LRU Approximation

Problem Formulation Dependence (black) Conflicts (red) Thread I Thread J Thread I Thread J ld A add ld A add st B st B st C st C st C Log st C ld B ld B ld D ld D st A st A sub sub st C st C ld B ld B st D st D Recording Replay • Reproduce exact same conflicts: no more, no less

Dependence Log 1 1 Log J: 23 14 35 46 16 bytes 2 2 3 3 Log I: 23 4 4 5 5 Log Size: 5*16=80 bytes (10 integers) 6 6 Log All Conflicts Thread I Thread J •  Detect conflicts  Write log ld A add st B st C st C ld B ld D st A sub st C ld B st D Replay • Assign IC • (logical Timestamps) • But too many conflicts

TR Reduced Log Log J: 23 35 46 Log I: 23 Log Size: 64 bytes (8 integers) Netzer’s Transitive Reduction Thread I Thread J TR reduced 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay

From I to J Vectors • Regulate Replay (RTR) From J to I Vectors The Intuition of the New RTR Algorithm After Reduction

New Reduced Log Log J: 23 45 Log I: 23 stricter Reduced Log Size: 48 bytes (6 integers) Stricter Dependences to Aid Vectorization Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay

Vectorized Log Log J: x=3,5, ∆=1 Log I: x=3, ∆=1 Vector Deps. Log Size: 40 bytes (5 integers) Compress Vectorized Dependencies Thread I Thread J 1 ld A add 1 st B st C 2 2 st C ld B 3 3 ld D st A 4 4 sub st C 5 5 ld B st D 6 6 Replay • Reduce log size to KB/core/second

Low Runtime Overhead Coherence Piggyback Transitive Reduction & Regulated TR Set/LRU Approximation Order-Value Hybrid

B.writer = (I, 2) C.writer =(J, 2) if (C.writer != I) log(WAW) foreach C.readers if (reader != I) log(WAR) C.readers.clear( ) C.writer = (I, 3) if (B.writer != J) log(RAW) B.readers.add(J,3) … Detect Conflicts A.readers A.writer Thread I Thread J A.readers.add(I, 1) 1 ld A add 1 st B st C 2 2 st C ld B 3 3 st A 4 Recording • Expensive in software

Get/S Request A.readers A.writer B.readers B.writer Data Response Timestamp Use Cache and Cache Coherence Proc I Proc J ld B Tag State Data Timestamp A S … 1 B M … 2 Tag State Data Timestamp A S … 3 B I … 2 RAW Detected & Logged • Detect conflict in hardware with little runtime cost

Ack Timestamp? Inv Get/S Cache Evictions and Writebacks Proc I Proc J st A Tag State Data Timestamp A S … 1 B M … 2 Tag State Data Timestamp A S … 3 B I … 2 M … 3 C M … 3 WAR Detected & Logged Directory of A: Shared(I,J) Owner() • OK with nonsilent eviction & directory eviction

Implement TR and RTR in Hardware • Ideal TR requires vector timestamps • Too expensive • New idea: Pairwise-TR (use scalar timestamp) • Enable pairwise transitive reduction • Optimal RTR algorithm is likely expensive • Implement a greedy RTR algorithm • One-pass, online algorithm • Keep a sliding window of vectorizable dependencies

Hardware Implementation

Coherence Piggyback Transitive Reduction & Regulated TR Low Cost Hardware Set/LRU Approximation Order-Value Hybrid

C M … 3 Timestamp Approximation Thread I Thread J 1 ld A add 1 One Set of I’s $ Tag State Data Timestamp A S … 1 B M … 2 st B st C 2 2 st C ld B 3 3 Use current IC of thread I I ld D st A J Recording Directory of A: Shared(I) • Correct, but more evictions  more logged conflicts

Hardware Cost Log Size

Thread I Thread J 1 ld A add 1 One Set of I’s $ Tag State Data Timestamp A S … 1 B M … 2 st B st C 2 2 C M … 3 st C ld B 3 3 LRU guarantee B’s TS > A’s TS Use current IC of thread I I ld D st A J Recording Set/LRU Approximation • Set/LRU better preserve reducibility • Small $  more misses  but still small log

Hardware Cost of Timestamps Coupled Timestamp Memory • Coupled timestamp memory: overhead  cache size • Not flexible • 64B line + 64b (24b) timestamp  12.5% (4.7%) overhead • 192 KB for a 4MB L2 • Need to modify cache Tag State Data Timestamp A S … 1 B M … 2

Cache Tag State Data A S … B M … Tag Timestamp A 1 B 2 Timestamp Memory Decoupled Timestamp Memory • Decoupling  Small timestamp memory (Set/LRU) • e.g., 32-set, 64-way  99% transitive reduction • Timestamps Memory  24 KB • No need to modify cache Coupled Timestamp Memory Tag State Data Timestamp A S … 1 B M … 2 • From 192 KB to 24 KB: 8x reduction

Coherence Piggyback Transitive Reduction & Regulated TR Set/LRU Approximation Order-Value Hybrid SC & TSO Applicability

Thread I Thread J A=B=0 st A,1 st A,1 st B,1 ld A A=1 A=0 A=1 A=0 1 st A,1 st B,1 1 st B,1 ld B B=0 B=1 B=0 B=1 ld A ld B ld B st B,1 st A,1 st A,1 ld B ld A 2 2 ld A ld A ld B st B,1 SC TSO Recording with Total Store Order (TSO) • Majority of existing MP are non-SC • TSO is well defined, x86-like

A=0 B=0 TSO Execution I J A=1 B=1 st A,1 st B,1 Thread I Thread J WrBuf WrBuf ld A A=B=0 ld B 1 st A,1 st B,1 1 st A,1 ld B ld A 2 2 Memory System st B,1 A=0 A=0 B=0 B=0

Thread I Thread J 1 st A,1 st B,1 1 ld B ld A 2 2 A=0 Replay B=0 Value Used A=0 Order-Value-Hybrid Recording WAR Omitted Value Logged st A,1 Thread I Thread J I J A=1 B=1 st B,1 A=B=0 ld A 1 st A,1 st B,1 1 WrBuf WrBuf ld B ld B ld A st A,1 2 2 Recording st B,1 Memory System A Changed! A=0 A=0 B=0 B=0 J Starts to Monitor A I Starts to Monitor B I Stops Monitoring B

Hybrid Recording with TR and RTR • Hybrid recording • All loads get correct values • Hardware similar to OoO SC [Gharachorloo et al. ’91] • Hybrid + TR & RTR • TR will not use the omitted WAR in reduction • RTR vectorize dependencies more conservatively

Evaluation Method & Results

Core 4 Core 1 TSM TSM Shared L2 Cache (L1 Dir) IC Core 3 Core 2 L1_I$ L1_D$ L1 Coherence Controller TSM TSM TSM Log TR Reg RTR Reg Put-it-together: Determinizer/CMP

Simulation Method • Commercial server hardware • GEMS: http://www.cs.wisc.edu/gems • Full-system (OS + application) executions • 4-core CMP (Sequential Consistent) • 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory • Commercial server software • Apache – static web serving • SpecJBB – middleware • OLTP – TPC-C like • Zeus – static web serving

KB/core/s byte/core/kilo-instr 200 2.0 150 1.5 100 1.0 50 0.5 0 0.0 Apache JBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Log Size: 1 byte/kilo-instr • Well within in the capability of current machines • Long recording (days – months) need improvement

Execution Time 100 100 80 80 60 60 40 40 20 20 0 0 Apache JBB OLTP Zeus Apache JBB OLTP Zeus Baseline With race recorder Runtime Overhead Interconnection Msg. B/W • Our recorder can be “always-on”

100 100 80 80 60 60 40 40 20 20 0 0 Apache JBB OLTP Zeus AVG Apache JBB OLTP Zeus AVG Perfect TSM 24KB Set/LRU TSM Benefits of RTR and Set/LRU (Log Size) Improvement by RTR Effectiveness of Set/LRU Log Size Log Size Pairwise-TR Our RTR

Why RTR and Set/LRU Work Well? • RTR • Processors execute instructions at similar speed • Therefore, we can find “vectorizable” dependencies • Set/LRU • Temporal locality makes the LRU timestamps old • We only need to know if a timestamp is “old-enough”

Sensitivity and Scalability • A design space of the timestamp memory (TSM) • Size: smaller TSM -> larger log • Read/write timestamp: should be used when TSM is large • Partial timestamp: 24-bit enough • Associativity: higher better for RTR • Scalability of the recorder • Studied with modest processors (2p – 16p) • Commercial workloads, not scientific workloads • Log size increase slowly with number of cores

Conclusions, etc.

Conclusions & Future Work • Race recording  Key to combat nondeterminism • Contributions  Effective & inexpensive Recorder • Transitive Reduction & RTR algorithm small log size • Coherencepiggyback Negligible slowdown • Timestamp approximation Low hardware cost • Order-value hybrid  support SC & TSO • Future work • Operate with Hardware Transactional Memory • Seek to Eliminate Timestamp on Acknowledgements

Pull Shared Get/X Toward Recording w/ Snooping Protocols • Key problem is combined/implicit response • Not a problem for AMD Hammer Proc I Proc J st A Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 + Current IC WAR Detected & Logged

Ack Timestamp Eviction Get/S Timestamp Memory Timestamp at L2-Directory or Memory? Proc I Proc J st A Tag State Data Timestamp A S … 1 B M … 4 Tag State Data Timestamp A S … 3 B I … 2 M … 4 C M … 3 Directory of A: Shared(J) Owner() StickyS(I,J) • Directory eviction: more false conflict, like snooping

Hardware Memory Race Recording for Deterministic Replay

Hardware Memory Race Recording for Deterministic Replay

Presentation Transcript

Simulate User Interaction: Visual Recording and Replay

Memory Hardware

PinPlay : A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs

Recording Inter-Thread Data Dependencies for Deterministic Replay

Replay

Theta-Coupled Periodic Replay in Working Memory

PinPlay : A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs

Rerun: Exploiting Episodes for Lightweight Memory Race Recording

Continuously Recording Program Execution for Deterministic Replay Debugging

Hardware Transactional Memory for GPU Architectures*

Effective and Inexpensive (Memory) Race Recording

Hardware Transactional Memory

Hardware Transactional Memory for GPU Architectures

Hardware Support for Dynamic Memory Management

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06)

Karma: Scalable Deterministic Record-Replay

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay

DMP: Deterministic Shared Memory Multiprocessing

Hardware Transactional Memory

Effective and Inexpensive (Memory) Race Recording

Rerun: Exploiting Episodes for Lightweight Memory Race Recording

Hardware Transactional Memory for GPU Architectures*