Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors Shuchang Shan† ‡ , Yu Hu †, Xiaowei Li † †Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences ‡ Graduate University of Chinese Academy of Sciences (GUCAS)

Outline • Introduction • TDB execution model • Experimental results • Conclusion

Architectural level Dual Modular Redundancy Leading thread EX’ CHK Leading instructions L1 L1 L1 L1 L1 Trailing thread Trailing instructions Thread-level DMR Instruction-level DMR AR-SMT[FTCS’99], SRT[ISCA’00] DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02] Memory system A A’ B B’ For CMP systems, to make use of abundant hardware resources, building Core-level DMR! Core-level DMR CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07]

Core-level Dual Modular Redundancy (DMR) Using coupled cores to verify each other’s execution • Static binding • lacks of flexibility • e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03] • Dynamic binding • Lacks of scalability for parallel processing • e.g., DCC [DSN’07, WDDD’08]

Key issue in Core-level DMR Maintain master-slave memory consistency • Master-slave memory consistency • Coupled cores must get the same memory value • External writes causes consistency violation • Reunion [Smolens-MICRO’06] • Rollback and recovery for the inconsistency • Dynamic Core Coupling (DCC) [LaFrieda-DSN’07] • Consistency window to stall the external writes • Scalability problem Consistency violation

Scalability problem • External writes occur earlier and more frequently as the system scales • Reunion: Unacceptable recovery overhead for consistency violation • DCC: Unacceptable stall latency caused by consistency window • Scalable solution needed • Reduce the consistency maintenance overhead cycles #External writes within 1K cycles: 0.3 for 4-CMP  3.3 for 16-CMP • For 16-CMP system: • 43% in 100 cycles • 55% in 500 cycles • For 4-CMP system: • 28% in 100 cycles • 37% in 500 cycles Probability of external writes occurring within certain slacks

Basic idea the scope of the master-slave memory consistency maintenance • Sphere of Consistency (SoC) • The memory hierarchy • The private caches Transparent Dynamic Binding (TDB): Reduce the SoC to the scale of private caches; provide scalable and flexible Core-level DMR solution! Master Slave Master Slave L1 cache L1 cache L1 cache L1 cache Global memory Global memory

TDB principle • The same program input for the pair • Similar memory access behavior Transparent binding: • Master issues L1 miss requests for the logical pair • Slave is prevent from accessing the global memory Program • Dynamic binding: using the system network for • data communication and result comparison A-L1$ A’-L1$ Global memory

Transparent dynamic binding Program Logical pair: Consumer-consumer Transparent of slaves Passively waiting Master Slave Sphere of Consistency The private caches Producer Global memory Consumer-consumer data access pattern

Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects [1]: Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU Producer Global memory [1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07

Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects: Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 2 1 LRU MRU Producer Global memory 2

Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects: Program MA1 MA1 Pipeline Refresh MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 3 2 1 2 3 4 1 LRU MRU Producer Global memory 4

Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects: Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 2 1 3 4 1 2 3 4 LRU MRU Producer Global memory 5

Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects: Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 Master-slave private cache consistency violation 1 3 4 2 3 4 LRU MRU Producer 5 Global memory 5 Invariant: in-order memory instruction retirement sequence

Victim Buffer Assisted Conservative Private Cache Ingress Rule Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU Global memory Victim Buffer: • Filter the WP data blocks

Victim Buffer Assisted Conservative Private Cache Ingress Rule Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU 2 2 Global memory

Victim Buffer Assisted Conservative Private Cache Ingress Rule Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU 2 4 3 2 3 4 Global memory

Victim Buffer Assisted Conservative Private Cache Ingress Rule Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU 5 2 4 3 5 2 3 4 • Conservative private cache ingress rule: accept data blocks from correct path into private caches Global memory

Maintain Consistency under Out-of-Order Execution Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 Potential master-slave consistency violation 1 5 1 5 LRU MRU 2 4 3 2 3 4 MA1 MA5 Global memory Invariant: in-order memory instruction retirement sequence

update-after-retirement LRU Replacement policy (uar-LRU) Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU MA1 MA5 Global memory

update-after-retirement LRU Replacement policy (uar-LRU) Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU 5 2 4 3 5 2 3 4 • uar-LRU: update MRU after the instruction retirement to prevent the WP memory references from violating the consistency MA1 MA5 Global memory

Master-slave memory consistency violation External writes violates the master-slave memory consistency • Atomicity of master-slave data access behavior • Lacks of scalability as external writes become more frequent Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC

Transparent Input Coherence Strategy Take advantage of Transparent dynamic binding • Break the atomicity of master-slave data access behavior Checker

Experimental Setup • Full system simulator: simics + GEMS • Parallel workloads: SPLASH-2 • The Baseline Dual Modular Redundancy System • N active cores and another N disabled cores • Simulate the DMR system where the slaves work without interfering the masters

The Performance of TDB Proposal Conservative private cache ingress rule helps filter the WP effects 97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively

Network Traffic of TDB Proposal the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems

Comparison against DCC [DSN’07] 37.1% 18% 10.4% 9.2% Transparent Dynamic Binding (TDB): scalable and flexible Core-level DMR solution!

Conclusion • Transparent Dynamic Binding • Reduce SoC to the scale of Private Caches • Techniques to maintain the consistency • Consumer-consumer data access pattern • Victim Buffer assisted conservative ingress rule • uar-LRU replacement policy • Transparent input coherence policy • Scalable and flexible core-level DMR solution

Q&A?

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Presentation Transcript

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Coherence Ordering for Ring-based Chip Multiprocessors

“ Nahalal: Cache Organization for Chip Multiprocessors ” New LSU Policy

Cache Coherence Schemes for Multiprocessors

StimulusCache : Boosting Performance of Chip Multiprocessors with Excess Cache

Parallel External Memory Model for Private-cache Chip Multiprocessors

Cache coherence

Fault-Tolerant Delay-Insensitive Inter-Chip Communication

Cache Coherence

Transient Fault Recovery For Chip Multiprocessors

Multiprocessors—Cache Coherency, Snooping Protocol

FTMP: A Fault-Tolerant Multicast Protocol

StimulusCache : Boosting Performance of Chip Multiprocessors with Excess Cache

Cache Coherence

Cache Coherence in Shared Memory Multiprocessors

Dynamic Verification of Cache Coherence Protocols

Cache Coherence

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

A Fault Tolerant Protocol for Massively Parallel Machines

Coherence Ordering for Ring-based Chip Multiprocessors