1 / 31

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors. Shuchang Shan † ‡ , Yu Hu † , Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences

netis
Télécharger la présentation

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors Shuchang Shan† ‡ , Yu Hu †, Xiaowei Li † †Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences ‡ Graduate University of Chinese Academy of Sciences (GUCAS)

  2. Outline • Introduction • TDB execution model • Experimental results • Conclusion

  3. Architectural level Dual Modular Redundancy Leading thread EX’ CHK Leading instructions L1 L1 L1 L1 L1 Trailing thread Trailing instructions Thread-level DMR Instruction-level DMR AR-SMT[FTCS’99], SRT[ISCA’00] DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02] Memory system A A’ B B’ For CMP systems, to make use of abundant hardware resources, building Core-level DMR! Core-level DMR CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07]

  4. Core-level Dual Modular Redundancy (DMR) Using coupled cores to verify each other’s execution • Static binding • lacks of flexibility • e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03] • Dynamic binding • Lacks of scalability for parallel processing • e.g., DCC [DSN’07, WDDD’08]

  5. Key issue in Core-level DMR Maintain master-slave memory consistency • Master-slave memory consistency • Coupled cores must get the same memory value • External writes causes consistency violation • Reunion [Smolens-MICRO’06] • Rollback and recovery for the inconsistency • Dynamic Core Coupling (DCC) [LaFrieda-DSN’07] • Consistency window to stall the external writes • Scalability problem Consistency violation

  6. Scalability problem • External writes occur earlier and more frequently as the system scales • Reunion: Unacceptable recovery overhead for consistency violation • DCC: Unacceptable stall latency caused by consistency window • Scalable solution needed • Reduce the consistency maintenance overhead cycles #External writes within 1K cycles: 0.3 for 4-CMP  3.3 for 16-CMP • For 16-CMP system: • 43% in 100 cycles • 55% in 500 cycles • For 4-CMP system: • 28% in 100 cycles • 37% in 500 cycles Probability of external writes occurring within certain slacks

  7. Basic idea the scope of the master-slave memory consistency maintenance • Sphere of Consistency (SoC) • The memory hierarchy • The private caches Transparent Dynamic Binding (TDB): Reduce the SoC to the scale of private caches; provide scalable and flexible Core-level DMR solution! Master Slave Master Slave L1 cache L1 cache L1 cache L1 cache Global memory Global memory

  8. Outline • Introduction • TDB execution model • Experimental results • Conclusion

  9. TDB principle • The same program input for the pair • Similar memory access behavior Transparent binding: • Master issues L1 miss requests for the logical pair • Slave is prevent from accessing the global memory Program • Dynamic binding: using the system network for • data communication and result comparison A-L1$ A’-L1$ Global memory

  10. Transparent dynamic binding Program Logical pair: Consumer-consumer Transparent of slaves Passively waiting Master Slave Sphere of Consistency The private caches Producer Global memory Consumer-consumer data access pattern

  11. Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects [1]: Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU Producer Global memory [1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07

  12. Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects: Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 2 1 LRU MRU Producer Global memory 2

  13. Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects: Program MA1 MA1 Pipeline Refresh MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 3 2 1 2 3 4 1 LRU MRU Producer Global memory 4

  14. Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects: Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 2 1 3 4 1 2 3 4 LRU MRU Producer Global memory 5

  15. Maintain Consistency under Out-of-Order Execution Out-of-Order execution brings in wrong-path effects: Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 Master-slave private cache consistency violation 1 3 4 2 3 4 LRU MRU Producer 5 Global memory 5 Invariant: in-order memory instruction retirement sequence

  16. Victim Buffer Assisted Conservative Private Cache Ingress Rule Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU Global memory Victim Buffer: • Filter the WP data blocks

  17. Victim Buffer Assisted Conservative Private Cache Ingress Rule Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU 2 2 Global memory

  18. Victim Buffer Assisted Conservative Private Cache Ingress Rule Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU 2 4 3 2 3 4 Global memory

  19. Victim Buffer Assisted Conservative Private Cache Ingress Rule Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU 5 2 4 3 5 2 3 4 • Conservative private cache ingress rule: accept data blocks from correct path into private caches Global memory

  20. Maintain Consistency under Out-of-Order Execution Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 Potential master-slave consistency violation 1 5 1 5 LRU MRU 2 4 3 2 3 4 MA1 MA5 Global memory Invariant: in-order memory instruction retirement sequence

  21. update-after-retirement LRU Replacement policy (uar-LRU) Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU MA1 MA5 Global memory

  22. update-after-retirement LRU Replacement policy (uar-LRU) Program MA1 MA1 MA2 MA3 MA3 MA6 MA4 Master Slave MA1 MA5 MA5 1 1 LRU MRU 5 2 4 3 5 2 3 4 • uar-LRU: update MRU after the instruction retirement to prevent the WP memory references from violating the consistency MA1 MA5 Global memory

  23. Master-slave memory consistency violation External writes violates the master-slave memory consistency • Atomicity of master-slave data access behavior • Lacks of scalability as external writes become more frequent Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC

  24. Transparent Input Coherence Strategy Take advantage of Transparent dynamic binding • Break the atomicity of master-slave data access behavior Checker

  25. Outline • Introduction • TDB execution model • Experimental results • Conclusion

  26. Experimental Setup • Full system simulator: simics + GEMS • Parallel workloads: SPLASH-2 • The Baseline Dual Modular Redundancy System • N active cores and another N disabled cores • Simulate the DMR system where the slaves work without interfering the masters

  27. The Performance of TDB Proposal Conservative private cache ingress rule helps filter the WP effects 97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively

  28. Network Traffic of TDB Proposal the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems

  29. Comparison against DCC [DSN’07] 37.1% 18% 10.4% 9.2% Transparent Dynamic Binding (TDB): scalable and flexible Core-level DMR solution!

  30. Conclusion • Transparent Dynamic Binding • Reduce SoC to the scale of Private Caches • Techniques to maintain the consistency • Consumer-consumer data access pattern • Victim Buffer assisted conservative ingress rule • uar-LRU replacement policy • Transparent input coherence policy • Scalable and flexible core-level DMR solution

  31. Q&A?

More Related