Fault Tolerance and Performance Enhancement using Chip Multi-Processors

Fault Tolerance and Performance Enhancement using Chip Multi-Processors Işıl ÖZ

Outline • Introduction • Related Work • Dual-Core Execution(DCE) • DCE for Fault Tolerance • DCE with Energy Optimization • Experimental Results • Conclusion

CMP • Single-chip multi-core or chip multiprocessor • High system throughput • Only explicit parallelism • No single-thread performance • Idle processor cores if insufficient parallel tasks • Dual core execution • Utilize multi-cores to improve the performance for single-thread workloads • Computation redundancy

Run-Ahead Execution • Blocked by long latency cache miss • Checkpoint the processor state • Enter run-ahead mode • Blocking completes • Return normal mode • Re-execution using warmed up caches • Limitation • Re-execution even run-ahead is correct • Multiple executions for miss-dependent misses

CFP (Continual Flow Pipelines) • Similar to Run-ahead execution • Store dependent (slice) instructions • Execute independent instructions speculatively • Commit speculative results • Limitation • Requires a large centralized load/store queue

Leader-Follower Architectures • Running a program on two processors • One leader • One follower using leader’s results to make faster progress • Limitation • Leader may be slower, follower cannot use the results • Follower may be slower, leader has to wait to retire

Dual-Core Execution (DCE)

Front Superscalar Core • Execute instructions in normal way, except • For long-latency cache misses (L2 miss) • Substitute the data fetched with invalid value • INV bit is set in the physical register • Invalidate the dependent instructions • Propagate INV flag through data dependency • Retire instructions in-order, except • Store instructions • No data cache or memory update • Update run-ahead cache to use in subsequent loads • Exceptions

Result Queue • First-in first-out structure • Keeps the retired instruction stream from the front processor • Provides continuous instruction stream to the back processor

Back Superscalar Core • Instructions are fetched from the result queue • Processes instructions in normal way, except • Mispredicted branches • All the instructions are squashed in back and front processor • The result queue is emptied • The back processor’s register values are copied into the front processor’s physical registers • Run-ahead cache is invalidated • Retires instructions in-order • Store instructions update data caches • Precise state for exception handling

Memory Hierarchy • Seperate L1 data caches for back and front processor • Shared L2 cache • L1 D-cache miss in the front processor -> prefecth request for L1 D-cache in the back • The back processor updates both L1 D-caches at the store instruction retirement

Simulation Methodology • Simulator infrastructure • SimpleScalar toolset • Baseline • MIPS-R10000-style superscalar processor • SPEC CPU 2000 benchmarks • Memory-intensive benchmarks

DCE_R • DCE for Transient Fault Tolerance • DCE with redundancy check • Compare results of front processor that are not invalid and results of back processor • In case of discrepancy • Branch misprediction recovery mechanism provide fault tolerance by rewinding the processors • Only partial redundancy coverage

Redundancy Checking Results The percentage of retired instructions with redundancy checking

DCE_FR • DCE_R with Full Redundancy Coverage • F_INV flag to each instruction to show whether it’s validated by the front processor • If invalidated, the back processor fetches the same instruction twice for normal and redundancy • If validated, the front processor result is used as redundancy • Changes in renaming logic • Redundant execution • Source operands access rename table as usual • Destination registers obtain new physical register, not update • At the retire stage, dest.registers are freed after the comparison

DCE_FR_t • DCE_FR with Renaming Scheme • Additional renaming table (A_table) to the original renaming table (R_table) • Invalidated normal execution, accesses and updates • R_table • Invalidated redundant execution, accesses and updates • A_table • Validated execution, accesses • R_table • Validated execution, updates • both R_table and A_table

Performance Impact • DCE_R and DCE_FR better than Base, except benchmarks having many branch mispredictions • DCE_R and DCE_FR not much better than DCE • DCE_FR 23.5% performance improvement

Energy Consumption • DCE_R and DCE_FR have high energy overhead

Energy Overhead Problems • Wrong-path instructions • Large instruction window • Branch misprediction results in fetching and executing large wrong-path instructions • Redundant execution for invalidated instructions • Need to access some structures (register file, access table etc.) although producing no useful results • DCE_FR has to dual-execute

Energy Overhead Solutions-1 • FR_rs • Adapting instruction window size • Reduce for high misprediction rated workloads • Keep large to exploit large-window benefits for others • FR_rs_tl • Selective invalidation • Not invalidate traversal address load • Only special “load ra, x(ra)” instructions • Due to requiring compiler support to decide load types

Energy Overhead Solutions-2 • FR_rs_tl_in • Adaptive Enable/Disable the invalidation • Based on workload’s dynamic behavior • Invalidate • Memory-intensive with moderate mispredictions, or • Memory-intensive with low mispredictions, or • Moderate memory-intensive with extremely low mispredictions • Otherwise no invalidate

Performance Impact • Not much performance improvement over DCE_FR

Energy Consumption • Significantly reduce the energy overhead

Energy Overhead Solutions-3 • Reducing redundant execution • No redundant execution for not invalidated instruction • Reexecute only loads and invalidated instructions • Switching between DCE and single-core • Workloads with high misprediction rates • Switch from the dual-core mode to single-core mode

Performance Impact • Executed instructions/ Retired instructions in the back processor • 41% in average

Enery Optimization Results-1

Enery Optimization Results-2

Conclusion • DCE • Improves the performance of single-threaded applications using CMPs • Works best with memory-intensive workloads with a low misprediction rate • Dynamic scheme which enables/disables DCE • DCE with full redundancy checking • 24.9% speedup, 87% energy overhead • DCE without reliability requirement • 34% speedup, 31% energy overhead

References • H. Zhou, “A Case for Fault-Tolerance and Performance Enhancement Using Chip Multiprocessors”, Computer Architecture Letters, Sept. 2005. • H. Zhou, “Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window,” Proc. 14th Int’l Conf. Parallel Architectures and Compilation Techniques (PACT ’05), 2005. • Yi Ma, Hongliang Gao, Martin Dimitrov, and H.Zhou, “Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery”, IEEE Transactions on Parallel and Distributed Systems, vol. 18, No. 2007.

Fault Tolerance and Performance Enhancement using Chip Multi-Processors

Fault Tolerance and Performance Enhancement using Chip Multi-Processors

Presentation Transcript

Single-Chip Multi-Processors (CMP)

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Fault Tolerance

Multi-Station and Fault Tolerance

Fault Tolerance

Single-Chip Multi-Processors (CMP)

Fault Tolerance

Fault Tolerance

Fault Tolerance

Multi-Station and Fault Tolerance

Fault Tolerance

Fault Tolerance