Towards Performance-Efficient Temporal Redundancy

CSE 598C Project Towards Performance-Efficient Temporal Redundancy Sudhanva Gurumurthi

Approaches to Redundancy • Spatial Redundancy • Hardware duplication • IBM G3, HP NonStop Himalaya • Informational Redundancy • Parity and ECC • Temporal Redundancy

“Sphere of Replication” Input Replicator Output Comparator Rest of the System Source: Mukherjee et al, “Detailed Design and Evaluation of Redundant Multithreading Alternatives”, ISCA’02

Sphere of Replication in Superscalar Processors ROB I0 FU Decode I1 Commit I-Cache FU Registers PC = RAT FU Source: J. Ray et al, “Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery”, MICRO’01.

System Configuration • 8-wide superscalar processor • 128-entry RUU, 64-entry LSQ • 4 I-ALUs, 3 I-MULT/DIVs, 2 FP-ALUs, 1 FP-MULT/DIV/SQRT • 32 KB 2-way L1-dCache • 64 KB, 2-way L1-iCache • 512 KB, 4-way L2-Cache • 112 cycles Memory latency

Performance Loss

Towards Single-Thread Performance • The instruction scheduler is oblivious to the presence of two execution contexts. • Goals • Minimize impact on the critical-path of execution. • Critical-instruction scheduling

The Concept of Criticality and Slack • Only some of the instructions in a program might be bottleneck causing. • Critical Instructions • Critical instructions cannot be delayed! • Slack is a measure of how critical an instruction is.

Microarchitectural Critical-Path • A compiler can examine data-dependences but not resource-dependences. • ROB stalls • Branch mispredictions • Routing network stalls • Critical path is a function of data-dependences + inherent instruction latencies as well as resource usage at runtime.

Critical-Path Prediction[B. Fields et al, ISCA’01] • Tries to construct dependence-graph model of the critical-path at runtime. • Considers both machine-independent data dependences and machine-specific resource dependences. • Tracks last-arriving chain of edges along the dynamically constructed microarchitectural dependence-graph using a token-passing algorithm.

ROB Size 1. plant token Critical 3. is token alive? 4. yes, train critical Token-Passing Example 2. propagate token • Found CPwithout constructing entire graph Adapted from Brian Fields’s ISCA’01 Slides

Hardware Implementation • Two components • Critical-path table • Trainer • Critical-Path Table • Stores predictions indexed by PC (16K-entry with 6-bit hysteresis) • Looked up in parallel with instruction fetch • Trainer • Token array • Stores dependence graph of ROB-size most recent instructions with 1 bit to show if token propagated into that node.

Hardware Implementation

Evaluation Methodology • SimpleScalar 3.0 • Code modified to simulate temporal-redundant execution based on [Ray’01]. • Critical-path predictor integrated with above. • Fast-forward 1 billion instructions. • Detailed simulation of 1 billion instructions.

Scheduling Strategies • base • Default scheduler used in SimpleScalar • Loads, long-latency instructions and branches selected first. • Other instructions selected in oldest-first order.

Scheduling Strategies • cp-all • All predicted-critical instructions selected first. • Other instructions selected in oldest-first order. • cp-dual • All predicted-critical instructions in redundant context selected first. • Other predicted-critical instructions next. • Other instructions (both contexts) in oldest-first order.

Availability of Critical Instructions

Performance Loss

Summary • More workloads. • Might need to stagger threads. • Larger ROBs needed to prevent stalls. • AR-SMT-style delay-buffer • Difficult to reconcile two far-flung threads

Backup Slides

Why not just provision more functional-units? • Scheduler complexity • Broadcast-free schedulers • Why couldn’t we have just used them to boost single-thread performance?

Temporal Redundancy • Execute multiple contexts of the same program. • At the commit-stage, check results of all the copies and re-execute if necessary. • Simple voting • Checker processor

Sphere of Replication • Components within the Sphere protected via redundant execution. • Components outside the Sphere protected via spatial/informational redundancy. • Temporal redundancy does not preclude extra hardware support.

Illustrative Example – Program Graph 4 2 I1 I2 Slack = 2 TES = 4 I3 Critical! I4

Towards Performance-Efficient Temporal Redundancy

Towards Performance-Efficient Temporal Redundancy

Presentation Transcript

Redundancy

Towards efficient prospective detection of multiple spatio -temporal clusters

Applied Temporal RDF: Efficient Temporal Querying using SPARQL

REDUNDANCY

REDUNDANCY

Towards Energy Efficient Hadoop

Towards Energy Efficient MapReduce

Efficient Temporal Join Processing using Indices

redundancy

Redundancy

Performance Analysis of Temporal Queries

Towards the Question of Redundancy in Social Tagging Networks

Large PI System Redundancy, Performance and Security Strategies

Redundancy

REDUNDANCY

TOWARDS AN ENERGY EFFICIENT NATION

Redundancy

Redundancy

Redundancy in High Performance Networks

redundancy

IOC Redundancy: Redundancy Monitor Task