Timing Speculation and the Razor Latch: Enhancing Pipeline Performance Through Speculative Execution

Circuit-Level Timing Speculation: The Razor Latch Developed by Trevor Mudge’s group at the University of Michigan, 2003

We’ve Already Encountered Speculation in ECE 568 • Branch prediction • When a branch is encountered, guess whether it is taken or not • If the guess is correct, we have gained time • If the guess is incorrect, we must undo any incorrectly executed instructions and move on • Multi-word cache lines • When a cache miss is encountered, we bring in the entire cache line, not just the word we’re looking for • If the access pattern shows spatial locality, we are prefetching other words that the program will soon ask for, thereby saving time. • If the speculation is too aggressive (i.e., the cache lines are too long), we’ll fetch many words uselessly.

Speculation (contd.) • Value Prediction • (Not covered in this course) • Idea is to predict what the value of a variable will be and use the predicted value. • If the predicted value was right, we gain some time; if it was wrong, we did some useless execution. If this execution changed processor state, these changes will have to be undone. • Not used in practice (to my knowledge): mainly an academic exercise so far.

Speculating on Time • The pipeline clock cycle is the time by which each stage is guaranteed to complete its assigned operation • This time is a function of • Actual hardware parameters: Gate and wire delays vary within the same die, from one die to another, and from one wafer to another. • Data involved in the computation: • Example: Ripple-carry adder. Worst-case execution time is the time it takes to ripple the carry through from carry-in of the least significant, to the carry-out of the most significant, stage. Actual execution times may vary considerably. • Requiring the worst-case delays to be accounted for often forces designers to be overly conservative in setting the clock rates

Timing Speculation: Basic Idea • Suppose F is the frequency at which the pipeline is guaranteed to function correctly • Run the pipeline at a somewhat higher rate, f. • Much of the time, this clock period, t_p=1/f, will be sufficient for all pipeline stages, and we’ll gain in execution speed • Some of the time, we may need more time: • Need to discover when this is the case • When this is the case, provide additional time by allowing the pipeline stages additional cycles to complete their operation

Implementation • Recall that pipeline stages are separated by latches • Duplicate each pipeline latch by introducing a shadow latch • Consider any stage of the pipeline. Suppose it starts some activity at time 0. • At time t_p=1/f, latch the output of that stage into the regular pipeline latch. • At time T_p=1/F, latch the output of the stage into the shadow latch. • Compare the results of the regular and shadow latches • If they agree, • do nothing: running at a higher speed has paid off • If they don’t agree, • Use the result of the shadow latch as the correct one • Squelch the computation that the following stage began on the basis of the incorrect shadow latch results • Restart the computation in the following stage using the correct results, as stored in the shadow latch

Unless otherwise stated, all figures are from Ernst, et al., MICRO-36, 2003.

Issues to Consider • How aggressive should we be? • If f is too high, a large fraction of the results will require correction with the shadow latch and we’ll actually lose time • If f is too low, the clock will be unnecessarily too slow and we won’t gain much

Issues to Consider (contd.) • What about F? • Lower bound of F is given by the worst-case path (for the worst-case inputs) • What happens if F is too small? [This is one of the few instances in design when being too conservative at one level affects correctness of functioning!] • F may be so small that the results of the next computation propagate through the stage and arrive at the shadow latch • We’d then be comparing the results of two different operations!

Metastability • If the input data is not stable when the clock transition happens, the output of the latch may float at a voltage that is in neither the 0 nor in the 1 logic ranges • Duration of metastable stage is not bounded • Different gates may interpret such indeterminate voltages differently (in terms of logic values) • Cannot reduce the probability of metastability to zero: all we can do is to keep it sufficiently low for all practical purposes

Recovery Technique 1: Global Clock Gating • If any stage detects a timing problem • Stall the entire pipeline for one clock cycle. • Use this additional clock cycle to recompute using the correct shadow-latch values

Recovery Technique 2:Counterflow Pipelining • When a mismatch (between regular and shadow latch contents) is detected: • Assert a bubble signal, to specify that the erring pipeline slot is now to be considered a bubble. • In the subsequent cycle, inject the shadow latch value into the next stage, allowing the errant operation to continue with the correct values • Trigger a flush train, traveling backwards from the errant stage, flushing operations at each stage it visits (Question: Is this flush operation necessary?? Can we do something else to avoid it?)

Power ConsumptionUsing a Processor to Fry an Egg From: www.phys.ncku.edu.tw/~htsu/humor/fry_egg.html

Power Density From: Hsu and Feng, “A Power-Aware Real-Time System…”, 2005

Power Implications: Dynamic Power From Krishna & Lee: IEEE Trans. Computers, 2003.

Static Power • Even when there is no switching, transistors leak current • Leakage power is a strongly increasing function of temperature and supply voltage; it is inversely proportional to the threshold voltage.

Subthreshold leakage vs temperature From: Do, et al: Tech Report 2007-06, Dept of CSE, Chalmers Instt of Tech

Leakage Current vs Vdd From Do et al., op cit.

Voltage Control for Razor Latch System

Timing Speculation and the Razor Latch: Enhancing Pipeline Performance Through Speculative Execution

Timing Speculation and the Razor Latch: Enhancing Pipeline Performance Through Speculative Execution

Presentation Transcript

Optimization Strategies for Physical Synthesis and Timing Closure

Hi z = Line level Low z = Mic Level Direct Box changes impedence from line level to mic level

COMBINATIONAL LOGIC DESIGN PRINCIPLES

Timing Analysis - timing guarantees for hard real-time systems- Reinhard Wilhelm Saarland University Saarbrücken

500 mJ 1.0 J 2.0 J 5.0 J

Sequential MOS Logic Circuits

Sheron Figarado

Server Time Protocol z/OS Considerations

Chapter 6: Printed Circuit Board Design

Pacemaker Timing and Intervals

Examples of Inductors

2 nd Level GLM

Chapter 4 Circuit-Switching Networks

Chapter 1 MAGNETIC CIRCUIT

A from A to C

Lecture 10: Circuit Families

Ockham’s Razor without Circles, Evasions, or Magic

Combinational-Circuit Building Blocks Data Flow Modeling of Combinational Logic

Case 22: Timing Is Everything!

Static Timing Analysis for Combinational Threshold Logic Networks

Sheron Figarado