How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian Fields Rastislav Bodík Mark D. Hill University of Wisconsin-Madison

Constraint: Memory latency Design: Cache hierarchy Non-uniformity: Load latencies Policy: What to replace? The Problem: Managing constraints Technological constraints dominate memory design

Constraint: Wires Power Complexity Design: Clusters Fast/Slow ALUs Grid, ILDP Non-uniformity: Bypasses Exe. Latencies L1 latencies Policy: ? ? ? The Problem: Managing constraints In the future, technological constraints will also dominate microprocessor design • Policy Goal: Minimize effect of lower-quality resources

Key Insight: Control policy crucial With non-uniform machines, the technological constraint problem becomes a control policy problem

Achieved through slack: The amount an instruction can be delayed without increasing execution time Key Insight: Control policy crucial The best possible policy: Delays are imposed only on instructions so that execution time is not increased

Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup

Determining slack: Why hard? “Probe the processor” approach:Delay and observe • Delay dynamic instruction by n cycles • See if execution time increased • No, increase n; restart; go to step 1 Microprocessors are complex: Sometimes slack is determined by resources (e.g. ROB) Srinivasan and Lebeck approximation, for loads(MICRO ’98) • heuristics to predict execution time increase

Determining slack Alternative approach: Dependence-graph analysis • Build resource-sensitive dependence graph • Analyze to find slack But, how to build resource-sensitive graph? Casmira and Grunwald’s solution(Kool Chips Workshop ’00) Graphs only with instructions in issue window

Data-Dependence Graph 1 2 1 1 1 1 3 Slack = 0 cycles

Our Dependence Graph Model (ISCA ‘01) F F F F F E E E E E C C C C C Slack = 0 cycles

Our Dependence Graph Model (ISCA ‘01) 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1 Slack = 6 cycles • Modeling resources increases observable slack

Reporting slack Globalslack: # cycles a dynamic operation can be delayed without increasing execution time 35 0 3 0 10 10 1 2 GS = 15 GS = 15 AS = 10 AS = 5 Apportioned slack: Distribute global slack among operations using an apportioning strategy

Slack measurements (Perl) 6-wide out-of-order superscalar128-entry issue window12-stage pipeline

Slack measurements (Perl) global

Slack measurements (Perl) global apportioned

Design Non-uniformity App. Strategy Analysis via apportioning strategy What non-uniform designs can slack tolerate? Fast/slow ALU Exe. latency Double latency Good news: 80% of dynamic instructions can have latency doubled

Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup 

Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack • Delay a dynamic instance by n cycles • Check if critical (via critical-path analyzer): • No, instruction has n cycles of slack • Yes, instruction does not have n cycles of slack ISCA ‘01

Two predictor designs • Implicit slack predictor • delay and observe with natural non-uniform delays • “Bin” instructions to match non-uniform hardware • Explicit slack predictor • Retry delay and observe with different values of slack Problem: obtaining unperturbed measurements

Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup 

Fast/slow pipeline microarchitecture P  F2save ~37% core power ALUs Reg WIN Data Cache Fast, 3-wide pipeline Fetch + Rename Steer Reg WIN ALUs Bypass Bus Slow, 3-wide pipeline • Design has three nonuniformities: • Higher execution latencies • Increased (cross-domain) bypass latency • Decreased effective issue bandwidth

Steer Fast Slow 1 3 High Schedule 2 4 Low Selecting bins for implicit slack predictor • Two decisions • Steer to fast/slow pipeline, then • Schedule with high/low priority within a pipeline Use implicit slack predictor with four (22) bins:

Putting it all together Prediction Path Fast/slow pipeline core Slack predictiontable 4 KB PC Slack bin # Training Path Criticality Analyzer ~1 KB 4-bin slack state machine

Fast/slow pipeline performance 2 fast, high-powerpipelines slack-based policy reg-dep steering

Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy

Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy reg-dep steering

Conclusion: Future processor design flow Future processors will be non-uniform. A slack-based policy can control them. • Measure slack in a simulator • decide early on what designs to build • Predict slack in hardware • simple implementation • Design a control policy • policy decisions  slack bins

Backup slides

2 cycles 1 cycle 1 cycle Define local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 1 1 1 1 1 1 3 In real programs, ~20% insts have local slack of at least 5 cycles

2 cycles 1 cycle 1 3 3 1 2 5 4 1 cycle Compute local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 1 1 1 1 1 1 3 Arrival Time In real programs, ~20% insts have local slack of at least 5 cycles

2 cycles 2 cycles 1 cycle 1 cycle Define global slack Global Slack: # cycles edge latency can be increased without delaying the last instruction in the program 1 1 1 1 1 1 3 In real programs, >90% insts have global slack of at least 5 cycles

GS5=LS5=2 GS1=MIN(GS3,GS5)+LS1=2 GS6=LS6=0 GS3=GS6+LS3=1 Compute global slack Calculate global slack: backward propagate, accumulating local slacks LS5=2 LS1=1 LS3=1 LS2=0 In real programs, >90% insts have global slack of at least 5 cycles

Apportioned slack Goal: Distribute slack to instructions that need it Thus, apportioningstrategydepends upon nature of non-uniformities in machine e.g.: non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle) strategy: give 1 cycle slack to as many edges as possible

Define apportioned slack Apportioned slack: Distribute global slack among edges For example: GS1=2, AS1=1 GS5=2, AS5=1 GS3=1, AS3=0 GS2=1, AS2=1 In real programs, >75% insts can be apportioned slack of at least 5 cycles

Slack measurements global apportioned local

Multi-speed ALUs Can we tolerate ALUs running at half frequency? Yes, but: • For all types of operations? (needed for multi-speed clusters) • Can we make all integer ops double latency?

Load slack Can we tolerate a long-latency L1 hit? design: wire-constrained machine, e.g. Grid non-uniformity: multi-latency L1 apportioning strategy: apportion ALL slack to load instructions

Most loads can tolerate an L2 cache hit Apportion all slack to loads

Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1

Most instructions can tolerate doubling their latency Latency+1 apportioning

Breakdown by operation (Latency+1 apportioning)

Validation Two steps: • Increase latencies of insts. by their apportioned slack • for three apportioning strategies: 1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible • Compare to baseline (no delays inserted)

Worst case: Inaccuracy of 0.6% Validation

Predicting slack Two steps to PC-indexed, history-based prediction: • Measure slack of a dynamic instruction • Store in array indexed by PC of staticinstruction Need: Ability to measure slack of a dynamic instruction Need:Locality of slack • can capture 80% of potential exploitable slack

Locality of slack experiment For each static instruction: • Measure % slackful dynamic instances • Multiply by # of dynamic instances • Sum across all static instructions • Compare to total slackful dynamic instructions (ideal case) slackful = has enough apportioned slack to double latency

Locality of slack

PC-indexed, history-based predictor can capture most of the available slack Locality of slack

Predicting slack Two steps to PC-indexed, history-based prediction: • Measure slack of a dynamic instruction • Store in array indexed by PC of staticinstruction Need: Ability to measure slack of a dynamic instruction Need:Locality of slack • can capture 80% of potential exploitable slack

Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack • Delay a dynamic instance by n cycles • Check if critical (via critical-path analyzer): • No, instruction has n cycles of slack • Yes, instruction does not have n cycles of slack 

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack

Presentation Transcript

HOW TO STATE A PROBLEM

Modelling and Solving the Stable Marriage problem using Constraint Programming

The Attendance Policy Problem

How to solve a math problem!

Using a Problem-Solving Plan

Using Data to Problem Solve

How to turn into the drop box.

A Constraint Programming Approach to the Hospitals / Residents Problem

Flow control problem

Using Graphs to Problem Solve

Solving a word Problem using:

Constraint Satisfaction Problem Solving

How to solve a 2-Step word problem using RUCSAC

The Stable Marriage Problem and Constraint Programming

How to turn into a brand

HOW BIG A PROBLEM IS GAMBLING PROBLEM?

How to Control Thyroid Problem, Relieve Hypothyroidism Symptoms

Recursion breaks a problem into several smaller instances of the same problem.

Your Turn Problem #1

A Constraint Logic Programming Solution to the Teacher Relocation Problem

Background Installation Ran into a Problem