Applying Perceptrons to Speculation in Computer Architecture

Applying Perceptrons to Speculation in Computer Architecture Michael Black Dissertation Defense April 2, 2007

Presentation Outline • Background and Objectives • Perceptron behavior • Local value prediction • Global value prediction • Criticality prediction • Conclusions

Motivation: Jimenez’s Perceptron Branch Predictor • 27% reduction in misprediction over gshare • 15.8% increase in performance over gshare1 why better? can consider longer history 1Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons.”, 2002.

Problem of Lookup Tables • Size grows exponentially with history • Result: must consider small subset of available data

Global vs. Local • Local history: past iterations of same instruction • Global history: all past dynamic instructions

Perceptron Predictions: • Dot product of binary inputs and integer weights • Apply threshold: if +, predict 1; if -, predict 0 Learning objective: Weight values should reflect input’s correlation

Training strategies Training by correlation if actual==inputk : wk++ else: wk-- Training by error error = actual - predicted wk = wk + inputk error

Linear Separability Weight can only learn one correlation: • direct (positive) • inverse (negative)

Dissertation Objectives • Analyze behavior of perceptrons when used to replace tables • Coping with limitations of perceptrons and their implementations • Applying perceptrons to value prediction • Applying perceptrons to criticality prediction

Dissertation Contributions • Perceptron Local Value Predictor • can consider longer local histories • Perceptron Global-based Local Value Predictor • can use global information to choose local values • Two Perceptron Global Value Predictors • Perceptron Global Criticality Predictor • Comparison and analysis of: • perceptron training approaches • multiple-bit topologies • interference reduction strategies

Analyses • How perceptrons behave when replacing tables • What effect the training approach has • Design and behavior of different multiple-bit perceptrons • Dealing with history interference

Context-based Learning Concatenated history pattern (“context”) indexes table

Pattern Compatibility

What affects perceptron learning? • Noise from uncorrelated inputs • Imbalance between pattern occurrences • False correlations Effects: • Perceptron takes longer to learn • Perceptron never learns

Noise Training by correlation: • weights grow large rapidly: less susceptible Training by error: • weights don’t grow until misprediction: susceptible Solution? Exponential Weight Growth

Studying Noise pattern set generation for n=4, p=2: ddid 1101xxxx – 1 0010xxxx – 0 11010101 – 1 00101110 – 0 • Perceptron modeled independently of application • p random patterns chosen for each level of correlation: • At n bits correlated, a random correlation direction (direct/inverse) chosen for each of n bits • Target randomly chosen for each pattern; Correlation direction determines first n bits of each pattern • Remaining bits chosen randomly for each pattern • Perceptron is trained on each pattern set • Average of training time for 1000 random pattern sets plotted

How does noise affect training time?

How does imbalance affect training time?

How does imbalance affect learning?

Why can’t training-by-correlation handle imbalance?

Findings • Increasing history size is bad if the percentage of correlated inputs decrease • Must use training-by-error if there is poor correlation and imbalance

Multibit Perceptron Predicts values, not single bits What is a value correlation? • Input value infers a particular output value • 5 --> 4 Approaches: • Disjoint • Fully Coupled • Weight per value

Disjoint Perceptron Tradeoff: + small size - can only learn from respective bits

Fully Coupled Perceptron Tradeoff: + can learn from any past bit - more weights

Learning abilities compared

Weight-per-Value Perceptron Tradeoff: + Can always learn - Tons of weights

History Interference

How common is interference?

How does interference affect perceptrons? • constructive • destructive • neutral • weight-destructive • value-destructive

Interference in Perceptron Branch Prediction

Coping: Assigned Seats Tradeoff: + no additional size - can’t consider multiple iterations of an instruction

Weight for each interfering branch (“Piecewise Linear”) Tradeoff: + interference is completely removed - massive size

Simulator new superscalar cycle-accurate execution-driven simulator can accurately model value prediction & criticality

Value Prediction What is it? • predicting instructions’ data values to overcome data dependencies Why consider it? • requires a multiple-bit prediction, not a single-bit

Table-based Predictor Limitations: • exponential growth in past values & value history • can only consider local history Storage: 70kB for 4 values, 34MB for 8 values, 74*1018 B for 16 values

Perceptron in Pattern Table (PPT) Tradeoff: + Few perceptrons needed (for 4 past values) Can consider longer histories - Exponential growth with # of past values

Perceptron in Value Table (PVT) Tradeoff: + Linear growth in both value history and # past values - More perceptrons needed

Results: PVT 2.4-5.6% accuracy increase, 0.5-1.2% performance increase 102kB-1.3MB storage needed

Results: PPT 1.4-2.8% accuracy decrease: not a good approach 72kB-115kB storage needed

Global-Local Value Prediction Uses global correlation to predict locally available values

Global-Local Predictor

Global-Global Prediction Tradeoff: + Less value storage - More bits needed per perceptron input

Global Bitwise Tradeoff: + No value storage Not limited to past values only - Many more bits needed per perceptron input

Global Predictors Compared Global-Local: 3.1% accuracy increase, 1.6% performance increase 1.2MB storage needed Global-Global: 7.6% accuracy increase, 6.7% performance increase 1.3MB storage needed Bitwise: 12.7% accuracy increase, 5.3% performance increase 4.2MB storage needed

Can Bitwise Predict New Values? 5.0% of all predictions are correct values never seen before Further 9.8% are correct values not seen in local history

Multibit Topologies Compared Disjoint: 3.1% accuracy increase, 1.6% performance increase 1.2MB storage needed Fully Coupled: 6.8% accuracy decrease, 1.5% performance decrease 3.8MB storage needed Weight per Value: 10.7% accuracy increase, 4.4% performance increase 21.5MB storage needed

Training Approaches Compared: Global-Local

Training Approaches Compared: PVT Local

Final Weight Values: Distribution and Accuracy

Anti-Interference Compared

Applying Perceptrons to Speculation in Computer Architecture

Applying Perceptrons to Speculation in Computer Architecture

Presentation Transcript

Modeling in Computer Architecture

Milestones in Computer Architecture

Simple Perceptrons

Advanced Computer Architecture Prediction/Speculation (Branches, Return Addrs )

Introduction To Computer Architecture

CSE 420 Computer Architecture Chapter 2 - DS-Speculation

Perceptrons

Introduction To Computer Architecture

Speculation

Introduction to Computer Architecture

Multilayer Perceptrons

Perceptrons

Introduction to Computer Architecture

Introduction To Computer Architecture

Computer Architecture

Applying Thread Level Speculation to Database Transactions

Perceptrons

Perceptrons

Introduction To Computer Architecture