OUTLINE

Runtime Workload Behavior Prediction Using Statistical Metric Modeling withApplication to Dynamic Power ManagementR. Sarikaya, C. Isci, A.Buyuktosunoglu & Li ZhangIBM T. J. Watson Research CenterIISWC 12-03-2011

Motivation for Runtime Prediction Ex for State of the Art: Table Based Prediction Our Approach: Statistical Metric Model Intuition Definition Formulation Implementation Experimental Results Conclusions OUTLINE

Dynamically-varying workload characteristics necessitates for adaptive computing Significant inter- and (repetitive) intra-workload variations exist Understanding and predicting this dynamic behavior help apply adaptations proactively and more effectively MOTIVATION Intra Workload Variability Inter Workload Variability

Current (Reactive) dynamic adaptation approach: Assume last/recent observed behavior will persist Dynamic Management with Live, Runtime Phase Prediction Tracked Characteristic • Great for stable execution • Inaccurate response for highly variable behavior t • Key question: • How can we accurately predict future application phase behavior on all types of execution?

Pt’ Pt’-1 Pt’-2 … … Pt’-N Pt-1 Pt-2 … … Pt-N Pt’’ Pt’’-1 Pt’’-2 … … Pt’’-N Table-based Prediction PHT Pred-n PHT Tags Age / Invalid PHT GPHR Pt’ Pt’-1 Pt’-2 … … Pt’-N Pt’+1 20 Pt Pt-N-1 Pt’’ Pt’’-1 Pt’’-2 … … Pt’’-N Pt’’+1 Pt’’+1 15 PHT entries : : : : : : : : GPHR depth Pt Pt : : : : : : : : P0 P0 P0 … … P0 P0 -1 Last observed phase from performance counters GPHR depth Predicted Phase From GPHR(0) if no matching pattern From the corresponding PHT Prediction entry if matching pattern in PHT

SMM Prediction • SMM is a probability distribution over metric sequences: P(s), where • P(“how are you doing today”)  0.001 • P(“10 11 12 10 15”)  0.001 • P(“or apples sing on the moon”)  0 • P(“1 2 3 4 5 6 7 8 9 10”)  0

The SMM is inspired from the way natural language is generated. The SMM treats the metric samples as the words in a language and builds a language model for each metric. There is an underlying structure defined by the grammar, which determines the order in which we bring words together to make meaningful sentences. We can treat the metric modeling as a language modeling problem. We assume that there is an underlying structure in each metric, and if indeed there is such an underlying structure (e.g. repetitive patterns) SMM can reveal and model this structure. In the rest of the presentation we will often draw parallels between the natural language modeling and the SMM to explain the concepts used in this work. Intuition for SMM THIS SLIDE NEEDS SHORTER BULLETPOINTS! Best for Ruhi to decide and have this as backup.

How does SMM work? • Set the quantization levels (depending on the granularity of prediction). • Choose n (past history) based on the available data. • Observe 100-200 metric values for initialization steps • Estimate the statistical metric model (SMM) parameters. • Predict the next value by picking the quantized metric value that obtains the highest probability. • Observe the true value of the metric • Reestimate the model parameters • Go to step 5.

Formulation of SMM • Difficult to compute P(“how are you doing today”): • Step 1: Decompose probability: • P(“how are you doing today) = P(“how”)  P(“are|how”)  P(“you|how are”)  P(“doing|how are you”)  P(“today|how are you doing”) • The n-gram Approximation: • Assume each word depends only on the previous n words (n words total, gram means writing) • P(“today|how are you doing”)  P(“today|you doing”)

SMM n-gram Model P(s4 | s3,s2,s1) P(s4 | s3,s2) P(s4 | s3) P(s4)

Conditional Probability Estimation • How do we find probabilities? • Maximum Likelihood Estimation (MLE): • P(“today | you doing”)  C(“you doing today”) / C(“you doing”) • What about P(“sing” | “or apples”) = C(“or apples sing”) / C(“apples sing”) • Probability would be 0: very bad!

Model Smoothing • Smoothing: modify the conditional distributions away from pure relative • frequency estimates in order to compensate for data sparsity. • Add-one smoothing: • Model Interpolation • Absolute Discounting:

Prediction Errors: Equake - Quantized Absolute Errors of different predictors during the entire run of equake.

SMM: Resilience to Variations - Quantized Absolute Errors of different predictors during a segment of equake run.

SMM: Benefit of Back off to Lower Order Models

Comparison of Predictors on SPECCPU 2000 • SMM Predictor performs significantly better than existing approaches

Long-Term Pattern Prediction - SMM Accuracy improves over time, with repetitions of observed behavior

SMM Application to Power Management • SMM predicts memory boundedness of applications  guides DVFS settings • Higher bin number  more memory-bound execution  lower (V,f) setting

Power Management Results • SMM power savings within 10% of other predictors • Reduces performance degradation by 30%

Statistical Metric Model (SMM) for predicting dynamically-varying program behavior. Four main strengths of SMM: Models long term global patterns in application behavior. Can respond to variable-length patterns. Resilient to small fluctuations in observed patterns. Adaptive – as it learns more it predicts better. Superior accuracy over existing predictors. Reduces prediction errors by 10X / 3X compared to last value / table based predictors Average improvement of more than 60% and 40% for highly varying benchmarks. Reduce performance degradation by 30% when applied to power management CONCLUSIONS

BACKUP

Application Runtime Monitoring Live, Runtime Phase Monitoring Performance Counters Phase Classification Prediction Phase Prediction Dynamic Power Management Dynamic Management Real Systems Real Measurements Overview

WHAT IS PREDICTION? ? … … … … x(n-2) x(n-1) x(n) Past Given the past observation, what will be the next? Note: the fundamental difficulty with prediction is the uncertainty associated with a future event

Compare to reactive approaches Last Value / Fixed Window History / Variable Window History GPHT performs significantly better for highly varying applications Up to 6X and on average 2.4X misprediction improvement LastValue PHT:1024, GPHR:8 VarWindow_128_0.005 FixWindow_8 GPHT Prediction Accuracies 100 90 80 Prediction Accuracy (%) 70 60 50 40 gap_ref gcc_200 gcc_166 mcf_inp apsi_ref gzip_log applu_in mgrid_in gcc_expr ammp_in gcc_scilab parser_ref equake_in wupwise_ref bzip2_source gcc_integrate bzip2_graphic bzip2_program

SMM: Long-Term and Short-Term Modeling Temporal Metric Modeling Global Metric Modeling P(s4 | s3,s2,s1) Ptemporal=P(wi),  estimated using recently observed samples P(s4 | s3,s2) Pfinal=b1*P global+ b2*Ptemporal P(s4 | s3) • Vector length and quantization levels control the model size. • Models both long-term and short-term behavior. P(s4)

COMPUTATIONAL COMPLEXITY • Computational complexity of the SMM model training depend on 3 parameters: • number of quantization bins (V), (V=20) • length of finite sequence (n), (n=8) • how often we need to update the model. • If SMM parameters updated at every sample. • Computational upperbound O(kV n) division and multiplication operation, where k is a small constant (k < 4). • Computational overhead is less than 1K multiply and divide operations. • Compute time overheads on the order of microseconds. • It can be implemented within the operating system software in context switch time granularities with no visible performance impact. • The storage requirements could be a more important concern for large n value and very long metric sequences. • Model pruning. • The model size can inherently be controlled by limiting V and n.

Example sequence: 2, 3, 1, 4, 1, 2, 3, 2, 2, 3, 4, [2, [3]], 2 Model Probabilities BEFORE we observed 2, p(2| 2,3)=1/3  probability of observing [2,3,2] [2,3] observed 3 times as HISTORY, only one is followed by 2 p(2|3)=1/3  probability of observing [3,2] [3] is observed 3 times as HISTORY, only one is followed by 2 p(2)=5/13  probability of observing 2 in isolation There are 13 samples observed and 5 of them are 2. Model Probabilities AFTER we observed 2, p(2| 2,3)=2/4  probability of observing [2,3,2] [2,3] sequence is observed as HISTORY 4 times so far, and only two of them are followed by 2 p(2|3)=2/4  probability of observing [3,2] [3] is observed 3 times as HISTORY so far, and only one of them is followed by 2 p(2)=6/14  probability of observing 2 in isolation There are 14 samples observed and 6 of them are 2. For n=8, in the extreme case there will be 8+7+6+5+4+3+2+1=36 sequences, whose probabilities will be updated. This computation is really negligible. Model Update Overhead Issues

This is a issue in theory:  V=20 and model order n=8, sequence space is as large as 20^8. in practice only a tiny fraction is observed, due to the underlying pattern in the data. V and n can be used to control the model size without significantly hurting the performance. If n=8  n= 6 all possible sequences: 20^8  20^6 If V=20  V=8 all possible sequences: 20^8  8^6 There are also pruning techniques to control the model size. Idea: eliminate higher order sequences from the tables whose probabilities fall below a threshold, instead rely on lower order sequences to predict the next word Model Storage Overhead Issues

Different applications show different levels of variability Variability in Different Applications • Less variability • Easier to predict • More variability • Harder to predict • More important for SMM

Table Based Predictor - Prediction Error vs. Table Size in Table Based Predictor.

Impact of Model Order on SMM Performance - The improvements from larger model orders seem to level off.

OUTLINE

OUTLINE

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: