Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads

Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads Ryan Cochran1, Can Hankendi2, Ayse K. Coskun2, Sherief Reda1 ICCAD’11, San Jose, CA 1Brown University School of Engineering 2 Boston University ECE Department This research has in part been funded by Dean’s Catalyst Award at College of Engineering, Boston University. R. Cochran and S. Reda are partially supported by NSF grants number 0952866 and 1115424.

Challenges for large scale computing systems -Energy-efficiency and budget/cost control are the major challenges (~40%) - Energy consumption is increasing by 15% per year [Koomey et al., 2008]. - Server management and power/cooling are the major contributors of increasing costs (Source: International Data Corporation (IDC), 2009)

Control knobs • Dynamic voltage frequency scaling (DVFS) • Thread count (for parallel workloads) • Components that can be turned on/off (CPU) Energy Management On Multicore Systems Goal: Identifying optimum operating points to improve energy-efficiency • Parallel workloads • Inter-thread dependencies • Software structure • Parallelization model Server (multicore processors) Large Scale Computing System (several racks) Rack (several servers) blackscholes Processor Energy

Outline • Background on Dynamic Power Management • Objective Functions for Energy-efficient Computing • Proposed Technique • Multinomial Logistic Regression • L1 Regularization • Experimental Results • Conclusions

Background • HW techniques for • fine-granularity DVFS, • clock domain design, • voltage island partitioning, etc. • (e.g., Magklis, ISCA’03, • Kim, HPCA’09 ) • Recent DVFS techniques: • Goals: • Low performance overhead • Adaptability to dynamically changing workload Performance Monitoring Units (Event Counters) (e.g., Choi, ISLPED’04, Rangan, ISCA’09, etc.) • SW optimization for • optimally scheduling a known set of tasks • compiler / application-level tuning, etc. • (e.g., Shin, DAC’01, • Azevedo, DATE’02) Online learning using CPI metrics to identify CPU usage/per. overhead [Dhiman, ISLPED’07] Choosing V-f setting based on “memory-boundedness” (mem/uop) [Isci, MICRO’06]

Optimizing for Energy-Efficiency • Optimum operating points show significant variations depending on: (1) Workload (2) Objective functions

Optimization Formulations • Minimize EDP/ED2P • No performance/power guarantees • Minimize Delay under Power Constraints (minDPC) • Minimize Energy under Performance and Power Constraints (minEDPC)

Outline • Background on Dynamic Power Management • Objective Functions for Energy-efficient Computing • Proposed Technique • Multinomial Logistic Regression (MLR) • L1 Regularization • Experimental Results • Conclusions

Methodology Overview performance counters, temperatures, & control settings Power & Performance Constraints Power & Performance Constraints performance counters, temperatures, & control settings performance counters, temperatures, & control settings performance counters, temperatures, & control settings (frequency, thread #) LEARNING MODEL LEARNING MODEL Runtime ESTIMATION Runtime ESTIMATION Optimum Operating Points Optimum Operating Points Weights for related performance counter ratios Thread # V-F Setting Weights for related performance counter ratios Weights for related metrics Weights for related metrics Lookup Table Lookup Table

MLR Learning Model Measurements performance counters, temperatures, & control settings power delay E.g.: μ-ops retired/thread count l2-miss/load locks …. Optimal Setting Calculation All Possible Input Ratios y (optimal setting) Φ(x) MLR Learning Model Logistic Weights (Ŵ) Lookup Table

Multinomial Logistic Regression • Fits multinomial logistic function to continuous input data • Estimates probability of discrete set of outputs: • V-f setting • Thread count Multinomial logistic model Input vector Logistic weights

L1 Regularized Input Selection • Prevents over-fitting of the classifier model • Reduces dimension of the input set, thus reduces the complexity • Determines most relevant metrics • Weight calculation without L1 Regularization with L1 Regularization α=1e−8 α = 7.74e−4 α = 3.6 Weight Distributions (%)

L1 Regularized Input Selection All Possible Input Ratios L1 Regularization μ-ops retired load locks l3-cache misses l2-cache misses resource stalls branch prediction misses floating point operations core temperature frequency thread count constant E.g.: μ-ops retired/thread count l2-miss/load locks stalls/frequency …. …. Top 10 most relevant metrics:

MLR Runtime Estimation Measurements performance counters, temperatures, & control settings All Possible Input Ratios Estimated Optimum Operating Point Runtime ESTIMATION Logistic Weights (Ŵ) Thread # V-F Setting Lookup Table Power & Performance constraints

Experimental Setup • Data collected on Intel Core i7-940 quad-core processor running PARSEC 2.1[Bienia et al., 2008] benchmark suite • Performance counters: uOPs retired, load locks, L2/L3 cache misses, resource stalls, branch misses, FP operations • Operating points: • Frequency (GHz): 1.60, 1.73, 2.00, 2.13, 2.40, 2.53, 2.67 • Thread count: 1, 2, 4

Experimental Results • Implemented DVFS techniques based on performance counters: • Memory Operations/uOP[Isci, MICRO’06] • CPI based metric [Dhiman, ISLPED’07] • Threshold based look-up tables 51% higher accuracy

EDP Reduction w.r.t. Prior Work • 10.9% higher EDP savings on average in comparison to best performing previous method • Maximum savings reach 30.9% 30.9%

Accuracy & Scalability • MLR provides over 95% accuracy for predicting the optimal operating points minDPC(P): minimize delayunder power constraint (P) minEDPC(P,D): minimize energy under power constraint (P) with performance degradation (D)

MLR vs. Prior Techniques L1 Regularized MLR Threshold-based approaches Manual input selection • Automatic feature (input) selection • Both V-f and thread count selection • Ability to handle various objective functions • Scalability to high number of operating points • V-f selection • Focus on EDP/ED2P • Poor scalability

Conclusions • We identified energy-efficiency tradeoffs and challenges for multi-threaded workloads on multicore systems • Tradeoffs have significant variation depending on the objective function and the workload characteristics • We proposed a novel technique based on MLR to choose optimum settings (DVFS, thread count) for various objective functions (E.g.: Minimizing execution time with a power constraint) • On a real system, we experimentally demonstrated our proposed technique outperforms the previous work by: • 51% higher accuracy w.r.t. previous techniques, which brings 10.9% more EDP savings • +95% accuracy for various objective functions • Preserving the estimation accuracy (scalability) with increasing number of potential operating settings

Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads

Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads

Presentation Transcript

Identifying the Value Points within the Corporation

Topics Identifying Series- Parallel Relationships Analysis of Series-Parallel Resistive Circuits

Energy-efficient Cluster Computing with FAWN : Workloads and Implications

Task Management for Irregular-Parallel Workloads on the GPU

Identifying Critical Control Points CCPs

Identifying the Value Points within the Corporation

Energy Efficient

Making HTCondor Energy Efficient by identifying miscreant jobs

Type of Workloads

Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads

Increasing Achievement: Identifying the greatest points of leverage

Identifying Stationary Points

Adaptive Energy -efficient Resource Sharing for Multi-threaded Workloads in Virtualized Systems

Modeling and Acceleration of File-IO Dominated Parallel Workloads

Designing Parallel Operating Systems via Parallel Programming

Types of Workloads

Efficient use of energy

Optimal Portfolios and Efficient Frontier

Data Parallel FPGA Workloads: Software Versus Hardware

OPTIMAL WALLING SOLUTIONS FOR ENERGY EFFICIENT HOMES IN SA

Data Parallel FPGA Workloads: Software Versus Hardware

The Review of Energy Efficient Network