Download
identifying the optimal energy efficient operating points of parallel workloads n.
Skip this Video
Loading SlideShow in 5 Seconds..
Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads PowerPoint Presentation
Download Presentation
Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads

Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads

109 Vues Download Presentation
Télécharger la présentation

Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Identifying the Optimal Energy-Efficient Operating Points of Parallel Workloads Ryan Cochran1, Can Hankendi2, Ayse K. Coskun2, Sherief Reda1 ICCAD’11, San Jose, CA 1Brown University School of Engineering 2 Boston University ECE Department This research has in part been funded by Dean’s Catalyst Award at College of Engineering, Boston University. R. Cochran and S. Reda are partially supported by NSF grants number 0952866 and 1115424.

  2. Challenges for large scale computing systems -Energy-efficiency and budget/cost control are the major challenges (~40%) - Energy consumption is increasing by 15% per year [Koomey et al., 2008]. - Server management and power/cooling are the major contributors of increasing costs (Source: International Data Corporation (IDC), 2009)

  3. Control knobs • Dynamic voltage frequency scaling (DVFS) • Thread count (for parallel workloads) • Components that can be turned on/off (CPU) Energy Management On Multicore Systems Goal: Identifying optimum operating points to improve energy-efficiency • Parallel workloads • Inter-thread dependencies • Software structure • Parallelization model Server (multicore processors) Large Scale Computing System (several racks) Rack (several servers) blackscholes Processor Energy

  4. Outline • Background on Dynamic Power Management • Objective Functions for Energy-efficient Computing • Proposed Technique • Multinomial Logistic Regression • L1 Regularization • Experimental Results • Conclusions

  5. Background • HW techniques for • fine-granularity DVFS, • clock domain design, • voltage island partitioning, etc. • (e.g., Magklis, ISCA’03, • Kim, HPCA’09 ) • Recent DVFS techniques: • Goals: • Low performance overhead • Adaptability to dynamically changing workload Performance Monitoring Units (Event Counters) (e.g., Choi, ISLPED’04, Rangan, ISCA’09, etc.) • SW optimization for • optimally scheduling a known set of tasks • compiler / application-level tuning, etc. • (e.g., Shin, DAC’01, • Azevedo, DATE’02) Online learning using CPI metrics to identify CPU usage/per. overhead [Dhiman, ISLPED’07] Choosing V-f setting based on “memory-boundedness” (mem/uop) [Isci, MICRO’06]

  6. Optimizing for Energy-Efficiency • Optimum operating points show significant variations depending on: (1) Workload (2) Objective functions

  7. Optimization Formulations • Minimize EDP/ED2P • No performance/power guarantees • Minimize Delay under Power Constraints (minDPC) • Minimize Energy under Performance and Power Constraints (minEDPC)

  8. Outline • Background on Dynamic Power Management • Objective Functions for Energy-efficient Computing • Proposed Technique • Multinomial Logistic Regression (MLR) • L1 Regularization • Experimental Results • Conclusions

  9. Methodology Overview performance counters, temperatures, & control settings Power & Performance Constraints Power & Performance Constraints performance counters, temperatures, & control settings performance counters, temperatures, & control settings performance counters, temperatures, & control settings (frequency, thread #) LEARNING MODEL LEARNING MODEL Runtime ESTIMATION Runtime ESTIMATION Optimum Operating Points Optimum Operating Points Weights for related performance counter ratios Thread # V-F Setting Weights for related performance counter ratios Weights for related metrics Weights for related metrics Lookup Table Lookup Table

  10. MLR Learning Model Measurements performance counters, temperatures, & control settings power delay E.g.: μ-ops retired/thread count l2-miss/load locks …. Optimal Setting Calculation All Possible Input Ratios y (optimal setting) Φ(x) MLR Learning Model Logistic Weights (Ŵ) Lookup Table

  11. Multinomial Logistic Regression • Fits multinomial logistic function to continuous input data • Estimates probability of discrete set of outputs: • V-f setting • Thread count Multinomial logistic model Input vector Logistic weights

  12. L1 Regularized Input Selection • Prevents over-fitting of the classifier model • Reduces dimension of the input set, thus reduces the complexity • Determines most relevant metrics • Weight calculation without L1 Regularization with L1 Regularization α=1e−8 α = 7.74e−4 α = 3.6 Weight Distributions (%)

  13. L1 Regularized Input Selection All Possible Input Ratios L1 Regularization μ-ops retired load locks l3-cache misses l2-cache misses resource stalls branch prediction misses floating point operations core temperature frequency thread count constant E.g.: μ-ops retired/thread count l2-miss/load locks stalls/frequency …. …. Top 10 most relevant metrics:

  14. MLR Runtime Estimation Measurements performance counters, temperatures, & control settings All Possible Input Ratios Estimated Optimum Operating Point Runtime ESTIMATION Logistic Weights (Ŵ) Thread # V-F Setting Lookup Table Power & Performance constraints

  15. Experimental Setup • Data collected on Intel Core i7-940 quad-core processor running PARSEC 2.1[Bienia et al., 2008] benchmark suite • Performance counters: uOPs retired, load locks, L2/L3 cache misses, resource stalls, branch misses, FP operations • Operating points: • Frequency (GHz): 1.60, 1.73, 2.00, 2.13, 2.40, 2.53, 2.67 • Thread count: 1, 2, 4

  16. Experimental Results • Implemented DVFS techniques based on performance counters: • Memory Operations/uOP[Isci, MICRO’06] • CPI based metric [Dhiman, ISLPED’07] • Threshold based look-up tables 51% higher accuracy

  17. EDP Reduction w.r.t. Prior Work • 10.9% higher EDP savings on average in comparison to best performing previous method • Maximum savings reach 30.9% 30.9%

  18. Accuracy & Scalability • MLR provides over 95% accuracy for predicting the optimal operating points minDPC(P): minimize delayunder power constraint (P) minEDPC(P,D): minimize energy under power constraint (P) with performance degradation (D)

  19. MLR vs. Prior Techniques L1 Regularized MLR Threshold-based approaches Manual input selection • Automatic feature (input) selection • Both V-f and thread count selection • Ability to handle various objective functions • Scalability to high number of operating points • V-f selection • Focus on EDP/ED2P • Poor scalability

  20. Conclusions • We identified energy-efficiency tradeoffs and challenges for multi-threaded workloads on multicore systems • Tradeoffs have significant variation depending on the objective function and the workload characteristics • We proposed a novel technique based on MLR to choose optimum settings (DVFS, thread count) for various objective functions (E.g.: Minimizing execution time with a power constraint) • On a real system, we experimentally demonstrated our proposed technique outperforms the previous work by: • 51% higher accuracy w.r.t. previous techniques, which brings 10.9% more EDP savings • +95% accuracy for various objective functions • Preserving the estimation accuracy (scalability) with increasing number of potential operating settings