360 likes | 470 Vues
Energy Optimization and Stability in Green Data Centers. Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical at the Department of Automatic Control Lund University, Sweden. Energy Management in Data Centers.
E N D
Energy Optimization and Stability in Green Data Centers Tarek Abdelzaher Dept. of Computer Science University of Illinois at Urbana Champaign, USA On Sabbatical at the Department of Automatic Control Lund University, Sweden
Energy Management in Data Centers • Total consumption: 2% of energy spent in US (EPA estimate) • Energy bill is 20-50% of total profit • Energy expended on: • Computing (powering up racks of machines) • Sensors: Utilization, Delay, Throughput, … • Actuators: DVS, turning machines On/Off • Cooling • Sensors: Temperature, air flow, … • Actuators: Air-conditioning units, fans, …
Current Status • Increased emphasis on energy control • More “manipulation knobs” are introduced to manage energy and performance • Challenge • Knobs may interact in unexpected ways • Different performance and energy management policies may interfere with one another • Uncoordinated interference of multiple knobs can lead to instability or poor efficiency
Energy SavingA Tale of Two Policies • DVS + On/Off: more energy consumption than DVS or On/Off alone! • DVS alone • On/Off alone Empirical measurements from a 30-machine 3-tier testbed of a shopping site
Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems
Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems
Response Time Control Problem in VMs VM2 VM1 Goal: dynamically change CPU shares of VMs to meet RT constraint CPU has been popular for controlling response time With only CPU control, response time severely violated. Why?
Memory Utilization, Disk I/O, and CPU Consumption CPU as a function of memory utilization # of page faults as a function of memory utilization Page faults drastically increase after a certain threshold Significant CPU overhead after the threshold - Increase in CPU usage mainly caused by extra paging activities
Response Time and Memory Utilization Sharp increase in response time after a certain threshold, say 90% To achieve the desired performance, we need to avoid the “bad” region
CPU and Memory Control Application-level performance Resource usage VMM VM 1 (App 1) CPU allocation CPU Controller CPU Scheduler Sr Sp Application SLOs Memory allocation Memory Controller Memory Manager VM n (App n) Sn Sp Resource usage Application-level performance CPU controller for controlling response time Memory controller makes sure the memory utilization doesn’t go over 90%
Performance of Joint Controllers with Synthetic Workload Cont. VM2 VM1 Without dynamic memory control, VMs cannot get enough memory when memory gets scarce Joint controller gives just enough memory not to fall into the bad region. Efficiently utilize physical memory
Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems
DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone
DVS and On/Off Interactions in Energy Minimization DVS + On/Off DVS alone On/Off alone The DVS and On-Off “knobs” must be controlled holistically in a coordinated manner as a solution to an optimization problem
Results DVS + On/Off DVS alone On/Off alone Optimal
Energy SavingMeasurements from a Machine Room Bottom-Up + Off Bottom-Up Even Even Optimal Bottom-Up Bottom-Up Optimal Fixed cooling set point Fixed number of machines Holistic Optimization
Three Performance Management Challenges • Avoid the “avoidable” (bad) interactions • Manage the “unavoidable” interactions (so they do not lead to instability) • Troubleshoot remaining interaction problems
Diagnostics In software systems, key variables in adaptive actions are correlated Monitor changes in correlations to diagnose performance problems In mechanical systems, components are connected and correlated Correlations are broken, the system may not perform as expected
AC D R + U Diagnostics • Learning phase: learn adaptation graph by calculating correlation coefficient AC D R + 2. At run-time: periodically recalculate the sign of edges in adaptation graph + U Learned Estimated 3. Check the sign Adaptation Graph Backup Policy Translate into causality assumptions System workload Automated-detection Control knob settings Detect assumption violation Performance Knobs (Actuators) Regulation Policy Target System Sensors Target performance reference Monitor the target system System output
AC D R + + U Diagnostics Stop the component causing the sign problem Execute backup action: open loop action Try several times Backup Policy Adaptation Graph Translate into causality assumptions System workload Automated-detection Control knob settings Detect assumption violation Performance Knobs (Actuators) Regluation Policy Target System Sensors Target performance reference Monitor the target system System output
Example • Increased workload interrupt handling to polling utilization drops • Controller tries to accept more requests Aggrevate the situation Most new requests dropped by kernel. • No prioritization enforced • Unintended interaction between an utilization controller in a Web server and the kernel anti-livelock mechanism: • Admission control based on utilization. • It drops lower priority request first + + AC AC Util Pd Util Pd + Req Req
DiagnosticsExample 1. Network processing is overloaded: switching from interrupt handling to polling 2. Utilization sharply drops due to decrease in the number of interrupts 3. Admission control policy tries to accept more requests, aggravating the situation CPU utilization # of network interrupts 2. Closed loop - violation 1. Closed loop Correlation ReqUtil becomes broken
More on Diagnostics • Correlations between continuous variables do not uncover problems due to sequences of discrete events • Focus on runtime events related to performance • Ex) turn on machines. Decrease DVS, send a packet, etc. • Find a (cyclic) sequence of events that discriminates “good” and “bad” perfornance cases • Data mining technique: discriminative sequence analysis
Main Idea • Log different events during runtime • Most of the time the system works • Occasionally it performs poorly • Generate the frequent sequences of events that occurs when the system works correctly • Generate the frequent sequences of events that occurs when the system exhibits undesirable behavior • Identify the “culprit” sequences of events that are found only in the latter case but not the former.
A Case Study on a “Hot” Day: Throughput of a Server Farm Low Throughput
Three Performance Control Policies • Thermal Management Policy • Puts machine to sleep if machine is overheated • Energy Aware Load Balancer • Distributes load based on average CPU utilization • Attempts to minimize the number of machines in use • Machine On/Off Policy • Turns off idle machines to save energy
Regular Operating Condition Maximum temperature is well Below 60 degrees
Anomalous Condition Maximum temperature is above 60 degrees
Anomalous Condition Maximum temperature is above 60 degrees Eventually, only the overheated machine remained on!
Diagnostics Output:Reported Culprit Event Sequences • Cycle: • SleepEvent,WakeUpEvent • Cycle: • Temp: 65 - 70, Temp: 60 - 65,
Diagnostics Output:Reported Culprit Event Sequences • Cycle: • SleepEvent,WakeUpEvent • Cycle: • Temp: 65 - 70, Temp: 60 - 65, Oops: Utilization is computed based on a recent time average (including “sleep” time) Artificially low if machine sleeps
What was going on? No matter how much task is assigned to the overheated machine, utilization remains well below threshold due to periodic sleeping Load balancer keeps assigning more and more tasks to the overheated machine On/Off policy keeps turning off other machines
Conclusions (the needs) • Must Identify the right knobs to manipulate (e.g., example with virtual machine memory allocation) • Must manage them in a jointly optimal manner to avoid instability or poor performance • Must develop automated self-diagnostic techniques to reduce administrator effort
Conclusions (the tools) • Control theory of positive systems offers interesting insights into distributing the holistic management of interacting feedback control knobs in data centers • Advances in event-based control offer opportunities to significantly reduce actuation overhead (e.g., number of times machines are tuned on/off without degrading performance • Advances in discriminative sequence mining offer opportunities for improving self-diagnostic capabilities in complex systems