Effective Thermal Management Strategies in High-Density Computing Environments
This overview explores critical thermal-aware issues impacting computer performance. As power densities rise, effective thermal management becomes essential to prevent material degradation and overheating. Key strategies include dynamic voltage and frequency scaling, optimal chassis design for air flow, and effective software solutions for CPU load balancing. The challenges of heat recirculation in data centers are also discussed, emphasizing the need for monitoring systems to detect hot spots and ensure efficient cooling. A comprehensive understanding of these aspects is vital for maintaining system reliability and efficiency in modern computing.
Effective Thermal Management Strategies in High-Density Computing Environments
E N D
Presentation Transcript
Thermal-aware Issues in Computers IMPACT Lab
Importance of thermal management • Cooling cost very high: • at providing cool air:equals the power consumed in computation • at bring the cool medium (air/liquid) to the circuitry:new density requires $2Watt of material/equipment if 40+ Watts of IC • Excessive heat accelerates material degradation • Power density only to increase in the future
Thermal management at various levels • Physical dimension • At IC level • At chassis/case level • At room level • Software dimension • Firmware level • Operating system level • Middleware level • Application level Source: Intel Source: Apple Source: Berkeley Lab
At integrated circuit level • Issues • Higher temperature Increased power leakage • Increased power leakage Higher temperature • Heat density – hot spots • Applied Solutions • Dynamic Voltage Scaling • Dynamic Frequency Scaling • Clock gating (“pause” mode) • Research solutions • Redundant circuitry • Redundant “cores” [Chapparro 2004] • Redundant pipelines [Lim 2002] • Switch from one circuitry to the othereither regularly or when temperatureexceeds levels
At chassis/case level • Issues • Fan capacity at low RPMs not enough for generated heat • Fan noise level at high RPMs too high • Solutions • Dynamic Fan Speed • CPU load balancing • Activity Adjustments • Dynamic Memory bandwidth scaling [Apple TN2156] • Dynamic FSB frequency scaling Layout forces flow ofair in a linear fashion Source: Apple Source: Intel Terms:inlets, outlets
At room level • Solutions: • Pause execution of tasks • Turn machines off • Performance impacts • Degraded performance Source: www.cix.ie Source: Elibo, Hong Kong Terms:hot aisle, cold aisle, raised floors, CRAC/HVAC
A typical data center Source: Siemens Terms:hot aisle, cold aisle, raised floors, CRAC/HVAC
CRAC & thermal maps: knowing where the hot spots are • Purpose • Knowing air temperature at any 3-D point • Adjust CRAC operation • Adjust computer operation • Obtaining by • Strategically placed sensors • On-board sensors • Predicting by • Thorough testing • CFD simulations
Thermal issues in dense computer rooms (Data centers, Computer Clusters, Data warehouses) • Heat recirculation • Hot air from the equipment outlets is fed back to the equipment inlets • Hot spots • Effect of Heat Recirculation • Areas in the data center with alarmingly high temperature • Impact • Cooling has to be set well low to have allinlet temperatures in safe operating range Courtesy: Intel Labs Terms:heat recirculation, hot spots,inlet temperatures, outlet temperatures,redline temperature, peak temperature
Thermal Management solutions softwaredimension Application Data centerjob scheduling (middleware) Thermal-aware JVM O/S CPU Load balancing Dynamic voltage scaling Fan speed scaling Dynamic frequency scaling firmware Circuitry redundancy IC Case/chassis room physicaldimension
Reducing heat recirculation (1) • Heat Recirculation is the only reason for increase inlet temperatures • Without recirculation, the inlet temperatures would be equal to supplied air temp. • The peak inlet temperature defines the CRAC operational temperature Inlet temperature distribution without Cooling Inlet temperature distribution with Cooling 25C
Reducing heat recirculation (2) • First things first • Find the causes of it • Find ways to predict it • What is causing it • The air flow from the CRAC is not adequate to feed all inlets • Imperfect layout • Usually 1. and 2. are not adjustable once the equipment is bought and in place • Find other ways to reduce it
Reducing heat recirculation (3) • Other ways to reduce it • Find who is contributing the most heat recirculation • Mitigate the heat recirculation by throttling activity at main contributors of recirculation(contributor = equipment unit that is generating heat)(throttling activity = change the jobs or the execution of them) • How to know how much heat each equipment contributes? • But: how to know how much heat each equipment generates? (i.e. power profile)
If we had a mechanism like this we could predict the effects of a running (or potentially running) job and decide about its fate according to its effects Reducing heat recirculation(general plan of action) Assess the effect of a task on the equipment (cpu, memory, I/O) Assess the heat generated bythe equipment from the task Assess how much of thatheat is recirculated Assess the inlet temperaturesgiven the heat recirculation Terms:task profile, power profile,thermal map prediction
Task profiling (1) • Task profiling • Assess how much CPU utilization, memory activity, disk I/O, network traffic etc, the application generates • Task profiling can be done • Offline, by code analyzers, or • Online, by test runs • Dirty (and convenient) fact about HPC (high-performance computing): • Incoming jobs have highly predictable profile
Power profiling • Power Profiling • Assess how much heat is generated from each component (i.e. CPU, memory, disk I/O, network etc) • Assess how much power is consumed from each component (i.e. CPU, memory, disk I/O, network etc) • Power profiling is usually preformed offline
Example results of power profiling • Power Consumption is mainly affected by the CPU utilization • Power consumption is linear to the CPU utilizationP = a U + b
A simple thermal model From other machines to other machines From A/C To A/C Power consumed
Effect of CPU utilization to outlet temperature • Task profiling • Assess how much CPU utilization the application generates • Outlet Temperature is a function of utilization plus inputToutlet = f(U) + Tinlet
Assessing recirculation for the given computational tasks • Assessing Recirculation • Obtaining the thermal map for the given task assignment • Compare with offline measurements • But we don’t need to know the temperature at every point in the air • Only at the inlets and the outlets N5 Courtesy: Intel Labs N4 N3 N2 N1
Recirculation coefficients • Purpose • Knowing air temperature at any 3-D point • Adjust CRAC operation • Adjust computer operation • Obtaining by • Strategically placed sensors • On-board sensors • Predicting by • Thorough testing • CFD simulations
Different demands for cooling capacity How scheduling impacts cooling cost Inlet temperature distribution without Cooling Inlet temperature distribution with Cooling Scheduling 1 25C Scheduling 2 25C
Functional model of scheduling • Tasks arrive at the data center • Scheduler figures out the best placement • Placement that has minimal impact on peak inlet temperatures • Assigns task accordingly Tasks Scheduler Task Task
Architectural View Scheduler(SLURM)
Scheduling Algorithms • Current work assumed incoming jobs that • Are Identical (same profile) • Are long-running • Enhance scheduling algorithm to work with • Heterogeneous data center • Asynchronous job arrival • Jobs have non-identical execution time
Scheduler Programming • Enhance existing job management software (Moab, SLURM etc) to work with • Gathering thermal data • Assigning jobs according to policy