Unobtrusive power proportionality for Torque: Design and Implementation

Unobtrusive power proportionality for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert Goto

Introduction • What is power proportionality ? • Performance-power ratio at all performance levels is equivalent to that at the maximum performance level • Servers consume a high percentage of their max power even idle • Hence, power proportionality => switch off idle servers

NapSAC – Krioukovet.al. Computational “Spinning Reserve” IPS Requests Load Distribution Scheduling Power Management WikiPedia Request Rate Power CPS 2011

The need for power proportionality of IT equipment in Soda Hall Cluster Room Power: 120-130kW (~25%) Total HVAC for cluster rooms : 75-85kW(~15%) Soda Hall Power : 450-500kW

PSI Cluster PSI Cluster: 20-25kW (~5% of Soda) Cluster Room Power: 120-130kW (~25% of Soda) Total HVAC for PSI Cluster room : 20-25kW(~5% of Soda) Total HVAC for cluster rooms : 75-85kW(~15% of Soda)

The PSI Cluster • PSI Cluster Consumes ~20-25kW of power irrespective of workload. Contains about 110 servers. • Recently server faults have reduced the size of the cluster to 78 servers. (The faulty servers mostly are powered on all the time) • Used mainly by NLP, Vision, AI and ML graduate students. • It is an HPC Cluster running Torque

PSI Cluster

Possible Energy savings Can save ~ 50% of the energy

Current state :

Result: 10 kW We save 49% of the energy

What is Torque? • Tera-scale Open-source Research and QUEue manager • Built upon original Portable Batch System (PBS) project • Resource manager: Manages availability of, and requests for, compute node resources • Used by most academic institutions throughout the world for batch processing.

Maui Scheduler • Job scheduler • Implements and manages: • Scheduling policies • Dynamic priorities • Reservations • Fairshare

Sample Job Flow • Script submitted to TORQUE specifying required resources • Maui periodically retrieves from TORQUE list of potential jobs, available node resources, etc. • When resources become available, Maui tells TORQUE to execute certain jobs on particular nodes • TORQUE dispatches jobs to the PBS MOMs (machine oriented miniserver) running on the compute nodes - pbs_mom is the process that starts the job script • Job status changes reported back to Maui, information updated

Why are we building power-proportional Torque ? • To shed load in Soda Hall • To investigate why production clusters don’t implement power proportionality • To integrate power-proportionality into a software used in many clusters throughout the world

Desirables from an unobtrusive power proportionality feature • Avoid modifications to torque source code • Only use existing torque interfaces • Make the feature completely transparent to end users • Maintain system responsiveness • Centralized • No dependence resource manager/scheduler version

Analysis of the psi cluster • Logs : • Active and Idle Queue Log • Job placement statistics • Logs exist for 68 days in Feb-April,2011 • Logs were recorded once every minute • Logs contain information of ~169k jobs , ~40 users

Type of servers in the psi cluster • Each server class is further divided according to various features • Not all servers listed above are switched on all the time

CDF of server idle duration TAKEAWAY 1: Most idle periods are small

Contribution of server idle period to total TAKEAWAY 2: To save energy, tackle the large idle periods

CDF of job durations INTERACTIVE (50,500s) BATCH TAKEAWAY 3: Most jobs are long. Hence slight increase in queuing time wont hurt

Summary of takeaways • Small server idle times, though numerous, contribute very less to total server idle time. • Power proportionality algorithm need not be aggressive in switching of servers • Waking servers takes 5 min. Considered to the running time of a job, it is negligible

Loiter Time vs Energy Savings

Design of unobtrusive Power Proportionality for Torque

Using Torque interfaces What useful state information does torque/maui maintain ? • Maintains the state(active/offline/down) of each server, and jobs running on it. • Obtained through “pbsnodes” command • Maintains a list of running and queued jobs • Obtained through “qstat” command • Maintains job constraints and scheduling details of each job • Obtained through “checkjob” command

First implementation- State machine for each server Active • Server has woken up • Server_idle_time > LOITER_TIME • If idle job can be scheduled on server Offline Waking • Server not waking • Server_offline_time >OFFLINE_LOITER_TIME • No job has been scheduled on server Problematic Server Down • Idle job exists

Does not work ! • Each job is submitted to a specific queue, • Must ensure right server wakes up.

Next implementation-State machine for each server Active • Server has woken up Problematic Server • Server_idle_time > LOITER_TIME • If idle job can be scheduled on server • Server not waking Offline Waking • Server_offline_time > OFFLINE_LOITER_TIME • No job has been scheduled on server • Idle job exists • Server belongs to desired queue Down

Still did not work ! • Each job has specific constraints which torque takes into account while scheduling • Job constraints can be obtained through “checkjob” command.

Next implementation-State machine for each server Active • Server has woken up • Server_idle_time > LOITER_TIME Problematic Server • If idle job can be scheduled on server • Server not waking Offline Waking • Idle job exists • Server belongs to desired queue • Server satisfies job constraints • Server_offline_time > OFFLINE_LOITER_TIME • No job has been scheduled on server Down

Scheduling problem: Job submission characteristics • Users tend to submit multiple jobs at a time (often >20) • Torque has its own fairness mechanisms, which wont schedule all the jobs even if there are free servers. • To accurately predict which jobs Torque will schedule, and not to switch on extra servers, we should emulate the Torque scheduling logic ! • Ties Power Proportionality feature to specific Torque Policy • Solution : Switch on only a few servers at a time to check if torque schedules the idle job

Next implementation-State machine for each server Active • Server has woken up • Server_idle_time > LOITER_TIME Problematic Server • If idle job can be scheduled on server • Server not waking Offline Waking • Idle job exists • Server belongs to desired queue • Server satisfies job constraints • Switch on only a few servers at a time • Server_offline_time > OFFLINE_LOITER_TIME • No job has been scheduled on server Down

Maintain responsiveness/headroom • The Debug cycle usually contains the users running short jobs and validating the output • If no server satisfying job contraints are switched on, a user might have to wait a long time to validate if his job is running • If jobs throw errors, he might have to wait for an entire server power cycle to run his modified job • Solution : • Group servers according to features. • In each group, have a limited numbers of servers as spinning reserve all the time

Final implementation-State machine for each server Active • Server has woken up • Server_idle_time > LOITER_TIME Problematic Server • If idle job can be scheduled on server • Server not waking Offline Waking • Idle job exists • Server belongs to desired queue • Server satisfies job constraints • Switch on only MAX_SERVERS at a time • Switch on server to maintain headroom • Server_offline_time >OFFLINE_LOITER_TIME • No job has been scheduled on server • Switching off servers leaves no headroom Down

But the servers don’t wake up !!! • Each server has to bootstrap a list of service, such as network file systems, work directories, portmapper, etc • Often these bootstraps fail, and hence servers are left in an undesired state ( e.g with no home directories mounted to write user output to ! ) • Solution : • Have a health-check script on each server • Check for proper configurations of useful services, and make server available for scheduling only if health-check succeeds.

Power Proportional Torque at a glance: • Completely transparent to user • Did not modify torque source code • 1000 line python script which runs only on torque master server • Halts servers through ssh • Wake servers through wake-on-lan • Separates scheduling policy from mechanism. • It allows torque to dictate the scheduling policy.

Deployment • Deployed on 57 of the 78 active nodes in the psi cluster. Total number of cores = 150 • Servers were classified into 5 groups based on features. • HEADROOM_PER_GROUP = 3 • MAX_SERVERS_TO_WAKE_AT_A_TIME = 5 • LOITER_TIME = 7 minutes • OFFLINE_LOITER_TIME = 3 minutes

Average Statistics • Deployed since last week • ~800 jobs analyzed • Avg utilization of cluster = 40% • % Energy saved = 49%

Results:

HVAC power savings

Number of servers powered on at a time: Headroom

Expected vs Actual savings

Submission vs Execution profile

CDF of job queue time as a percentage of job length

Conclusions – what we achieved • Power proportionality is easy to achieve for torque without changing any source code at all • The script could be run on any standard torque cluster to save energy. • Switching servers back on in a consistent state is the single biggest roadblock to deployment of script. • We saved a max of ~17kW of power is Soda Hall (~3%). This was only half the psi cluster !

Unobtrusive power proportionality for Torque: Design and Implementation