170 likes | 270 Vues
Scheduling in HPC Resource Management System: Queuing vs. Planning. Matthias Hovestadt, Odej Kao, Alex Keller, and Achim Streit 2003 Job Scheduling Strategies for Parallel Processing (JSSPP) Workshop Jerry Chou 8/29/2005. Outline. Background Queuing and Planning Systems
E N D
Scheduling in HPC Resource Management System: Queuing vs. Planning Matthias Hovestadt, Odej Kao, Alex Keller, and Achim Streit 2003 Job Scheduling Strategies for Parallel Processing (JSSPP) Workshop Jerry Chou 8/29/2005
Outline • Background • Queuing and Planning Systems • Advanced Planning Functions • Example: Computing Center Software • Conclusion • Discussion
Background • HPC systems are operated by resource management systems (RMS) based on the queuing approach • PBS, SGE, Loveleveler, etc… • Grid middleware emerges between resource management systems and applications • Globus, vgES, etc • High level function (co-allocation) needs features from RMS • Advanced reservation, quality of service • It is hard to realize those features with RMS because it only consider present resource usage => This paper purpose planning system to close the gap
Big Picture Application Co-allocation Grid Middleware Globus vgES Advanced Reservation QoS RMS (PBS) RMS (Loadleveler) RMS (SGE) RMS (Condor) Resources
Queuing and Planning Systems • Queuing Systems • Planning Systems • Queuing vs. Planning Systems
Queuing Systems • Queues have different limits on the resource requests • Number of resources requested • Execution time • Interactive/Batch jobs • Jobs are sorted by schedule policy in the queue • The highest priority request is the queue head • If more than one queue can be started, further criteria are needed, such as Queue priority • If no queue head can be started, the idle resources may be utilized with backfilling
Planning Systems - Replanning • Requested • Start time • Estimated run time • When • A new request is submitted • A running request ends before it’s estimated end time • How • Delete all non-reservations from schedule • Sort non-reservations according to schedule policy • Arrange reservations into schedule • Insert non-reservations in the schedule at the earliest possible start time
Advanced Planning Functions • Requesting Resources • Dynamic Aspects • Service Level Agreements
Requesting Resources • Diffuse requests • Give a range: “need 32~128 CPUs” • Let RMS optimizes: “need as much nodes as possible” • Negotiation
Dynamic Aspects • Variable Reservations • Make a reservation ASAP • Different from reserved jobs: • No fix start time • Different from non-reserved jobs: • Never planed later than its first planned start time • Resource Reclaiming • Replace requested resources at run time • Automatic Duration Extension • Extend the runtime of jobs while they are running • How long can it be extended • Hoe many time it can be extended
Dynamic Aspects (Cont.) • Automatic Restart • It can utilize short time slots in the scheduling • Space sharing “Cycle Stealing” • Run as a background job to steal resources in a space sharing system (like condor) • Deployment Servers • RMS plans both the requested resources and the time to reconfigure the hardware
Service Level Agreements (SLA) • SLA has to be considered not only in the scheduling process but also during the runtime • At runtime the scheduler is not responsible for measuring the fulfillment of the SLA, but to provide all granted resources
Computing Center Software (CCS) • Architecture • User Interface (UI): provide single access point to one or more systems • Access Manager (AM): manages the user interface and is responsible for authentication, authorization and accounting • Planning Manager (PM): plans the user requests onto the machine • Machine Manager (MM): provides machine specific feature • Island Manager (IM): provide CCS internal services and watchdog facilities to keep the island in a stable condition
Process Flow User: specify the expected duration of their requests Requests • PM: re-plans the schedule • Fix-time Request: request reserves resource for a given time • Var-time Request: can move to a earlier time slot when replanning Schedule MM: maps schedule to machines Verify if a schedule can be realized with the available hardware. No Yes Find alternative time Send conflict list to PM Conflict List No Done Can PM accept? Yes
Conclusion • Classify and compare queuing systems with planning systems • Present possible advanced planning functionality • The aim of the paper is to show the benefit of planning systems for managing HPC machines
Discussion • Does planning system solve all the problem? • What if most of jobs want to run ASAP • What if runtime is not estimated precisely • What’s the performance and utilization comparison between queuing systems and planning systems • If you are resource provider, will you use it? • What feature could be provided by vgES? • Diffuse requests • Resource reclaiming • Variable reservation • Negotiation