Local Resource Management System & State Estimation

Local Resource Management System & State Estimation • Local resource management systems • Condor, Maui, LSF, PBS • Prediction techniques • example NWS • improve resource selection

Condor - Introduction • Batch job system that allows usage of both dedicated and non-dedicated systems. • Provides users with extra computing power • Introduces complexities • remove jobs before they are finished (preemption) • run on a wide array of machines (matchmaking)

CondorPreemptive Resume Scheduling • Advantages • use resources that are only available occasionally by the use of checkpoints, preemption and allocation • no backfilling (take advantage of holes in the schedule to run more jobs, and hereby increase efficiency) • fair sharing of jobs and towards users • compute on demand (low vs high priority)

Condor – Scheduling • Submit jobs to local computer queue • Interact with matchmaker to run job (1 cpu/job) • Run appropiate (ClassAd) job by claiming it

Triumvirate • User agent – make sure job finishes, on failure resubmit, etc. • Owner agent – ensure owner's policy of how computer is used, responsible for running submitted jobs • Matchmaker – find matches between user and owner agent and implement system-wide policies

Triumvirate (2)

Condor – Matchmaking & Claiming • User submits job to queue, unique identification • User agent sends ClassAd (5 min) until there are jobs that are not running • Owner agent sends ClassAd (5 min) to describe the computer it is responsible for • Matchmaker accepts ClassAd's and attempts to find matches – negotiation • On match, user and owner agent independently of matchmaker work out the details (up-to-date inf.) • User agent sends job to owner agent, and it runs

Condor – Matchmaking & Claiming (2) • On problems outside process redo matchmaking; on program error, record problem and inform user • When program starts, another process (shadow) is started on user agent that is responsible for Condor’s remote I/O capabilities • Running jobs continue even if matchmaker fails

Condor - preemption • Preemption is necessary to respect interests of all parties • Key to success is checkpoint creation • when preempted from a machine • manual checkpoint creation • periodic checkpoint creation to safeguard against failures • Crashes/disruptions happen frequently in grids • Check pointing and reacting to preemptions is an essential part of Condor’s approach to reliability.

Condor – user preemption • Manual preemption • Automation of above process (eg. running time) • Preemption on behalf of Condor • eg. check if job can run on a better machine • not supported in current version of Condor • needs consideration such as ‘thrashing’ (always look for better computer, not being able to do any jobs)

Condor – owner / matchmaker preemption • Owner removes job running on his machine • automated by Condor (eg. check keyboard inactivity) • manually by running a command • Matchmaker can enforce administrator policies to increase efficiency • eg. run a better job on a machine already running one • Condor strongly prefers however not to preempt jobs if they can be run on an idle machine.

Condor - conclusion • Condor can balance the desires of all stakeholders • Condor can take both advantage of sporadically available resources and react to problems such as failures • This flexibility and robustness is its key to success

Maui Scheduler - Introduction • High performance scheduler for local clusters • Includes resource reservation, availability estimation and allocation management • External manager, extends and enhances the capabilities and performance of existing scheduler

Maui – Allocation properties • Concept of reservation to maintain resource allocations • most important feature is future allocations • set aside a block of resources for various purposes such as cluster maintenance, guaranteed job start time • resource expression: resource quantity and type conditions which must be met to include • access control list (ACL): which consumers may utilize the reserved resources • timeframe: time period over which reservation actually blocks resources

Maui – Allocation properties (2) • Revocation of allocation • support for revocable and irrevocable reservations • eg. strict time constrains on data availability or job completion • default is irrevocable; reservations maintained until timeframe has expired or explicitly removed • Guaranteed completion time of allocations • locked to exact time, guaranteed to complete before certain time or guaranteed to start after given time • scheduler regularly tries to optimize

Maui – Allocation properties (3) • Guaranteed number of attempts to complete a job • don’t attempt to start job until all prerequisites are met • using defer mechanism maui can specify how many times to locate resources for a job before giving up, or putting on hold • Allocation run-to-completion • configure to disable all or subset of preemptions thus guaranteeing a job to complete without interference • Exclusive allocations • request dedicated resources to guarantee exclusive access

Maui – Allocation properties (4) • Malleable Allocations • all aspects can be dynamically modified • if job consumes excessive resources, Maui can preempt or even cancel job depending on the resource utilization policy

Maui - Access to available scheduling info • Access to the tentative scheduler • provide information to all possible availability times • scheduler can request single estimated start time for job • Exclusive control • Maui maintains exclusive control over the execution • Event notification • generalized event management interface; respond immediately to changes in the environment

Maui – Requesting resources • Allocation offers • full contextual information regarding the request and if and how Maui can satisfy this request • Allocation cost or objective information • interface with allocation management systems that assist to assign costs to resource consumption • Advance reservation • allows full control to peers over the scheduling of jobs through time • Requirement for providing maximum allocation time in advance • credential-based walltime limits can be configured based on various criteria

Maui – Requesting resources (2) • Deallocation policy • support for single-step resource allocation requests; create resource allocation valid until job completion • two-phase courtesy reservation; after courtesy is sent, needs to receive a reservation commit; otherwise remove job • Remote co-scheduling • stage remote jobs to a local cluster • Consideration of job dependencies • offer basic job dependency support to block certain job steps until specific prerequisites are met

Maui – Manipulating the allocation execution • Preemption • suspend operations are supported as far as that capability is available in the underlying manager • Checkpointing • ‘checkpoint and terminate’ & ‘checkpoint and continue’ are supported • Migration • support for intra-domain job migration, but no support for QoS, load balancing, or other optimization • Restart • checkpoints used if available

LSF - Introduction • As a low-level scheduler • Load Sharing Facility

LSF – Available-information attributes • Access to the tentative scheduler • often impractical in real-world applications, no support • Exclusive control • LSF executes in user-space, so its control is not exclusive so can only provide necessary measures • Event notification • supplies an event-notification service for high-level schedulers

LSF – Requesting resources • Allocation offers • doesn’t expose potential resource allocations • Allocation cost or objective information • unsupported • Advance reservation • provides built-in and Maui-integrated capabilites • Requirement for providing maximum allocation time in advance • high regard

LSF – Requesting resources (2) • Deallocation policy • automatic • Remote co-scheduling • support by a higher-order scheduling instances • Consideration of job dependencies • built-in support for job dependencies by logical expressions based on 15 dependency conditions

LSF – Allocation properties • Revocation of allocation • not needed because of resource shortness, etc. • Guaranteed completion time of allocations

LSF – Allocation properties (2) • Guaranteed number of attempts to complete a job • distinguish between attempts that are execution pre-condition and execution condition with complete flexibility • Allocation run-to-completion • with implicit assumptions that allocations don’t exceed resource limits for example • Exclusive allocations • can dispatch jobs to hosts where no other LSF job is running

LSF – Allocation properties (3) • Malleable Allocations • built-in mechanisms allow allocations to decay consumption over time on a per-resource basis

LSF – Manipulating the allocation execution • Preemption • support since 1995, preempted workloads retain resources • Checkpointing • assuming application supports it, LSF provides interface • Migration • provide mechanism to be done by high-level scheduler • Restart • provides interface

LSF - Conclusion • Supports most attributes of a low-level scheduler that can be exploited by a high-level scheduler

PBS – Introduction • Portable Batch System • Flexible workload management and batch job scheduling system • Covers the entire Grid computing space: security, information, compute and data • Middleware technology that sits between compute-intensive or data-intensive applictions and the network, hardware and OS • All jobs to single virtual pool which is scheduled and distributed on the grid

PBS – Security • Fundamental capabilities are secure authentication and authentication • Internally it makes use of user-name based auth • Support for X.509 Grid standard identification • certificate lifetime (expire/renew) • Identity mapping between sites is handled by a mapping function

PBS - Information • Information management with access to the state of the infrastructure • Collect real-time data on state with job executor daemon process (MOMs) • Easy integration with larger Grid information databases

PBS - Compute • Advance reservation support • check for conflicts • eg. reserve resources for car-crash test including computer cycles, network, database, facility • Cycle harvesting • expand available computing resources by using idle workstations • Peer scheduling • enable a site or sites with different PBS installations to automatically run jobs from eachother • no job will be moved if it cannot run immediately

PBS - Data • Most basic capability of data Grid: file staging • automatic handling of copying files onto execution nodes (stage-in) prior to running job • copying files off execution nodes (stage-out) after job completes • PBS will not run jobs until stage-in is fully done • Support for Globus Toolkit, scp, Gridftp, etc.

PBS – Available-information attributes • Access basic information by typing qstat • Email notification

PBS – Requesting resources • Single resource solution to a job request • Estimated completion time is configurable • absence of this information however hampers peformance (needed by backfilling for example) • Job dependencies • Co-scheduling by simply configuring the queues of the system

PBS – Allocation properties • Revoke any allocation both while job is queued or is running • Also possible preemption by the scheduler; choice of suspension, checkpointing, requeuing, termination • Configurable job completion attempts • Configurable exclusive allocation, etc. • No support for malleable allocation (eg. allows addition or revocation of resources during runtime)

PBS - Manipulating the allocation execution • Support for requeue, restart • On preemption checkpoint generation and migration

Prediction techniques • Problem of scheduling and resource allocation are central to Grid performance • Applications must balance between performance and communication overhead parallelism produces • Grid resources differ widely in performance • A resource allocator must choose right combination of resources from pool while it's constantly changing

Prediction techniques (2) • Categorization into static and dynamic performance characteristics based on speed of change • static: clock speed (CPU) for example • dynamic: CPU load, network throughput

Grid resource performance prediction • For a grid scheduler two characteristics can be exploited to overcome the complexities introduced by the dynamics of Grid performance response • Observable Forecast Accuracy • predictions for future performance measurements can be evaluated by recording the accuracy once the measurements are actually gathered • Near-term Forecasting Epochs • scheduler can make decisions dynamically, just before execution begins. Since accuracy usually degrades into the future, make decision at last possible moment

Prediction – an example (NWS) • Provide 3 fundamental functionalities • Monitoring, Forecasting, Reporting • NWS – Network Weather Service • grid monitoring and forecasting tool designed to support dynamic resource allocation and scheduling • sensor control subsystem • historical data for future performance prediction • multiple reporting interfaces • convenient methodology for replication and caching

Prediction – an example (NWS) (2) • Performance monitoring and forecasting system must be able to execute on all platforms available to the user • written in C; highest portability with standard libs • Two types of monitors (CPU probe) • passive: read measurement gathered through some other means (eg. local OS) eg. UNIX load average • non-intrusive • inaccurate? • active: load own resource and observe performance response • know exact performance • intrusive

Prediction – an example (NWS) (3) • Intrusiveness vs Scalability (Network probe) • probe the network by timing packet travel duration • for more hosts, probe collision will occur, resulting in loss of bandwidth • NWS uses a token-passing method to prevent such problems

Prediction – an example (NWS) (4) • Forecasting • an inherent problem of prediction. • assumptions made on what resources will be when the job runs • in Grid settings, available resource performance can fluctuate dynamically • NWS uses statistical methods to attempt to mechanize and automate forecasting based on historical data

Prediction - Conclusions • Effective resource allocation and scheduling are critical to performance • Immediate performance history data is used to make implicit prediction • To be truly effective the performance gathering system must be robust, portable and non-intrusive • Overhead introduced by perf.gath. system must be carefully controlled • Using fast, robust techniques it is possible to improve accuracy of performance predictions

Improve resource selection with prediction • Run time predictions • statistical analysis that have already run • automatic code analysis or instrumentation • Explanation of two techniques, both using statistical data with information provided to scheduler upon run

Categorization prediction technique • Derive run time predictions from historical information based on previous similar runs • many ways to look at similar applications; application name, user, arguments, submission time, etc. • use of genetic algorithm to identify good templates (eg user+time) for a given workload • use a mean prediction type • results are an average error of 39%

Local Resource Management System & State Estimation