Integrated Workload Management for Beowulf Clusters

Integrated Workload Management for Beowulf Clusters Bill DeSalvo – April 14, 2004 wdesalvo@platform.com

What We’ll Cover • Platform LSF Family of Products • What is Platform LSF HPC • Key Features & Benefits • How it Works • Q&A

What is the Platform LSF Family of Products?

What Problems Are We Solving? • Solve large, grand challenge, complex problems by optimizing the placement of workload in High Performance Computing environments

Platform LSF HPC • Intelligent, policy-driven high performance computing (HPC) workload processing • Parallel & sequential batch workload management for High Performance Computing (HPC) • Includes patent-pending topology-based scheduling • Intelligently schedules parallel batch jobs • Virtualizes resources • Prioritizes service levels based on policies • Based on Platform LSF: • Standards-based, OGSI-compliant, grid-enabled solution • Commercial production quality product

Platform Customers

Platform LSF HPC • Platform LSF HPC AlphaServer SC • Platform LSF HPC for IBM • Platform LSF HPC for Linux • Platform LSF HPC for SGI • Platform LSF HPC for Cray

Extensive Hardware Support

Platform LSF HPC – Linux Support • HP • HP XC Systems running Unlimited Linux • HP Itanium 2 systems running LINUX 2.4.x kernel, glibc 2.2 with RMS on Quadrics QsNet/Elan3 • HP Alpha/AXP systems running LINUX 2.4.x kernel, glibc 2.2.x with RMS on Quadrics QsNet/Elan3 • Linux • IA-64 systems, Kernel 2.4.x, compiled with glibc 2.2.x, tested on RedHat 7.3 • x86 systems: • Kernel 2.2.x, compiled with glibc 2.1.x, tested on Debian 2.2, OpenLinux 2.4, RedHat 6.2 and 7.0, SuSE 6.4 and 7.0, TurboLinux 6.1 • Kernel 2.4.x, compiled with glibc 2.1.x, tested on RedHat 7.x and 8.0, and SuSE 7.0, and RedHat Linux Advanced Server 2.1 • Clustermatic Linux 3.0 Kernel 2.4.x, compiled with glibc 2.2.x, tested on RedHat 8.0 • Scyld Linux, Kernel 2.4.x, compiled with glibc 2.2.x. • SGI • SGI Altix systems running Linux Kernel 2.4.x compiled with glibc 2.2.x and SGI Propack 2.2 and higher

Key Features and Benefits Platform LSF HPC

Key Features • Optimized Application, System and Hardware Performance • Enhanced Accounting, Auditing & Control • Commercial Grade System Scalability & Reliability • Extensive Hardware Support • Comprehensive, Extensible and Standards-based Security

Key Features – Platform LSF HPC • Optimized Application, System and Hardware Performance Enhanced Accounting, Auditing & Control Commercial Grade System Scalability & Reliability Comprehensive, Extensible and Standards-based Security

Adaptive Interconnect Performance Optimization • Scheduling that takes advantage of unique interconnect properties • IBM SP Switch at the POE software level • RMS on AlphaServer SC (Quadrics) • SGI topology hardware graph • Out-of-the-box functionality without any customization required

Generic Parallel Job Launcher • Generic support for all different types of Parallel Job Launchers • LAMMPI, MPICH-GM, MPICH-P4, POE, SCALI, CHAMPION PRO, etc • Customizable for any vendor or publicly available parallel solution • Control - ensuring no jobs can escape the workload management system

Integrated out-of-the-box Parallel Launcher Support • Full integration with IRIX MPI and array session daemon • Full integration with SGI MPI for Linux • Full integration with Sun HPC Clustertools providing full MPI control, accounting and integration with SUNs PRISM debugger • Vendor MPI libraries provide better performance than open source libraries • Vendor MPI library full support • Vendor integration supported by Platform • Seamless control and accounting

HPC Workload Scheduling • Dynamic load balancing supporting heterogeneous workloads • IBM SP switch aware scheduling • Scheduling of parallel jobs • Number of CPUs, min/max, node span • Backfill on processor & memory • Processor & memory reservation • Topology aware scheduling • Exclusive scheduling • Advance Reservation • Fairshare, Preemption • Accounting

High Performing, Open, Scalable Architecture • Scalable scheduler architecture • Modularized, support for over 500,000 active jobs per cluster • More than 2,000 multi-processor host per cluster - with multiple processors in each host • Process 5x more work & achieve 100% utilization • Scale with business growth • External executable support • Collect information from multiple external resources to track site specific local and global resources • Extends out-of-the-box capabilities to manage additional resources and customer application execution • Differentiation • Multiple vs single external resource collector • Job Groups • Organize jobs into higher level work units - hierarchical tree • Easy to manage and control work to increase user productivity by reducing complexity • OGSI compliance • Future-proof & protect grid investment using standards-based solutions, interoperate with third-party systems

Intelligent Scheduling Policies • Fairshare (User & Project-based) • Ensure job resources are used for the right work • Guarantees resource allocation among users and projects are met • Co-ordinate access to the right number of resources for different users and projects according to pre-defined shares • Differentiation • Hierarchal & guaranteed • Policy-based Preemption • Maximizes throughput of high priority critical work based on priority and load conditions • Prevents starvation of lower priority work • Differentiation • Platform LSF supports multiple preemption policies • Goal-oriented SLA driven policies • Based on customer SLA driven goals: Deadline, Velocity, Throughput • Guarantees projects are completed on time • Reduces projects and administration costs • Provides visibility into the progress of projects • Allows the admin focus on “What work and When” needs to be done, not “how” the resources are to be allocated Fairshare Preemption Resource Reservation Advance Reservation Intelligent Scheduler License Scheduling SLA Scheduling Service Level Agreement MultiCluster Other Scheduling Modules Plugin Schedulers

Advanced Self-Management • Flexible, Comprehensive Resource Definitions • Resources defined on a node basis across an entire cluster or subset of the nodes in a cluster • Auto-detectable or user defined resources • Adaptive membership – nodes join and leave Platform LSF clusters dynamically and automatically without administration effort • Dynamic or static resources • Job Level Exception Management • Exception-based error detection to take automatic, configurable, corrective actions • Increased job reliability & predictability • Improved visibility on job and system errors & reduced administration overhead and costs • Automatic Job Migration and Requeue • Automatically migrate and requeue jobs based on policies in the event of host or network failures • Reduce user and administrator overhead in managing failures & reduce risk of running critical workloads • Master Scheduler Failover • Automatically fail over to another host if the master host is unavailable • Continuous scheduling service and execution of jobs & eliminate manual intervention

Backfill • Policy configured at the queue level and applies to all jobs in a queue • Smaller sequential jobs are ‘backfilled’ behind larger parallel jobs • Improves hardware utilization • Users provided with an accurate time when their job will start

Key New Feature & Benefits Platform LSF V6.0

Feature Overview • OGSI Compliance • Goal-Oriented SLA-Driven Scheduling • License-Aware Scheduling • Job-Level Exception Management (Self Management Enhancement) • Job Group Support • Other Scheduling Enhancements • Queue-Based Fairshare • User Fairshare by Queue Priority • Job Starvation Prevention plug-in

Feature Overview (Cont.) • HPC Enhancements • Dynamic ptile Enforcement • Resource Requirement Specification for Advance Reservation • Thread Limit Enforcement • General Parallel Support • Parallel Job Size Scheduling • Job Limit Enhancements • Non-normalized Job Run Limit • Resource Allocation Limit Display • Administration and Diagnostics • Scheduler Dynamic Debug • Administrator Action Messages

Goal-Oriented SLA-Driven Scheduling • What is it? • A new scheduling policy. • Unlike current scheduling policies based on configured shares or limits, SLA-driven scheduling is based on customer provided goals: • Deadline based goal: Specify the deadline for a group of jobs. • Velocity based goal: Specify the number of jobs running at any one time. • Throughput based goal: Specify the number of finished jobs per hour. • This scheduling policy works on top of queues and host partitions. • Benefits • Guarantees projects are completed on time according to explicit SLA definitions. • Provides visibility into the progress of projects to see how well projects are tracking to SLAs • Allows the admin focus on “What work and When” needs to be done, not “how” the resources are to be allocated. • Guarantees service level deliveries to the user community, reduces the risks of projects and administration cost.

User case • Problem: we need to finish all simulation jobs before 15:00pm. • Solution: Configure a deadline service class in lsb.serviceclasses file. • Begin ServiceClass • NAME=simulation • PRIORITY=100 • GOALS = [deadline timeWindow (13:00 – 15:00)] • DESCRIPTION = A simple deadline demo • End ServiceClass • Submitting and monitoring jobs • $bsub –sla simulation –W 10 –J A[1-50] mySimulation • $date;bsla • Wed Aug 20 14:00:16 EDT 2003 • SERVICE_CLASS_NAME: simulation • GOAL: DEADLINE ACTIVE_WINDOW: (13:00 – 15:00) • STATUS: Active:Ontime • DEAD_LINE: (Wed Aug 20 15:00) • ESTIMATED_FINISH_TIME: (Wed Aug 20 14:30) • Optimum Number of Running Jobs: 5 • NJOBS PEND RUN SSUSP USUSP FINISH • 50 25 5 20

Job-Level Exception Management (Self Management Enhancement) • What is it? • Platform LSF can monitor the exception behavior and take action accordingly. • Benefits • Increased reliability of job execution • Improved visibility on job and system errors • Reduced administration overhead and costs • How it works • Platform LSF V6 handles following exceptions: • “Job eating” machine (or “black-hole” machine): for some reason, jobs keep exiting abnormally on a machine (e.g. no processes, mount daemon dies, etc.) • Job underrun (job run time less than configured minimum time) • Job overrun (job run time more than configured maximum time) • Job run idle (job run without cpu usage increasing).

Job-Level Exception Management (Self Management Enhancement) (Cont.) • Use Case 1: • Requirement: If the host has more than 30 jobs exited in past 5 minutes, I want LSF to close that machine, then notify me and tell me the machine name. • Solution: • Configure host exceptions (EXIT_RATE in lsb.hosts). • Begin Host • HOST_NAME MXJ EXIT_RATE # Keywords • Default ! 6 • End Host • Configure the JOB_EXIT_RATE_DURATION = 5 in lsb.params (default value is 10 minutes)

Job-Level Exception Management (Self Management Enhancement) (Cont.) • Use Case 2: • Requirement: If any job runs more than 3 hours, I want LSF to notify me and tell me the jobID. • Solution: • Configure job exceptions (lsb.queues) • Begin Queue • … • JOB_OVERRUN = 3*60 # run time in minutes • End Queue

Job Starvation Prevention Plug-in • What is it? • External scheduler plug-in allows users to define their own equation for job priority • Benefits • Low priority work is guaranteed to run after ‘waiting’ for a specified time ensuring that the job does not wait forever (i.e. starvation). • How it works • By default, the scheduler provides the following calculation Job priority =A * (q_priority) *MIN(1, int(wait_time/T0)) * (B*requested_processors+MAX(C*wait_time*(1+1/run_time),D) +E*requested_memory) Where A, B, C, D, E are coefficients. T0 is the grace period. Default run_time= INFINIT • Admin can define different coefficients for each queue with the following format: MANDATORY_EXTSCHED=JOBWEIGHT[A=val1; B=val2; …]

Job Starvation Prevention Plug-in • Use Case: • Requirement: Lowest priority queue can wait no more than 10 hours. • Solution: If highest priority queue PRIORITY = 100, lowest priority queue PRIORITY = 20. Configure the following in Lowest queue: • MANDATORY_EXTSCHED=JOBWEIGHT[A=1;B=0;C=10;D=1;E=0;T0=0.1] • After waiting 10 hours, the job in Lowest queue will have higher priority than jobs in highest priority queue. • Note: The formula for calculating job weight is open source and customers can customize it.

Resource Requirement Specification For Advance Reservation • What is it? • Enable users to select the hosts for advance reservation based on the resource requirement. • Benefit • More flexible to reserve the host slots for the mission critical job. • How it works • brsvadd command supports select string: brsvadd –R “select[type==LINUX]” –n 4 –u xwei –b 10:00 –e 12:00

Key Features – Platform LSF HPC Optimized Application, System and Hardware Performance Commercial Grade System Scalability & Reliability Comprehensive, Extensible and Standards-based Security • Enhanced Accounting, Auditing & Control

Job Termination Reasons • Accounting log with detailed audit & error information for every job in the system • Indicates why a job was terminated • Difference between an abnormal termination or caused by Platform LSF HPC

Key Features – Platform LSF HPC Optimized Application, System and Hardware Performance Enhanced Accounting, Auditing & Control Comprehensive, Extensible and Standards-based Security • Commercial Grade System Scalability & Reliability

Enterprise Proven • Running on several of the top 10 supercomputers in the world on the “TOP500” (#2,4,5,6) • More than 250,000 licenses in use spanning 1,500 customer sites • Scales to over 100 clusters, 200,000 CPUs and 500,000 active jobs per cluster • 11+ years experience in distributed & grid computing • Risk free investment – proven solution • Commercial production quality

Key Features – Platform LSF HPC Optimized Application, System and Hardware Performance Enhanced Accounting, Auditing & Control Commercial Grade System Scalability & Reliability • Comprehensive, Extensible and Standards-based Security

Comprehensive, Extensible, Standards-based Security • Scalable scheduler architecture • Multiple scheduler plug-in API support • External executable support • Web GUI • Open source components • Risk free investment – proven solution • Commercial grade • Scalability and flexibility as a business grows

How It Works Platform LSF HPC

mbsched mbd Am I master ? exchange load info Fault Tolerance via Master Election sbd sbd sbd Master LIM slave LIM slave LIM Host 1 Host N Host i master announcement

RES RES RES RES ELIM Virtual Server Technology LIM: Collects & centralizes status of all resources in cluster RES: Transparent remote task execution Workload Management System Monitor Admin Tools Cluster APIs Master LIM Free memory Disk I/O Rate Host Status Free swap space Load Information Number of CPUs Idle Time Custom Status Slave LIM Slave LIM Slave LIM Slave LIM

SBD ELIM Executing Work Jobs Chooses best, available resource to process the job MBD Clients Master LIM Gaussian Distribution Job BLAST Sequence Job Computational Chemistry Job Protein Modeling Job SBD SBD SBD SBD Slave LIM Slave LIM Slave LIM Slave LIM

Grid-enabled, Scalable Architecture Open, modular plug-in schedulers scale with the growth of your business

The framework hides the complexity of interacting with core services. Resource Broker responsible for resource information collection from other core services. Minimize the inter-dependencies between scheduling policies Maximize extensibility through the plug-in scheduler module stack Scheduler Framework Scheduler Modules Scheduler Framework Resource Broker

Localized setup The Four Scheduling Phases Pre-Selected Jobs 1. Pre-Processing • Match eligible resources to nodes 2. Matching / Limits • Prioritize jobs and allocate resources 3. Order / Allocation • Allocation adjustments 4. Post-Processing Scheduling Decisions/ Job Control Decisions Scheduling Decisions/ Job Control Decisions

Multiple Scheduling Modules Pre- Processing Pre- Processing Pre- Processing ... • Vendor specific matching policies (without changing the existing scheduler • Support for external scheduler ... Matching / Limits Matching / Limits Matching / Limits ... Order / Allocation Order / Allocation Order / Allocation ... Post- Processing Post- Processing Post- Processing Internal Module Add-on Module 1 Add-on Module N

Maui Integration MAUI Plugin Event Handle (wait until GO event) MAUI Scheduler RMGetInfo Job, Host, Res Info Pre-processing Order jobs Decisions and ack SCH_FM QueueScheduleSJobs QueueScheduleRJobs QueueScheduleIJobs QueueBackFill Sync MBD Post-Processing UIProcessClients

Linux-specific Solutions

Controlling an MPI job • On a distributed system (Linux cluster) there are many problems to address: • Job launch across multiple nodes • Gather resource usage while job executes • Propagate signals • Job “clean-up” to eliminate “dangling” MPI processes • Comprehensive job accounting

Integrated Workload Management for Beowulf Clusters

Integrated Workload Management for Beowulf Clusters

Presentation Transcript

Workload Management

Energy Management for Servers and Clusters

Setting for Beowulf

Review for Beowulf

Comparative Study of Beowulf Clusters and Windows 2000 Clusters

Workload Management PBS Professional

GRID Workload Management System

Integrated Workload Management for Beowulf Clusters

Workload Management PBS Professional

Networking and Workload Management

Notes for Beowulf

Grid Workload Management

Sudo Access with Beowulf Clusters

WP1 Grid Workload Management

Testing adaptive workload management

Workload Management System

Workload Management System