Grid Scheduling through Service-Level Agreement

Grid Scheduling through Service-Level Agreement Karl Czajkowski The Globus Project http://www.globus.org/

Overview • Introduction to Grid Environments • The Resource Management Problem • Cross-domain applications • Resource owner goals vs. application goals • An Open Architecture to Manage Resources • Service-Level Agreement (SLA) • GRAM and Managed Services • Related and Ongoing Work

Grid Resource Environment R ? R R ? R R R R R R network dispersed users R ? ? R R R R R R R R R R • Distributed users and resources • Variable resource status • Variable grouping and connectivity • Decentralized scheduling/policy VO-A VO-B

Social/Policy Conflicts • Application Goals • Users: deadlines and availability goals • Applications: need coordinated resources • Localized Resource Owner Goals • Policies towards users • Optimization goals • Community Goals Emerge As: • An aggregate user/application? • A virtual resource? Both!

Parallel I/O Reduction Sorting Transport Rendering TCP/IP Receive Buffer ... ... ... ... Data-Intensive Example • Concurrent resource requirements • Large scale storage, computing, network, graphics • Datapath involves autonomous domains

Early Co-Allocation in Grids • SF-Express (1997-8) • Real-time simulation • 12+ supercomputers, 1400 processors • Required advance reservation • Brokered by telephone! • Globus DUROC software to sync startup • Over 45 minutes to recover from failure • In use today in MPICH-G2 (MPI library)

Traditional Scheduling • Closed-System Model • Presumption of global owner/authority • Sandboxed applications with no interactions • “Toss job over the fence and wait” • Utilization as Primary Metric • Deep batch queues allow tighter packing • No incentives for matching user schedule • Sub-cultures Counter Site Policies • Users learn tricks for “gaming” their site

An Open Negotiation Model • Resources in a Global Context • Advertisement and negotiation • Normalized remote client interface • Resource maintains autonomy • Users or Agents Bridge Resources • Drive task submission and provisioning • Coordinate acts across domains • Community-based Mediation • Coordination for collective interest

Community Scheduling Example • Individual users • Require service • Have application goals • Community schedulers • Broker service • Aggregate scheduling • Individual resources • Provide service • Have policy autonomy • Serve above clients

Negotiation Phases • Discovery • “What resources are relevant to interest?” • Finds service providers • Monitoring • “What’s happening to them now?” • Compare service providers • Service-Level Agreement • “Will they provide what I need?” • The core Resource Management problem • Process can iterate due to adaptation

Service-Level Agreement • Three kinds of SLA • Task submission (do something) • Resource reservation (pre-agreement) • Lazy task/resource Binding (apply resv.) • Simple protocol for negotiating SLAs • Basic 2-party negotiation • Support for basic offer/accept pattern • Optional counter-offer patterns • Variable commitment phase for stricter promises • Client may maintain multiple 2-party SLAs

Many Types of Service • Must support service heterogeneity • Resources • Hardware: disks, CPU, memory, networks, display… • Logical: accounts, services… • Capabilities: space, throughput… • Tasks • Data: stored file, data read/write • Compute: execution, suspended/swapped job • SLAs bear embedded term languages • Isolate domain-specific details

Domain Extension: File Transfer • Single goal • Reliable deadline transfer • Specialized scheduler • Brokers basic services • Synthesizes new service • Fault-handling logic • Distributed resources • Storage space • Storage bandwidth • Network bandwidth

Technical Challenges • Complex Security Requirements • Global Scalability • Similar ideals to Internet • Interoperable infrastructure • Policy-configurable for social needs • Permanence or “Evolve in Place” • Cannot take World off-line for service • Over time: upgrade, extend, adapt • Accept heterogeneity

Coordinator GRAM Architecture SLA implementation Planner Domain-specific SLA Application Information Service Monitor & Discover Concrete SLA Incremental SLAs Local resource managers GRAM2 GRAM2 GRAM2 Job CPU Disk

WS-Agreement • New standardization effort • Generalizes GRAM ideas • Service-oriented architecture • Resource becomes Service Provider • Tasks become NegotiatedServices • SLAs presented as Agreement services • Still supports extensible domain terms

WS-Agreement Entities

WS-Agreement Adds Management

Virtualized Providers

Agreement-based Jobs • Agreement represents “queue entry” • Commitment with job parameters etc. • Agreement Provider • i.e. Job scheduler/Queuing system • Management interface to service provider • Service Provider • i.e. scheduled resource (compute nodes) • Service is the Job computation

Advance Reservation for Jobs • Schedule-based commitment of service • Requires schedule based SLA terms • Optional Pre-Agreement (RSLA) • Agreement to facilitate future Job Agreement • Characterizes virtual resource needed for Job • May not need full job terms • Job Agreement almost as usual • May exploit Pre-Agreement • Reference existing promise of resource schedule • May get schedule commitment in one shot • Directly include schedule terms • (Can think of as atomic advance reserve/claim)

Need for Complex Description • 128 physical nodes • Physical topology • Interconnect • RAM, disk size • Subject of RSLA • Single MPI job • Subject of TSLA • May reference RSLAs • Quality requirements • Real-time parameters • CPU, disk performance • Subject of BSLA

MDS Resource Models (History)

Future Models • Service behavioral descriptions • Unified service term model • Capture user/application requirements • Capture provider capabilities • Core meta-language • Facilitates planner/decision designs • Extends with domain concepts • Extensible negotiability mark-up • Capture range of negotiability for variable terms • Capture importance of terms (required/optional) • Capture cost of options (fees/penalties)

SLA Types in Depth • Resource SLA (RSLA), i.e. reservation • A promise of resource availability • Client must utilize promise in subsequent SLAs • Task SLA (TSLA), i.e. execution • A promise to perform a task • Complex task requirements • May reference an RSLA (implicit binding) • Binding SLA (BSLA), i.e. claim • Binds a resource capability to a TSLA • May reference an RSLA (otherwise obtain implicitly) • May be created lazily to provision the task

Resource Lifecycle • S0: Start with no SLAs • S1: Create SLAs • TSLA or RSLA • S2: Bind task/resource • Explicit BSLA • Implicit provider schedule • S3: Active task • Resource consumption • Backtrack to S0 • On task completion • On expiration • On failure

Incremental Negotiation • RSLA: reserve resources for future use • TSLA: submit task to scheduler • BSLA: bind reservation to task • Resources change state due to SLAs and scheduler decisions

Linking SLAs for Complex Case TSLA1 account tmpuser1 RSLA1 50 GB in /scratch filesystem BSLA1 30 GB for /scratch/tmpuser1/foo/* files TSLA2 Complex job TSLA3 TSLA4 RSLA2 Net Stage in Stage out BSLA2 • Dependent SLAs nest intrinsically • BSLA2 defined in terms of RSLA2 and TSLA4 • Chained SLAs simplify negotiation • Optionally link destruction/reclamation time

Related Work • Academic Contemporaries • Condor Matchmaking • Economy-based Scheduling • Work-flow Planning • Commercial Scheduler Examples • Many examples for traditional sites • Several generalized for “the enterprise” • Platform Computing • LSF scaled to lots of jobs • MultiCluster for site-to-site resource sharing • IBM eWLM • Goal-based provisioning of transactional flows

Condor Matchmaking • At heart: a scheduling algorithm • Heuristics for pairing job with resource • Match symmetric “Classified Ads” • Great for bulk/commodity matching • Closed system view • Subsumes resource through “lease” • Sandboxed job environment • Favor vertical integration over generality • Tuned high-throughput system

Future Work • SLA interaction with policy • SLA negotiation subject to policy • One SLA affects another, e.g. RSLA subdivision • One client “more important” than another • SLA implemented by low-level policies • Domain-specific SLA maps to resource SLAs • Resource SLAs map to resource control mechanisms • Resource characterization • Advertisement of resources: options, cost • Interoperable capability languages

Conclusion • Generic SLA management • Compositional for complex scenarios • Extensible for unique requirements • Requires work on Grid service modeling • To describe jobs, resource requirements, etc. • Enhancement to proven architectures • Encompasses GRAM+GARA • Evolution of the Globus Toolkit RM • GRAM evolving since 1997 • WS-Agreement standard in progress

Grid Scheduling through Service-Level Agreement

Grid Scheduling through Service-Level Agreement

Presentation Transcript

Grid Systems and scheduling

Grid Scheduling

Costing a Service Level Agreement

Negotiating a Service Level Agreement

Service Level Agreement 2008

Negotiation and Drafting of Service Level Agreement

Grid Quality of Service and Service Level Agreements

Grid Scheduling

Service Level Agreement

Service Level Agreement(s)

Trip Level Scheduling

Service Level Agreement

Intelligent GRID Scheduling Service (ISS)

Service Level Agreement 2008

Service level agreement in cloud computing

Service Level Agreement Workshop

Service Level Agreement 2008

Service Level Agreement

Service Level Agreement

Service Level Agreement Template

Grid Quality of Service and Service Level Agreements