QoS-aware Resource Management in Distributed System

QoS-aware Resource Management in Distributed System ECE7610

QoS-Aware Resource Management • Physical Environment • Job scheduling • Load balancing • Data locality • Application deployment • Server/Resource allocation • Virtualized environment (Cloud Computing) • Similar issues as in Physical Environment • Interference-aware Sche. • VM deployment • VM migration • Virtual resource allocation

Physical Resource Management • Typical systems in practice • Hadoop Cluster • Resource-aware Scheduling • Data locality-aware Scheduling • Resource Management Framework (YARN) • Grid Computing • QoS-aware resource management • Multi-tier Web System • Dynamic application placement • Dynamic servers allocation • Dynamic resource provisioning

Hadoop resource-aware Scheduling • Fair Scheduler (Facebook) • Hadoop cluster is shared by multiple users with multiple jobs • Assigning resource/cluster capacity to jobs such that all jobs get an equal share of resource/cluster capacity • Also work with job priorities, the priorities are used as weights to determine the fraction of total compute time that each job gets. • Guarantee minimum shares to resource pools or jobs. • Maintain a job queue, sorted according to fairness. The job farthest below its fair share will be scheduled first.

Hadoop resource-aware Scheduling • Capacity Scheduler (Yahoo) • Jobs are fair-sharing the capacity of the cluster • Jobs are submitted into queues • Queues are allocated a fraction of the total resource capacity • Free resources are allocated to queues beyond their total capacity • Within a queue a job with a high level of priority will have access to the queue's resources • There is no preemption once a job is running.

Hadoop Locality-aware Scheduling • Delay Scheduling (Facebook) • Try to assign task to its input data as close as possible • Local data access is much efficient than remote data access • Locality level: node locality, rack locality and off rack • The schedule order is based on fairness. Strict policy may hurt data locality • Delay some jobs to achieve high data locality by compromising fairness a little bit

Hadoop Locality-aware Scheduling Master Job 1 Job 2 Scheduling order Task 2 Task 5 Task 3 Task 1 Task 7 Task 4 Slave Slave Slave Slave Slave Slave File 1: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 File 2: 1 2 1 3 2 3

Hadoop Locality-aware Scheduling Master Job 2 Job 1 Scheduling order Task 4 Task 3 Task 2 Task 5 Task 1 Task 1 Task 3 Task 2 Task 7 Slave Slave Slave Slave Slave Slave File 1: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 File 2: 1 2 1 3 2 3 Problem: Fair decision hurts locality Especially bad for jobs with small input files

Hadoop Locality-aware Scheduling Wait Master Job 2 Job 1 Scheduling order Task 1 Task 3 Task 5 Task 1 Task 8 Task 3 Task 2 Task 7 Task 4 Task 6 Task 2 Slave Slave Slave Slave Slave Slave File 1: 2 3 6 2 5 8 3 5 8 1 4 9 1 7 9 4 6 7 File 2: 1 2 1 3 2 3 Idea: Wait a short time to get data-local scheduling opportunities

Hadoop Resource Manager • Hadoop NextGen MapReduce (YARN) • Split the resource management and scheduling/monitoring functions into two daemons • Have a global Resource Manager (RM) and multiple Node Manager (NM) and application specific Application Master (AM) • The RM is the authority that allocates resources among all the applications in the system • NM periodically report Node status

Resource Management in Grid • Grid Computing • Large amount of resource from multiple locations to reach a common goal • Usually considered as a distributed system with non-interactive workload that involve a large number of files • Tend to be loosely coupled, heterogeneous, and geographically dispersed • Resource management Challenges in Grid • Satisfactory end-to-end performance • Availability to computational resources • Handle of conflicts of resource demands • Fault-tolerance • Common critical resource • Computing Power, Disk Space, • Memory, Network Bandwidth, etc

Resource Management in Grid • Stages of Resource Management • Resource Discovery • Find the available resource • Systems Selection • Allocate the resource • Job Execution • Run the job • Log resource usage • Release resource • Target • Guarantee Quality of Service • Rapid and cost-effective access to • large amounts of resources • Scheduling resource regardless of • network topology

Key Issues in RMS • RMS Organization • Flat/Cells/Hierarchical • Job Resource Demand Estimation • Predictive • Heuristics prediction/Statistical Modeling/Machine Learning • Non-predictive • Heuristics/Probability Distribution • Scheduling Policy • Fixed • System Oriented/ Application Oriented • Extensible • Ad-hoc/ Structured

Grid RMS Examples

Multi-tier Web Systems • Typical Architecture • Web server tier (presentation tier) • Application server tier (logic tier) • Database server tier (data access tier) • Resource Management Challenges • Interactive jobs, time-sensitive • Heterogeneous apps with diff. demand • Dynamic workload • Resource Management Issues • Dynamic Application Placement • Dynamic resource allocation • Dynamic servers allocation

Dynamic Application Placement • Problem • Given a set of servers with constrained resources and a set of application with dynamic demands, how many instances to run and where to put them ? • Objective • Maximize the total satisfied application demand • Minimize placement overhead • Balance the workload • Highly scalable A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07

Dynamic Application Placement A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07

Dynamic Application Placement • Approaches • NP-hard Problem, a variant of the Class Constrained Multiple-Knapsack Problem, traditional approaches are not scalable • Computing the maximum total application demand that can be satisfied by the current placement solution. • First shifting the workload among instances of same applications • Max-flow and min-cost max-flow problem • At most one underutilized instances • Residual memory and CPU co-located • Perform application placement • Outmost Loop rank the apps in increasing load-memory ratio, rank the machines in decreasing CPU-memory ratio • Intermediate loop test all the applications • Innermost Loop find appropriate applications A Scalable Application Placement Controller for Enterprise Data Centers WWW’ 07

Dynamic Resource Allocation • Problem • How to guarantee the quality to web service with limited resources with dynamic user demand • How to evaluate and monitor the service quality • Objective • Guarantee Client-perceived QoS by dynamical adjusting resource allocation • consider the response time of the whole pages instead of single packet • Approach • Model-independent two-level self-tuning fuzzy controller for resource allocation • A Framework to guarantee client-perceived end-to-end QoS eQoS: Provisioning of Client-Perceived End-to-End QoS Guarantees in Web Servers IEEE Trans. Computers 2006

Client-Percieved QoS request-based QoS connection close server last object waiting for new requests object 2 object 1 base page client client-perceived pageview QoS Setup connection HTTPS Traffic Internet Mirrored HTTPS Traffic TCP Packets HTTPS Trans Packet Capture Packet Analyzer Perf Analyzer Wei/Xu, sMonitor for Measurement of User-Perceived Laency, USENIX’2006

Dynamic Resource Allocation • Architecture • QoS controller makes resource allocation decision • Resource manager manages requests • QoS monitor measure the page-view client-perceived response time • QoS Controller • Resource controller with fuzzy rules • Scaling factor controller eQoS: Provisioning of Client-Perceived End-to-End QoS Guarantees in Web Servers IEEE Trans. Computers 2006

Dynamic Server Allocation • Objective • Automatically allocate computing resource (coarse-grained, number of servers) to each application in a data center to maximize performance. • Approach • Machine Learning algorithm Online Resource Allocation Using Decompositional Reinforcement Learning AAAI 2005

QoS-Aware Resource Management • Physical Environment • Job scheduling • Load balancing • Data locality • Server/Resource allocation • Application deployment • Virtualized environment (Cloud Computing) • Similar issues as in Physical Environment • Interference-aware Sche. • Virtual resource allocation • VM deployment • VM migration

Interference-Aware Task Scheduling • Co-hosted VMs share hardware and software • Interference slows down the tasks dramatically

Interference-Aware Task Scheduling TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments. SC’11 Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13 System architecture

Interference Prediction Model TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments. SC’11 Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13 • Quantify the interference impact on system performance • Different Models • Linear Model • Quadratic Model • Exponential Model

Interference-Aware Task Scheduling Least Interference Scheduling Given an available node Predict the slowdown S for all jobs Sort jobs Accept the job with least interference Dynamic Threshold Scheduling Given a job and an available node Given an initial threshold H Predict the slowdown rate S If S<H Then accept this job Else reject this job // num of working slots Lr // dynamic threshold Hd SetHd = H if (Lr+1)/S > Lr/Hd Then accept the job Update Hd = S Else reject this job TRACON: Interference-Aware Scheduling for Data-Intensive Applications in Virtualized Environments. SC’11 Interference and Locality-Aware Task Scheduling for MapReduce Applications in Virtual Clusters HPDC’ 13

Dynamic Virtual Resource Allocation Resource waste Over provisioning SLA violation Under provisioning application 1. When to allocate resource? 2. How much resource to allocate? Expected Dynamic provisioning

Dynamic Virtual Resource Allocation • Fine-grained resource management • Dynamical adjust VM capacity • Virtual CPU/Memory/Disk I/O bandwidth • Challenges • Heterogeneous applications with different characteristics consolidated in single machine • Dynamic workloads • Interference between co-hosted Applications/VMs • Interplay with related application components • Scalability and Adaptability • Objective • Guarantee SLA and QoS for each application • Maximizing resource utilization • Maximizing system throughput

Dynamic Virtual Resource Allocation • Multi-Input,Multi-Output (MIMO) Controller • Allocates multiple types of • resources to multiple enterprise applications. • Set of application controllers • and to determine the amount • of resources. • Set of node controllers to detect • resources bottlenecks and • allocate “actual” resources to • multiple types of individual • applications. Automated control of Multiple Virtualized Resource. EuroSys’ 09

Approaches • Application Controller Design • Model Estimator: Auto-regressive-moving-average model • Optimizer: Minimizing cost function Performance Cost Control Cost Automated control of Multiple Virtualized Resource. EuroSys’ 09

Approaches • Node Controller Design • Allocates resources based on the requested resources by Application controllers and resources available at the node • Scenarios • Adequate CPU and Disk Resources. • Adequate Disk but inadequate CPU resources. • Adequate CPU but inadequate Disk Resources • Inadequate CPU and Disk Resources Automated control of Multiple Virtualized Resource. EuroSys’ 09

Why is modeling hard? Cloud resource is not uniform The resource management needs to be model-free, adaptive, and scalable

Reinforcement Learning Method application Evaluate decision (S1,Act1) = r1+r2+r3+…+rn-1 Agent r2 r1 state resource adjustment feedback S1 S2 S3 Act1 Act2 System r3 rn-1 Act3 Goal S3 … Actn-1 VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09 A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11 • Learning process through interactions with env • Model-free • Optimal control, feedback control • Statistical Modeling • Optimizes long-term reward • Current decision may have delayed consequences on both future reward and future state. • Avoid Local optimum: mathematical optimization

Q-Learning Estimate the future Q(s, a) exploitation action ??? ? state bad good exploration negative positive VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09 A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11 application • Q-value • Estimated accumulated reward • Evaluate the “goodness” of an action at a state • Continuously updated using temporal difference method • Policy • Exploitation • Select the best one • Exploration • Random try

VM Resource Management as a RL task Centralized Resource Management VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09 A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11 • Goal (Host-wide) • Max performance • Min resource cost • State • Rsrc allocations • Action • Rsrc adjustment • Reward • System performance

VM Resource Management as a RL task Distributed Resource Management VCONF: A Reinforcement Learning Approach to Virtual Machines Auto-configuration . ICAC’ 09 A Distributed Self-learning Approach for Elastic Provisioning of Virtualized Cloud Resources. MASCOTS’ 11

VM Deployment and Migration • Dynamic VM Deployment • Adjust resource allocation according to demand in order to satisfy SLA • Minimize number of working node • Minimize power consumption • Minimize reconfiguration cost • VM Live Migration • Moving a running VMs • Between physical servers • Support dynamic Deploy. • Dynamic balance wkload.

Data and VM Placement for Hadoop • Job Specific-awareness • Map-input heavy: grep • Map-and-Reduce-input heavy: sort • Reduce-input-heavy: generator Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11

Reduce Task Locality Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11

Data and VM Placement for Hadoop • Load-awareness • Computation load • Storage load • Network load Expected-load-unaware data placement Expected-Load-aware data placement Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11

Placement Techniques • Minimizing Cost Functions Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11

Placement Techniques • Map-input heavy jobs • Data placement: load balancing • VM placement: to the physical machine with local data or close • Map-and-Reduce-input jobs • Data placement: load balancing/reduce locality • VM placement: to the physical machine with local data or close • Reduce-input heavy jobs • Data placement: any where • VM placement: close to each other Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11

Data and VM Placement for Hadoop Reduce phase Map phase Map-and-Reduce heavy Job Purlieus: Locality-aware resource Allocation for MapReduce in a Cloud. SC’ 11

QoS-Aware Resource Management • Physical Environment • Job scheduling • Load balancing • Data locality • Application deployment • Server/Resource allocation • Virtualized environment (Cloud Computing) • Similar issues as in Physical Environment • Interference-aware Sche. • VM deployment • VM migration • Virtual resource allocation

QoS-aware Resource Management in Distributed System

QoS-aware Resource Management in Distributed System

Presentation Transcript

Resource Management in Distributed Systems

Distributed Resource Management: Distributed Shared Memory

Resource Management in Distributed Systems: Distributed File Systems

Explicit Control in a Batch-aware Distributed File System

Network Aware Resource Allocation in Distributed Clouds

Tunable QoS -Aware Network Survivability

Thermal Aware Resource Management Framework

Resource management system for distributed environment

On QoS Aware Uplink Improvement in Multi-Wavelength PON System

Distributed Data Access and Resource Management in the D0 SAM System

Emergency Services: Resource Management and QoS Control

Agreement-based Distributed Resource Management

QoS-Aware Resource Allocation for Slowly Time-Varying Channels

A Distributed Resource Controller for QoS Applications

LOCATION-AWARE RESOURCE MANAGEMENT IN SMART HOME ENVIRONMENTS

QoS-Aware Memory Systems

QoS-Aware Dependency Management for Component Based Systems

On Admission Control in Qos-Aware PON System

Distributed Resource Management: Distributed Shared Memory

QoS Aware

Resource Management in Distributed Systems

QoS-Aware Service Selection in Geographically Distributed Clouds