Optimal Task Assignment Policies for Distributed Server Load Balancing

Task Assignmentwith Unknown Duration Mor Harchol-Balter Carnegie Mellon

1 3 2 Large # jobs L.B. 4 Distributed Server Load Balancer employs TAP (Task Assignment Policy): rule for assigning jobs to hosts Age-old Question: What’s a good TAP ?

FCFS FCFS Large # jobs L.B. FCFS FCFS The Model • Processing requirement (size) of job is not known. • Jobs are not preemptible. • Jobs queued at a host are processed in FCFS order. • Hosts are identical. Motivation for model:Distributed servers for supercomputing, where each host is a multi-processor.

Which TAP is best (given model)? 1 2 L.B. 3 4 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job when free. least total work left. “best” -- minimize mean waiting time

Which TAP is best (given model)? 1 2 L.B. 3 4 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job least total work left. when free. Known: Optimal for exponentially- distributed sizes. “best” -- minimize mean waiting time

But real jobs do NOT have exponentially-distributed sizes! They have heavy-tailed sizes.

1 x Unix process CPU lifetime measurements [Harchol-Balter, Downey TOCS 97] Fraction of jobs with CPU duration > x (log-log plot) Pr{Size > x} = Duration (xsecs) • We measured over 1 million UNIX processes. • Instructional, research, and sys. admin. machines. • Job of cpu age x has probability 1/2 of using another x.

- a Pr{ Size > x } ~ x , 0 < a < 2 Bounded Pareto (heavy-tailed) distribution a : degree of variability a 0 ------ ------ 2 less variable & less heavy-tailed more variable & more heavy-tailed • Properties: • Decreasing Failure Rate • Very high variance! • Heavy-tail property -- • Miniscule fraction (<1%) • of the very largest jobs • comprise half the load. 1 0 min max job size

Which TAP is best for heavy-tailed job sizes? 1 2 L.B. 3 4 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job least total work left. when free. Known: Optimal for exponentially- distributed sizes. “best” -- minimize mean waiting time

The TAGS algorithm “Task Assignment by Guessing Size” s1 Host 1 s2 Host 2 Outside Arrivals s3 Host 3 Host 4 When job at host j reaches size sj , then job is killed and restarted from scratch at host j+1

3 Flavors of TAGS How to choose the cutoffs: s1, s2, s3, … • TAGS-opt-meanslowdown • TAGS-opt-meanwaitingtime • TAGS-opt-fairness

TAGS is counterintuitive • TAGS wastes resources … non-workconserving • Big jobs seem unfairly penalized … yet somehow • turns out to be fair? • TAGS always operates under unbalanced load.

Results of Analysis 2 hosts only -- system load = .5 Random Least-Work-Left TAGS-opt-slowdown Random Least-Work-Left TAGS-opt-fairness

Results of Analysis 2 hosts only -- system load = .5 Random Least-Work-Left TAGS-opt-waitingtime

More Results 4 hosts -- system load = .3 Random Least-Work-Left TAGS

More Results New metric: Server Expansion Server expansion = number of hosts we would have to add to system to get mean slowdown down to 2 or 3. (Initial system: 2 hosts , system load = .7) Least-Work-Left TAGS

WHY does TAGS work so well? 1) Reduction of variance of job size distribution 2) Load Unbalancing

WHY does TAGS work so well? Recall, P-K formula for M/G/1 queue: FCFS Second moment of Job Size Distribution 2 l { X } E Mean Waiting Time E { W } = 2 ( 1 - r ) 1)Reduction of variance of job size distribution: TAGS reduces variance of job size distribution at the hosts. No other policy does this!

WHY does TAGS work so well? 2)Load Unbalancing: This is fair? YES TAGS-opt-slowdown TAGS-opt-fairness Host 2 Host 2 Host 1 Host 1 All other policies aim to balance the load. TAGS unbalances load.

Conclusion • This research challenges our common wisdom: • Load unbalancing may be better than load balancing. • It may be worthwhile to waste resources by restarting a job • from scratch at a new machine … even if the new machine • has a much higher load than the original machine! • A policy which appears to greatly penalize large jobs • may actually be fair.

Optimal Task Assignment Policies for Distributed Server Load Balancing

Optimal Task Assignment Policies for Distributed Server Load Balancing

Presentation Transcript

Indexing with Unknown Illumination and Pose

chapter 3 task assignment and scheduling

BLOG: Probabilistic Models with Unknown Objects

PREPARATORY TASK ASSIGNMENT

Multi-resource Allocation with Unknown Participants

BLOG: Probabilistic Models with Unknown Objects

Subjunctive with the Unknown

Assignment 1: Task 2

Transactions with Unknown Duration for Web Services

Measuring Function Duration with Ftrace

Duration of courtship effort with memory

unknown

Science with MoNA – Exploring the Unknown

Matrix Factorization with Unknown Noise

Duration

Achieving Consensus with Unknown Participants

Duration

Assignment Prime - Sample Assignment on Health & Social Care (Task 1)

ASSIGNMENT TASK CRITICAL REFLECTION / TUTORIALOUTLETDOTCOM

Academic Writing Help: Complete Assignment Task

Measuring Function Duration with Ftrace

Conditions with σ Unknown