200 likes | 208 Vues
Task Assignment with Unknown Duration. Mor Harchol-Balter Carnegie Mellon. 1. 3. 2. Large # jobs. L.B. 4. Distributed Server. Load Balancer employs TAP (Task Assignment Policy): rule for assigning jobs to hosts. Age-old Question: What’s a good TAP ?. FCFS. FCFS. Large # jobs.
E N D
Task Assignmentwith Unknown Duration Mor Harchol-Balter Carnegie Mellon
1 3 2 Large # jobs L.B. 4 Distributed Server Load Balancer employs TAP (Task Assignment Policy): rule for assigning jobs to hosts Age-old Question: What’s a good TAP ?
FCFS FCFS Large # jobs L.B. FCFS FCFS The Model • Processing requirement (size) of job is not known. • Jobs are not preemptible. • Jobs queued at a host are processed in FCFS order. • Hosts are identical. Motivation for model:Distributed servers for supercomputing, where each host is a multi-processor.
Which TAP is best (given model)? 1 2 L.B. 3 4 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job when free. least total work left. “best” -- minimize mean waiting time
Which TAP is best (given model)? 1 2 L.B. 3 4 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job least total work left. when free. Known: Optimal for exponentially- distributed sizes. “best” -- minimize mean waiting time
But real jobs do NOT have exponentially-distributed sizes! They have heavy-tailed sizes.
1 x Unix process CPU lifetime measurements [Harchol-Balter, Downey TOCS 97] Fraction of jobs with CPU duration > x (log-log plot) Pr{Size > x} = Duration (xsecs) • We measured over 1 million UNIX processes. • Instructional, research, and sys. admin. machines. • Job of cpu age x has probability 1/2 of using another x.
- a Pr{ Size > x } ~ x , 0 < a < 2 Bounded Pareto (heavy-tailed) distribution a : degree of variability a 0 ------ ------ 2 less variable & less heavy-tailed more variable & more heavy-tailed • Properties: • Decreasing Failure Rate • Very high variance! • Heavy-tail property -- • Miniscule fraction (<1%) • of the very largest jobs • comprise half the load. 1 0 min max job size
Which TAP is best for heavy-tailed job sizes? 1 2 L.B. 3 4 1. Round-Robin 2. Random 3. Shortest-Queue Send job to host with fewest number jobs. 4. Least-Work-Left Central-Queue Send job to host with Host grabs next job least total work left. when free. Known: Optimal for exponentially- distributed sizes. “best” -- minimize mean waiting time
The TAGS algorithm “Task Assignment by Guessing Size” s1 Host 1 s2 Host 2 Outside Arrivals s3 Host 3 Host 4 When job at host j reaches size sj , then job is killed and restarted from scratch at host j+1
3 Flavors of TAGS How to choose the cutoffs: s1, s2, s3, … • TAGS-opt-meanslowdown • TAGS-opt-meanwaitingtime • TAGS-opt-fairness
TAGS is counterintuitive • TAGS wastes resources … non-workconserving • Big jobs seem unfairly penalized … yet somehow • turns out to be fair? • TAGS always operates under unbalanced load.
Results of Analysis 2 hosts only -- system load = .5 Random Least-Work-Left TAGS-opt-slowdown Random Least-Work-Left TAGS-opt-fairness
Results of Analysis 2 hosts only -- system load = .5 Random Least-Work-Left TAGS-opt-waitingtime
More Results 4 hosts -- system load = .3 Random Least-Work-Left TAGS
More Results New metric: Server Expansion Server expansion = number of hosts we would have to add to system to get mean slowdown down to 2 or 3. (Initial system: 2 hosts , system load = .7) Least-Work-Left TAGS
WHY does TAGS work so well? 1) Reduction of variance of job size distribution 2) Load Unbalancing
WHY does TAGS work so well? Recall, P-K formula for M/G/1 queue: FCFS Second moment of Job Size Distribution 2 l { X } E Mean Waiting Time E { W } = 2 ( 1 - r ) 1)Reduction of variance of job size distribution: TAGS reduces variance of job size distribution at the hosts. No other policy does this!
WHY does TAGS work so well? 2)Load Unbalancing: This is fair? YES TAGS-opt-slowdown TAGS-opt-fairness Host 2 Host 2 Host 1 Host 1 All other policies aim to balance the load. TAGS unbalances load.
Conclusion • This research challenges our common wisdom: • Load unbalancing may be better than load balancing. • It may be worthwhile to waste resources by restarting a job • from scratch at a new machine … even if the new machine • has a much higher load than the original machine! • A policy which appears to greatly penalize large jobs • may actually be fair.