Lecture 14:Combating Outliers in MapReduce Clusters

Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang

References: • Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris • http://research.microsoft.com/en-us/UM/people/srikanth/data/Combating%20Outliers%20in%20Map-Reduce.web.pptx

log(size of cluster) 105 104 mapreduce 103 HPC, || databases 102 e.g., the Internet, click logs, bio/genomic data 101 1 log(size of dataset) GB 109 TB 1012 PB 1015 EB 1018 • MapReduce • Decouples customized data operations from mechanisms to scale • Is widely used • Cosmos (based on SVC’s Dryad) + Scope @ Bing • MapReduce @ Google • Hadoop inside Yahoo! and on Amazon’s Cloud (AWS)

An Example SELECT Query, COUNT(*) AS Freq FROMQueryTable HAVINGFreq > X What the user says: GoalFind frequent search queries to Bing How it Works: job manager assign work, get progress file block 0 task file block 1 output block 0 task Local write task file block 2 output block 1 task task file block 3 Reduce Read Map

We find that: Outliers slow down map-reduce jobs File System Map.Read 22K Map.Move 15K Map 13K Barrier Reduce 51K • Goals • Speeding up jobs improves productivity • Predictability supports SLAs • … while using resources efficiently

What is an outlier • A phase (map or reduce) has n tasks and s slots (available compute resources) • Every task takes T seconds to run • ti = f (datasize, code, machine, network) • Ideally run time = ceiling (n/s) * T • A naïve scheduler • Goal is to be closer to

From a phase to a job • A job may have many phases • An outlier in an early phase has a cumulative effect • Data loss may cause multi-phase recompute  outliers

Why outliers? Problem: Due to unavailable input, tasks have to be recomputed map reduce Delay due to a recompute sort Delay due to a recompute readily cascades 8

Previous work • The original MapReduce paper observed the problem • But didn’t deal with it in depth • Solution was to duplicate the slow tasks • Drawbacks • Some may be unnecessary • Use extra resources • Placement may be the problem

Quantifying the Outlier Problem • Approach: • Understanding the problem first before proposing solutions • Understanding often leads to solutions • Prevalence of outliers • Causes of outliers • Impact of outliers

Why bother? Frequency of Outliers stragglers = Tasks that take  1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost straggler straggler Outlier • 50% phases have 10% stragglers and no recomputes • 10% of the stragglers take >10X longer

Causes of outliners: data skew Duplicating will not help! • In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network

Non-outliers can be improved as well • 20% of them are 55% longer than median

Problem: Tasks reading input over the network experience variable congestion Reduce task Map output uneven placement is typical in production • reduce tasks are placed at first available slot 14

Causes of outliers: cross rack traffic • 70% of cross track traffic is reduce traffic • Tasks in a spot with slow network run slower • Tasks compete network among themselves • Reduce reads from every map • Reduce is put into any spare slot 50% phases takes 62% longer to finish than ideal placement

Cause of outliers: bad and busy machines • 50% of recomputes happen on 5% of the machines • Recompute increases resource usage

Outliers cluster by time • Resource contention might be the cause • Recomputes cluster by machines • Data loss may cause multiple recomputes

Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers

Mantri Design

High-level idea • Cause aware, and resource aware • Runtime = f (input, network, machine, datatoProcess, …) • Fix each problem with different strategies

Resource-aware restarts • Duplicate or kill long outliers

When to restart • Every ∆ seconds, tasks report progress • Estimate trem and tnew

γ= 3 • Schedule a duplicate if the total running time is smaller • P(c trem > (c+1) tnew) > δ • When there are available slots, restart if reduction time is more than restart time • E(trem – tnew ) > ρ ∆

Network Aware Placement • Compute the rack location for each task • Find the placement that minimizes the maximum data transfer time If rack i has di map output and ui,vi bandwidths available on uplink and downlink, Place ai fraction of reduces such that:

Avoid recomputation • Replicating the output • Restart a task if data are lost • Replicate the most costly job

Data-aware task ordering • Outliers due to large input • Schedule tasks in descending order of dataToProcess • At most 33% worse than optimal scheduling

Estimation of trem and tnew • d: input data size • dread: the amount read

Estimation of tnew • processRate: estimated of all tasks in the phase • locationFactor: machine performance • d: input size

Results Deployed in production cosmos clusters Prototype Jan’10 baking on pre-prod. clusters  release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles

Evaluation Methodology • Mantri run on production clusters • Baseline is results from Dryad • Use trace-driven simulations to compare with other systems

Comparing jobs in the wild 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release) • w/ and w/o Mantri for one month of jobs in Bing production cluster

In production, restarts… improve on native cosmos by 25% while using fewer resources

In trace-replay simulations, restarts… CDF % cluster resources CDF % cluster resources are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice

Network-aware Placement • Equal: all links have the same bandwidth • Start: same as the start • Ideal: available bandwidth at run time

Protecting against recomputes CDF % cluster resources

Summary • Reduce recomputation: preferentially replicate costly-to-recompute tasks • Poor network: each job locally avoids network hot-spots • Bad machines: quarantine persistently faulty machines • DataToProcess: schedule in descending order of data size • Others: restart or duplicate tasks, cognizant of resource cost. Prune

Conclusion • Outliers in map-reduce clusters are a significant problem • happen due to many causes • interplay between storage, network and map-reduce • cause-, resource- aware mitigationimproves on prior art

Lecture 14:Combating Outliers in MapReduce Clusters

Lecture 14:Combating Outliers in MapReduce Clusters

Presentation Transcript

Combating Outliers in map-reduce

MapReduce: simplified data processing on large clusters

Lecture 2 MapReduce

Lecture 2 – MapReduce

Lecture 14:Combating Outliers in MapReduce Clusters

MapReduce : Simplified Data Processing on Large Clusters

Lecture 12: MapReduce : Simplified Data Processing on Large Clusters

Reining in the Outliers in MapReduce Jobs using Mantri

Tarazu Optimizing MapReduce On Heterogeneous Clusters

MapReduce Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

MapReduce : Simplified Data Processing on Large Clusters

Virtual Clusters Supporting MapReduce in the Cloud

MapReduce : Simplified Data Processing on Large Clusters

Performance of MapReduce on Multicore Clusters

MapReduce: simplified data processing on large clusters

MapReduce: Simplied Data Processing on Large Clusters

CUDA Performance Study on Hadoop MapReduce Clusters

Combating Outliers in map-reduce

LECTURE 14 OUTLIERS AND MULTICOLLINEARITY

Lecture 3 – MapReduce: Implementation