Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster

Natjam: Supporting Deadlines and Priorities in a Mapreduce Cluster Brian Cho (Samsung/Illinois), MuntasirRahman, TejChajed, Indranil Gupta, Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin University of Illinois (Urbana-Champaign) Distributed Protocols Research Group (DPRG): http://dprg.cs.uiuc.edu

Hadoop Jobs have Priorities • Dual Priority Case • Production jobs (high priority) • Time sensitive • Directly affect criticality or revenue • Research jobs (low priority) • e.g., long term analysis • Example: Ad provider Ad click-through logs Count clicks Is there a better way to place ads? Update ads Slow counts → Show old ads → Don’t get paid $$$ Run machine learning analysis Prioritize production jobs Daily and Historical logs. http://dprg.cs.uiuc.edu

State-of-the-art: Separate clusters • Production cluster receives production jobs (high priority) • Research cluster receives research jobs (low priority) • Traces reveal large periods of under-utilization in each cluster • Long job completion times • Human involvement in job management • Goal: single consolidated cluster for all priorities and deadlines • Prioritize production jobs and yet affect research jobs least • Today’s Options: • Wait for research tasks to finish(e.g., Capacity Scheduler)  Prolongs production jobs • Kill research tasks (e.g., Fair Scheduler) can lead to repeated work  Prolongs research jobs http://dprg.cs.uiuc.edu

Natjam’s Techniques • Scale down research jobs by • Preempting some Reduce tasks • Fast on-demand automated checkpointing of task state • Later, reduces can resume where they left off • Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook]) • Job Eviction Policies • Task Eviction Policies http://dprg.cs.uiuc.edu

Natjam built into Hadoop YARN Architecture Resource Manager • Preemptor • Chooses Victim Job • Reclaims queue resources • Releaser • Chooses Victim Task • Local Suspender • Saves state of Victim Task Capacity Scheduler preempt() Preemptor ask container Node A Node B # containers to release Node Manager A Node Manager B suspend Task (App1) Application Master 1 Task (App2) Application Master 2 Task (App2) resume() saved state release() (empty container) Local Suspender Local Suspender Releaser Releaser http://dprg.cs.uiuc.edu

Suspending and Resuming Tasks (Suspended) Container freed, Suspend state saved HDFS Task Attempt 1 tmp/task_att_1 • Existing intermediate data used • Reduce inputs,stored at local host • Reduce outputs,stored on HDFS • Suspended task state saved locally, so resume can avoid network overhead • Checkpoint state saved • Key counter • Reduce input path • Hostname • List of suspended task attempt IDs Key Counter Key Counter outdir/ Inputs (Resumed) Task Attempt 2 tmp/task_att_2 (skip) Inputs http://dprg.cs.uiuc.edu

Two-level Eviction Policies Resource Manager Capacity Scheduler • On a container request in a full cluster: • JobEviction • @Preemptor • Task Eviction • @Releaser preempt() Preemptor Node A Node B Node Manager A Node Manager B # containers to release Application Master 1 Task (App2) Application Master 2 Task (App2) release() Local Suspender Local Suspender Releaser Releaser http://dprg.cs.uiuc.edu

Job Eviction Policies • Based on total amount of resources (e.g., containers) held by victim job (known at Resource Manager) • Least Resources (LR)  Large research jobs unaffected  Starvation for small research jobs (e.g., repeated production arrivals) • Most Resources (MR)  Small research jobs unaffected  Starvation for the largest research job • Probabilistically-weighted on Resources (PR)  Weigh jobs by number of containers: treats all tasks same, across jobs  Affects multiple research jobs http://dprg.cs.uiuc.edu

Task Eviction Policies • Based on time remaining (known at Application Master) • Shortest Remaining Time (SRT)  Leaves the tail of research job alone  Holds on to containers that would be released soon • Longest Remaining Time (LRT)  May lengthen the tail • Releases more containers earlier • However: SRTprovably optimal under some conditions • Counter-intuitive. SRT = Longest-job-first scheduling. Now http://dprg.cs.uiuc.edu

Eviction Policies in Practice • Task Eviction • SRT 20% faster than LRT for research jobs • Production job similar across SRT vs. LRT • Theorem: When research tasks resume simultaneously, SRT results in shortest job completion time. • Job Eviction • MR best • PR very close behind • LR 14%-23% worse than MR • MR + SRT best combination http://dprg.cs.uiuc.edu

Natjam-R: Multiple Priorities • Special case of priorities: jobs with real-time deadlines • Best-effort only (no admission control) • Resource Manager keeps single queue of jobs sorted by increasing priority (derived from deadline) • Periodically scans queue: evicts later job to give to earlier waiting job • Job Eviction Policies • Maximum Deadline First (MDF): Priority = Deadline • Prefers short deadline jobs  May miss deadlines, e.g., schedules a large job instead of a small job with a slightly large deadline • Maximum Laxity First • Priority = Laxity = Deadline minus Job’s Projected Completion time • Pays attention to job’s resource requirements

MDF vs. MLF in Practice Job deadlines MDF prefers short deadlines MLF moves in lockstep Misses all deadlines • 8 node cluster • Yahoo! trace experiments in paper

Natjam vs. Alternatives time (seconds) • Microbenchmark: • 7 node cluster 7% worse than Ideal 40% better than Soft cap 50% worse than ideal 90% worse than ideal 20% worse than ideal 2% worse than ideal 15% better than Killing Empty Cluster t=50s Production-S (25% of cluster) t=0s Research-XL (100% of cluster)

Large Experiments • 250 nodes @Yahoo!, Driven by Yahoo! traces • Natjamvs. Waiting for research tasks (Hadoop Capacity Scheduler: Soft cap) • Production jobs: 53% benefit, 97% delayed < 5 s • Research jobs: 63% benefit, very few outliers (low starvation) • Natjamvs. Killing research tasks • Production jobs: largely unaffected • Research jobs: • 38% finish faster than 100 s • 5th percentile faster than 750 s • Biggest improvement: 1880 s • Negligible starvation http://dprg.cs.uiuc.edu

Related Work • Single cluster job scheduling has focused on: • Locality of Map tasks [Quincy, Delay Scheduling] • Speculative execution [LATE Scheduler] • Average fairness between queues [Capacity Scheduler, Fair Scheduler] • Recent work: Elastic queues but uses Sailfish – needs special intermediate file system, does not work with Hadoop [Amoeba] • Mapreduce-5269 JIRA: Preemption in Hadoop http://dprg.cs.uiuc.edu

Takeaways • Natjam supports dual priority and arbitrary priorities (derived from deadlines) • SRT (Shortest remaining time) best policy for task eviction • MR (Most resources) best policy for job eviction • MDF (Maximum deadline first) best policy for job eviction in Natjam-R • 2-7% Overhead for dual priority case • Please see our poster + demo video later today! http://dprg.cs.uiuc.edu

Backup slides http://dprg.cs.uiuc.edu

Contributions • Our system Natjam allows us to • Maintain one cluster • With a production queue and a research queue • Prioritize production jobs and complete them quickly • While affecting research jobs the least • (Later: Extend to multiple priorities.) http://dprg.cs.uiuc.edu

Hadoop 23’s Capacity Scheduler • Limitation: research jobs cannot scale down • Hadoop capacity shared using queues • Guaranteed capacity (G) • Maximum capacity(M) • Example • Production (P) queue:G 80%/M 80% • Research (R) queue:G 20%/M 40% • Production jobsubmitted first: • Research jobsubmitted first: (under-utilization) P takes 80% R takes 40% time → R can only grow to 40% P cannot grow beyond 60% (under-utilization) time → http://dprg.cs.uiuc.edu

Natjam Scheduler • Does not require Maximum capacity • Scales down research jobs by • Preempting Reduce tasks • Fast on-demand automated checkpointing of task state • Resumption where it left off • Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook]) • P/R Guaranteed: 80%/20% • P/RGuaranteed: 100%/0% R takes 100% R takes 100% time → P takes 80% P takes 100% time → Prioritize Production Jobs http://dprg.cs.uiuc.edu

Yahoo! Hadoop Traces:CDF of differences (negative is good) 7-node cluster 250-node Yahoo! cluster Only two starved jobs 260 s and 390 s Largest benefit 1880 s

Natjam : Supporting Deadlines and Priorities in a Mapreduce Cluster