Surviving Failures in Bandwidth-Constrained Datacenters

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodik, IshaiMenache@ Microsoft Pradeepkumar Mani, David A.Maltz @ Microsoft Research Mosharaf Chowdhury, Ion Stoica @ UC Berkeley Presenter: Xin Li

How to allocate services to physical machine Web Server Database Server Indexing Server A Simplified Searching Engine Example Given network topology & VM allocation: - Is it optimal? - What is optimal?

What Matters Revenue User Experience smoothness availability Transparency

Availability Service A Service B Core Switch C A A Aggregation Switch Rack

Availability Service A Service B Core Switch C WCS—Worse Case Survival Smallest fraction of machines that remain functional during a single failure in datacenter. A A Aggregation Switch Rack FT—Fault Tolerance Average WCS across all services

Smoothness Service A 1000MB/s Service B 1000MB/s Core Switch C 1000*2+1000*2+ 1000*2+1000*2= 8000MB/s 6000MB/s A A Aggregation Switch 4000MB/s 1000*3+ 1000*3= 6000MB/s Rack Congestion

Oversubscription Ratio Upper Link Bandwidth(UB) B B B …………… Server n Server 2 Server 1 Oversubscription Ratio= B*n/UB

Smoothness Service A 1000MB/s Service B 1000MB/s Core Switch C 1000*2+1000*2+ 1000*2+1000*2= 8000MB/s 6000MB/s A A Aggregation Switch 4000MB/s 1000*3+ 1000*3= 6000MB/s Rack BW—Bandwidth Aggregate bandwidth usage on core links

Smoothness Service A 1000MB/s Service B 1000MB/s Core Switch C 6000MB/s 6000MB/s A A Aggregation Switch 4000MB/s Rack

Transparency • When we optimize the above two metrics, the cost should be low.

Transparency Service A Service B • When we optimize the above two metrics, the cost should be low. Core Switch C A A Aggregation Switch Rack

Transparency Service A Service B • When we optimize the above two metrics, the cost should be low. Core Switch C A A NM—Number of Moves # of server moves to reach target allocation Aggregation Switch Rack

What does this paper want to do----Optimize Datacenter Running Bing.com

Challenges

Optimizing for one metric degrades the other GOAL! 160% allocations optimizing 120% only worst-case survival 80% 40% 0% initial allocations optimizing allocation -40% only core bandwidth -80% -20% 0% 20% 40% 60% 80% reduction in BW usage Results from 6 Microsoft datacenters 9

Motivation for combined solution Service communication matrix set of services cluster managerforming an application service (App,Service), (App,Service………. (App,Service) only 2% of service pairs communicate 1% of services generate 64% of traffic (lot more in the paper) (App,Service), (App,Service),………………. (App,Service) 15

Service A 1000MB/s Service B 20MB/s Core Switch C 6000MB/s A A Aggregation Switch 4000MB/s Rack

Problem Statement---the framework • Metrics: • Bandwidth (BW): The sum of the rates on the core links is the overall measure of the bandwidth usage at the core of network. • Fault Tolerance(FT): It is the average of Worst-Case-Survival(WCS) across all the services. • No. of Moves(NM): The number of servers that have to be re-imaged to get from initial datacenter allocation to the proposed allocation. • Optimization: Maximize FT – α BW Subject to NM ≤ N0 α – tunable positive parameter N0 – Upper limit on number of moves.

Metric #1: Fault Tolerance FT= fraction (%) of service available during single worst-case failure network core Switches (EoR/Agg) Containers Racks (ToR) power distribution Fault domain: space of all machines affected by a single (any) failure, * Fault domains are complex

Fault Tolerance • Cells – a subset of physical machines that belong to exactly the same fault domains. This allows reduction in the size of optimization problem. Power Supply Rack indicates the number of machine within cell n allocated to service k.

Formal Definitions • I – the indicator function • I(n1,n2) = 1 if traffic from n1 to n2 traverses through core link & I(n1,n2) = 0 otherwise. • Bandwidth is given by: Where is required BW between a pair of machines from services k1 and k2. • To define FT let be the total number of machines allocated to service k affected by fault j. FT is given by: • K – total no. of services.

But…. Maximize FT – α BW Subject to NM ≤ N0

Min-Cut 10 1 1 8 6 1 9 1 1 7

Optimizinng or BW only cut considered previously in [Meng et al., INFOCOM’10] network topology machine communication graph C A A + C k-wayn cut A A min cut k-way min graph cut: • ignores #M: reshuffles almost all machines • 99% migration chance - ignores FT: can’t be easily extended

BW k-way graph cut cut 43

What about FT?

FT algorithm Input: service map+VM, fault domain map 1. Calculate initial FTC of DC 2. For every possible swap: a. calculate new FTC after swap b. ∆FTC=FTC_old-FTC_new 3. Execute swap max(∆FTC) - symmetry => many “good” swaps exist Only evaluate a small, random set of swaps (~1000) 17

FT cut Non- solution: + Not good enough! 11

Conclusion Study of communication patterns of Bing.com - sparse communication matrix - very skewed communication pattern Principled optimization of both BW and FT - exploits communication patterns - can handle arbitrary fault domains Reduction in BW: 20 - 50% Improvement in FT: 40 - 120% 37

Thanks!

Surviving Failures in Bandwidth-Constrained Datacenters

Surviving Failures in Bandwidth-Constrained Datacenters

Presentation Transcript

Achieving Better than Human Design in Detecting Events of Interest in Bandwidth Constrained Sensor Networks

Thermal Management in Datacenters

Datacenters (Optional fun)

Optimal Sleeping in Datacenters

Packing Jobs onto Machines in Datacenters

Surviving Failures in Bandwidth Constrained Datacenters

FAILURES

“ Surviving in Amityville”

Fixing TCP in Datacenters

State Monitoring in Cloud Datacenters

Modular Datacenters

FAILURES

FAILURES

Building Green Datacenters

Surviving in space

Bandwidth-Efficient, Energy-Constrained Short Range Wireless Communications

Bandwidth Constrained Energy Efficient Transmission Protocol in Wireless Sensor Networks

Interconnection, Bandwidth and Datacenters

Surviving in Business

Failures

AstroGrid Datacenters

Surviving Large Scale Internet Failures