Surviving Failures in Bandwidth Constrained Datacenters

Surviving Failures in Bandwidth Constrained Datacenters Authors: Peter Bodik IshaiMenache MosharafChowdhury Pradeepkumar Mani David A.Maltz Ion Stoica Presented By, SnehaArvind Mani

OUTLINE • Introduction • Motivation and Background • Problem Statement • Algorithmic Solutions • Evaluation of the Algorithms • Related Work • Conclusion

Introduction • The main goal of this paper: • To improve the fault tolerance of the deployed applications • Reduce bandwidth usage in the core. -How? - By optimizing allocation of applications to physical machines. • Both of the above problems are NP-hard • So they formulated a related convex optimization problem that • Incentivizes spreading machines of individual services across fault domains. • Adds a penalty term for machine reallocations that increase bandwidth usage.

Introduction (2) • Their algorithm achieved 20%-50% reduction in bandwidth usage and improving worst-case survival by 40%-120% • Improvement in Fault Tolerance – reduced the fraction of services affected by potential hardware failures by up to a factor of 14. • The contribution of this paper is three-fold: • Measurement Study • Algorithms • Methodology

Motivation and Background • Bing.com – a large scale Web application running in multiple datacenters around the world. • Some definitions used in this paper: • Logical Machine: Smallest logical component of a web application. • Service: Service consists of many logical machines executing the same code. • Environment: Consists of many services • Physical Machine: Physical server that can run a single logical machine. • Fault Domain: Set of physical machines that share a single point of failure.

Communication Patterns • On tracing communication between all pairs of servers and for each pairs of services i and j, it was observed that datacenter network core is highly utilized. • Traffic matrix is very sparse. Only 2% service pairs communicate at all.

Communication Patterns(2) • Communication pattern is very skewed. 0.1% of the services that communicate generate 60% of all traffic & 4.8% of service pairs generate 99% of traffic. • Services that do not require lot of bandwidth can be spread out across the datacenter, improving their fault tolerance.

Communication Patterns(3) • The majority of the traffic, 45% stays within the same service, 23% leaves the service but stays within the same environment & 23% crosses environments. • Median services talk to nine other services. • Communicating services form small and large components.

Failure Characteristics • Networking hardware failures causes significant outages. • Redundancy reduces impact of failures on lost bytes by only 40% • Power fault domains create non-trivial patterns. Implications for Optimization Framework: It has to consider the complex patterns of the power and networking fault domains, instead of simply spreading the services across several racks to achieve good fault tolerance.

Problem Statement • Metrics: • Bandwidth (BW): The sum of the rates on the core links is the overall measure of the bandwidth usage at the core of network. • Fault Tolerance(FT): It is the average of Worst-Case-Survival(WCS) across all the services. • No. of Moves(NM): The number of servers that have to be re-imaged to get from initial datacenter allocation to the proposed allocation. • Optimization: Maximize FT – α BW Subject to NM ≤ N0 α – tunable positive parameter N0 – Upper limit on number of moves.

Algorithmic Solutions • The solution roadmap is as follows: • Cells – a subset of physical machines that belong to exactly the same fault domains. This allows reduction in the size of optimization problem. • Fault Tolerance Cost (FTC) is a convex structure, hence the minimization of FTC improves FT. • Their method to optimize BW is to perform a minimum k-way cut on the communication graph. • CUT + FT + BW consists of two- phases • Minimum k-way cut to compute initial assignment that minimizes bandwidth at the network core. • Iteratively move machines to improve FT. FT + BW does not perform graph-cut but starts with current allocation & improves performance by greedy moves that reduce weighted sum of BW and FTC.

Formal Definitions • I – the indicator function • I(n1,n2) = 1 if traffic from n1 to n2 traverses through core link & I(n1,n2) = 0 otherwise. • Bandwidth is given by: Where is required BW between a pair of machines from services k1 and k2. • To define FT let be the total number of machines allocated to service k affected by fault j. FT is given by: • K – total no. of services.

Formal Definitions(2) • Fault Tolerance Cost(FTC) is given by: • bk and wj are positive weights assigned to services and faults. • A decrease in FTC should increase FT, as squaring the zk,j variables incentivizes keeping their values small, obtained by spreading the machine assignment across multiple fault domain. • Minimization of BW is based on minimum k-way cut, which partitions the logical machines into a given number of clusters.

Algorithms to improve both BW & FT • CUT+FT : Apply CUT in the first phase then minimize FTC in the second phase using machine swap • CUT + FT +BW: As above but in second phase a penalty term for bandwidth is added. (i.e )ΔFTC + αΔ BW, α is the weighing factor. NM-aware algorithm: • FT + BW: Start with initial allocation, do only second phase of CUT + FT + BW.

Scaling to large Datacenters An algorithm that directly exploits skewness of the communication matrix. • CUT+RandLow: Apply cut in the first phase. Determine the subset of services whose aggregate BW are lower than others then randomly permute the machine allocation of all services belonging to the subset. Scaling to large datacenters: • To scale to large datacenters, we sample a large number of candidate swaps and choose the one that most improves FTC. • Also during graph cut, logical machines of same service are grouped into smaller number of representative nodes.

Evaluation of Algorithms • CUT + FT+ BW: When ignoring the server moves, it achieves 30%-60% reduction in BW usage at the same time improving FT by 40-120% • FT + BW is close to CUT + FT+BW : FT+BW performs only steepest-descent moves.It could be used in scenarios where concurrent server moves is limited. • Random allocation in CUT + RandLow works well as many services transfer relatively little data and they can be spread randomly across DCs.

Methodology to Evaluate The following information is needed to perform evaluation: • Network Topology of a cluster • Services running in the cluster and list of machines required for each services. • List of fault domains and machines in each fault domains • Traffic matrix for services in the cluster. The algorithms are compared on the entire achievable tradeoff boundaries instead of their performance.

Comparing Different Algorithms • The solid circles represents the FT and BW at starting allocation(at origin), after BW-only optimization(bottom-left-corner) & after FT-only optimization (top-right-corner).

Optimizing for both BW and FT • Artificially partitioning each service to several subgroups – did not lead to satisfactory results. • Augmenting the cut procedure with “spreading” requirements for services – did not scale to large applications. • Cut + FT:Graph is plotted by increasing number of server swaps. • By changing the number of swaps, tradeoff between FT & BW can be controlled. • The formulation is convex, so performing steepest descent until convergence leads to global minimum w.r.t. fault tolerance.

Optimizing for both BW and FT(2) • Cut + FT+BW: Depends on α . Higher the value of α, more weight on improving BW at the cost of not improving FT. • Not optimizing over a convex function, not guaranteed to reach global optimum. • Cut + RandLow : Performs close to Cut+FT+BW but does not optimize the BW of low-talking service nor the FT of high-talking ones.

These graphs show the trade-off boundary between FT and BW for different algorithms across 3 more DCs.

Optimizing for BW,FT and NM • We notice significant improvements by moving just 5% of the cluster. Moving 29% of the cluster achieves results similar to moving most of machines using Cut + FT + BW

When running FT + BW until convergence, it achieves results close to Cut+FT+BW even without the graph cut. • This is significant because it means we can use FT + BW incrementally and still reach similar performance as Cut+FT+BW reshuffles the whole datacenter.

Improvements in FT & BW • For α = 0.1, FT+BW achieved reduction in BW usage by 26% but improved FT by 140% and FT was reduced only for 2.7% of services and it is much lesser than for α = 1.0 • For α = 1.0, FT+BW reduced core BW usage by 47% and improved average FT by 121%

Additional Scenarios • Optimization of bandwidth across multiple layers. • Preparing for maintenance and online recovery. • Adapting to changes in traffic patterns. • Hard constraints on fault tolerance and placement. • Multiple logical machines on a server.

Related Work • Datacenter traffic analysis • Datacenter resource allocation • Virtual network embedding • High availability in distributed systems • VPN and network testbed allocation

Conclusion • Analysis shows that the communication volume between pairs of services has long tail, with majority of traffic being generated by small fraction of service pairs. • This allowed the optimization algorithm to spread most of the services across fault domains without significantly increasing BW usage in the core.

Thank You!

Surviving Failures in Bandwidth Constrained Datacenters

Surviving Failures in Bandwidth Constrained Datacenters

Presentation Transcript

Achieving Better than Human Design in Detecting Events of Interest in Bandwidth Constrained Sensor Networks

Thermal Management in Datacenters

Datacenters (Optional fun)

Optimal Sleeping in Datacenters

Packing Jobs onto Machines in Datacenters

Surviving Failures in Bandwidth-Constrained Datacenters

FAILURES

“ Surviving in Amityville”

Fixing TCP in Datacenters

State Monitoring in Cloud Datacenters

Modular Datacenters

FAILURES

FAILURES

Building Green Datacenters

Surviving in space

Bandwidth-Efficient, Energy-Constrained Short Range Wireless Communications

Bandwidth Constrained Energy Efficient Transmission Protocol in Wireless Sensor Networks

Interconnection, Bandwidth and Datacenters

Surviving in Business

Failures

AstroGrid Datacenters

Surviving Large Scale Internet Failures