Limiting the Impact of Failures on Network Performance Joint work with

Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group, 25 Feb. 2004 Yashar Ganjali Computer Systems Lab. Stanford University yganjali@stanford.edu http://www.stanford.edu/~yganjali

Motivation • The core of the Internet consists of several large networks (IP backbones). • IP backbones are carefully provisioned to guarantee low latency and jitter for packet delivery. • Failures occur on a daily basis as a result of • Physical layer malfunction, • Router hardware/software failures, • Maintenance, • Human errors, … • Failures affect the quality of service delivered to backbone customers.

Outline • Background • Sprint’s IP backbone • Data • Impact Metrics • Time-based metrics • Link-based metrics • Measurements • Reducing the impact • Identifying critical failures • Causes analysis • Reducing critical failures

Background – Sprint’s IP backbone • IP layer operates above DWDM with SONET framing. • IS-IS protocol used to route traffic inside the network. • IP-level restoration • When an IP link fails, all routers in the network independently compute a new path around the failure • No protection in the underlying optical infrastructure.

Data • IS-IS Link State PDU logs • Collected by passive listeners from Sprint’s North America backbone. • Feb. 1st, 2003 to Jun. 30th, 2003. • SNMP logs • Link loads recorded once in every 5 minutes. • SONET layer alarms • Corresponding to minor and major problems in the optical layer • We are only interested in two alarms:SLOS, and SLOS cleared.

Link Failures in Sprint’s IP Backbone – 9408 Failures

ANA-1 ANA-4 ANA-2 ANA-3 Inter-POP vs. Intra-POP

Inter-POP Link Failures in Sprint’s IP Backbone

Two Perspectives • For a given impact metric • Time-based analysis: Measure the impact of failures on the given metric as a function of time. • Link-based analysis: Measure the impact of failures on the given metric as a function of failing links.

Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load

Number of Simultaneous Failures

Number of Affected O-D Pairs B A C F D E

Number of Affected O-D Pairs

Number of Affected BGP Prefixes

Path Unavailability B A C F D E

Path Unavailability

Total Rerouted Traffic

Maximum Load Throughout the Network

Maximum Load Throughout the Network 96% of link failures were not followed by an immediate change in maximum load.

Number of Failures per Link

Number of Affected OD Pairs per Link

Number of Affected BGP Prefixes per Link

Path Coverage B A C F D E

Path Coverage of Links

Total Rerouted Traffic on a Link

Peak Factor of a Link

Link-based Impact Metrics • Number of Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path coverage • Total rerouted traffic • Peak factor

Critical Failures • For each time-based metric • Removing failures occuring during 1-5% of time improves the metrics by a factor of at least 5. • For each link-based metric • Removing failures on 1-7% of links improves the metric by a factor of at least 3.

Critical Time Periods

Critical Links • Any link which has a critical failures, is called a Critical Link. • We are interested in fixing such links.

Correlation of Critical Sets

Correlation of the Critical Sets Overall 23% of all links are critical.

Cause Analysis • Markopoulou et al. have used IS-IS update messages for characterizing link failures into the following categories [MIB+04]. • Maintenance • Unplanned • Shared failures • Router-related • Optical-related • Unspecified • Individual failures About 70% of all unplanned failures

IP link failure Time SLOS ~ 20ms SLOS Cleared ~ 12sec Matching SLOS Alarms with IP Link Failures 58% of all link failures are due to optical layer problems. 84% of critical failures are due to optical layer problems.

Reducing Critical Failures • Replace old optical fibers/parts. • Optical Protection. • Push the traffic away. • Also works for maximum load and peak factor.

Performance Improvement

Reducing Link Down-time • Low-failure links: • Failure are very rare. • Damping doesn’t help. • High-failure links: • Failure rate changes very slowly. • Fixed damping is wasteful.

Adaptive Damping Output: ADT: Adaptive damping timer Input: : time difference between the last two failures : threshold : constant function Adaptive_Damping begin if( < ) ADT :=  x ; else ADT := 0; end;

Number – Duration Pareto Curve

Thank you!

Limiting the Impact of Failures on Network Performance Joint work with

Limiting the Impact of Failures on Network Performance Joint work with

Presentation Transcript

the impact of work study on productivity

Joint work with :

The Impact of Work Environment on the Performance of Health Workers in Sudan

The Impact of Variability on Process Performance

The impact of airline service failures on travelers’ carrier choice

The Impact of Network on Video Quality

The impact of work on benefits

Joint work with:

Joint work with :

Joint work with :

Joint work with and

Joint work with and

Joint work with and

Joint work with :

Joint work with :

The Impact of Work Environment on the Performance of Health Workers in Sudan

Joint work with:

Impact of Interference on Multi-hop Wireless Network Performance

Understanding Network Failures

Mercury: Detecting the Performance Impact of Network Upgrades

Based on joint work with X. Ding

Joint work with and