1 / 49

Limiting the Impact of Failures on Network Performance Joint work with

Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group, 25 Feb. 2004. Yashar Ganjali Computer Systems Lab. Stanford University yganjali@stanford.edu http://www.stanford.edu/~yganjali.

teenie
Télécharger la présentation

Limiting the Impact of Failures on Network Performance Joint work with

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Limiting the Impact of Failures on Network Performance Joint work with Supratik Bhattacharyya, and Christophe Diot High Performance Networking Group, 25 Feb. 2004 Yashar Ganjali Computer Systems Lab. Stanford University yganjali@stanford.edu http://www.stanford.edu/~yganjali

  2. Motivation • The core of the Internet consists of several large networks (IP backbones). • IP backbones are carefully provisioned to guarantee low latency and jitter for packet delivery. • Failures occur on a daily basis as a result of • Physical layer malfunction, • Router hardware/software failures, • Maintenance, • Human errors, … • Failures affect the quality of service delivered to backbone customers.

  3. Outline • Background • Sprint’s IP backbone • Data • Impact Metrics • Time-based metrics • Link-based metrics • Measurements • Reducing the impact • Identifying critical failures • Causes analysis • Reducing critical failures

  4. Background – Sprint’s IP backbone • IP layer operates above DWDM with SONET framing. • IS-IS protocol used to route traffic inside the network. • IP-level restoration • When an IP link fails, all routers in the network independently compute a new path around the failure • No protection in the underlying optical infrastructure.

  5. Data • IS-IS Link State PDU logs • Collected by passive listeners from Sprint’s North America backbone. • Feb. 1st, 2003 to Jun. 30th, 2003. • SNMP logs • Link loads recorded once in every 5 minutes. • SONET layer alarms • Corresponding to minor and major problems in the optical layer • We are only interested in two alarms:SLOS, and SLOS cleared.

  6. Link Failures in Sprint’s IP Backbone – 9408 Failures

  7. ANA-1 ANA-4 ANA-2 ANA-3 Inter-POP vs. Intra-POP

  8. Outline • Background • Sprint’s IP backbone • Data • Impact Metrics • Time-based metrics • Link-based metrics • Measurements • Reducing the impact • Identifying critical failures • Causes analysis • Reducing critical failures

  9. Inter-POP Link Failures in Sprint’s IP Backbone

  10. Two Perspectives • For a given impact metric • Time-based analysis: Measure the impact of failures on the given metric as a function of time. • Link-based analysis: Measure the impact of failures on the given metric as a function of failing links.

  11. Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load

  12. Number of Simultaneous Failures

  13. Number of Simultaneous Failures

  14. Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load

  15. Number of Affected O-D Pairs B A C F D E

  16. Number of Affected O-D Pairs

  17. Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load

  18. Number of Affected BGP Prefixes

  19. Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load

  20. Path Unavailability B A C F D E

  21. Path Unavailability

  22. Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load

  23. Total Rerouted Traffic

  24. Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load

  25. Maximum Load Throughout the Network

  26. Maximum Load Throughout the Network 96% of link failures were not followed by an immediate change in maximum load.

  27. Time-based Impact Metrics • Number of Simultaneous Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path unavailability • Total rerouted traffic • Maximum load

  28. Number of Failures per Link

  29. Number of Affected OD Pairs per Link

  30. Number of Affected BGP Prefixes per Link

  31. Path Coverage B A C F D E

  32. Path Coverage of Links

  33. Total Rerouted Traffic on a Link

  34. Peak Factor of a Link

  35. Link-based Impact Metrics • Number of Link Failures • Number of affected O-D pairs • Number of affected BGP prefixes • Path coverage • Total rerouted traffic • Peak factor

  36. Outline • Background • Sprint’s IP backbone • Data • Impact Metrics • Time-based metrics • Link-based metrics • Measurements • Reducing the impact • Identifying critical failures • Causes analysis • Reducing critical failures

  37. Critical Failures • For each time-based metric • Removing failures occuring during 1-5% of time improves the metrics by a factor of at least 5. • For each link-based metric • Removing failures on 1-7% of links improves the metric by a factor of at least 3.

  38. Critical Time Periods

  39. Critical Links • Any link which has a critical failures, is called a Critical Link. • We are interested in fixing such links.

  40. Correlation of Critical Sets

  41. Correlation of the Critical Sets Overall 23% of all links are critical.

  42. Cause Analysis • Markopoulou et al. have used IS-IS update messages for characterizing link failures into the following categories [MIB+04]. • Maintenance • Unplanned • Shared failures • Router-related • Optical-related • Unspecified • Individual failures About 70% of all unplanned failures

  43. IP link failure Time SLOS ~ 20ms SLOS Cleared ~ 12sec Matching SLOS Alarms with IP Link Failures 58% of all link failures are due to optical layer problems. 84% of critical failures are due to optical layer problems.

  44. Reducing Critical Failures • Replace old optical fibers/parts. • Optical Protection. • Push the traffic away. • Also works for maximum load and peak factor.

  45. Performance Improvement

  46. Reducing Link Down-time • Low-failure links: • Failure are very rare. • Damping doesn’t help. • High-failure links: • Failure rate changes very slowly. • Fixed damping is wasteful.

  47. Adaptive Damping Output: ADT: Adaptive damping timer Input: : time difference between the last two failures : threshold : constant function Adaptive_Damping begin if( < ) ADT :=  x ; else ADT := 0; end;

  48. Number – Duration Pareto Curve

  49. Thank you!

More Related