1 / 62

Fault management and WDM network survivability

Fault management and WDM network survivability. Lena Wosinska, The Royal Institute of Technology KTH lena@it.kth.se. Outline. Motivation Fault detection and isolation Fault recovery Terminology and concepts Connection availability modeling Protection taxonomy

noreen
Télécharger la présentation

Fault management and WDM network survivability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault managementandWDM network survivability Lena Wosinska, The Royal Institute of Technology KTH lena@it.kth.se

  2. Outline • Motivation • Fault detection and isolation • Fault recovery • Terminology and concepts • Connection availability modeling • Protection taxonomy • Optical layer protection, basics • NP, NE and TE • Connection availability in WDM mesh networks with multiple failures • Introduction • Shared path protection • Our model • Assumptions and connection availability calculation • Evaluation of the algorithm • Conclusions

  3. Motivation (1/2) IP Router ADM S OXC D Single-cable cut: 864 fibers/cable * 160/fiber * OC192/  1.38 Pbps

  4. Motivation (2/2) • The reliability and fault management aspects of optical high capacity networks are extremely important. • Critical infrastructure • Reliability performance of network elements can not guarantee a sufficient availability of optical channel • Survivability mechanisms in optical layer are required. • Maintaining acceptable level of survivability while reducing network protection costs has become an important challenge for network planners and engineers. 

  5. Service classes • Service Availability • Differentiated services • Customer paying more can get better quality of service • “Availability-aware provisioning”

  6. Fault detection and isolation • BER monitoring • Requires access to the bit stream • In transponders and regenerators • Requires known bit pattern or check sums • Optical path trace • An identification of the lightpath • For instance provided by a pilot tone • Alarm management • Forward and backward defect indicators (FDIs and BDIs) • FDI for a section can lead to FDIs for larger sections downstream • Rest of the path

  7. Monitoring methods • Physical layer monitoring using optical time domain reflectometer (OTDR) technology • Optical signal-to-noise ratio (OSNR) using optical spectrum analyzer (OSA) technology • Bit error rate (BER) monitoring using SONET/SDH BER test technology

  8. Fault diagnosis • Hard failures: unexpected events that suddenly interrupt the established channels. • Soft failures: events that progressively degrade the quality of transmission.

  9. Layers within the optical layer OMS OMS OMS OTS OTS OCh-S OCh-S OCh-P • OTS: Optical transmission section • OMS: Optical multiplexing section • OCh-TS: Optical channel transparent section (i.e. all-optical section) • OCh-S: Optical channel section • OCh-P:Optical channel path

  10. Alarm management • Single failure might cause many alarms • Alarm suppression necessary • Forward defect indicator • Backward defect indicator • Defect indicators used for protection switching

  11. Fault recovery: terminology and concepts • Protection, restoration and survivability, • Working and protection paths • Often disjoint paths • Node disjoint or link disjoint • Dedicated and shared protection paths • Revertive and non-revertive schemes • Revertive schemes for shared paths • Protection switching • Path, span and ring switching

  12. Protection vs. restoration (1/2) • Protection • Pre-assigned capacity • Can be very fast • Protection: 1+1 (dedicated), 1:1 and 1:N (shared) • Restoration • Dynamically provisioning of spare resources in the network after failure occurs. • Based on the use of capacity available in the network to replace a failed transport entity. • Network survivability • Network’s ability to continue to provide service in the presence of failures. • Applying protection or restoration. • Usually, networks are designed to survive a single failure

  13. Protection vs. restoration (2/2) Protection Restoration Backup resources are Pre-computed and reserved in advance Backup resources are dynamically discovered upon failure • Advantages • Guaranteed recovery after single failure • Shorter recovery time • Disadvantages • Backup resources “wasted” Disadvantages No guarantee on recovery (backup resources may not be found) Longer recovery time Suitable for layer 3 (IP packet switching) Suitable for lower layers Ring Protection Mesh Protection

  14. Different Fault Recovery Schemes Fault Recovery Schemes Protection Schemes Restoration Schemes Link-based Restoration Path-based Restoration Link-based Protection Path-based Protection Shared Protection Shared Protection Dedicated Protection Dedicated Protection

  15. Protection • Path protection • Find shortest link or node disjoint path pair. • Algorithm: e.g., Bhandari’s algorithm using modified Dijkstra algorithm • Link (span) protection • Find the shortest path bridging working span. • Algorithm: e.g., Dijkstra single source shortest path algorithm. • P-cycle(preconfigured protection-cycle) • A sort of shared span protection • Shortest path as working path. • Find candidate cycles with a hop limit of 10. • Minimum Spare Capacity for 100% restorability. • Mixed Integer Program (MIP) problem formulation. • Make use of optimization tool.

  16. EXAMPLE Link protection Path protection P-cycles Seattle NewYork

  17. Survivable network topologies Standby line Point-to point Mesh Working line S H R

  18. Self healing ring (SHR): fiber break

  19. S H R: nodefailure

  20. Mesh recovery: Link protection

  21. Mesh recovery: Path protection

  22. Connection Availability Modeling • Unprotected connection • Available when all the network components used are available • Dedicated-path protected connection • Available when either the primary path or backup path is available • Shared-path protected connection • More complex…

  23. LINK PROTECTION: availability model • UAdd = UC+UT , UDrop = UC+UR • U1 = U2 = …= U n-1 = LUFC + (L/50 - 1)UOFA, • U1p = U2p = …= U (n-1)p = LpUFC + (Lp/50 - 1)UOFA,

  24. PATH PROTECTION: availability model • UAdd = UC + UT , Udrop = UC + UR • UO = (nO-1)LUFC + (nO-2)Unode + (L/50 - 1)(nO-1)UOFA • UR = (nR-1)LUFC + (nR-2)Unode + (L/50 - 1)(nR-1)UOFA

  25. SHR: availability model • UW = nLUFC + nUnode+ n(L/50 - 1)UOFA • UR(SH) = (n-1)UFC + (n-1)(L/50 - 1)UOFA + nUnode UR(NF) = (n-2)UFC + (n-2)(L/50 - 1)UOFA + (n-1)Unode

  26. Mesh recovery: dedicated vs. shared protection 1:1 Protection 1+1 Protection primary primary 2 2 2 3 3 3 backup backup 1 1 1 4 4 4 Shared Protection primary 1 more expensive 6 6 6 5 5 5 share the Backup b/w on link (6,5) 7 A primary 2 less expensive 8 9 B. Mukherjee: Invited Talk at KTH, Sweden Both primary and backup are carrying “live” traffic Backup activated after failure detected…normally, can carry other low-priority preemptable traffic Different categories of recovery… • 1+1 • 1:1 • Shared • 0:1 Not preemptable • 0:1 Preemptable • “X ms” guaranteed recovery time “Multiplexed” protection… more efficient than 1:1

  27. Example: Path Protection B. Mukherjee: Invited Talk at KTH, Sweden 19 1 2,600 1,000 800 1,200 1,900 11 1,300 950 6 2 1,300 15 20 1,400 1,200 900 1,100 1,000 600 700 1,000 1,000 21 16 1,000 12 1,000 1,000 9 3 7 300 800 250 22 900 1,000 4 850 1,000 600 1,000 1,150 850 800 1,100 950 13 1,000 23 17 10 5 650 8 900 800 1,200 900 850 14 18 24 1,200 900 Overall Fiber Distance = 11,300 Km

  28. Path Protection: Failure Recovery B. Mukherjee: Invited Talk at KTH, Sweden 19 1 2,600 1,000 800 1,200 1,900 11 1,300 950 6 2 1,300 15 20 1,400 1,100 1,200 900 1,000 600 700 1,000 21 16 16 1,000 12 12 1,000 1,000 9 9 3 7 7 alarm alarm 300 alarm alarm 800 250 22 900 1,000 4 850 1,000 600 1,000 1,150 850 800 1,100 950 13 1,000 23 17 10 5 650 8 900 800 1,200 900 850 14 18 24 1,200 900 20 ms 20 ms Failure occurrence Failure detection Failure notification

  29. Path Protection Failure Recovery B. Mukherjee: Invited Talk at KTH, Sweden 19 1 2,600 1,000 800 1,200 1,900 11 1,300 950 6 2 1,300 15 20 1,400 1,200 900 1,100 1,000 600 700 1,000 21 16 16 1,000 12 12 1,000 1,000 9 9 3 7 7 alarm alarm 300 alarm alarm 800 250 22 900 1,000 4 850 1,000 600 1,000 1,150 850 800 1,100 950 13 1,000 23 17 10 5 650 8 900 800 1,200 900 850 14 18 24 1,200 900 20 ms 31.5 ms 20 ms 71.5 ms > 50 ms Failure recovery Failure occurrence Failure detection Failure notification

  30. Topology Link Layer (OMS) (terminology used owing to ITU/ANSI) Path Layer (OCh) Protection taxonomy Dedicated Protection (DP) Shared Protection (SP) Dedicated Protection (DP) Shared Protection (SP) Ring OULSR (optical unidirectional line-switched ring) or OMS-DPRing OBLSR (optical bi-directional line switched ring) or OMS-SPRing OUPSR (optical unidirectional path-switched ring) or OCh-DPRing OBPSR (optical bi-directional path switched ring) or OCh-SPRing Point-to-Point 1+1 linear APS (OMS-DP) 1:1/1:N linear APS (OMS-SP) 1+1 linear APS (OCh-DP) 1:1/1:N linear APS (OCh-SP) Mesh No standard term available No standard term available 1+1 OSNCP 1:N mesh protection

  31. Optical layer protection • Advantages • Faster reaction to optical layer failures • More efficient • Disadvantages • Cannot handle client layer failures • Limits to monitoring due to transparency • Smallest protected unit is a wavelength channel (no grooming) • Interworking with higher layers • Protection paths might be substantially longer than working paths

  32. Restoration • Only for mesh topology • Link Restoration • End nodes of the failed link dynamically discover route around link for each connection; link restoration is faster than path restoration • Path Restoration • The source and destination nodes of each connection independently discover back-up route on end-to-end basis; path restoration is slower than link restoration

  33. TE, NE and NP • Traffic Engineering (TE) • Network Engineering (NE) • Network Planning (NP) • “Put the traffic where the bandwidth is” • “Put the bandwidth where the traffic is” • “Put the bandwidth where the traffic is forecasted to be” • TE – online, dynamic, provisioning problem, ms time scale • NE – intermediate problem, months time scale • NP – offline, static dimensioning problem, 5-yr time scale

  34. Example: NP Problem – Path Protection Network Planning • Given: • Network topology • Static traffic demands • Need/Requirements/Constraints: • Set up a primary path and a backup path for each demand • The two paths must be link disjoint (node disjoint too?) • Dedicated or shared backup (based on problem specs) • Guaranteed to recover from a single fiber cut • Goal: minimize cost, e.g., # of wavelength channel

  35. Example: TE Problem – Restoration Traffic Engineering • Given: • Network topology (including # of wavelengths per fiber) • Dynamic traffic demands (or connections) • Need/Requirements/Constraints: • Set up only one path (primary path) for each connection • Try to quickly restore the connection when a fault occurs • Control signaling (GMPLS?) for connection setup + restoration • Note: This method can handle multiple network failures • Three restoration methods: path, sub-path, link

  36. Cross-layer protection • Multiple layers provide survivability • IP • MPLS • SDH/SONET • Each layer acts independently of other layers • Protect paths on all layers, wasteful • Possible contention and unpredictable behavior • Add coordination or separate timescales

  37. Answer the following questions • Which sublayer within the optical layer would be responsible for handling the following functions: • Setting up and taking down lightpaths in the network • Monitoring and changing the optical wrapper overhead in a lightpath • Rerouting all wavelengths (except the optical supervisory channel) from a failed fiber link onto another fiber link • Detecting a fiber cable cut in a WDM line substem • Detecting failure of an individual lightpath • Detecting bit errors in a lightpath

  38. Connection availability in WDM mesh networks with multiple failures

  39. Motivation • Connection availability requirement for critical services is 99.999% or higher • Survivability mechanisms • Dedicated or shared resources? (c.f. trade-off) • Maintaining acceptable level of survivability while reducing network protection costs has become an important challenge for network planners and engineers.  • Networks are usually designed to provide 100% survivability for a single failure. • In large networks multiple concurrent failures can not be neglected.

  40. Significance of multiple failures Based on: M. To, P. Neusy, “Unavailability Analysis of Long-Haul Networks”, IEEE JSAC, vol. 12., no. 1, January 1994, pp. 100-109

  41. NSFNet topology Number of nodes: 19 Number of links: 28 Number of connections: 171

  42. Some previous results 5-nines: 99.999% (< 5.26 min/year)  cost up to $0.6 M/year 4-nines: 99.99% (< 52.6 min/year)  cost up to $6 M/year 3-nines: 99.9% (< 526 min/year)  cost up to $60 M/year

  43. Aim To develop a model to calculate connection availability in Shared Path Protection mesh networks with multiple failure assumption

  44. Related Works • A few connection availability models that work with multiple failure assumption in SPP scheme have been proposed. • Limited number of faults (double link failures) or • Limited number of protection paths • Limitations are due to the high computational complexity of these models.

  45. Example 1 Multiple concurrent failures One shared protection path Sharing group Sp= {p1, p2, ..., pn}. Connection twill be available if: 1) pis available; or 2) pis unavailable but bis available, and the failure on phappens before failures to the other primary paths in Sp. Availability of connection t: pi: probability that exact i primary paths are unavailable J. Zhang, K.Zhu, H.Zhang and B.Mukherjee, ICC’2003

  46. Validation p = 1000 km b = 3000 km /10km

  47. Example 2: Matrix based model At most two concurrent link failures can occur One (shared) protection path associated with each working path Balance equations: D.A.A. Mello, D.A.Schupke and H.Waldman, Com.Letters, Sept.2005

  48. Comparison with other models 1 2 3 1 2 3 1: D.Arci, D.Petecchi, G.Maier and M. Tornatore, DRCN’2003 2: J. Zhang, K.Zhu, H.Zhang and B.Mukherjee, ICC’2003 3: M. Jaeger and R.Huerselman, ECOC’2004 Results for Italian network. Scenario 1: working paths of different connections do not traverse common links, fiber failure rate for protection paths were assumed twice the fiber failure rate of working paths (200 FIT/km and 100 FIT/km respectively). Scenario 2: working paths of different connections can traverse common links, fiber failure rate = 200 FIT/km.

  49. Shared Path Protection • Assumptions: • only link failures are considered • a link is either available or failed (two states) • all fiber links fail independently • no wavelength conversion • protection paths can share a wavelength on a common link only if the corresponding working paths are route disjoint

  50. Sharing Conflicts B A D C working lightpath backup lightpath

More Related