Protection and Restoration

Protection and Restoration • Definitions • A major application for MPLS

The problem • Network resources will fail • Nodes and links • IGP will re-converge • But this may take some time • 10s of seconds • Fast convergence has a price • May make IGP more sensitive/unstable • I may have sensitive traffic that can not afford interruptions • Voice, Consumer TV • Do something for the time until IGP re-converges

Terminology • Restoration • Bring traffic back to normal • Backup • Alternative resources to be used when there is a failure • Protection • Determine and allocate the backup resources before the failure • When there is a failure just activate them • Can be very fast • Repair • Determine, allocate and activate the backup resources after the failure • Will be slower

Failure Modes • Single vs. multiple link failures • If duration of link failure is short, can assume that there will be only a single link failure • Much harder to deal with multiple link failures • Node vs. link failures • Can assume that links will fail more frequently than nodes • Node failures are harder to handle

Backup resources • Can be multiple types • Links • Paths • Trees • Cycles • Whole topologies • In order to avoid network overload after a failure need to have some extra capacity for backup resources • Problem is how to engineer them so as not to make the network too expensive • Minimize the amount of backup capacity that is reserved

More jargon • 1:1 • 1 working, 1 backup • Wastes a lot of bandwidth for the backups • 1:N • N working and 1 backup • Assume that only 1 working will fail • Then 1 backup is enough – save bandwidth • Revertive: • when the failure is fixed, revert to the primary • SRLG: Shared Risk Link Group • A set of network links that fails together • E.g fibers that are in the same conduit • A bulldozer will cut all of them together

Other issues • How to detect the failure fast • BFD is one general solution • There are medium specific solutions • OAM for ATM • Alarms for SONET • Preferable if they exist • Protocol mechanisms (RSVP HELLOs, OSPF HELLOs, etc) • How to activate the backup • I.e how to make traffic use an alternate path, or a tree

Backbone failure analysis • Sprint backbone ca. March 2002 • Link in class website • Monitor IS-IS traffic • Data only for link failures, not node failures • Failure Duration • 50% failures last less than 1 min • 40% failures last between 1 and 20 min • Maintenance • 50% of failures during maintenance windows • Mean time between failure (MTBF) • Mean time between failures varies a lot across links • “good” and “bad” links • 3 bad links account for 25% of the failures

More analysis • Unplanned failure breakdown • Shared link failures = 30% • Router related = 16.5% • Optical related = 11.5% • Individual link failures = 70% • Node failures less common that single link failures • About 16.5% of failures affect more than 1 link

Handling failures with IP • Easy case • ECMP, no need to do anything extra during failure • But it may not repair all failures • Coverage: what percentage of the possible failures can be repaired • In general activating backup resources is hard with IP • Packets will follow the IP route table/FIB • Forwarding is hop-by-hop • Even if I compute a backup link for a failure, I have no control what will happen after the next hop • May have routing loops

IP protection • Backup next-hop • Each node computes a backup nexthop for each destination • so that I will not have routing loops • It may not have 100% coverage • For more general solutions I need tunneling • Must force packets to reach their destination • Without crossing the failed resource • Tunnel to the node after the failed link • Tunnel to an intermediate node • IP tunneling is an expensive operation • It is packet encapsulation

Not-Via addresses • Consider router A, with interfaces A1, A2, A3 • A1 connects to interface B1 or router B, • A2 connects to interface C2 of router C • B1 has a second address B1-not-via-A • All routers compute paths to B1-not-via-A by removing router A from topology and running SPF • When router A fails, if C wants to reach B sends packets to address B1-not-via-A • Encapsulates the packets • 100% coverage • Can handle node and link failures • Still needs encapsulation

Multi-topology protection • New approach • Have multiple subsets of the topology • IGP protocols already support multi-topology routing • Switch to a different topology when there is a failure • By modifying the header of the packet • Or even using an MPLS label • Allows for more flexible routing of traffic after a failure

Using MPLS • MPLS can conveniently direct traffic where I want • Ideal for setting up backup resources • Mostly backup paths • Can be used to repair both IP and MPLS failures (I.e. LSP failure) • LSP protection can be • Path • Local

Path protection • For each LSP (primary) have a backup LSP • It is already established (with RSVP) but it is not carrying any traffic • Primary and backup LSPs should be link and node disjoint • When there a failure the source of the LSP will start sending traffic to the backup • Source needs to be notified for the failure • May take some time for the repair of the traffic • Can work in both 1:1 and 1:N modes

Local protection • When a link or node fails the node upstream from the failure repairs the traffic • Traffic is put into a back LSP that does not go over the failed resource • Backup LSP merges with the primary LSP • Repairing router does not send a PATHerr upstream • Instead notify upstream nodes that it is repairing the failure • It is very fast • Can work in 1:1 and 1:N modes • Can be • Node • Bypass a failed node • Link • Bypass a failed link

Link local protection • The node upstream of the failed link initiates the protection • Point of local repair (PLR) • Backup LSP will merge back to the primary one • At the next-hop (Nhop) of the PLR • Can work in 1:1 and 1:N modes • Usually a single backup LSP protects multiple primary LSPs • Else scalability is not good

Node local protection • When a node fails, assume its links have failed too • The node upstream of the failed node initiates the protection • Point of local repair (PLR) • Backup LSP will merge back to the primary one • At the next-next-hop (NNHop) of the PLR • What label does the NNHop use for the primary LSP? • Need RSVP’s help to find out • Will need multiple backup LSPs for each node • At least one for each NNHop • Can optionally configure more

Label stacking • Each time I send traffic into an LSP I push a label on the packets • Packets in the primary LSP already have a label • I create a label stack • Top label is popped by the router just before the merge point • A catch • At the merge point, packet arrives from an interface different than the expected one • Must have global (platform) label space

Need some RSVP support • If the LSP is protected do not send a errors upstream/downstream when there is a failure • Instead notify upstream nodes that repair is in progress • During failure the PATH,RESV for the primary LSP must continue • Send them through the backup LSP • For node protection need to know the label the NNHop is using for the primary • Use the record label option for the LSP • All the labels used in all the hops are recorded in the RESV message

LSP protecting IP • Can use the above techniques to also protect IP traffic • If a link fails all the traffic that would go through the link is sent over the backup LSP • Similar for node failures • But in this case, do I know the nnhop for IP? • In general, If I have MPLS in my network all my traffic will be inside MPLS tunnels anyway

Observations • If node degree is d and I have N nodes then • I need at least O(Nd) tunnels for link protection • And at least O(Nd^2) for node protection • Of course I can not protect from failures of the ingress or egress node • The assumption is that failures will be short lived • Traffic may be unbalanced during the failure • Links can get overloaded

The resource allocation problem • How do I setup the backup tunnels so that • I do not overload any link after a failure • I minimize the amount of extra bandwidth that will need to be reserved for the backups • It is a form of traffic engineering (TE) • We will see more on TE later on • Has been studied a lot • In optical and telephone networks • And recently in MPLS type networks • Solutions can be • On-line (as the requests arrive) • Off-line

Example • Kodialam, Lakshman, 2001 • Local link and node protection • Assume I know the b/w demands of all LSPs • Assume that only one link or node can fail at a time • Find a set of backup paths that minimizes the amount of bandwidth for both primary and backup LSPs • Backup LSPs can share bandwidth on some links • What do I know about the links? • How much bandwidth is used by each LSP • Complete but expensive to maintain • How much bandwidth is available • Almost zero information • How much bandwidth is used by backup LSPs • Little bit better than zero

Protection and Restoration