290 likes | 306 Vues
Apricot 2006. Advanced BGP Convergence Techniques. Pradosh Mohapatra pmohapat@cisco.com. Agenda. Terminology Convergence Scenarios Core Link Failure Edge Node Failure Edge Link Failure. Basic Terminology. Prefix – A route that is learnt by routing protocols. 12.0.0.0/16
E N D
Apricot 2006 Advanced BGP Convergence Techniques Pradosh Mohapatra pmohapat@cisco.com
Agenda • Terminology • Convergence Scenarios • Core Link Failure • Edge Node Failure • Edge Link Failure
Basic Terminology • Prefix – A route that is learnt by routing protocols. • 12.0.0.0/16 • Pathlist – A list of Next Hop paths learnt by routing protocols. • 12.0.0.0/16 • Via POS1/0 • Via GE2/0, 5.5.5.5 • 10.0.0.0/16 • Via 5.5.5.5 Non-recursive Recursive (Depends on the resolution of the next-hop)
BGP PL IGP PL IGP PL path 1 path 1 path 1 path 2 path 2 path 2 Forwarding Table Structure Intf1/NH1 Intf2/NH2 Intf3/NH3 Intf4/NH4
Salient Features • Pathlist Sharing: • All BGP prefixes that have the same set of paths point to a single pathlist. • Hierarchical Structure: • BGP prefixes (recursive) point to IGP prefixes (non-recursive).
Core Link Failure 6 6 6
BGP PL IGP PL IGP PL path 1 path 1 path 1 path 2 path 2 path 2 Multipath BGP, Multipath IGP, IGP path goes down • Initial organization before failure of IGP path 1. • Link to Path 1 goes down.
IGP PL BGP PL IGP PL path 1 path 1 path 2 path 2 path 2 Multipath BGP, Multipath IGP, IGP path goes down • IGP pathlist modified after Path 1 failure. • BGP Convergence = IGP Convergence.
IGP PL BGP PL IGP PL Path 1 path 1 path 1 Path 2 path 2 path 2 Multipath BGP, Multipath IGP, IGP prefix is deleted • Initial organization before deletion of IGP prefix 1. • IGP Prefix 1 gets deleted. • Fix-up BGP PL to point to the second path.
BGP LI IGP LI path 1 path 1 path 2 Multipath BGP, Multipath IGP, IGP prefix is deleted • BGP pathlist modified after deletion of IGP prefix 1. • BGP Convergence = IGP Convergence.
BGP LI IGP LI IGP LI path 1 path 1 path 1 path 2 path 2 path 2 Multipath BGP, Multipath IGP, IGP path modified • Initial organization before modification of IGP Path 1. • IGP Path 1 gets modified. • BGP Convergence = IGP Convergence
Conclusion • In case of core link failure: • Sub-second convergence. • BGP Prefix-independent & In-place modification of the forwarding table. • Make-before-break solution
Edge Node Failure 13 13 13
Edge node failure PE2 • PE1 has selected PE2 as bestpath and has installed that path only in forwarding table. • What PE1 needs upon PE2’s failure is fast detection of Unreachability. • Unreachability status requires all the IGP neighbors to have detected the failure and have sent their LSP’s to PE1. • PE1 now needs to point to PE3. P2 PE1 P1 PE3
BGP Next-Hop Tracking • Event-driven reaction to BGP next-hop changes • BGP communicates its next-hops to RIB. • If RIB gets a modify/delete/add of an entry covering these next-hops, it notifies BGP. • BGP runs bestpath algorithm. • Stability requirement • Fast reaction to isolated events • Delayed reaction to too frequent events • Classification of Events • Next-hop unreachable is critical: React faster. • Metric Change is non-critical: React slower.
BGP NHT – Implementation highlights • RIB implements dampening algorithm • Next-hops flapping too often are dampened. • RIB classifies next-hop changes as critical or non-critical. • Critical events are sent immediately to BGP. Non-critical events are delayed up-to 3 seconds. • BGP has an initial delay before it reacts to next-hop changes. • Default: 5s. Configurable. • Capture as many changes as • possible within the initial delay before running bestpath. router bgp 1 bgp nexthop-trigger-delay 1
BGP NHT - example RIB sends 1st NH notification • T1: Link failure triggering IGP convergence. • T2: First next-hop notification to BGP. • T3: BGP reads the next-hop updates and starts initial delay timer. • T4: Initial delay period expires. BGP does Nhscan and bestpath change (a function of the table size). Lk Dn NHScan + BestPath IGP CV T4 T3 T1 T2
BGP NHT • Principle: The first SPF must declare PE2 as unreachable • We want to make sure that if PE2 fails, then all its neighbors have had the time to detect the failure, originate their LSP and have flooded it to PE1 • We want to make sure that when PE1 starts its SPF, all PE2’s neighbors LSP’s are in PE1’s database • Dependency • fast failure detection • fast flooding • SPF Initial-wait conservative enough
BGP NHT – Typical Timing • 0: PE2 failure • 50ms: PE1 receives the 1st LSP and schedules SPF at T=200ms • the other LSP’s will have all the time to arrive in the meantime • 200ms: PE1 starts SPF • we account a duration of 30ms but with iSPF it will be ~1ms • 232ms: PE1 deletes PE2’s loopback and schedules BGP NHT at T=1232ms • there are few prefixes to modify as this is a node failure • 1232ms: PE1 runs BGP NHT • table scan: ~6us per entry: if PE1 has 20k routes: ~ 120ms • RIB modify: ~140us per entry: if PE1 has 5k routes from PE2, it takes ~ 700ms • 70ms distribution download • 2122ms: PE1/LC has finished modifying the BGP entries to use nh=PE3. We still need to resolve them • resolution starts [0, 1000ms] • resolution lasts: ~ 100us per entry • 3622ms: Convergence is finished in the worst case
Conclusion – Edge node failure • Sub-5s is achievable • analyzed scenario leads to WC ~ 3500ms • Sub-Second is challenging • Ongoing work to improve this further: • Backup path
BGP PL IGP PL IGP PL Intf1/NH1 path 1 path 1 path 1 Intf2/NH2 backup path path 2 path 2 Backup Path Intf3/NH3 • No Multipath. Prefix always points to Path 1. • Reroute triggered per IGP prefix: fix-up Path 1 to • point to the backup path. Intf4/NH4
Backup Path – Contd. • Problem: • How to know the backup path? BGP advertises only one path. • Peering with RRs: RR sends only the bestpath it computes. • Solution: • Add-path draft.
ADD-PATH • Mechanism that allows the advertisement of multiple paths for the same prefix without the new paths implicitly replacing any previous ones. • Add a path identifier to the encoding to distinguish between different prefixes. +-----------------------------+ | Path Identifier (4 octets) | +-----------------------------+ | Length (1 octet) | +-----------------------------+ | Label (3 octets) | +-----------------------------+ ........................................... +-----------------------------+ | Prefix (variable) | +-----------------------------+ • +----------------------+ • | Path Identifier (4 octets) | • +----------------------+ • | Length (1 octet) | • +----------------------+ • | Prefix (variable) | • +----------------------+
ADD-PATH - Operation • New capability: Add-path • Advertisement of the capability indicates ability to receive multiple paths for all negotiated AFI/SAFI. • Advertisement of specific AFI/SAFI information in the capability indicates the intent to send multiple paths. • Only in these cases must the new encoding be used. • Concerns: Cost of multiple paths advertisement outweigh the benefits on convergence?
Edge Link Failure 25 25 25
Example: PE-CE Link Failure CE2 RRB1 RRA1 PE2 VPN1 HQ CE1 PE1 RRB2 RRA2 CE3 PE3 VPN1 site
Edge Link Failure scenarios • Edge Link Failure: Next-hop on the peering link • Convergence behavior same as the last two scenarios. • Edge Link Failure: Next-hop-self • Default behavior for L3VPN • In-place modification and/or BGP NHT do not help. • Advanced BGP signaling required.