Surviving Large Scale Internet Outages

Surviving Large Scale Internet Outages Dr. Krishna Kant Intel Research Acknowledgements: Work supported by National Science Foundation Collaborative work with A. Sahoo & P. Mohapatra

Outline • Overview • Routing and Name resolution infrastructures • Some large scale failures • Routing Vulnerabilities • Routing algorithms & their properties • Improving inter-domain routing • Dealing with Name Resolution Failures • Name resolution preliminaries • DNS vulnerabilities & Solutions K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

The Problem • Internet has two critical elements • Routing (Inter & intra domain) • Name resolution • How robust are they against large scale failures/attacks? • How do we improve them? K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

inter-domain router intra-domain router Internet Routing • Not a homogeneous network • A network of autonomous systems (AS) • Large variation in AS sizes – typical heavy tail. • Inter-AS routing • Border Gateway Protocol (BGP) • Complex configuration parameters • Flexible but serious stability, recoverability and configurability issues • Intra-AS routing • Usually easier to manage • Central control, smaller network, … • But, can suffer from similar problems K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Internet Name Resolution • Domain Name Server (DNS) • Translates names to IP addresses. • Critical for all networking services • Hierarchical structure • Caching of data in proxy servers & resolvers • DNS Vulnerabilities • Complex dependencies & easy to “poison” • Can lead to large scale “failures” • Inability of access or diversion to malicious sites. ftp acme.com application Resolver acme.com 10.7.196.31 DNS proxy server Auth. DNS server K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Large Scale Failures • Characteristics • Large service impact. • Usually non-uniformly distributed, e.g., an affected geographical area, hijacked .com domain, etc. • Why study large scale failures? • Several moderate sized incidents already. • Larger failures will happen • Can cause other undesirable impacts • Secondary failures due to large recovery traffic, • Substantial imbalance in load, … K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Routing Failures • Physical Damage • Earthquake, hurricane, high BW cable cuts, … • SW bugs & configuration errors • Incorrect input or output filtering rules • Aggregation of large un-owned IP blocks • Incompatible policies among AS’es • Network wide congestion (DoS attack) • Malicious route advertisements via worms K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Name Resolution Failures • Compromising name resolution • Poisoning (altering/insertion) of address records • Doesn’t even require compromising the server • Extensive caching  More points of entry • Substitution of rogue DNS server • Security holes due to configuration errors • Potential large scale effects • Poisoning at higher levels  Large scale disruption • Example: March 2005 .com attack • Redirection to malicious sites to collect sensitive info K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Some Significant Failure Events K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Taiwan Earthquake Dec 2006 • Major outage in SE Asia, 60% drop in traffic • Issues: • Global traffic passes through a small number of seismically active choke points. • Luzon strait, Malacca strait, South coast of Japan • Satellite & overland cables  Inadequate backup capacity • Several countries depend on 1-2 landing pts • Outlook: Potential repeat performance • Economics makes change unlikely. • May be exploited by pirates + terrorists • Reference: http://master.apan.net/meetings/xian2007/publication/051_Kitamura.pdf K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Hurricane Katrina (Aug 2005) • Major local outages. No major regional cable routes through the worst affected areas. • Outages persisted for weeks & months. Notable after-effects in FL (significant outages 4 days later!) • Reference: http://www.renesys.com/tech/presentations/pdf/Renesys-Katrina-Report-9sep2005.pdf K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

NY Power Outage (Aug 2003) • No of concurrent network outages vs. time • Large ASes suffered less than smaller ones. • Many ASes all routers down for >4 hours. • Very similar power outage in Italy, sept 2003. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Slammer Worm (Jan 2003) • Worm started w/ buffer overflow of MS SQL. • Very rapid replication, huge congestion buildup in 10 mins • Korea falls out, 5/13 DNS root servers fail, failed ATMs, … • High BGP activity to find working routes. • Reference: http://www.cs.ucsd.edu/ savage/papers/IEEESP03.pdf K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

DNS Attack (Jan 2006) • Attack Type • Authoritative TLD DNS servers attacked using ~100 zombie clients & 51K recursive servers. • 55 Byte zombie query  4.2KB response. • Responses directed to target name server (w/ spoofed IP address). • Impact • Failures in networks in the path including transit providers to authoritative TLD DNS servers • Graph • #Unanswered queries (Y-axis) vs. Time (X-axis) • Red: failure, yellow: slow Reference: http://www.oecd.org/dataoecd/34/40/38653402.pdf K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Infrastructure Induced Failures • En-masse use of backup routes by 4000 Cisco routers in May 2007 (Japan) • Routing table rewrites  7 hr downtime in NE Japan • Ref: http://www.networkworld.com/news/2007/051607-cisco-routers-major-outage-japan.html • Akamai CDN failure – June 2004 • Probably widespread failures in Akamai’s DNS. • Ref: http://www.landfield.com/isn/mail-archive/2004/Jun/0064.html • Worldcom router mis-configuration – Oct 2002 • Misconfigured eBGP router flooded internal routers with routes. • Ref: http://www.isoc-chicago.org/internetoutage.pdf K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Routing Infrastructure K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Routing Basics • Distance vector based (DV) • RIP (Routing Information Protocol). • IGRP (Interior gateway routing Protocol). • Link State Based (LS) • OSPF (Open shortest path first) • IS-IS (Intermediate system to IS) • Path Vector Based (PV) • BGP (Border Gateway Protocol) • Intra-domain (iBGP) & inter-domain (eBGP) versions. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

A-D=3 E-D=2 F-D=1 E-D=2 C-D=3 Distance Vector (DS) Protocols • Build RT using successive path advertisements. • May use stale info used to handle failures • “count to infinity” problem; Several versions to fix this. • Difficult to use policies Routing Table for A B D E A F C K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Link State (LS) Protocols • Each node keeps complete adjacency/cost matrix & computes shortest paths locally • Any failure propagated via flooding • Expensive in a large network • Loop-free & can use policies easily. 3 B D 1 4 6 A 2 5 E C K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

B/E/F/D:3 E/F/D:2 F/D:1 B D E/F/D:2 E A Link_cost=2 F C C/E/F/D:4 Path Vector Protocols • Each node initialized w/ a set of paths for each destination • Active paths updated much like in DV • Explicitly withdraw failed paths (& advertise next best) • Filtering on incoming/outgoing paths, path selection policies • Paths A to D: • Via B: cost 3 • Via C: cost 4 • Entire path not stored (only cost, next hop) K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Intra-domain Routing under Failures • Routing algorithms • Link state (OSPF) • Flooding can handle failures quickly. • Path vector (iBGP) • iBGP routers are fully meshed in small networks (routing not much of an issue) • In large network, route reflectors may be used for scalability • Can recover rather quickly • Single domain of control • High visibility, common management network, etc. • Easy to configure consistent values at all routers • iBGP with route reflection shown to suffer from oscillations, but can be remedied. • Reference: A. Rawat & M.A. Shayman, “Preventing persistent oscillations and loops in IBGP configuration with route reflection”, Computer Networks, Vol 50, No 18, Dec 2006, pp 3642-3665 K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

R5 R R4 R1 R3 R2 AS1 I-BGP IGP AS3 border router internal router A AS2 E-BGP announce B B Inter-domain Routing • BGP: Default inter-AS protocol (RFC 1771) • Path vector protocol, runs on TCP • Scalable, “rich” policy settings • But prone to long “convergence delays” • High packet loss & delay during convergence K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Inter-domain routingBGP specifics and vulnerabilities K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

BGP Routing Table • Prefix: origin address for dest & mask (eg.,207.8.128.0/17) • Next hop: Neighbor that announced the route • One active route, others kept as backup • Only active route can be advertised • Route “attributes” -- may be conveyed outside • ASpath: Used for loop avoidance. • MED (multi-exit discriminator); preferred incoming path • Local pref: Used for local path selection K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Withdrawn route lengths (2 octets) Withdrawn routes (variable length) Length of all path attributes (2 octets) Advertised path attributes (variable length) Reachability Information (variable length) BGP Messages • Message Types • Open (establish TCP conn), notification, update, keepalive • Update • Withdraw zero or more old routes • Optionally advertise exactly one new route. • May need to also advertise sub-prefix • E.g., 207.8.240.0/24 which is contained in 207.8.128.0/17 K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Fwd/not-fwd Set MEDs Accept, deny, set preferences BGP decision process Route pkts Routes received from peers IP Routing Table BGP routing Table Routes sent to peers Input Policy Engine Output policy engine Routing Process • Input policy engine • Filter routes by path attributes, prefix, etc. • Output policy engine • Manipulate attributes, e.g. Local pref., MED, etc. • Multiple points for possible configuration errors & mismatch between AS’es K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

BGP Recovery • BGP Convergence Delay • Time for all routes to stabilize following an event • Four durations of interest • Tup, Tshort, Tlong, Tdown • Min. Route Advertisement Interval (MRAI) • Applies only to adv., not withdrawals • Intended – per destination, Implemented – per peer • Damps out oscillations Convergence Delay MRAI K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Impact of BGP Recovery • Long Recovery Times • >3 min. for 30% of isolated failures • > 15 min. for 10% of cases • Longer for larger failures • Consequences • Connection attempts over invalid routes fail. • Big increase in pkt loss (30X) and delay (4X) • Compromised QoS Graphs taken from ref #2, Labovitz, et.al. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

H E F G I D 2 3 A B 10 C H E F G I D 2 3 A B 10 C BGP Illustration (1) • Best path PSD=(N, cost) [X] • S,D: Source & destination nodes • N: Next hop • X: Actual path (for illustration only) • Sample starting paths to C • PBC=(D,3) [BDAC], PDC=(A,2) [DAC], etc. • Paths shown using arrows (all share seg AC) • Failure of A • BGP does not attempt to diagnose problem or broadcast failure events. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

BGP Convergence Delay Analysis K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Known Analytic Results • Lots of work for isolated failures, none on large scale failures. • Labovitz [1]: Convergence delay bound for full mesh networks • O(n3) for average case, O(n!) for worst case. • Labovitz [2], Obradovic [3], Pei[8]: • Convergence delay  Length of longest path involved • Applies only for unit cost hops • Griffin and Premore [4]: • V shaped curve of convergence delay wrt MRAI. • #Messages wrt MRAI decreases at a decreasing rate. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Evaluation of LS Failures • Evaluation methods • Primarily simulation. Analysis is intractable • BGP Simulation Tools • Several available, but simulation expense is the key! • SSFNET – scalable, but max 240 nodes on 32-bit machine • SSFNet default parameter settings • MRAI jittered by 25 % to avoid synchronization • OSPFv2 used as the intra-domain protocol K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Topology Modeling • Topology Generation: BRITE • Enhanced to generate arbitrary degree distributions • Heavy tailed based on actual measurements. • Approx: 70% low & 30% high degree nodes. • Mostly used 1 router/AS  Easier to see trends. • Failure topology: Geographical placement • Emulated by placing all AS routers and ASes on a 1000x1000 grid • The “area” of an AS  No. of routers in AS K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Convergence Delay vs. Failure Extent • Initial rapid increase & then flattens out. • Delays & increase rate both go up with network size  Large failures can pose a problem! K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Delay & Msg Traffic vs. MRAI • Small networks in simulation  • Optimal MRAI for isolated failures small (0.375 s). • Main observations • Larger failure  Larger MRAI more effective K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Convergence Delay vs. MRAI • A V-shaped curve, as expected • Curve flattens out as failure extent increases • Optimal MRAI shifts to right with failure extent. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Impact of AS “Distance” • ASes more likely to be connected to other “nearby” ASes. • b indicates the preference for shorter distances (smaller b higher preference) • Lower convergence delay for lower b. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Improving BGP Convergence Delay K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Reducing Convergence Delays • Many schemes – mostly evaluated for isolated failures • Some popular schemes • Ghost Flushing • Consistency Assertions • Root Cause Notification • Our work (Large scale failure focused) • Dynamic MRAI • Batching • Speculative Invalidation K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

H E F G I D 2 3 A B 10 C Ghost Flushing • Bremler-Barr, Afek, Schwarz: Infocom 2003 • An adv. implicitly replaces old path • GF withdraws old path immediately. • Pros • Withdrawals will cascade thru ntwk • More likely to install new working routes • Cons • Substantial addl load on routers • Flushing takes away a working route! • Install BC  • Routes at D, F, I via B will start working • Flushing will take them away. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Consistency Assertion • Pei, Zhao, et.al., Infocom 2002 • If S has two paths S:N1xD & S:N2yN1xD, & first path is withdrawn, then second path is not used (considered infeasible). • Pros • Avoids trying out paths that are unlikely to be working. • Cons • Consistency Checking can be expensive S N2 N1 y x D K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Root Cause Notification • Pei, Azuma, Massy, Zhang: Computer Networks, 2004 • Modify BGP messages to carry root cause (e.g., node/link failure). • Pros • Avoid paths w/ failed nodes/links  substantial reduction in conv. delay. • Cons • Change to BGP protocol. Unlikely to be adopted. • Applicability to large scale failures unclear (diagnosis difficult) H E F G I D 2 3 A B 10 C • D, E, G diagnose if A or link to A has failed. • Propagate this info to neighbors K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Large Scale FailuresOur Approach • What we can’t or wouldn’t do? • No coordination between ASes • Business issues, security issues, very hard to do, … • No change to wire protocol (i.e., no new msg type). • No substantial router overhead • Solution applicable to both isolated & LS failures. • What we can do? • Change MRAI based on network and/or load parms • e.g., degree dependent, backlog dependent, … • Process messages (& generate updates) differently K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Key Idea: Dynamic MRAI • Increase MRAI when the router is heavily loaded • Reduces load & #of route changes. • Relationship to large scale failure • Larger failure size  Greater router loading  Larger MRAI more appropriate. • Router load directed MRAI caters to all failure sizes! • Implementation: • Queue length threshold based MRAI adjustment. Decrease th1 Decrease th2 Increase th1 Increase th2 K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Dynamic MRAI: Effect on Delay • Change wrt fixed MRAI=0.375 secs. • Improves convergence delay as compared to fixed values. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Key Idea: Message Batching • BGP default: FIFO message processing  • Unnecessary processing, if • A later update (already in queue) changes route to dest. • Timer expiry before a later msg is processed. • Relationship to large scale failure • Significant batching (and hence batching advantage) likely for large scale failures only. • Algorithm • A separate logical queue/dest. – allows processing of all updates to dest as a batch. • >1 update from same neighbor  Delete older ones. B B C A A A A B A A B C K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Batching: Effect on Delay • Behavior similar to dynamic MRAI w/o actually making it dynamic • Combination w/ dynamic MRAI works somewhat better. K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Key Idea: Speculative Invalidation • Large scale failure • A lot of route withdrawals for the failed AS, say X • #withdrawn paths w/ AS X e AS_path > thres  Invalidate all paths containing X • Implementation Issues • Going through the routes for invalidation is inefficient • Use output route filters at each node • Threshold estimation  Computed (see paper) • Reverting routes to valid state  time-slot based K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Effect of Invalidation • Avoids exploring unnecessary paths • Reduces conv. delay significantly, but … • May affect connectivity adversely. • Implement only at nodes with degree 4 or higher K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Comparison of Various Schemes • CA is the best scheme throughout! • GF is rather poor • Batching & dynamic MRAI do pretty well considering their simplicity K. Kant, Surviving Large Scale Internet Outages -- A Tutorial

Surviving Large Scale Internet Outages

Surviving Large Scale Internet Outages

Presentation Transcript

Large Scale Weather

Large Scale Internet Search at Ask.com

Large Scale Structure

Surviving the Internet

LARGE - SCALE ASSESSMENTS

large scale Refactoring

Large-scale matching

LARGE SCALE

Surviving a Large Scale Organized Hunger Strike at your institution

Large- scale Organisations

Surviving Large Scale Rollout of Financial System Overhaul

Outages

Large Scale Internet Search at Ask

LARGE SCALE ORGANISATIONS

Large scale

Large-Scale Systems

Large Scale Sharing

Large Scale Operations

Large Scale Applications

Surviving Large Scale Internet Failures

Large Scale Drupal