Network Measurements in Overlay Networks

Network Measurements in Overlay Networks Richard Cameron Craddock School of Electrical and Computer Engineering Georgia Institute of Technology

Outline • Resilient Overlay Networks • Best-Path vs. Multi-Path Overlay Routing • Measuring the Effect of Internet Path Faults on Reactive Routing

Resilient Overlay Networks D. Andersen, H. Balakrishnan, F. Kaashoek, and R. Morris Proc. 18th ACM SOSP October 2001 ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Resilient Overlay Networks • RONs seek to quickly detect and respond to network failures • Network nodes participate in a limited size overlay network • Overlay nodes cooperate with one another to forward data on behalf of any other nodes in the RON • RON detects problems by aggressively probing the paths connecting its nodes • RON nodes exchange information about the quality of paths among themselves, and build forwarding tables based on a variety of path metrics • Latency, Packet Loss, and Available Throughput ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Resilient Overlay Networks Goals • Failure detection and recovery in less than 20 seconds • Tighter integration of routing and path selection with the application • Expressive policy routing ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Active Probing • RON probes every other node PROBE_INTERVAL plus a random jitter of 1/3 PROBE_INTERVAL • A probe not returned in PROBE_TIMEOUT is considered loss ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Link-State Dissemination • RON nodes disseminate their performance metrics to the other nodes every ROUTING_INTERVAL • This information is sent over the RON overlay • The only time that a RON node has incomplete information about any other node is when it is completely cut off from the Overlay ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Outage Detection • On the loss of a probe, several consecutive probes spaced by PROBE_TIMEOUT are sent out • If OUTAGE_THRESH probes elicit no response the path is considered “dead” • If even one probe gets a response then high frequency probing is cancelled • Paths experiencing outages are rated on their packet loss history ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Latency and Loss Rate • Latency is the round trip time calculated from the probes • Latency = A * Latency + (1-A) * New Sample • A is chosen to be 0.9 • Overall latency is the SUM of the individual virtual link latencies • Loss Rate is the average of the last k = 100 probe samples • If losses are assumed independent then the overall path loss rate is the PRODUCT of the individual virtual link loss rates ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Throughput • Throughput is calculated using (2) • p is the one way packet loss probability • Estimated as half of the calculated two-way packet loss probability • rtt is the end-to-end round trip time • Throughput cannot be aggregated across virtual links • In order to simplify the selection of throughput optimized paths only one intermediate node is considered • An indirect path is only chosen if it improves throughput by 50% ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Experiment • The raw measurement data consists of probe packets • To probe each RON node independently repeated the following steps • Pick a random node j • Pick a probe-type from one of {direct, latency, loss} using round-robin. • Send probe to j • Delay for a random interval between 1 and 2 seconds ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Results • Two distinct datasets • RON1 • 64 hours between 3/21/2001 and 3/23/2001 • 12 nodes with 132 distinct paths • Traverses 36 different AS’s and 74 distinct inter-AS links • RON2 • 85 hours between 5/7/2001 and 5/11/2001 • 16 nodes with 240 distinct paths • Traverses 50 AS’s and 118 different AS links ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Results ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Overcoming Path Outages • A RON win occurred when internet loss was >= p% and RON loss was < p% • 10 complete communication outages of which RON routed around all of them ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Loss Rate • Improved loss rate by more than 0.05 more than 5% of the time in RON1 • RON can make loss rates worse too • Improved loss rate by more than 0.04 more than 5% of the time in RON2 ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Handling Packet Floods • Three hosts connected in a triangle • Indirect routing is possible through the third node but not preferable • Flood attack beginning at 5s • RON recovered in 13s • Non-RON doesn’t recover ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Latency • RON reduces communication latency in many cases • 11% saw improvements of 40 ms or more in RON1 • 8.2% saw improvements of 40 ms or more in RON2 ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

TCP Throughput • RON’s throughput-optimizing router does not attempt to change paths unless it obtains a 50% improvement in throughput • 5% of samples doubled their throughput • 2% increased their throughput by more then 5 times • 9 by a factor of 10 ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Conclusions • Resilient overlay networks can greatly improve the reliability of the Internet • RON was able to overcome 100%(RON1) and 60%(RON2) of the several hundred observed outages • RON takes 18 seconds on average to detect and recover from a fault • RON can substantially improve loss rate, latency and TCP throughput • Forwarding packets via at most one intermediate node is sufficient for fault recovery and latency improvements ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Best-Path vs. Multi-Path Overlay Routing D. Andersen, A. Snoeren, and H. BalakrishnanIMC Miami, FL, October 2003. D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Best-Path vs. Multi-Path Overlay Routing • Best-path and multi-path routing techniques have been proposed to reduce packet loss • These techniques are compared in terms of loss rate and latency reduction • This comparison is made in the context of an overlay network D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Best-Path vs. Multi-Path Overlay Routing • Multi-Path Routing • Packets are duplicated and sent on different paths through overlay • Reactive Routing • Overlay nodes constantly measure the paths between themselves using probes • Packets are sent on either the direct path or forwarded via a sequence of other overlay nodes D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Routing Methods • Direct: • Single packet using the direct path • Loss: • Loss optimized reactive routing • Lat: • Latency optimized reactive routing D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Routing Methods • Direct rand: • 2 redundant multi-path routing • First packet is sent directly • Second packet is sent randomly • Lat Loss: • 2 redundant multi-path routing with reactive routing • First packet is sent on latency optimized link • Second packet is sent on loss optimized link D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Routing Methods • Direct direct • 2-redundant direct routing with back-to-back packets on the same path • DD 10 ms • Direct direct with 10ms delay between packets • DD 20 ms • Direct direct with 20ms delay between packets D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Method • Nodes periodically initiate one or two request packets to a target • Each request has a random 64-bit identifier which is logged along with send and receive times • Nodes cycle through the different request types • Targets are chosen randomly • Nodes delay for a random period between 0.6 and 1.2 seconds D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Base Network Statistics D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Packet Loss Rate D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Conditional Loss Probability and Latency D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Conclusion • There is loss and failure independence in the Internet • 40% of observed losses were avoidable • The benefits of multi-path routing can be achieved with direct duplication • 10 or 20 ms delay between packets • Reactive and redundant routing can work in concert to reduce loss • 45% decrease in packet loss rate D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Measuring the Effect of Internet Path Faults on Reactive Routing Nick Feamster, David G. Andersen, Hari Balakrishnan, and M. Frans Kaashoek SigMetrics San Diego, June 2003 FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Measuring the Effect of Internet Path Faults on Reactive Routing • Where do failures appear? • How long do failures last? • How well do failures correlate with BGP routing instability? FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Data Collection • Based on the analysis of data collected for one year on a test bed of 31 hosts • Geographically as well as topologically diverse test bed • Paths between these hosts traverse more than 50% of the well-connected ASs on the internet • Data includes: • Active probes between hosts • Traceroutes • BGP messages collected at 8 locations FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Active Probing • An active probe consists of a request packet from the initiator to a target and reply packet from the target to the initiator • Each probe has a 32 bit ID that is logged along with send and receive times • A central monitoring machine aggregates logs • Post processing finds all probes received within 60 minutes of when they are sent • Each host independently initiates a probe to a random target and then sleeps between 1 and 2 seconds • Mean time between probes on a particular path is 30s • With a 95% probability each path is probed at least once every 80s • Failures are defined as 3 or more consecutive lost probes • Limits the time resolution of failure detection to a few minutes FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Loss-triggered traceroutes • Path failure indicated by the active prober initiates a single traceroute • Traceroute is limited to 30 hops • The last reachable IP address is considered point of failure • The failure of a traceroute could be due to either the forward or reverse path • One-way reachability from active probes ensures that the traceroute measurement corresponds to failure on the forward path • Measurement hosts periodically push traceroute logs to the central monitoring machine FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Network Depth Estimation • Want to determine if a failure occurs near an end host of in the middle of the network • Assign an estimated network depth to each link based on its connectivity to other network nodes • Links between routers and measurement nodes have a network depth of 0 • Any edge that connects a 0 depth router to other routers has a depth of 1, and so on • Edges that can receive more than one value, get assigned the smaller value • By computing the depth of all links, the depth at which a traceroute fails can be estimated FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Inferring AS Topology • Inferring AS topology requires: • Mapping interfaces to routers, alias resolution • Assigning routers to ASs • Alias resolution • Based on Rocketfuel’s “Ally” technique • A pair of IP addresses is candidate for alias resolution if they both have the same next or previous hop in a traceroute • For each candidate pair the alias resolution test is performed 100 times • If the test is positive 80% or more of the time, the two IP addresses are assigned to the canonical ID of the router FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Inferring AS Topology • Routers are assigned to ASs based on the AS’s address space • If a router has addresses from more then one AS it is assigned to the AS with the most addresses and considers the router a border router • Routers that cannot be identified in the above manner are assigned by neighbor router votes • If the majority of the links from a router lead into one AS, we assign the router to that AS • If the router has links to multiple ASs it is considered a border router • Routers that cannot be assigned in the above manner are assigned by hand using traceroute information FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

BGP Data Collection • 8 nodes in the test bed collected BGP messages using Zebra 0.92a • Configured to see only BGP messages that cause a change in the border router’s choice of best route • Monitors observe most BGP messages relevant to routing stability FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Failure Location FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Failure Length FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Failures after RON FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Correlating Failures and BGP FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Conclusions • Failures are more likely to appear within an AS than on the boundary • 70% of observed failures last less than 5 min, 90% shorter then 15 min • Failures near the core are more likely to coincide with BGP messages • Failures typically precede failures by 4 minutes • RON can typically route around 50% of path failures • 20% of the failures masked by RON were preceded by at least one BGP message • Suggesting that reactive routing can be improved using BGP instability as an indicator of path failures. FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Discussion • Can passive measurements be used for a RON? • How do you guarantee that you have enough data? • How do you handle old data? • How practical are RONs? • Are the performance gains worth the overhead? FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Network Measurements in Overlay Networks