PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University

Motivation • Routing anomalies are common on Internet • Maintenance • Power outage • Fiber cut • Misconfiguration • … • Anomalies can affect end-to-end performance • Packet losses • Packet delays • Disconnectivities

Background • Anomaly detection and diagnosis are nontrivial • Asymmetric paths • Failure information propagation • Highly varied durations • Limited coverage

Contributions • New techniques for • Anomaly detection • Anomaly isolation • Anomaly classification • Large-scale study of anomalies • Broad coverage • High detection rate, low overhead • Characterization of anomalies • End-to-end effects • Benefits to host service

Outline • State of the Art • PlanetSeer Components • MonD – passive monitoring • ProbeD – active probing • Anomaly Analysis • Loop-based anomaly • Non-loop anomaly • Bypassing Anomalies • Summary

State of the Art • Routing messages • BGP: AS-level diagnosis • IS-IS, OSPF: Within single ISP • Router/link traffic statistics • SNMP, NetFlow: proprietary • End-to-end measurement • Ping, traceroute

End-to-End Probing • All-pairs probes among n nodes • O(n^2) measurement cost • Not scalable as n grows

Key Observation • Combine passive monitoring with active probing • Peer-to-Peer (P2P), Content Distribution Network (CDN) • Large client population • Geographically distributed nodes • Large traffic volume • Highly diverse paths • The traffic generated by the services reveals information about the network.

Our Approach • Host service • CDN • Components • Passive monitoring • Active probing • Advantages • Low overhead • Wide coverage Client C R1 R2 B A

MonD: Anomaly Detection • Anomaly indicators • Time-to-live (TTL) change • Routing change • n consecutive timeouts (n = 4 in current system) • Idling period of 3 to 16 seconds • most congestion periods < 220ms

ProbeD Operation • Baseline probes • When a new IP appears • From local node • Forward probes • When a possible anomaly detected • From multiple nodes (including local node) • Reprobes • At 0.5, 1.5, 3.5 and 7.5 hours later • From local node

ProbeD Groups • 353 nodes, 145 sites, 30 groups • According to geographic location • One traceroute per group

Local ProbeD Client RemoteProbeD ra rd rb rc Estimating Scope • Which routers might be affected? • Routers which possibly change their next hops • Traceroutes from multiple locations can narrow the scope

Core Edge 215 ASes 22 ASes 1392 ASes 1420 ASes 13872 ASes Path Diversity • Monitoring Period: 02/2004 – 05/2004 • Unique IPs: 887,521 • Traversed ASes: 10,090

Confirming Anomalies • Reported anomalies • 2,259,588 • Conditions • Loops • Route change • Partial unreachability • ICMP unreachable • Very conservative confirmation Undecided 22% Non-anomaly 66% Anomaly 12%

Confirmed Anomaly Breakdown • Confirmed anomalies • 271,898 • 2 per minute • 100x more • Temp anomalies • Inconsistent probes Temp Anomalies 16% Persist Loop 7% Temp loop 1% Path Change 44% Other Outage 23% Fwd Outage 9%

1% persist loops cross ASes 15% temp loops cross ASes Scope of Loops • How many routers or ASes are involved? • Temp loops involve more routers than persistent loops • 97% persistent loops and 51% temp loops contain 2 hops

Distribution of Loops • Many persistent loops in tier-3, few in tier-1 • Worst 10% of tier-1 ASes – implications for largest ISPs • 20% traffic • 35% persistent loops

Duration of Persistent Loops • How long do persistent loops last? • Either resolve quickly or last for an extended period

78% outages within 2 ASes 57% changes within 2 ASes Scope of Forward Anomalies • How many routers or ASes are affected? • 60% outages within 1 hops • 75% outages and 68% changes within 4 hops

Location of Forward Anomalies • How close are the anomalies to the edges of the network? • 44% outages at the last hop • 72% outages and 40% changes within 4 hops

Distribution of Forward Anomalies • Which ASes are affected? • Tier-1 ASes most stable • Tier-3 ASes most likely to be affected

destination source intermediate Overlay Routing • Use alternate path when default path fails

Bypassing Anomalies • How useful is overlay routing for bypassing failures? • Effective in 43% of 62,815 failures, lower than previous studies • 32% bypass paths inflate RTTs by more than a factor of two

Summary • Confirm 272,000 anomalies in 3 months • Persistent and temporary loops • Persistent loops narrower scope, either resolve quickly or last for a long time • Path outages and changes • Outages closer to edge, narrower scope • Anomaly distribution • Skewed. Tier-1 most stable. Tier-3 most problematic. • Overlay routing • Bypasses 43% failures, latency inflation

More Information • In the paper • More details about anomaly characteristics • End-to-end impacts • Classification methodology • Optimizations to reduce overheads & improve confirmation rate • mzhang@cs.princeton.edu • http://www.cs.princeton.edu/nsg/infoplane

Classifying Anomalies • Temporary vs. persistent loops • Whether exit loops at maximum hop • Path changes vs. outages • Changes: follow different paths to clients • Outages: stop at intermediate hops ProbeD Client

Non-anomalies • Non-anomalies • Ultrashort anomalies • Path-based TTL • Aggressive timeout

Identifying Forward Outages • Forward outages • Route change • ICMP dest unreachable • Forward timeout

Loop Effect on RTT • How do loops affect RTTs? • Loops can incur high latency inflation

Loop Effect on Loss Rate • How do loops affect loss rates? • 65% temporary and 55% persistent loops preceded by loss rates exceeding 30%

Forward Anomaly Effect on RTT • How do forward anomalies affect RTTs? • Outages and changes can incur latency inflation • Outages have more negative effect on RTTs

Forward Anomaly Effect on Loss Rate • How do forward anomalies affect loss rates? • 45% outages and 40% changes preceded by loss rates exceeding 30%

Reducing Measurement Overhead • Can we reduce the number of probes? • 15 probes can achieve the same accuracy in 80% cases • Flow-based TTL

Traffic Breakdown By Tiers

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services