Investigating Sampling Biases in IP Topology Measurements

Sampling Biases in IP Topology Measurements John Byers with Anukool Lakhina, Mark Crovella and Peng XieDepartment of Computer ScienceBoston University

Discovering the Internet topology 163.55.221.98 212.12.58.3 212.12.5.77 163.55.1.10 163.55.1.41 source destination Measurement Primitive:tracerouteReports the IP path from A to B i.e., how IP paths are overlaid on the router graph • Goal: Discover the Internet Router Graph • Vertices represent routers, • Edges connect routers that are one IP hop apart

Traceroute studies today • k sources: Few active sources, strategically located. • m destinations: Many passive destinations, globally dispersed. • Union of many traceroute paths. • (k,m)-traceroute study Sources Destinations

Heavy tails in Topology Measurements Frequency log(Pr[X>x]) Dataset from [PG98] Degree Subsequent measurements show that the degree distribution is a heavy tail, [GT00, BC01, …] log( ) A surprising finding: [FFF99] Let be a given node degree. Let be frequency of degree vertices in a graph Power-law relationship:

We’re skeptical • We will argue that the evidence for power laws is at best insufficient. • Insufficient does not mean noisy or incomplete. (which these datasets certainly are!) • For us, insufficient means that measurements are statistically biased. We will show that (k,m)-traceroute studies exhibit significant sampling bias.

A thought experiment • Idea: Simulate topology measurements on a random graph. • Generate a sparse Erdös-Rényi random graph, G=(V,E). Each edge present independently with probability pAssign weights: w(e) = 1 + e , where e in • Pick k unique source nodes, uniformly at random • Pick m unique destination nodes, uniformly at random • Simulate traceroute from k sources to m destinations, i.e. learn shortest paths between k sources and m destinations. • Let Ĝ be union of shortest paths. • Ask: How does Ĝ compare with G ?

Underlying Random Graph, G log(Pr[X>x]) MeasuredGraph, Ĝ Underlying Graph: N=100000, p=0.00015Measured Graph: k=3, m=1000 log(Degree) Ĝ is a biasedsample of G that looks heavy-tailedAre heavy tails a measurement artifact?

Outline • Motivation and Thought Experiments • Understanding Bias on Simulated Topologies Where and Why • Detecting and Defining Bias Statistical hypotheses to infer presence of bias • Examining Internet Maps

Understanding Bias (k,m)-traceroute sampling of graphs is biased An intuitive explanation: When traces are run from few sources to large destinations, some portions of underlying graph are explored more than others. We now investigate the causes behind bias.

Are nodes sampled unevenly? True Degrees of nodes in Ĝ Degrees of all nodes in G Measured Graph:k=5,m=1000 • Conclusion: Difference between true degrees of Ĝ and degrees of G is insignificant; dismiss conjecture. • Conjecture:Shortest path routing favors higher degree nodes nodes sampled unevenly • Validation:Examine true degrees of nodes in measured graph, Ĝ. Expect true degrees of nodes in Ĝ to be higher than degrees of nodes in G, on average.

Are edges sampled unevenly? Observed Degree True Degree • Conclusion: Edges incident to a node are sampled disproportionately; supports conjecture. • Conjecture:Edges selected incident to a node in Ĝ not proportional to true degree. • Validation:For each node in Ĝ, plot true degree vs. measured degree. If unbiased, ratio of true to measured degree should be constant. Points clustered around y=cx line (c<1).

Why: Analyzing Bias • Result of adding more destinations: most new nodes and edges closer to the source. • Question:Given some vertex in Ĝ that is h hops from the source, what fraction of its true edges are contained in Ĝ? • Messages: • As h increases, number of edges discovered falls off sharply.* 1000dst Fraction of node edges discovered 600dst 100dst Distance from source * We can prove exponential fall-off analytically, in a simplified model.

What does this suggest? Measured Graph Underlying Graph log(Pr[X>x]) Hop2 Hop3 Hop4 Hop1 log(Degree) Summary:Edges are sampled unevenly by (k,m)-traceroute methods.Edges close to the source are sampled more often than edges further away. Intuitive Picture: Neighborhood near sources is well explored but, visibility of edges declines sharply with hop distance from sources.

Outline • Motivation and Thought Experiments • Understanding Bias in Simulated Topologies Where and Why • Detecting and Defining Bias Statistical hypotheses to infer presence of bias • Examining Internet Maps

Inferring Bias Goal: Given a measured Ĝ, does it appear to be biased? Why this is difficult: Don’t have underlying graph. Don’t have formal criteria for checking bias. General Approach:Examine statistical properties as a function of distance from nearest source. Unbiased sample  No change Change  Bias

Detecting Bias Examine Pr[D=d|H=h], the conditional probability that a node has degree d, given that it is at distance h from the source. Underlying Graph log(Pr[X>x]) Ĝ degrees| H=3 Ĝ degrees| H=2 log(Degree) Two observations:1.Highest degree nodes are near the source.2. Degree distribution of nodes near the source different from those far away

A Statistical Test for C1 C1:Are the highest-degree nodes near the source? If so, then consistent with bias. The 1% highest degree nodes occur at random with distance to nearest source. H0C1 Cut vertex set in half: N(near) and F(far), by distance from nearest source. Let v : (0.01) |V| k : fraction of vthat lies in N Can bound likelihood k deviates from 1/2 using Chernoff-bounds: Reject hypothesis with confidence 1-a if:

A Statistical Test for C2 C2: Is the degree distribution of nodes near the source different from those further away? If so, consistent with bias. Chi Square Test succeeds on degree distribution for nodes near the source and far from the source. H0C2 Partition vertices across median distance: N (near) and F (far) Compare degree distribution of nodes in NandF, using the Chi-Square Test: where O and E are observed and expected degree frequencies and l is histogram bin size. Reject hypothesis with confidence 1-a if:

Our Definition of Bias • Bias (Definition):Failure of a sampled graph to meet statistical tests for randomness associated with C1 and C2. • Disclaimers: Tests are not conclusive. Tests are binary and don’t tell us how biased datasets are. • But dataset that fails both tests is a poor choice to make generalizations of underlying graph.

Introducing datasets Pansiot-Grad Mercator Skitter log(Pr[X>x]) log(Degree)

Testing C1 H0C1 The 1% highest degree nodes occur at random with distance to source. Pansiot-Grad: 93% of the highest degree nodes are in N Mercator: 90% of the highest degree nodes are in N Skitter: 84% of the highest degree nodes are in N

Testing C2 Mercator Pansiot-Grad Skitter Near Near Near log(Pr[X>x]) Far Far Far All All All log(Degree) H0C2

Summary of Statistical Tests • All datasets pass both statistical tests for evidence of bias. • Likely that true degree distribution of the routers is different than that of these datasets.

Final Remarks • Using (k,m)-traceroute methods to discover Internet topology yields biased samples. • Rocketfuel [SMW:02] is limited-scale but may avoid some pitfalls of (k,m)-traceroute studies. • One open question: How to sample the degree of a router at random? • Node degree distributions are especially sensitive to biased sampling  may not be a sufficiently robust metric for characterizing or comparing graphs.

Sampling Power-Law Graphs Underlying, Power-Law Graph log(Pr[X>x]) MeasuredGraph Underlying PLRG: N=100000Measured Graph: k=3, m=1000 log(Degree) Even though distributional shape similar,different exponents matter for topology modeling. Again, Ĝ is a biasedsample of G

Investigating Sampling Biases in IP Topology Measurements