More “normal” than Normal: Scaling distributions in complex systems

More “normal” than Normal:Scaling distributions in complex systems Walter Willinger (AT&T Labs-Research) David Alderson (Caltech) John C. Doyle (Caltech) Lun Li (Caltech) Winter Simulation Conference 2004

Acknowledgments • Reiko Tanaka (RIKEN, Japan) • Matt Roughan (U. Adelaide, Australia) • Steven Low (Caltech) • Ramesh Govindan (USC) • Neil Spring (U. Maryland) • Stanislav Shalunov (Abilene) • Heather Sherman (CENIC)

Agenda More “normal” than Normal • Scaling distributions, power laws, heavy tails • Invariance properties High Variability in Network Measurements • Case Study: Internet Traffic (HTTP, IP) • Model Requirement: Internal Consistency • Choice: Pareto vs. Lognormal • Case Study: Internet Topology (Router-level) • Model Requirement: Resilience to Ambiguity • Choice: Scale-Free vs. HOT

2 10 Technological ($10B) Natural ($100B) 1 10 US Power outages (10M of customers) 0 10 -2 -1 0 10 10 10 20th Century’s 100 largest disasters worldwide Log(rank) Log(size)

Note: it is helpful to use cumulative distributions to avoid statistics mistakes 2 10 Log(Cumulative frequency) 1 10 = Log(rank) 0 10 -2 -1 0 10 10 10 Log(size)

100 10 3 2 1 2 10 Log(rank) 1 10 0 10 -2 -1 0 10 10 10 Log(size)

Median Typical events are relatively small 2 10 Log(rank) Largest events are huge (by orders of magnitude) 1 10 0 10 -2 -1 0 10 10 10 Log(size)

Slope = -1 20th Century’s 100 largest disasters worldwide 2 10 Technological ($10B) Natural ($100B) 1 10 US Power outages (10M of customers, 1985-1997) 0 10 -2 -1 0 10 10 10

Slope = -1 (=1) 20th Century’s 100 largest disasters worldwide A random variable X is said to follow a power law with index> 0 if 2 10 1 10 0 10 -2 -1 0 10 10 10

2 US Power outages (10M of customers, 1985-1997) 10 Slope = -1 (=1) 1 10 ? 0 10 A large event is not inconsistent with statistics. -2 -1 0 10 10 10

Observed power law relationships • Species within plant genera (Yule 1925) • Mutants in bacterial populations (Luria and Delbrück 1943) • Economics: income distributions, city populations (Simon 1955) • Linguistics: word frequencies (Mandelbrot 1997) • Forest fires (Malamud et al. 1998) • Internet traffic: flow sizes, file sizes, web documents (Crovella and Bestavros 1997) • Internet topology: node degrees in physical and virtual graphs (Faloutsos et al. 1999) • Metabolic networks (Barabasi and Oltavi 2004)

Notation • Nonnegative random variable X • CDF: F(x) = P[ X x ] • Complementary CDF (CCDF): 1 – F(x) = P [ X  x ] NB: Avoid descriptions based on probability density f(x)! Cumulative Rank-Size Relationship Frequency-Based Relationship

0.1 Frequency 0.01 1000 0.001 100 a =0 Rank 10 a =0 a =1 a =1 1 0 1 2 3 4 5 6 10 10 10 10 10 10 10 Size 0 1 2 3 4 5 6 10 10 10 10 10 10 10 Size Cumulative Rank-Size Relationship Frequency-Based Relationship Avoid non-cumulative frequency relationships for power laws

Notation • Nonnegative random variable X • CDF: F(x) = P[ X x ] • Complementary CDF (CCDF): 1 – F(x) = P [ X  x ] NB: Avoid descriptions based on probability density f(x)! Cumulative Rank-Size Relationship Frequency-Based Relationship Avoid non-cumulative frequency relationships for power laws

Notation • Nonnegative random variable X • CDF: F(x) = P[ X x ] • Complementary CDF (CCDF): 1 – F(x) = P [ X  x ] NB: Avoid descriptions based on probability density f(x)! For many commonly used distribution functions • Right tails decrease exponentially fast • All moments exist and are finite • Corresponding variable X exhibits low variability (i.e. concentrates tightly around its mean)

Subexponential Distributions Following Goldie and Klüppelberg (1998), we say that F (or X) is subexponential if where X1, X2, …, Xn are IID non-negative random variables with distribution function F. This says that  Xi is likely to be large iff max (Xi) is large (i.e. there is a non-negligible probability of extremely large values in a subexponential sample). This implies for subexponential distributions that (i.e. right tail decays more slowly than any exponential)

Heavy-tailed (Scaling) Distributions A subexponential distribution function F(x) (or random variable X) is called heavy-tailed or scaling if for some 0 <  < 2 for some constant 0 < c <. Parameter  is called the tail index • 1 < < 2 F has finite mean, infinite variance • 0 < < 1 F has infinite mean, infinite variance • In general, all moments of order   are infinite.

Simple Constructions for Heavy-Tails • For U uniform in [0,1], set X = 1/U, then X is heavy-tailed with  = 1. • For E (standard) exponential, set X = exp(E), then X is heavy-tailed with  = 1. • The mixture of exponential distributions with parameter 1/ having a (centered) Gamma(a,b) distribution is a Pareto distribution with  = a. • The distribution of the time between consecutive visits to zero of a symmetric random walk is heavy-tailed with  = 1/2.

Power Laws Note that (1) implies • Scaling distributions are also called power law distributions. • We will use notions of power laws, scaling distributions, and heavy tails interchangeably, requiring only that In other words, the CCDF when plotted on log-log scale follows an approximate straight line with slope -.

Slope = -1 (=1) 20th Century’s 100 largest disasters worldwide 2 10 1 10 0 10 -2 -1 0 10 10 10

Why “Heavy Tails” Matter … • Risk modeling (insurance) • Load balancing (CPU, network) • Job scheduling (Web server design) • Combinatorial search (Restart methods) • Complex systems studies (SOC vs. HOT) • Understanding the Internet • Behavior (traffic modeling) • Structure (topology modeling)

Power laws are ubiquitous • High variability phenomena abound in natural and man made systems • Tremendous attention has been directed at whether or not such phenomena are evidence of universal properties underlying all complex systems • Recently, discovering and explaining power law relationships has been a minor industry within the complex systems literature • We will use the Internet as a case study to examine the what power laws do or don’t have to say about its behavior and structure. First, we review some basic properties about scaling distributions

Response to Conditioning • If X is heavy-tailed with index , then the conditional distribution of X given that X > w satisfies For large values, x is identical to the unconditional distribution P[ X > x ], except for a change in scale. • The non-heavy-tailed exponential distribution has conditional distribution of the form The response to conditioning is a change in location, rather than a change in scale.

Mean Residual Lifetime • An important feature that distinguishes heavy-tailed distributions from non-heavy-tailed counterparts • For the exponential distribution with parameter , mean residual lifetime is constant • For a scaling distribution with parameter , mean residual lifetime is increasing

Key Mathematical Properties of Scaling Distributions • Response to conditioning (change in scale) • Mean residual lifetime (linearly increasing) Invariance Properties • Invariant under aggregation • Non-classical CLT and stable laws • (Essentially) invariant under maximization • Domain of attraction of Frechet distribution • (Essentially) invariant under mixture • Example: The largest disasters worldwide • Invariant under marginalization

Linear Aggregation: Classical Central Limit Theorem • A well-known result • X(1), X(2), … independent and identically distributed random variables with distribution function F (mean  <  and variance 1) • S(n) = X(1) + X(2) +…+ X(n)n-th partial sum • More general formulations are possible • Often-used argument for the ubiquity of the normal distribution

Linear Aggregation: Non-classical Central Limit Theorem • A less well-known result • X(1), X(2), … independent and identically distributed with common distribution function F that is heavy-tailed with 1 <  < 2 • S(n) = X(1)+X(2)+…+X(n)n-th partial sum • The limit distribution is heavy-tailed with index  • More general formulations are possible • Gaussian distribution is special case when  = 2 • Rarely taught in most Stats/Probability courses

Maximization:Maximum Domain of Attraction • A not so well-known result (extreme-value theory) • X(1), X(2), … independent and identically distributed with common distribution function F that is heavy-tailed with 1 <  < 2 • M(n) = max(X(1), …, X(n)),n-th successive maxima • G is the Fréchet distribution exp(-x-) • G is heavy-tailed with index 

Weighted Mixture • A little known result • X(1), X(2), … independent random variables having distribution functions Fi that are heavy-tailed with common index 1 <  < 2, but possibly different scale coefficientsci • Consider the weighted mixture W(n) of X(i)’s • Let pi be the probability that W(n) = X(i), with p1+…+pn=1, then one can show where cW =  pi ci is the weighted average of the separate scale coefficients ci. • Thus, the weighted mixture of scaling distributions is also scaling with the same tail index, but a different scale coefficient

Multivariate Case: Marginalization • For a random vector X Rd, if all linear combinations Y = kbk Xk are stable with  1, then X is a stable vector in Rd with index . • Conversely, if X is an -stable random vector in Rd then any linear combination Y = kbk Xk is an -stable random variable. • Marginalization • The marginal distribution of a multivariate heavy-tailed random variable is also heavy tailed • Consider convex combination denoted by multipliers b = (0, …, 0, 1, 0, …, 0) that projects X onto the kth axis • All stable laws (including the Gaussian) are invariant under this type of transformation

Invariance Properties • For low variability data, minimal conditions on the distribution of individual constituents (i.e. finite variance) yields classical CLT • For high variability data, more restrictive assumption (i.e. right tail of the distribution of the individual constituents must decay at a certain rate) yields greater invariance

Scaling: “more normal than Normal” • Aggregation, mixture, maximization, and marginalization are transformations that occur frequently in natural and engineered systems and are inherently part of many measured observations that are collected about them. • Invariance properties suggest that the presence of scaling distributions in data obtained from complex natural or engineered systems should be considered the norm rather than the exception. • Scaling distributions should not require “special” explanations.

Our Perspective • Gaussian distributions as the natural null hypothesis for low variability data • i.e. when variance estimates exist, are finite, and converge robustly to their theoretical value as the number of observations increases • Scaling distributions as natural and parsimonious null hypothesis for high variability data • i.e. when variance estimates tend to be ill-behaved and converge either very slowly or fail to converge all together as the size of the data set increases

High-Variability in Network Measurements:Implications for Internet Modeling and Model Validation Walter Willinger (AT&T Labs-Research) David Alderson (Caltech) John C. Doyle (Caltech) Lun Li (Caltech) Winter Simulation Conference 2004

Agenda More “normal” than Normal • Scaling distributions, power laws, heavy tails • Invariance properties High Variability in Network Measurements • Case Study: Internet Traffic (HTTP, IP) • Model Requirement: Internal Consistency • Choice: Pareto vs. Lognormal • Case Study: Internet Topology (Router-level) • Model Requirement: Resilience to Ambiguity • Choice: Scale-Free vs. HOT

G.P.E. Box: “All models are wrong, … • … but some are useful.” • Which ones? • In what sense? • … but some are less wrong. • Which ones? • In what sense? • Mandelbrot’s version: • “When exactitude is elusive, it is better to be approximately right than certifiably wrong.”

What about Internet measurements? • High-volume data sets • Individual data sets are huge • Huge number of different data sets • Even more and different data in the future • Rich semantic context of the data • A packet is more than arrival time and size • Internet is full of “high variability” • Link bandwidth: Kbps – Gbps • File sizes: a few bytes – Mega/Gigabytes • Flows: a few packets – 100,000+ packets • In/out-degree (Web graph): 1 – 100,000+ • Delay: Milliseconds – seconds and beyond

On Traditional Internet Modeling • Step 0: Data Analysis • One or more sets of comparable measurements • Step 1: Model Selection • Choose parametric family of models/distributions • Step 2: Parameter Estimation • Take a strictly static view of data • Step 3: Model Validation • Select “best-fitting” model • Rely on some “goodness-of-fit” criteria/metrics • Rely on some performance comparison How to deal with “high variability”? • Option 1: High variability = large, but finite variance • Option 2: High variability = infinite variance

Some Illustrative Examples • Some commonly-used plotting techniques • Probability density functions (pdf) • Cumulative distribution functions (CDF) • Complementary CDF (CCDF) • Different plots emphasize different features • Main body of the distribution vs. tail • Variability vs. concentration • Uni- vs. multi-modal

1.5 Lognormal(0,1) Gamma(.53,3) Exponential(1.6) Weibull(.7,.9) Pareto(1,1.5) 1 f(x) 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x Probabilitydensityfunctions

1 0.9 Lognormal(0,1) Gamma(.53,3) 0.8 Exponential(1.6) Weibull(.7,.9) Pareto(1,1.5) 0.7 0.6 F(x) 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 x Cumulative Distribution Function

0 10 -1 10 -2 10 -3 10 -4 10 -1 0 1 2 10 10 10 10 Complementary CDFs log(1-F(x)) Lognormal(0,1) Gamma(.53,3) Exponential(1.6) Weibull(.7,.9) log(x)

0 10 -1 10 -2 10 log(1-F(x)) -3 10 Lognormal(0,1) Gamma(.53,3) Exponential(1.6) Weibull(.7,.9) -4 ParetoII(1,1.5) 10 ParetoI(0.1,1.5) -1 0 1 2 10 10 10 10 log(x) Complementary CDFs

By Example Internet Traffic • HTTP Connection Sizes from 1996 • IP Flow Sizes (2001) Internet Topology • Router-level connectivity (1996, 2002)

HTTP Connection Sizes (1996) • 1 day of LBL’s WAN traffic (in- and outbound) • About 250,000 HTTP connection sizes (bytes) • Courtesy of Vern Paxson 0 10 HTTP Data -1 10 -2 10 -3 10 1-F(x) -4 10 -5 10 -6 10 0 2 4 6 8 10 10 10 10 10 x (HTTP size)

0 10 HTTP Data Fitted Lognormal -1 10 Fitted Pareto -2 10 -3 10 1-F(x) -4 10 -5 10 -6 10 0 2 4 6 8 10 10 10 10 10 x (HTTP size) HTTP Connection Sizes (1996) How to deal with “high variability”? • Option 1: High variability = large, but finite variance • Option 2: High variability = infinite variance Fitted 2-parameter Lognormal (=6.75, =2.05) Fitted 2-parameter Pareto (=1.27, m=2000)

IP Flow Sizes (2001) • 4-day period of traffic at Auckland • About 800,000 IP flow sizes (bytes) • Courtesy of NLANR and Joel Summers 0 10 IP flow data -2 10 1-F(x) -4 10 -6 10 IPflow 0 5 10 10 10 10 x (IP Flow Size)

0 10 IP flow data Fitted Lognormal Fitted Pareto -2 10 1-F(x) -4 10 -6 10 0 5 10 10 10 10 x (IP Flow Size) IP Flow Sizes (2001) How to deal with “high variability”? • Option 1: High variability = large, but finite variance • Option 2: High variability = infinite variance IPflow

0 10 -1 10 -2 10 -3 1-F(x) 10 Fitted Pareto -4 10 Samples from Fitted Pareto -5 10 -6 10 0 2 4 6 8 10 10 10 10 10 x Samples from Pareto Distribution

0 10 -1 10 -2 10 -3 1-F(x) 10 Fitted Lognormal -4 10 Samples from Fitted Lognormal -5 10 -6 10 -2 0 2 4 6 8 10 10 10 10 10 10 x Samples from Lognormal Distribution

More “normal” than Normal: Scaling distributions in complex systems

More “normal” than Normal: Scaling distributions in complex systems

Presentation Transcript

Normal Distributions

The Normal distributions

§ 16.1 - 16.2 Approximately Normal Distributions; Normal Curves

Normal distributions

2.2: Normal Distributions

7.4 Normal Distributions

Normal Probability Distributions

Normal Distributions

Chapter 13: Normal Distributions

Normal Probability Distributions

Normal Distributions

Use Normal Distributions

Normal Distributions

Normal Probability Distributions

Comparing normal distributions

Normal Distributions

Normal Probability Distributions

§ 16.1 - 16.2 Approximately Normal Distributions; Normal Curves

2.2: Normal Distributions

Normal Distributions

NORMAL DISTRIBUTIONS

Normal Distributions