Network Tomography Based on Flow Level Measurements

Network Tomography Based on Flow Level Measurements Dogu Arifler Ph.D. Defense Committee Members: Prof. Ross Baldick Prof. Melba M. Crawford Prof. Gustavo de Veciana (Co-advisor) Prof. Brian L. Evans (Co-advisor) Prof. Theodore S. Rappaport Prof. Sanjay Shakkottai April 19, 2004

Outline • Introduction • Background and motivation • Overview of contributions • Methodology for inferring network resource sharing • Conditional sampling • Flow filtering • Dimensionality reduction • Validation • Simulation studies • Application to real data with the bootstrap • Conclusion • Summary • Future work

Inference of network properties • Motivation: Network managers need information about properties of networks to better plan for services and diagnose performance problems • Problem: In general, properties of networks outside one’s administrative domain are unknown • Little or no information on routing and topology • Little or no information on link and server utilizations • Solution: Network tomography • Inferring characteristics of networks from available network traffic measurements • Application of statistical methods to network measurements

Link failure Congested content server Wireless hot spot Inference of congested resource sharing • Internet service providers • Diagnose misconfigurations, link failures • End users • Assess routing diversity • Infer how resources are allocated • Content providers • Balance workload among servers • Plan placement of caches • Wireless service providers • Evaluate adequacy of backhaul link capacity • Determine if access point is configured properly

Related work • Brute force: via a Unix utility, traceroute • Cooperation of routers along packet’s route required • Providers unwilling to disclose information for security concerns • Topology visualization: skitter[CAIDA], rocketfuel[UWA] • Location-based approximations [Savage, Cardwell, Anderson, 1999] • Packets destined for given network address generally follow the same path • Statistical techniques on packet level measurements • Correlation of end-to-end packet losses [Harfoush, Bestavros, Byers, 2000] • Clustering based on minimizing entropy of inter-packet spacing [Katabi, Bazzi, Yang, 2001] • Correlation of end-to-end packet losses and delays [Rubenstein, Kurose, Towsley, 2002]

Network tomography based on flows • Packet level measurements are • Data intensive to collect and store • Dependent on cooperation of network and/or collaboration of users • Complex to analyze • Propose a significantly different strategy to infer network properties • Correlation of passive flow level measurements available at a local measurement site • A flow is a sequence of packets associated with a given instance of an application • Packets corresponding to transfer of a Web page, file, e-mail, etc. • Flow is an abstraction at higher protocol layers, i.e. closer to the application layer

Records Data warehouse Monitored link timeout response time identifier 1 time identifier 2 start time end time packets of a flow Flow level measurements • Flow records • Summary information • Easier to collect and store • State-of-the-art networking equipment can collect flow records (e.g. Cisco NetFlow, sFlow, Argus) • Records contain • Source/destination IP addresses, port numbers, number of packets and bytes in the flow, and start time and end time of flow

available capacity flow 1 flow 2 time TCP flows • Approximately 80% of flows in the Internet are transferred via TCP [CAIDA, 1999] • TCP adapts its data transmission rate to available network capacity • Congested link bandwidth sharing among flows is roughly fair • One performance measure for TCP flows is perceived throughput • Amount of data in bytes (flow size) divided by response time • Premise: Throughputs of TCP flows that temporally overlap at a congested resource are correlated

Overview of contributions • New approach to network tomography based on flow level measurements • Methodology for inferring congested resource sharing: 1. Conditional sampling strategy • Estimation of correlation matrix from pairwise correlations 2. Flow filtering criteria • Preprocessing flow records: omitting flows based on size in bytes, duration, and number of packets 3. Dimensionality reduction • Exploratory factor analysis via principal component method 4. Validation with measured data • Bootstrap methods to estimate confidence intervals for factor analysis results

Contribution #1 Throughput of a flow class • Flow class is a collection of flow records that have a common identifier, e.g. source/destination address • How can one infer which flow classes share resources? • Correlate flow class throughput processes given by class 1 Flow records collected at a measurement site . . . . . . class 2 time

activity of a class during n time consider red and blue classes Contribution #1 Conditional sampling of random processes • Which flow class throughput samples can be used to capture flow class throughput correlations? • Use a pairwise approach to estimate correlation matrix • Estimate throughput correlations between class pairs by using samples at times when class pair is active • Construct correlation matrix R with elements

Contribution #2 Flow filtering • Can one better capture correlations due to resource sharing if only a subset of flow records are used? • Throughputs of short TCP flows are noisy, because they do not have an opportunity to “learn” the congestion state • Amount of temporal overlap between a long TCP flow and a short TCP flow is small • What is the impact of short flows and long flows on throughput correlations? • Model instantaneous link bandwidth available to a flow as an autoregressive process • Analyze the effect of flow duration and amount of overlap between flows on throughput correlation

time overlap Contribution #2 Autoregressive model for available bandwidth • Suppose that link bandwidth available to a flow at time i is a first-order autoregressive process denoted by B(i) • Express perceived throughputs of flows f1 and f2 as where model the inability of a short TCP flows to “learn” the congestion state of the network

time overlap Contribution #2 Correlation between flow throughputs Perfectly overlapping flows Duration of f1=20 effect of noise vanishes as flow duration increases, and correlation approaches 1 Correlation correlation depends on overlap relative tothe longer flow high correlation for temporally overlapping flows Start time of f2 Duration of f1 and f2

Contribution #2 Flow filtering criteria • Resource sharing flow classes • Long flows with large amounts of overlap result in high throughput correlations, but this situation does not arise frequently • Long flows overlapping with short flows result in lower correlations • “Noisy” short flows result in lower correlations even when the amount of overlap is large • Removing large- and small-sized flows helps in capturing positive throughput correlations due to resource sharing • Long (short) flows will typically be large (small) in size • Unlike duration of a flow, size of a flow is invariant regardless of the capacity of links • Flow size is the proper attribute to consider for filtering out flows

Contribution #3 Exploratory factor analysis • Interpretation of flow class throughput correlation matrix to infer resource sharing is difficult • Correlation structure of flow class throughputs can often be represented by a few latent factors • Orthogonal factor model ( m≤ p ): • No hypothesis on m, butfactors must have high explanatory power • Λij are loadings (or weights) of each factor on a variable

Contribution #3 Principal component method • Use spectral decomposition on R to estimate Λand • Eigenvalue-eigenvector pairs (i, ξi), 1 ≤ i ≤ p • Determine m “significant”eigenvalues of R using Kaiser’s rule [Kaiser, 1960] • Variances of factors are given by eigenvalues  m significant eigenvalues   variance of a normalized variable eigenvalue 1     … 1 2 3 4 5 6 7 where

Contribution #3 Inference of resource sharing • Structure of a pp correlation matrix R is explained by a pm factor loading matrix Λ • Columns of Λ represent shared congested resources • Magnitudes of loadingstell us which shared resource has the most effect on the variability of class throughput • Loading matrix can be rotated via varimax rotation to obtain Λ* that potentially gives a better description of resource sharing Factor 1 Factor 2 Consider five flow classes and suppose that the correlation matrix has two significant eigenvalues Class 1 Class 2 Factor loading with the largest magnitude in each row is boxed Class 3 Class 4 Classes 1, 2 and 5 share one resource Classes 3 and 4 share another resource Class 5

TCP simulations • Primary goals of simulations: • Evaluate effectiveness of exploratory factor analysis in identifying flow classes that share resources in a controlled environment • Find a range of flow sizes that better capture network’s congestion dynamics • Simulations are performed using OPNET Modeler • A discrete-event environment for network modeling and simulation (http://www.opnet.com) • Simulate 2 hour-long file download activity • File requests from users arrive according to a Poisson process • Each user downloads a file whose size is chosen from a lognormal distribution with mean 16 kB, std 131 kB [Downey,2001] • File sizes, request times, and download response times are recorded to create NetFlow-like data for statistical analysis

Assessment of factor model • Need a metric to evaluate if loadings correctly determine which classes are associated with which resources • Define squared error loss • Couple explanatory power with squared error loss to evaluate factor analysis in inferring resource sharing • Assess inference accuracy • Empirically search for size thresholds for filtering out flows to improve accuracy : “Ideal” loading matrix : Estimated loading matrix

Each file server-subnet pair is a flow class Bottlenecks A1, A2, and A3 are loaded equally Effect of offered load by classes and filtering out small and/or large flows on inference will be investigated Consider a scenario in which users in seven subnets download files from a file server 7 6 A3 5 S1 4 A2 3 file server 2 A1 1 10 Mbps LANs with 10 workstations Tree topology with three bottlenecks

flow size (kB) 4 8 16 32 Tree topology with three bottlenecks: results Explanatory power Accuracy of loadings Squared error loss % Variance Load offered by each class on corresponding bottleneck Load offered by each class on corresponding bottleneck Explanatory power increases with increasing offered load Squared error loss decreases with increasing offered load Filtering out small and large flows has significant benefits Compromise between statistical accuracy and reliability of inference!

Interaction of coupled traffic • Consider a “linear” network to evaluate the effect of interactions of coupled network traffic • Can throughputs of two flow classes that do not share a link be correlated due to interactions through another flow class? • Results of fluid simulations show that degree of correlation between throughputs of classes not sharing a link is negligible file server 1 1 2 file server 3 3 file server 2 10 Mbps LANs with 10 workstations

file server 1 (20%) 80% 80% 1 Background traffic utilizes 20% of bottleneck links 2 file server 3 (40%) 3 file server 2 (40%) 10 Mbps LANs with 10 workstations Interaction of coupled traffic: an example • Consider the “linear” network below • Discard flows with sizes < 4 kB or > 32 kB • Based on 2 significant factors, determine factor loadings • Rotated factor loading estimates • Rows correspond to classes • Columns correspond to shared links

Stations operate at 11 Mbps Stations operate at 1 Mbps 11 Mbps file server file server 1 Mbps Backhaul link underprovisioned for traffic generated by wireless users Access point’s location is not optimal with respect to users Wireless LANs • 802.11b wireless LANs with 20 users • Differentiate between two cases in which poor throughput performance (40 kbps) is being reported • Discard flows with sizes < 4 kB or > 32 kB • Correlate throughputs of 4 users, eigenvalues are • Underprovisioned backhaul link: {3.0254, 0.6139, 0.2066, 0.1541} • Poor signal strength: {1.2571, 0.9530, 0.9416, 0.8484}

Discussion of wireless LAN results • Consider bottlenecks with capacity 1 Mbps • M active users, each having Ni active flows • M is almost constant (has low variance) • Total number of active flows N = N1+N2+…+NM user 1 Resource bandwidth allocated to flows = backhaul link 1 Mbps user 2 … One common source for variability (per flow allocation) user M access point 1 Mbps user 1 Resource bandwidth allocated to flows = user 2 … Each user has its own source for variability (per user scheduling) user M

Summary of methodology Flow filtering Conditional sampling Network tomography Bootstrap Exploratory factor analysis

Validation with real data is extremely difficult! Unlike controlled simulations, we do not know routing information We would like to be able to make inferential statements Estimate 95% confidence intervals for eigenvalues and loadings Modify Kaiser’s rule for selecting significant eigenvalues The bootstrap, a computer-based method, can be used to compute confidence intervals[Efron and Tibshirani, 1993] From data at hand, construct empirical distribution and generate many realizations No distributional assumptionson data required Applicable to any statistic, s(X), simple or complicated samples of size n (B independent replications) Contribution #4 The bootstrap

Real data: preprocessing • Two NetFlow datasets from UT Austin’s border router • Assume that traffic is stationary over one-hour periods • Choose two incoming flow classes that are very likely to experience congestion at the server • Select IP addresses associated with AOL and HotMail • Divide each class into two: AOL1, AOL2 and HotMail1, HotMail2 • Filter flow records based on • Packets: Discard flows consisting of only 1 packet • Duration: Discard flows with duration shorter than 1 second • Size: Discard flows with sizes < 8 kB or > 64 kB

Real data: eigenvalues • Parent class (AOL and HotMail) throughput correlation is -0.07 for Dataset2002 and 0.05 for Dataset2004 • 95% bootstrap confidence intervals of eigenvalues of throughput correlation matrix of 4 classes AOL1, AOL2, HotMail1, and Hotmail2 given below • 2 significant factors with explanatory power of 72% for Dataset2002 and 63% for Dataset2004

Real data: factor loadings • Based on 2 significant factors, determine factor loadings • Rotated factor loading estimates: • Rows correspond to classes • Columns correspond to shared infrastructure • Estimate 95% bootstrap confidence intervals for loadings to establish accuracy • With 95% confidence, we can identify which flow classes share infrastructure! Dataset2002 Dataset2004 AOL1 AOL2 HotMail1 Hotmail2 AOL1 AOL2 HotMail1 Hotmail2

Methodology for inferring resource sharing

Impact of research • Application of a structural analysis technique, factor analysis, to explore network properties • Methodology for inferring resource sharing • Use of bootstrap methods to make inferential statements about resource sharing • Possible applications • Network monitoring and root cause analysis of poor performance • Problem diagnosis and off-line evaluation of congestion status of networks • Route configuration by service providers • Configuration and placement of access points in wireless LANs • Development of new network service charging schemes

Future work • An active measurement approach • Probe packets have been used in previous network research • Propose “probe flows” for on-demand inference, control of temporal overlaps, and sending “right-sized” flows • Key question: How many probes are required for reliable inference? • Wireless networks • Investigate possibility of clustering wireless users experiencing “similar network conditions” based only on flow measurements • Explore applicability to optimal access point and/or backhaul link configuration more extensively • Validation with more extensive datasets • Use flow records from major internet service providers, possibly accompanied by routing information

Publications related to dissertation • Journal • D. Arifler, G. de Veciana, and B. L. Evans, “Network tomography based on flow level measurements,” IEEE/ACM Trans. on Networking, submitted Feb. 2004. • Conferences • D. Arifler, G. de Veciana, and B. L. Evans, “Network tomography based on flow level measurements,” in IEEE Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, May 2004, to appear. • D. Arifler, G. de Veciana, and B. L. Evans, “Inferring path sharing based on flow level TCP measurements,” in IEEE Proc. Int. Conf. on Communications, June 2004, to appear.

Other publications • Self-similarity • D. Arifler and B. L. Evans, “Modeling the self-similar behavior of packetized MPEG-4 video using wavelet-based methods,” in Proc. Int. Conf. on Image Processing, Sep. 2002. • Measurement-based network traffic analysis • S. Li, S. Park, D. Arifler, “SMAQ: A measurement-based tool for traffic modeling and queueing analysis. Part I – Design methodologies and software architecture,” IEEE Communications Magazine, vol. 36, no. 8, pp. 56-65, Aug. 1998. • S. Li, S. Park, D. Arifler, “SMAQ: A measurement-based tool for traffic modeling and queueing analysis. Part II – Network applications,” IEEE Communications Magazine, vol. 36, no. 8, pp. 66-77, Aug. 1998.

Network Tomography Based on Flow Level Measurements

Network Tomography Based on Flow Level Measurements

Presentation Transcript

Network Tomography

Tomography-based Overlay Network Monitoring

Volume Flow Measurements

Tomography-based Overlay Network Monitoring

FLOW measurements

FLOW MEASUREMENTS

LEVEL measurements

Network Tomography on Correlated Links

Tomography-based Overlay Network Monitoring

Network Flow-based Bipartitioning

Network Tomography

ATLAS flow measurements

Network Tomography Using Passive End-to-End Measurements

Tomography-based Overlay Network Monitoring and its Applications

Tomography-based Overlay Network Monitoring and its Applications

A P2P flow Identification Model Based On Bayesian Network

Network Tomography through End-End Multicast Measurements

Network Measurements

Network Tomography based Unresponsive Flow Detection and Control

Multiterminal Network Tomography

Tomography-based Overlay Network Monitoring

Network Tomography Based on Flow Level Measurements