Predicting and Bypassing End-to-End Internet Service Degradation

Predicting and Bypassing End-to-End Internet Service Degradation Anat Bremler-Barr Edith Cohen Haim Kaplan Yishay Mansour Tel-Aviv UniversityAT&T Labs Tel-Aviv University Talk Omer Ben-Shalom Tel-Aviv University

Outline: • Degradation • deviation from “normal” (minimum) RTT. • Predicting Degradation: • Different Predictors • Performance Evaluation: • Precision/recall methodology • Suggested Application: Gateway selection

Intelligent Routing device ? Motivating Application AS 41 AS 123 Peering link AS 56 Peering link AS 12 • Gateway selection (Intelligent Routing device) • Choosing peering links

Data and Measurements: Sources • Aciri (CA2) • AT&T (CA1) • AT&T(NJ1) • Princeton (NJ2) • Base Measurements from 4 different location (AS) simulated 4 • gateway: • California (CA): AT&T + ACIRI • New Jersey (NJ): AT&T + Princeton

Data and Measurements: Destinations • Aciri (CA2) • AT&T(CA1) • AT&T(NJ1) • Princeton (NJ2) • Obtaining a representative sets of web servers + weights • (derived from proxy-log)

Data and Measurements: RTT • Aciri (CA2) • AT&T(CA1) • AT&T(NJ1) • Princeton(NJ2) • Data: Weekly RTT (SYN) ( End to End (path+server)) • Hourly measurements  35,124 servers • Once-a-minute weighted sample measurements  100 servers

Degradation: Definition • Deviation from minimum recorded RTT (propagation delay) • Discrete degradation levels 1-6.

Objective: Avoiding degradation ? • Attempt to reroute through a different gateway • Two conditions have to hold • Need to be able to predict the failure from a gateway • Need to have a substitute gateway (low correlation between gateways) • Blackout (consecutive degradation) through one gateway

Blackout durations • Longer duration, easier to predict. • Majority of blackouts are short 1-3 consecutive points • However, considerable fraction occurs in longer durations. Long duration blackout

Gateways Correlation • Gateways are correlated but often the correlation is not too strong

Gateways Correlation • Longer blackouts more likely to be shared • failure closer to the server • Majority of 2-gateways blackouts involved same-coast pairs

Building predictors • For a given degradation level l. • Prediction per IP. • Input: Previous RTT Measurements for the IP-address. • Output: probability for a failure • Predict “failure” if probability > Ф

Actual degraded & Predicted Degraded Actual degraded & Predicted Degraded Precision = Recall = Predicted degraded Actual degraded Precision \ Recall Methodology Predicted degraded Actual degraded

Precision-recall curve • Sweep the threshold Ф in [0,1] to obtain a precision-recall curve. • In other words, let P(t) the predicted failure probability at time t

What is important for prediction? • Recency principle • The more recent RTTs are more important. • Quantity Principle • The more measurements the higher the accuracy.

Recency Principle : Importance • Test case: Single measurement predictor • predict according to a measurement x-minute ago. • observe the change in the quality of the prediction.  15% different between using the last minute measurement or the 15 minutes ago measurement

Quantity Principle: Importance • Test case: Fixed-Window-Count(FWC) • the prediction is the fraction of failures in the W most recent measurements  By quantity we can achieve better precision for high recall FWC 1 FWC 5 FWC 10 FWC 50

Our predictors • Exponential Decay • Polynomial Decay • Model based Predictors: • VW-cover : Variable Window Cover algorithm • HMM : Hidden Markov Model

Exponential-decay predictors • The weight of each measurement is exponentially decreasing with its age by factor λ. For consecutive measurements: • Binary variable ft represents a failure at time t. • In general,

Polynomial-decay predictors • Exact computation required to maintaining the complete history. • We approximated it.

The VW-Cover predictor • Consists of a list of pairs ( a1 , b1) ( a2 , b2 ) …( an , bn ) • Predict a failure if exist i such that there are at least bi failures among previous ai measurements

VW-Cover predictor: Building • Build the predictor greedily to cover the failures. • Use a learning set of measurements • Pick ( a1 , b1 ) to be the pair which maximizes precision • Pick ( ai , bi ) to be the pair which maximizes precision among uncovered failures

Hidden Markov Model • Finite set states S (we use 3 states) • Output probability as(0),as(1) • Transition function, determines the probability distribution of the next state. • The probability for a failure: Where ps(t) is the probability to be at state s at time t. Ps(t) is updated according to the output of time t-1.

Experimental Evaluation

Predictor Performance – Level 3 FWC10 FWC 50 ExpDecay 0.99 ExpDecay 0.95 VW-Cover HMM  A recall 0.5 precision close to 0.9

Predictor Performance – Level 6 FWC10 FWC 50 ExpDecay 0.99 ExpDecay 0.95 VW-Cover HMM • Degradation of level-6 are harder to predict: recall 0.5 precision 0.4

Predictor Performance: Conclusion • The best predictors in level 3 and 6 are VW-cover and HMM • But they only slightly outperform ExpDecay0.95 which is considerable simpler to implement

Gateway Selection Level 6 Level 3

Gateway Selection: Conclusion • Active gateway selection resulted in 50% reduction in the degradation-rate with respect to best single gateway. • Static gateway selection can avoid at most 25% of degradations. • Again ExpDecay0.95 only slightly under perform the best predictor (VW-cover).

Performance of gateway selection as a function of recency

Correlation between coast • Gateway selection on same-coast pair resulted only in 10% reduction. • Chose independent gateways

Controlling prediction overhead • Type of measurements: • Active measurements : • initiate probes (SYN,ping,HTTP request). • Scalability problem. • Passive measurements: • collected on regular traffic • Controlling the prediction overhead: • Using less-recent measurements • Active measurements only to small set of destinations, which cover the majority of traffic. • Cluster destinations. The measurements of one destination can be used to predict another.

Questions ?? natali@cs.tau.ac.il edith@research.att.com haimk@cs.tau.ac.il mansour@cs.tau.ac.il

Predicting and Bypassing End-to-End Internet Service Degradation

Predicting and Bypassing End-to-End Internet Service Degradation

Presentation Transcript

ESnet End-to-end Internet Monitoring

End to End Routing Behavior in the Internet

Vigilante: End-to-End Containment of Internet Worms

Internet End-to-end Monitoring Project at SLAC

Vigilante: End-to-End Containment of Internet Worms

Vigilante: End-to-End Containment of Internet Worms

End to End Internet Packet Dynamics

End-to-End and Innovation

End-to-End Service Level Agreement Provisioning and Monitoring for End-to-End QoS

An End-to-End Service Architecture

Vigilante: End-to-End Containment of Internet Worms

Vigilante: End-to-End Containment of Internet Worms

End-2-End QoS Internet

End to End Quality of Service

End to end Internet Performance today

End-to-End WordPress Development Service | NetleafSoftware

EtE: Passive End-to-End Internet Service Performance Monitoring

An End-to-End Service Architecture

End-to-End and Innovation