Troubleshooting Chronic Conditions in Large IP

Troubleshooting Chronic Conditions in Large IP

Network Reliability • Applications demand high reliability and performance • Network outages required accurate and timely troubleshooting • Traditionally, troubleshooting focused on hard failures

Chronic Conditions-A Big Trouble • Individual events disappear before you can react to them • Keep re-occurring • Cause performance degradation to customers • Can even turn into serious hard failures • Examples : Chronic link flaps Chronic router CPU utilization anomalies

Key points of Troubleshooting Chronic Conditions • Mining measurement data – the heart of the troubleshooting process • Find chronic patterns • Reproduce patterns in lab settings (if needed) • Perform software and hardware analysis (if needed) • Traditionally, troubleshooting chronics has been performed manually, making it a cumbersome, time-consuming and error-prone process

Troubleshooting Challenges • Massive Scale • Potential root-causes hidden in thousands of event-series • E.g., root-causes for packet loss include link congestion (SNMP), protocol down (Route data), software errors (syslogs) • Complex spatial and topology models • Cross-layer dependency • Causal impact scope • Local versus global (propagation through protocols) • Imperfect timing information • Propagation (events take time to show impact – timers) • Measurement granularity (point versus range events)

NICE (Network-wide Information Correlation and Exploration) • a novel infrastructure that enables the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. • NICE Chronic Symptom Statistically Correlated Events Spatial Proximity model Unified Data Model Statistical Correlation Network data

Customs and Traditions • Hierarchical structure -capture event location • Proximity distance -capture impact scope of event

Unified Data Model • Facilitate easy cross-event correlations • Padding time-margins to handle diverse data • Convert any event-series to range series • Common time-bin to simplify correlations • Convert range-series to binary time-series

Statistical Correlation Testing • Measure statistical time co-occurrence Pair-wise Pearson’s correlation coefficient • Test the significance of the correlation score using novel circular permutation-based significance test

Conclusions • In this part we focus on troubleshooting in chronic conditions • Simply introduced NICE • Any comments or questions?

Automating Cross-Layer Diagnosis of Enterprise 802.11 Wireless Networks

Different Standards • 802.11 -- applies to wireless LANs and provides 1 or 2 Mbps transmission in the 2.4 GHz band. • 802.11a -- an extension to 802.11 that applies to wireless LANs and provides up to 54 Mbps in the 5GHz band. • 802.11b (also referred to as 802.11 High Rate or Wi-Fi) -- an extension to 802.11 that applies to wireless LANS and provides 11 Mbps transmission (with a fallback to 5.5, 2 and 1 Mbps) in the 2.4 GHz band. • 802.11g -- applies to wireless LANs and provides 20+ Mbps in the 2.4 GHz band.

Enterprise Wireless Networks • May comprise hundreds of distinct APs • Carefully sited and configured in accordance with an RF (radio frequency) site survey • Minimize contention, maximize throughput and to provide the illusion of seamless coverage • In practice, there are numerous opportunities for disrupting or degrading a user's connectivity in an 802.11 network.

Problems in the life of a packet • There are numerous opportunities for disrupting or degrading a user's connectivity in an 802.11 network. • Physical Layer: Sharing the 2.4Ghz spectrum are a wide range of non-802.11 devices, ranging from cordless phones to microwave ovens. • Link Layer: Transmission delays. Management delays. • Infrastructure support. • Transport Layer.

Problems can be in anywhere • Across layers – protocols • Even in the same layer – 802.11 {a,b,f,g,h,i,n,s} • Software incompatibilities – vendor variations • Transient or persistent - time • Radio propagates in free space - locations • Radio spreads across channels – frequencies • Shared spectrum makes it worse • APs bridge wireless and wired worlds – infrastructure

Shaman • Goal: Develop a system to automatically diagnose problems in wireless networks • Pervasive data collection (Jigsaw) • Extensive passive monitoring system • Observe all transmissions across locations, channels, and time • Provides a unified synchronized trace of every packet transmission • Explicitly model protocols on critical path • DHCP, 802.11 MAC, TCP, etc. • Provides complete delay and loss breakdown • For every packet transmission, all protocol stages • Framework for diagnostic tools • Use model outputs to determine root cause of problems • Users can query on demand, also alert admins

Another good ideal • The goal is to determine the various delays an actual monitored frame encountered as it traversed through the stages of the wireless network path.

Conclusion • Modern enterprise networks are of sufficient complexity that even simple faults can be difficult to diagnose. • Some good solutions such as Shaman • Any comments or questions?

Troubleshooting Chronic Conditions in Large IP