180 likes | 288 Vues
This guide addresses the persistent challenges of troubleshooting chronic conditions in large IP networks, emphasizing the need for accurate and timely analysis to avoid network outages and performance degradation. Traditional methods often fall short in detecting hard-to-spot chronic issues like link flaps or CPU anomalies. With NICE (Network-wide Information Correlation and Exploration), a novel infrastructure is proposed to analyze statistical correlations across multiple data sources. This approach aims to simplify troubleshooting, automate processes, and enhance reliability in enterprise wireless networks.
E N D
Network Reliability • Applications demand high reliability and performance • Network outages required accurate and timely troubleshooting • Traditionally, troubleshooting focused on hard failures
Chronic Conditions-A Big Trouble • Individual events disappear before you can react to them • Keep re-occurring • Cause performance degradation to customers • Can even turn into serious hard failures • Examples : Chronic link flaps Chronic router CPU utilization anomalies
Key points of Troubleshooting Chronic Conditions • Mining measurement data – the heart of the troubleshooting process • Find chronic patterns • Reproduce patterns in lab settings (if needed) • Perform software and hardware analysis (if needed) • Traditionally, troubleshooting chronics has been performed manually, making it a cumbersome, time-consuming and error-prone process
Troubleshooting Challenges • Massive Scale • Potential root-causes hidden in thousands of event-series • E.g., root-causes for packet loss include link congestion (SNMP), protocol down (Route data), software errors (syslogs) • Complex spatial and topology models • Cross-layer dependency • Causal impact scope • Local versus global (propagation through protocols) • Imperfect timing information • Propagation (events take time to show impact – timers) • Measurement granularity (point versus range events)
NICE (Network-wide Information Correlation and Exploration) • a novel infrastructure that enables the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. • NICE Chronic Symptom Statistically Correlated Events Spatial Proximity model Unified Data Model Statistical Correlation Network data
Customs and Traditions • Hierarchical structure -capture event location • Proximity distance -capture impact scope of event
Unified Data Model • Facilitate easy cross-event correlations • Padding time-margins to handle diverse data • Convert any event-series to range series • Common time-bin to simplify correlations • Convert range-series to binary time-series
Statistical Correlation Testing • Measure statistical time co-occurrence Pair-wise Pearson’s correlation coefficient • Test the significance of the correlation score using novel circular permutation-based significance test
Conclusions • In this part we focus on troubleshooting in chronic conditions • Simply introduced NICE • Any comments or questions?
Automating Cross-Layer Diagnosis of Enterprise 802.11 Wireless Networks
Different Standards • 802.11 -- applies to wireless LANs and provides 1 or 2 Mbps transmission in the 2.4 GHz band. • 802.11a -- an extension to 802.11 that applies to wireless LANs and provides up to 54 Mbps in the 5GHz band. • 802.11b (also referred to as 802.11 High Rate or Wi-Fi) -- an extension to 802.11 that applies to wireless LANS and provides 11 Mbps transmission (with a fallback to 5.5, 2 and 1 Mbps) in the 2.4 GHz band. • 802.11g -- applies to wireless LANs and provides 20+ Mbps in the 2.4 GHz band.
Enterprise Wireless Networks • May comprise hundreds of distinct APs • Carefully sited and configured in accordance with an RF (radio frequency) site survey • Minimize contention, maximize throughput and to provide the illusion of seamless coverage • In practice, there are numerous opportunities for disrupting or degrading a user's connectivity in an 802.11 network.
Problems in the life of a packet • There are numerous opportunities for disrupting or degrading a user's connectivity in an 802.11 network. • Physical Layer: Sharing the 2.4Ghz spectrum are a wide range of non-802.11 devices, ranging from cordless phones to microwave ovens. • Link Layer: Transmission delays. Management delays. • Infrastructure support. • Transport Layer.
Problems can be in anywhere • Across layers – protocols • Even in the same layer – 802.11 {a,b,f,g,h,i,n,s} • Software incompatibilities – vendor variations • Transient or persistent - time • Radio propagates in free space - locations • Radio spreads across channels – frequencies • Shared spectrum makes it worse • APs bridge wireless and wired worlds – infrastructure
Shaman • Goal: Develop a system to automatically diagnose problems in wireless networks • Pervasive data collection (Jigsaw) • Extensive passive monitoring system • Observe all transmissions across locations, channels, and time • Provides a unified synchronized trace of every packet transmission • Explicitly model protocols on critical path • DHCP, 802.11 MAC, TCP, etc. • Provides complete delay and loss breakdown • For every packet transmission, all protocol stages • Framework for diagnostic tools • Use model outputs to determine root cause of problems • Users can query on demand, also alert admins
Another good ideal • The goal is to determine the various delays an actual monitored frame encountered as it traversed through the stages of the wireless network path.
Conclusion • Modern enterprise networks are of sufficient complexity that even simple faults can be difficult to diagnose. • Some good solutions such as Shaman • Any comments or questions?