Peer-to-Peer-based Automatic Fault Diagnosis in VoIP

Peer-to-Peer-based Automatic Fault Diagnosis in VoIP Henning Schulzrinne (Columbia U.) Kai X. Miao (Intel) International SIP 2008 (Paris)

Overview • The transition in IT cost metrics • End-to-end application-visible reliability still poor (~ 99.5%) • even though network elements have gotten much more reliable • particular impact on interactive applications (e.g., VoIP) • transient problems • Lots of voodoo network management • Existing network management doesn’t work for VoIP and other modern applications • Need user-centric rather than operator-centric management • Proposal: peer-to-peer management • “Do You See What I See?” • Using VoIP as running example -- most complex consumer application • but also applies to IPTV and other services • Also use for reliability estimation and statistical fault characterization International SIP 2008 (Paris)

Circle of blame ISP probably packet loss in your Internet connection  reboot your DSL modem probably a gateway fault  choose us as provider OS VSP must be a Windows registry problem  re-install Windows app vendor must be your software  upgrade International SIP 2008 (Paris)

Diagnostic undecidability • symptom: “cannot reach server” • more precise: send packet, but no response • causes: • NAT problem (return packet dropped)? • firewall problem? • path to server broken? • outdated server information (moved)? • server dead? • 5 causes  very different remedies • no good way for non-technical user to tell • Whom do you call? International SIP 2008 (Paris)

Traditional network management model X SNMP “management from the center” International SIP 2008 (Paris)

Old assumptions, now wrong • Single provider (enterprise, carrier) • has access to most path elements • professionally managed • Problems are hard failures & elements operate correctly • element failures (“link dead”) • substantial packet loss • Mostly L2 and L3 elements • switches, routers • rarely 802.11 APs • Problems are specific to a protocol • “IP is not working” • Indirect detection • MIB variable vs. actual protocol performance • End systems don’t need management • DMI & SNMP never succeeded • each application does its own updates International SIP 2008 (Paris)

Managing the protocol stack protocol problem authorization asymmetric conn (NAT) media echo gain problems VAD action RTP SIP protocol problem playout errors UDP/TCP TCP neg. failure NAT time-out firewall policy IP no route packet loss International SIP 2008 (Paris)

Types of failures • Hard failures • connection attempt fails • no media connection • NAT time-out • Soft failures (degradation) • packet loss (bursts) • access network? backbone? remote access? • delay (bursts) • OS? access networks? • acoustic problems (microphone gain, echo) • a software bug (poor voice quality) • protocol stack? Codec? Software framework? International SIP 2008 (Paris)

Examples of additional problems • ping and traceroute no longer works reliably • WinXP SP 2 turns off ICMP • some networks filter all ICMP messages • Early NAT binding time-out • initial packet exchange succeeds, but then TCP binding is removed (“web-only Internet”) • policy intent vs. failure • “broken by design” • “we don’t allow port 25” vs. “SMTP server temporarily unreachable” International SIP 2008 (Paris)

Fault localization • Fault classification – local vs. global • Does it affect only me or does it affect others also? • Global failures • Server failure • e.g., SIP proxy, DNS failure, database failures • Network failures • Local failures • Specific source failure • node A cannot make call to anyone • Specific destination or participant failure • no one can make call to node B • Locally observed, but global failures • DNS service failed, but only B observed it International SIP 2008 (Paris)

Proposal: “Do You See What I See?” DYSWIS • Each node has a set of active and passive measurement tools • Use intercept (NDIS, pcap) • to detect problems automatically • e.g., no response to SIP, HTTP or DNS request • deviation from normal protocol exchange behavior • gather performance statistics (packet jitter) • capture RTCP and similar measurement packets • Nodes can ask others for their view • possibly also dedicated “weather stations” • Iterative process, leading to: • user indication of cause of failure • in some cases, work-around (application-layer routing)  TURN server, use remote DNS servers • Nodes collect statistical information on failures and their likely causes International SIP 2008 (Paris)

Architecture Three types of nodes – sensor, probe, and diagnosis Diagnosis Diagnosis Probe Sensor Probe Sensor SIP Proxy DNS Server SMTP Server Firewall Other International SIP 2008 (Paris)

Architecture Sensor node “not working” (notification) Diagnosis node orchestrate tests contact others request diagnostics inspect protocol requests (DNS, HTTP, RTCP, …) ping 127.0.0.1 can buddy reach our resolver? “DNS failure for 15m” notify admin (email, IM, SIP events, …) International SIP 2008 (Paris)

P6 P2P P2P PESQ Test P5 P7 P2P Service Provider 1 Service Provider 2 P2P P8 P4 P2P P2P DNS Test SIP Test P2 SIP Server DNS Server P3 P2P P2P P1 Call Failed at P1 Domain A Solution architecture Nodes in different domains cooperating to determine cause of failure International SIP 2008 (Paris)

Failure detection tools • STUN server • what is your IP address? • ping and traceroute • Transport-level liveness and QoS • open TCP connection to port • send UDP ping to port • measure packet loss & jitter • Need scriptable tools with dependency graph • using DROOLS for now • TBD: remote diagnostic • fixed set (“do DNS lookup”) or • applets (only remote access) media RTP UDP/TCP IP International SIP 2008 (Paris)

Components and Operations Distributed p2p architecture with an iterative process involving all these functions: - Data gathering from multiple perspectives - Knowledge in existence or built over time (learning) - Tools (with intelligence built in) for active probing or observations - Inference, analysis, and decision making Peer nodes: detection nodes, diagnosis nodes, and probe nodesP2P protocol for fault diagnosisOperation rules used to generate tests – built or learned in real timeInference based in rules (inference modeling) International SIP 2008 (Paris)

Learning & modeling Analysis/Inference/Diagnosis Fault profiles Diagnostic analysis Statistical inference Diagnostic tests Active probes Adaptive probes Passive Tests/Active Tests Monitoring deviant behavior Fault diagnosis architecture, components, and domain agents Dependency relationships/Decision trees Dependency Graphs Normal Network Behavior Fault types: hard vs. soft Components and Operations International SIP 2008 (Paris)

Dependency classification • Functional dependency • At generic service level • e.g., SIP proxy depends on DB service, DNS service • Structural dependency • Configuration time • e.g., Columbia CS SIP proxy is configured to use mysql database on host metro-north • Operational dependency • Runtime dependencies or run time bindings • e.g., the call which failed was using failover SIP server obtained from DNS which was running on host a.b.c.d in IRT lab International SIP 2008 (Paris)

Dependency Graph International SIP 2008 (Paris)

A A A Failed, Use Decision Tree D B C C Yes No Invokes Decision Tree for C B No Yes A = SIP Call C = SIP Proxy B = DNS Server D = Connectivity D Invokes Decision Tree for B Yes No Cause Not Known Report, Add new Dependency Invokes Decision Tree for D Dependency graph encoded as decision tree International SIP 2008 (Paris)

Current work • Building decision tree system • Using JBoss Rules (Drools 3.0) International SIP 2008 (Paris)

Future work • Learning the dependency graph from failure events and diagnostic tests • Learning using random or periodic testing to identify failures and determine relationships • Self healing • Predicting failures • Protocols for labeling event failures --> enable automatically incorporating new devices/applications to the dependency system • Decision tree (dependency graph) based event correlation International SIP 2008 (Paris)

Conclusion • Hypothesis: network reliability as single largest open technical issue  prevents (some) new applications • Existing management tools of limited use to most enterprises and end users • Transition to “self-service” networks • support non-technical users, not just NOCs running HP OpenView or Tivoli • Need better view of network reliability International SIP 2008 (Paris)

Peer-to-Peer-based Automatic Fault Diagnosis in VoIP

Peer-to-Peer-based Automatic Fault Diagnosis in VoIP

Presentation Transcript

Peer to peer

Peer to Peer

Byzantine Fault Tolerant Public Key Authentication in Peer-to-peer Systems

peer-to-peer and agent-based computing

Peer-to-Peer Based Multimedia Distribution Service

peer-to-peer and agent-based computing

peer-to-peer and agent-based computing

peer-to-peer and agent-based computing

peer-to-peer and agent-based computing

peer-to-peer and agent-based computing

PEER-TO-PEER

peer-to-peer and agent-based computing

peer-to-peer and agent-based computing

peer-to-peer and agent-based computing

Peer-to-Peer

peer-to-peer and agent-based computing

Peer to Peer

Fault-tolerant Routing in Peer-to-Peer Systems

Peer-to-Peer Based Multimedia Distribution Service

peer-to-peer and agent-based computing

Peer-to-Peer