Do You See What I See (DYSWIS)? or Leveraging end systems to improve network reliability

Do You See What I See (DYSWIS)?orLeveraging end systems to improve network reliability Henning Schulzrinne Dept. of Computer Science Columbia University July 2005

Overview • End-to-end application-visible reliability still poor (~ 99.5%) • even though network elements have gotten much more reliable • particular impact on interactive applications (e.g., VoIP) • transient problems • Lots of voodoo network management • Existing network management doesn’t work for VoIP and other modern applications • Need user-centric rather than operator-centric management • Proposal: peer-to-peer management • “Do You See What I See?” • Also use for reliability estimation and statistical fault characterization

Transition in cost balance • Total cost of ownership • Ethernet port cost  $10 • about 80% of Columbia CS’s system support cost is staff cost • about $2500/person/year  2 new PCs/year • much of the rest is backup & license for spam filters  • Does not count hours of employee or son/daughter time • PC, Ethernet port and router cost seem to have reached plateau • just that the $10 now buys a 100 Mb/s port instead of 10 Mb/s • All of our switches, routers and hosts are SNMP-enabled, but no suggestion that this would help at all

Circle of blame probably packet loss in your Internet connection  reboot your DSL modem ISP probably a gateway fault  choose us as provider OS VSP must be a Windows registry problem  re-install Windows app vendor must be your software  upgrade

Diagnostic undecidability • symptom: “cannot reach server” • more precise: send packet, but no response • causes: • NAT problem (return packet dropped)? • firewall problem? • path to server broken? • outdated server information (moved)? • server dead? • 5 causes  very different remedies • no good way for non-technical user to tell • Whom do you call?

VoIP user experience • Only 95-99.5% call attempt success • “Keynote was able to complete VoIP calls 96.9% of the time, compared with 99.9% for calls made over the public network. Voice quality for VoIP calls on average was rated at 3.5 out of 5, compared with 3.9 for public-network calls and 3.6 for cellular phone calls. And the amount of delay the audio signals experienced was 295 milliseconds for VoIP calls, compared with 139 milliseconds for public-network calls.” (InformationWeek, July 11, 2005) • Mid-call disruptions common • Lots of knobs to turn • Separate problem: manual configuration

Traditional network management model X SNMP “management from the center”

Old assumptions, now wrong • Single provider (enterprise, carrier) • has access to most path elements • professionally managed • Problems are hard failures & elements operate correctly • element failures (“link dead”) • substantial packet loss • Mostly L2 and L3 elements • switches, routers • rarely 802.11 APs • Problems are specific to a protocol • “IP is not working” • Indirect detection • MIB variable vs. actual protocol performance • End systems don’t need management • DMI & SNMP never succeeded • each application does its own updates

Management what causes the most trouble? network understanding fault location we’ve only succeeded here configuration element inspection

Managing the protocol stack protocol problem authorization asymmetric conn (NAT) media echo gain problems VAD action RTP SIP protocol problem playout errors UDP/TCP TCP neg. failure NAT time-out firewall policy IP no route packet loss

Types of failures • Hard failures • connection attempt fails • no media connection • NAT time-out • Soft failures (degradation) • packet loss (bursts) • access network? backbone? remote access? • delay (bursts) • OS? access networks? • acoustic problems (microphone gain, echo)

Examples of additional problems • ping and traceroute no longer works reliably • WinXP SP 2 turns off ICMP • some networks filter all ICMP messages • Early NAT binding time-out • initial packet exchange succeeds, but then TCP binding is removed (“web-only Internet”) • policy intent vs. failure • “broken by design” • “we don’t allow port 25” vs. “SMTP server temporarily unreachable”

Proposal: “Do You See What I See?” • Each node has a set of active and passive measurement tools • Use intercept (NDIS, pcap) • to detect problems automatically • e.g., no response to HTTP or DNS request • gather performance statistics (packet jitter) • capture RTCP and similar measurement packets • Nodes can ask others for their view • possibly also dedicated “weather stations” • Iterative process, leading to: • user indication of cause of failure • in some cases, work-around (application-layer routing)  TURN server, use remote DNS servers • Nodes collect statistical information on failures and their likely causes

Architecture “not working” (notification) request diagnostics orchestrate tests contact others inspect protocol requests (DNS, HTTP, RTCP, …) ping 127.0.0.1 can buddy reach our resolver? “DNS failure for 15m” notify admin (email, IM, SIP events, …)

Failure detection tools • STUN server • what is your IP address? • ping and traceroute • Transport-level liveness and QoS • open TCP connection to port • send UDP ping to port • measure packet loss & jitter • TBD: Need scriptable tools with dependency graph • initially, we’ll be using ‘make’ • TBD: remote diagnostic • fixed set (“do DNS lookup”) or • applets (only remote access) media RTP UDP/TCP IP

Failure statistics • Which parts of the network are most likely to fail (or degrade) • access network • network interconnects • backbone network • infrastructure servers (DHCP, DNS) • application servers (SIP, RTSP, HTTP, …) • protocol failures/incompatibility • Currently, mostly guesses • End nodes can gather and accumulate statistics

Conclusion • Hypothesis: network reliability as single largest open technical issue  prevents (some) new applications • Existing management tools of limited use to most enterprises and end users • Transition to “self-service” networks • support non-technical users, not just NOCs running HP OpenView or Tivoli • Need better view of network reliability

Do You See What I See (DYSWIS)? or Leveraging end systems to improve network reliability

Do You See What I See (DYSWIS)? or Leveraging end systems to improve network reliability

Presentation Transcript

Do You See What I See

Enterprise Imaging Do you see what I see?

Do you see what I see?

What do you see?

Do You See What I See?

Do you see what I see?

Do you see what I see?

Do You See What I See?

Do You See What We See?

What do you see?

What Do You See?

Do You See What I See?

Do you see what I see?

Do I See What You See?

Do You See What I See (DYSWIS)

Do you See what I see?

DYSWIS (Do You See What I See)

What Do You See?

What do you see?

Do You See What I See?

What do you see?

Can You See what I See?