1 / 38

Diagnostic Steps

Diagnostic Steps. Les Cottrell – SLAC Presented at the Networks for Non Networkers 2 nd International Workshop, 21-22 June 2005, Edinburgh, Scotland http://www.slac.stanford.edu/grp/scs/net/talk05/nfnn2-jun05.ppt.

carsten
Télécharger la présentation

Diagnostic Steps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Diagnostic Steps Les Cottrell – SLAC Presented at the Networks for Non Networkers 2nd International Workshop, 21-22 June 2005, Edinburgh, Scotland http://www.slac.stanford.edu/grp/scs/net/talk05/nfnn2-jun05.ppt Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

  2. Overview Goal: provide a practical guide to debugging common problems (Brian covered high performance problems) • Why is diagnosis difficult yet important? • Local host • Ping, Traceroute, PingRoute • Looking at time series • Locating bottlenecks • Correlation of problems with routes • More tools and problems • Where is a node • Who do you tell, what do you say? • Case studies and More Information

  3. Why is diagnosis difficult? • Internet's evolution as a composition of independently developed and deployed protocols, technologies, and core applications • Diversity, highly unpredictable, hard to find “invariants” • Rapid evolution & change, no equilibrium so far • Findings may be out of date • Measurement/diagnosis not high on vendors list of priorities • Resources/skill focus on more interesting an profitable issues • Tools lacking or inadequate • Implementations are flaky & not fully tested with new releases

  4. Add to that … • Distributed systems are very hard • A distributed system is one in which I can't get my work done because a computer I've never heard of has failed. Butler Lampson • Network is deliberately transparent • The bottlenecks can be in any of the following components: • the applications • the OS • the disks, NICs, bus, memory, etc. on sender or receiver • the network switches and routers, and so on • Problems may not be logical • Most problems are operator errors, configurations, bugs • When building distributed systems, we often observe unexpectedly low performance • the reasons for which are usually not obvious • Just when you think you’ve cracked it, in steps security • Firewall, NAT boxes etc. • Block pings, traceroute looks like port scan, diagnostic tool ports are blocked … • ISPs worried about providing access to core, making results public, & privacy issues

  5. Sources of problems • Host “errors” • TCP buffers, heavy utilization … • Duplex mismatch (Ethernet) • Misconfigured router/switches • Including routing errors, especially for backup paths • Bad equipment, wiring/fiber problem • Congestion

  6. Local Host (also see NDT later) • Usual Unix tools (uname-a, top, vmstat, iostat ..) • Is the host overloaded, do you have a gateway (route), name server (nslookup), which interface are you using (mii-tool (needs root), gives duplex & speed = common error source) • Net: ifconfig –a (look at errors), netstat –a • Is server running (if you know port)? • >telnet localhost 2811 Trying 127.0.0.1 • 220 aftpexp04.bnl.gov GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready. • ^] • telnet> quit

  7. Local Host - LISA • Localhost Information Service Agent  LISA is a Java Web Start application which provides: • Integration with MonALISA • Complete Monitoring of the System (Load, CPU, Memory, Disk, Disk IO, Paging, Processes, Network Traffic and Connectivity...). • History and instantaneous • Filters to trigger actions when predefined conditions are detected. • A user Friendly GUI to present the monitoring information. • Optimization modules for distributed applications. • It is a lightweight application that can be easily deployed on any system. • Modules for End to End network measurements ( e.g. IPERF). • See monalisa.caltech.edu/dev_lisa.html

  8. Ping • Ping to localhost, ping to gateway, ping to well known host & to relevant remote host • Use IP address to avoid nameserver problems • Look for connectivity, loss, RTT, jitter, dups • May need to run for a long time to see some pathologies (e.g. bursty loss due to DSL loss of sync) • Try flood pings if suspect rate limited • Use synack or sting if ICMP blocked • www-iepm.slac.stanford.edu/tools/synack/

  9. Ping example Packet size Remote host Repeat count RTT syrup:/home$ ping -c 6 -s 64 thumper.bellcore.com PING thumper.bellcore.com (128.96.41.1): 64 data bytes 72 bytes from 128.96.41.1: icmp_seq=0 ttl=240 time=641.8 ms 72 bytes from 128.96.41.1: icmp_seq=2 ttl=240 time=1072.7 ms 72 bytes from 128.96.41.1: icmp_seq=3 ttl=240 time=1447.4 ms 72 bytes from 128.96.41.1: icmp_seq=4 ttl=240 time=758.5 ms 72 bytes from 128.96.41.1: icmp_seq=5 ttl=240 time=482.1 ms --- thumper.bellcore.com ping statistics --- 6 packets transmitted, 5 packets received, 16% packet loss round-trip min/avg/max = 482.1/880.5/1447.4 ms Missing seq # Summary

  10. 3rd party ping (via Looking Glass) • Find servers: • www.caida.org/analysis/routing/reversetrace/ • Example: http://stats.geant.net/cgi-bin/lg/lg.cgi • Ok for checking connectivity and RTT but not for losses (unless huge) Looking Glass Results - ch1.ch.geant.net Date: Mon May 30 21:28:39 2005 GMT Query: Ping <IP_Addr | FQDN>Real Query: ping rapid count 5Argument(s): www.slac.stanford.edu PING www8.slac.stanford.edu (134.79.18.163): 56 data bytes !!!!! --- www8.slac.stanford.edu ping statistics --- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max/stddev=167.316/172.212/191.222/9.506 ms

  11. Traceroute • Traceroute to remote host • Is the route direct, over commercial congested nets • Reverse traceroute from remote host to you or 3rd party • www.slac.stanford.edu/comp/net/wan-mon/traceroute-srv.html • www.tracert.com/ • www.caida.org/analysis/routing/reversetrace/ CAIDA Mouse sensitive map

  12. Traceroute Remote host Max hops Probes/hop • UDP/ICMP tool to show route packets take from local to remote host 17cottrell@flora06:~>traceroute -q 1 -m 20 lhr.comsats.net.pk traceroute to lhr.comsats.net.pk (210.56.16.10), 20 hops max, 40 byte packets 1 RTR-CORE1.SLAC.Stanford.EDU (134.79.19.2) 0.642 ms 2 RTR-MSFC-DMZ.SLAC.Stanford.EDU (134.79.135.21) 0.616 ms 3 ESNET-A-GATEWAY.SLAC.Stanford.EDU (192.68.191.66) 0.716 ms 4 snv-slac.es.net (134.55.208.30) 1.377 ms 5 nyc-snv.es.net (134.55.205.22) 75.536 ms 6 nynap-nyc.es.net (134.55.208.146) 80.629 ms 7 gin-nyy-bbl.teleglobe.net (192.157.69.33) 154.742 ms 8 if-1-0-1.bb5.NewYork.Teleglobe.net (207.45.223.5) 137.403 ms 9 if-12-0-0.bb6.NewYork.Teleglobe.net (207.45.221.72) 135.850 ms 10 207.45.205.18 (207.45.205.18) 128.648 ms 11 210.56.31.94 (210.56.31.94) 762.150 ms 12 islamabad-gw2.comsats.net.pk (210.56.8.4) 751.851 ms 13 * 14 lhr.comsats.net.pk (210.56.16.10) 827.301 ms location Long delay satellite No response: Lost packet or router ignores

  13. RTT from California to world Europe E. Coast Brazil E. Coast US W. Coast US 300ms RTT (ms) Europe & S. America 0.3*0.6c Longitude (degrees) 300ms Frequency Source = Palo Alto CA, W. Coast RTT (ms.) Data from CAIDA Skitter project

  14. Traceroute server results • Example: www.slac.stanford.edu/cgi-bin/nph-traceroute.pl Related info Security warning Traceroute Enter IP address or name

  15. Pingroute • Ping routers along route, e.g. a tool to install that helps: • www.slac.stanford.edu/comp/net/fpingroute.pl • or www.slac.stanford.edu/comp/net/fpingroute.pl if fping N/A 15cottrell@noric04:~>fpingroute.pl fpingroute.pl does a traceroute to the selected host. For each of the hops along the route it then uses fping to ping each node (in parallel) 'count' times. Output includes traceroute information, RTTs, losses for 100 and 'size‘ byte pings. Version=0.21, 8/24/04 Usage: fpingroute.pl [Opts] host where host is the remote host's IP address or name e.g. www.slac.stanford.edu Opts: [-c count default=10] [-s size default=1400] [-i initial default=1] Example: fpingroute.pl -i 3 -c 10 -s 1400 www.triumf.ca

  16. Pingroute example Start of losses? • May help tell where losses start • Will need many pings if losses small But? Start of sustained losses Routers may not respond

  17. Look at time series • Look at history plots (PingER, AMP, IEPM-BW, ISPs, own border router etc.), when did problem start, how big an effect is it? • Assumes you know “proximity” of paths for which there are archived active measurements to the path that you are interested in • Also that relevant measurements exist • www-iepm.slac.stanford.edu/pinger/ • amp.nlanr.net/ • ISPs plots: • Abilene: http://stryper.uits.iu.edu/abilene/ • GEANT: http://stats.geant.net/usagemap/usagemap • RIPE: http://www.ripe.net/projects/ttm/Plots/ • ESnet: http://measurement.es.net/ (OWAMP) • Collaboration between Internet2/ESnet/Geant to provide access to router measurements holds promise • Look at traceroute histories (see later)

  18. Example time series • Look for change in measured value • Note time • Correlate Italy disconnected

  19. Find location of a bottleneck • Look at hops along the path • Pingroute (see earlier) • If possible look at utilizations or active probes launched from there • Pipechar (son of pathchar, pchar) • Send packets of varying sizes to each router along path • Look at RTT as a function of packet size • From slope deduce “bandwidth” • Diferentiate to find capacity at each hop • However pchar is no longer supported, pathchar is very slow, pipechar has uncertain support (ask Brian) • Packet size variation limited to 1-MTU (~1500) Bytes, so on fast links timing is difficult, with the result that estimates may not be reliable • Find pipechar at: http://www.dsd.lbl.gov/OldProjects/NCS/

  20. Divide & Conquer • Abilene has hosts at major PoPs running bwctl • So make measurements from end to middle to ID loss of performance • http://e2epi.internet2.edu/pipes/ami/bwctl/

  21. Correlate with routes (traceanal)

  22. Visualizing traceroutes • One compact page per day • One row per host, one column per hour • One character per traceroute to indicate pathology or change (usually period(.)= no change) • Identify unique routes with a number • Be able to inspect the route associated with a route number • Provide for analysis of long term route evolutions Route # at start of day, gives idea of route stability Multiple route changes (due to GEANT), later restored to original route Period (.) means no change

  23. Changes in network topology (BGP) can result in dramatic changes in performance Hour Samples of traceroute trees generated from the table Los-Nettos (100Mbps) Remote host Snapshot of traceroute summary table Notes: 1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:00 2. ESnet/GEANT working on routes from 2:00 to 14:00 3. A previous occurrence went un-noticed for 2 months 4. Next step is to auto detect and notify Drop in performance (From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech ) Back to original path Dynamic BW capacity (DBC) Changes detected by IEPM-Iperfand AbWE Mbits/s Available BW = (DBC-XT) Cross-traffic (XT) Esnet-LosNettos segment in the path (100 Mbits/s) ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

  24. Moving towards application • See Brian’s talk • Try user application (mem to mem & disk to disk) • GridFTP, bbcp, bbftp … • Iperf or thrulay (also provides RTT) to test TCP or UDP throughput • dast.nlanr.net/Projects/Iperf/ • www.internet2.edu/~shalunov/thrulay/ • NDT • What are the interface speeds? • What is the bottleneck? • Is there a duplex mismatch? • Are buffers set right (both ends)?

  25. NDT example (Rich Carlson)

  26. Other tools • Ntop • Summarizes libpcap (sniffer) infor • Internet2 Detective: • Tests connectivity to I2, bandwidth, multicast, IPv6 • Can run as Java applet • http://detective.internet2.edu/ • NLANR Internet Advisor • Ethereal, tcpdump, snoop for masochists • Passive tools: • Netflow for characterizing network, spotting abnormalities, e.g. • www.itec.oar.net/abilene-netflow • www.slac.stanford.edu/comp/net/slac-netflow/html/SLAC-netflow.html • SNMP based tools

  27. And then … • Wireless • Avoid peer-to-peer/ad-hoc connections • Disable connecting to ad-hoc (set infrastructure only) • Disable bridging • How to do it varies by OS (XP, OSX, Linux) • Ad hoc can still interfere if on same channel • Tools to locate an access point (e.g. Yellow-Jacket) • See • www2.slac.stanford.edu/comp/net/wireless/Wireless-Meeting-Handout.mht • NAT boxes may block or not support application • Private addresses: • 10.0.0.0 - 10.255.255.255 a single class A net • 172.16.0.0 - 172.31.255.255 16 contiguous class Bs • 192.168.0.0 – 192.168.255.255 256 contiguous class Cs

  28. “Where is” a host? • Beware some of information following is ephemeral, in general use heuristics with Google • Google “Internet country codes” for TLDs • Host may not be in TLD country, especially developing regions often use proxies elsewhere • Location may be encoded in router name • ipls=Indianapolis, snv=Sunnyvale … • Name server lookup to find hostname given IP address 47cottrell@netflow:~>nslookup 210.56.16.10 Server: localhost Address: 127.0.0.1 Name: lhr.comsats.net.pk Address: 210.56.16.10 • Use a whois server, e.g. • www.networksolutions.com/cgi-bin/whois/whois (Americas & Africa) • www.ripe.net/cgi-bin/whois (Europe) • www.apnic.net/ (Asia) • May identify site name, address, contact, etc, not all domains are in databases (e.g. will not find comsats.net.pk)

  29. “Where is” a host – cont. • Find the Autonomous System (AS) administering • Form giving AS for domain name • http://www.fixedorbit.com/search.htm • Gives AS number, name adjacent AS’s web page for AS • Given an AS find out more about it: • Use http://bgp.potaroo.net/cidr/ go to bottom and enter AS into form: • Gives ISP name, web page, phone number, email, hours etc. • Review list of AS's ordered by Upstream AS Adjacency • www.telstra.net/ops/bgp/bgp-as-upsstm.txt • Tells what AS is upstream of an ISP

  30. “Where is” a host - cont. • May be able to get latitude & longitude: • http://www.hostip.info/index.html • http://www.ip2location.com/ • But it is a subscriber service ($$$, but …), however it is probably best for developing regions • Triangulate pings from landmarks (in development) • planetlab-01.ipv6.lip6.fr:10000/cbg.php

  31. Who you gonna tell? • Local network support people • Internet Service Provider (ISP) usually done by local networker • Usually will know immediate one, e.g. trouble@es.net • Use puck.nether.net/netops/nocs.cgi to find ISP • Use www.telstra.net/ops/bgp/bgp-as-upsstm.txt to find upstream ISPs • Well managed sites and ISPs maintain a list of email addresses such as abuse@ or postmaster@, that one can send email to, for example to complain about spam etc. • This follows an Internet recommendation (RFC 2142). • Some less helpful sites do not provide such services, for more on these, see RFC-ignorant.org

  32. What ya gonna tell ‘em? • Describe problem with details • What is affected? • Application, host OS (uname –a), NIC (ifconfig, route) • How is it affected? • Non responsiveness, unable to contact remote host • Slow performance (see Brian’s talk), packet loss • When did it start? • Send ping output between hosts • Send traceroute forward & reverse – if possible • Maybe use –I (ICMP option) • NDT • Identify when it started • If complex think about creating web page with details • Top, vmstat, pingroute, pipechar, application output (GridFTP, iperf)…

  33. Web page examples: Case studies • http://www.slac.stanford.edu/grp/scs/net/case/html/ • http://e2epi.internet2.edu/case-studies/

  34. More Information • Tutorial on monitoring • www.slac.stanford.edu/comp/net/wan-mon/tutorial.html • RFC 2151 on Internet tools • www.freesoft.org/CIE/RFC/Orig/rfc2151.txt • Network monitoring tools • www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html • www.caida.org/tools/taxonomy/ • Network Performance Tools: an I2 Cookbook • e2epi.internet2.edu/network-perf-wk/tools-cookbook.pdf • Network Monitoring sites • www.slac.stanford.edu/comp/net/wan-mon/netmon.html

  35. Pathology Encodings Change but same AS No change Probe type Change in only 4th octet End host not pingable Hop does not respond Stutter Multihomed ICMP checksum ! Annotation (!X)

  36. Navigation traceroute to CCSVSN04.IN2P3.FR (134.158.104.199), 30 hops max, 38 byte packets 1 rtr-gsr-test (134.79.243.1) 0.102 ms … 13 in2p3-lyon.cssi.renater.fr (193.51.181.6) 154.063 ms !X • #rt# firstseen lastseen route • 0 1086844945 1089705757 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx • 1 1087467754 1089702792 ...,192.68.191.83,171.64.1.132,137,...,131.215.xxx.xxx • 2 1087472550 1087473162 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx • 3 1087529551 1087954977 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx • 4 1087875771 1087955566 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,(n/a),131.215.xxx.xxx • 5 1087957378 1087957378 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx • 6 1088221368 1088221368 ...,192.68.191.146,134.55.209.1,134.55.209.6,...,131.215.xxx.xxx • 7 1089217384 1089615761 ...,192.68.191.83,137.164.23.41,(n/a),...,131.215.xxx.xxx • 8 1089294790 1089432163 ...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a),...,131.215.xxx.xxx

  37. History Channel

  38. AS’ information

More Related