End-2-End Network Monitoring What do we do ? What do we use it for?

End-2-End Network MonitoringWhat do we do ? What do we use it for? Richard Hughes-JonesMany people are involved: GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

DataGRID WP7: Network Monitoring Architecturefor Grid Sites LDAP Schema Grid Apps GridFTP Backend LDAP script to fetch metrics Monitor process to push metrics PingER (RIPE TTB) iperf UDPmon rTPL NWS etc Local Network Monitoring Store & Analysis of Data (Access) local LDAP Server Grid Application access via LDAP Schema to - monitoring metrics; - location of monitoring data. Access to current and historic data and metrics via the Web, i.e. WP7 NM Pages, access to metric forecasts Robin Tasker GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Clients WEB Display Predictions Grid Brokers Analysis LDAP LDAP LDAP LDAP LDAP Web I/f Table Table Table plot plot plot raw raw raw raw raw Scheduler control control Cron script Cron script Cron script Tool Ping Netmon UDPmon iPerf Ripe WP7Network Monitoring Components GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

WP7 MapCentre: Grid Monitoring & Visualisation • Grid network monitoring architecture uses LDAP & R-GMA - DataGrid WP7 • Central MySQL archive hosting all network metrics and GridFTP logging • Probe Coordination Protocol deployed, scheduling tests • MapCentre also provides site & node Fabric health checks Franck Bonnassieux CNRS Lyon GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

WP7 MapCentre: Grid Monitoring & Visualisation CERN – RAL UDP CERN – IN2P3 UDP CERN – RAL TCP CERN – IN2P3 TCP GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

UK e-Science: Network Monitoring • Technology Transfer • DataGrid WP7 M/c • UK e-Science DL • DataGrid WP7 M/c • Architecture GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

24 Jan to 4 Feb 04 TCP iperf DL to HEP DL -> RAL ~80 Mbit/s UK e-Science: Network Problem Solving 24 Jan to 4 Feb 04 TCP iperf RAL to HEP Only 2 sites >80 Mbit/s RAL -> DL 250-300 Mbit/s GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Number of packets n bytes  time Wait time Tools: UDPmon – Latency & Throughput • UDP/IP packets sent between end systems • Latency • Round trip times using Request-Response UDP frames • Latency as a function of frame size • Slope s given by: • Mem-mem copy(s) + pci + Gig Ethernet + pci + mem-mem copy(s) • Intercept indicates processing times + HW latencies • Histograms of ‘singleton’ measurements • UDP Throughput • Send a controlled stream of UDP frames spaced at regular intervals • Vary the frame size and the frame transmit spacing & measure: • The time of first and last frames received • The number packets received, lost, & out of order • Histogram inter-packet spacing received packets • Packet loss pattern • 1-way delay • CPU load • Number of interrupts GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Send Transfer Receive Transfer UDPmon: Example 1 Gigabit NIC Intel pro/1000 Throughput • Motherboard: Supermicro P4DP6 • Chipset: E7500 (Plumas) • CPU: Dual Xeon 2 2GHz with 512k L2 cache • Mem bus 400 MHz PCI-X 64 bit 66 MHz • HP Linux Kernel 2.4.19 SMP • MTU 1500 bytes • Intel PRO/1000 XT Latency Bus Activity GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Tools: Trace-Rate Hop by hop measurements • A method tomeasure the hop-by-hop capacity, delay, and lossup to the path bottleneck • Not intrusive • Operates in a high-performance environment • Does not need cooperation of the destination • Based on Packet Pair Method • Send sets of b2b packets with increasing time to live • For each set filter “noise” from rtt • Calculate spacing – hence bottleneck BW • Robust regarding the presence of invisible nodes Effect of the bottleneck on a packet pair. L is a packet size C is the capacity Examples of parameters that are iteratively analysed to extract the capacity mode GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Tools: Trace-Rate Some Results • Capacity measurements as function of load in Mbit/s from tests on the DataTAG Link: • Comparison of the number of packets required • Validated by simulations in NS-2 • Linux implementations, working in a high-performance environment • Research report: http://www.inria.fr/rrrt/rr-4959.html • Research Paper: ICC2004 : International Conference on Communications, Paris, France, June 2004. IEEE Communication Society. GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Network Monitoring as a Tool to study: • Protocol Behaviour • Network Performance • Application Performance • Tools include: • web100 • tcpdump • Output from the test tool: • UDPmon, iperf, … • Output from the application • Gridftp, bbcp, apache GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Hans Blom Protocol Performance: RDUDP • Monitoring from Data Moving Application & Network Test Program • DataTAG WP3 work • Test Setup: • Path: Ams-Chi-Ams Force10 loopback • Moving data from DAS-2 cluster with RUDP – UDP based Transport • Apply 11*11 TCP background streams from iperf • Conclusions • RUDP performs well • It does Back off and share BW • Rapidly expands when BW free GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Jitter for IPP and BE flows under load Flow: BE BG:60+40% BE+LBE Flow: IPP BG:60+40% BE+LBE Flow: IPP none Performance of the GÉANT Core Network • Test Setup: • Supermicro PC in: London & Amsterdam GÉANT PoP • Smartbits in: London & Frankfurt GÉANT PoP • Long link : UK-SE-DE2-IT-CH-FR-BE-NL • Short Link : UK-FR-BE-NL • Network Quality Of Service • LBE, IP Premium • High-Throughput Transfers • Standard and advanced TCP stacks • Packet re-ordering effects GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Tests GÉANT Core: Packet re-ordering • Effect of LBE background • Amsterdam-London • BE Test flow • Packets at 10 µs – line speed • 10,000 sent • Packet Loss ~ 0.1% • Re-order Distributions: GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

MB - NG Application Throughput + Web100 • 2Gbyte file transferred RAID0 disks • Web100 output every 10 ms • Gridftp • See alternate 600/800 Mbit and zero • Apachie web server + curl-based client • See steady 720 Mbit GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

VLBI Project: Throughput Jitter 1-way Delay Loss • 1472 byte Packets Manchester -> Dwingeloo JIVE • 1472 byte Packets man -> JIVE • FWHM 22 µs (B2B 3 µs ) • Packets Loss distribution • Prob. Density Function: P(t) = λ e-λt • Mean λ = 2360 / s [426 µs] • 1-way Delay – note the packet loss (points with zero 1 –way delay) GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Passive Monitoring • Time-series data from Routers and Switches • Immediate but usually historical- MRTG • Usually derived from SNMP • Miss-configured / infected / misbehaving End Systems (or Users?) • Note Data Protection Laws & confidentiality • Site MAN and Back-bone topology & load • Help to user/sysadmin to isolate problem – eg low TCP transfer • Essential for Proof of Concept tests or Protocol testing • Trends used for capacity planning • Control of P2P traffic GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Pete White Pat Myers Users: The Campus & the MAN [1] • NNW – to – SJ4 Access 2.5 Gbit PoS Hits 1 Gbit 50 % • Man – NNW Access 2 * 1 Gbit Ethernet GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Users: The Campus & the MAN [2] • Message: • Not a complaint • Continue to work with your network group • Understand the traffic levels • Understand the Network Topology • LMN to site 1 Access 1 Gbit Ethernet • LMN to site 2 Access 1 Gbit Ethernet GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Only testing – Could be worse! VLBI Traffic Flows • Manchester – NetNorthWest - SuperJANET Access links • Two 1 Gbit/s • Access links:SJ4 to GÉANT GÉANT to SurfNet GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

GGF: Hierarchy Characteristics Document • Network Measurement Working Group • “A Hierarchy of Network Performance Characteristics for Grid Applications and Services” • Document defines terms & relations: • Network characteristics • Measurement methodologies • Observation • Discusses Nodes & Paths • For each Characteristic • Defines the meaning • Attributes that SHOULD be included • Issues to consider when making an observation • Status: • Originally submitted to GFSG as Community Practice Documentdraft-ggf-nmwg-hierarchy-00.pdf Jul 2003 • Revised to Proposed Recommendation http://www-didc.lbl.gov/NMWG/docs/draft-ggf-nmwg-hierarchy-02.pdf 7 Jan 04 • Now in 60 day Public comment from 28 Jan 04 – 18 days to go. GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

GGF: Schemata for Network Measurements Network Monitoring Service XML test request XML tests results • Request Schema: • Ask for results / ask to make test • Schema Requirements Document made • Use DAMED style namese.g. path.delay.oneWay • Send: Char. Time, Subject = node | pathMethodology, Stats • Response Schema: • Interpret results • Includes Observation environment • Much work in progress • Common components • Drafts almost done • 2 (3) proof-of-concept implementations • 2 implementations using XML-RPC by Internet2 SLAC • Implementation in progress using Document /Literal by DL & UCL GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

So What do we Use Monitoring for: A Summary • Detect or X-check problem reports • Isolate / determine a performance issue • Capacity planning • Publication of data: network “cost” for middleware • RBs for optimized matchmaking • WP2 Replica Manager • Capacity planning • SLA verification • Isolate / determine throughput bottleneck – work with real user problems • Test conditions for Protocol/HW investigations • Protocol performance / development • Hardware performance / development • Application analysis • Input to middleware – eg gridftp throughput • Isolate / determine a (user) performance issue • Hardware / protocol investigations • End2End Time Series • Throughput UDP/TCP • Rtt • Packet loss • Passive Monitoring • Routers Switches SNMP MRTG • Historical MRTG • Packet/Protocol Dynamics • tcpdump • web100 • Output from Application tools GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

More Information Some URLs • DataGrid WP7 Mapcenter: http://ccwp7.in2p3.fr/wp7archive/ & http://mapcenter.in2p3.fr/datagrid-rgma/ • UK e-science monitoring: http://gridmon.dl.ac.uk/gridmon/ • MB-NG project web site:http://www.mb-ng.net/ • DataTAG project web site: http://www.datatag.org/ • UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net • Motherboard and NIC Tests: www.hep.man.ac.uk/~rich/net • IEPM-BW site: http://www-iepm.slac.stanford.edu/bw GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Network Monitoring to Grid Sites • Network Tools Developed • Using Network Monitoring as a Study Tool • Applications & Network Monitoring – real users • Passive Monitoring • Standards – Links to GGF GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Send CSR setup Send Transfer Receive Transfer Packet on Ethernet Fibre Send PCI ~36 us Receive PCI Data Flow: SuperMicro 370DLE: SysKonnect • Motherboard: SuperMicro 370DLE Chipset: ServerWorks III LE Chipset • CPU: PIII 800 MHz PCI:64 bit 66 MHz • RedHat 7.1 Kernel 2.4.14 • 1400 bytes sent • Wait 100 us • ~8 us for send or receive • Stack & Application overhead ~ 10 us / node GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

10 GigEthernet: Throughput • 1500 byte MTU gives ~ 2 Gbit/s • Used 16144 byte MTU max user length 16080 • DataTAG Supermicro PCs • Dual 2.2 GHz Xeon CPU FSB 400 MHz • PCI-X mmrbc 512 bytes • wire rate throughput of 2.9 Gbit/s • SLAC Dell PCs giving a • Dual 3.0 GHz Xeon CPU FSB 533 MHz • PCI-X mmrbc 4096 bytes • wire rate of 5.4 Gbit/s • CERN OpenLab HP Itanium PCs • Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz • PCI-X mmrbc 4096 bytes • wire rate of 5.7 Gbit/s GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

mmrbc 512 bytes mmrbc 1024 bytes mmrbc 2048 bytes CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update mmrbc 4096 bytes Tuning PCI-X: Variation of mmrbc IA32 • 16080 byte packets every 200 µs • Intel PRO/10GbE LR Adapter • PCI-X bus occupancy vs mmrbc • Plot: • Measured times • Times based on PCI-X times from the logic analyser • Expected throughput GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

10 GigEthernet at SC2003 BW Challenge • Three Server systems with 10 GigEthernet NICs • Used the DataTAG altAIMD stack 9000 byte MTU • Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to: • Pal Alto PAIX • rtt 17 ms , window 30 MB • Shared with Caltech booth • 4.37 Gbit hstcp I=5% • Then 2.87 Gbit I=16% • Fall corresponds to 10 Gbit on link • 3.3Gbit Scalable I=8% • Tested 2 flows sum 1.9Gbit I=39% • Chicago Starlight • rtt 65 ms , window 60 MB • Phoenix CPU 2.2 GHz • 3.1 Gbit hstcp I=1.6% • Amsterdam SARA • rtt 175 ms , window 200 MB • Phoenix CPU 2.2 GHz • 4.35 Gbit hstcp I=6.9% • Very Stable • Both used Abilene to Chicago GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

Summary & Conclusions • Intel PRO/10GbE LR Adapter and driver gave stable throughput and worked well • Need large MTU (9000 or 16114) – 1500 bytes gives ~2 Gbit/s • PCI-X tuning mmrbc = 4096 bytes increase by 55% (3.2 to 5.7 Gbit/s) • PCI-X sequences clear on transmit gaps ~ 950 ns • Transfers: transmission (22 µs) takes longer than receiving (18 µs) • Tx rate 5.85 Gbit/s Rx rate 7.0 Gbit/s (Itanium) (PCI-X max 8.5Gbit/s) • CPU load considerable 60% Xenon 40% Itanium • BW of Memory system important – crosses 3 times! • Sensitive to OS/ Driver updates • More study needed GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

PCI Activity: Read Multiple data blocks 0 wait • Read 999424 bytes • Each Data block: • Setup CSRs • Data movement • Update CSRs • For 0 wait between reads: • Data blocks ~600µs longtake ~6 ms • Then 744µs gap • PCI transfer rate 1188Mbit/s(148.5 Mbytes/s) • Read_sstor rate 778 Mbit/s (97 Mbyte/s) • PCI bus occupancy: 68.44% • Concern about Ethernet Traffic 64 bit 33 MHz PCI needs ~ 82% for 930 Mbit/s Expect ~360 Mbit/s Data transfer Data Block131,072 bytes CSR Access GNEW2004 CERN March 2004 R. Hughes-Jones Manchester PCI Burst 4096 bytes

PCI Activity: Read Throughput • Flat then 1/t dependance • ~ 860 Mbit/s for Read blocks >= 262144 bytes • CPU load ~20% • Concern about CPU load needed to drive Gigabit link GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

BaBar Case Study: RAID Throughput & PCI Activity • 3Ware 7500-8 RAID5 parallel EIDE • 3Ware forces PCI bus to 33 MHz • BaBar Tyan to MB-NG SuperMicroNetwork mem-mem 619 Mbit/s • Disk – disk throughput bbcp40-45 Mbytes/s (320 – 360 Mbit/s) • PCI bus effectively full! Read from RAID5 Disks Write to RAID5 Disks GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

BaBar: Serial ATA Raid Controllers • ICP 66 MHz PCI • 3Ware 66 MHz PCI GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

VLBI Project: Packet Loss Distribution • Measure the time between lost packets in the time series of packets sent. • Lost 1410 in 0.6s • Is it a Poisson process? • Assume Poisson is stationary λ(t) = λ • Use Prob. Density Function:P(t) = λ e-λt • Mean λ = 2360 / s[426 µs] • Plot log: slope -0.0028expect -0.0024 • Could be additional process involved GNEW2004 CERN March 2004 R. Hughes-Jones Manchester

End-2-End Network Monitoring What do we do ? What do we use it for?