1 / 44

Protocols Recent and Current Work.

Protocols Recent and Current Work. Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks”. Outline. SC|05 TCP and UDP memory-2-memory & disk-2-disk flows 10 Gbit Ethernet VLBI Jodrell Mark5 problem – see Matt’s Talk

Télécharger la présentation

Protocols Recent and Current Work.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ProtocolsRecent and Current Work. Richard Hughes-Jones The University of Manchesterwww.hep.man.ac.uk/~rich/ then “Talks” ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  2. Outline • SC|05 • TCP and UDP memory-2-memory & disk-2-disk flows • 10 Gbit Ethernet • VLBI • Jodrell Mark5 problem – see Matt’s Talk • Data delay on a TCP link – How suitable is TCP? • 4th Year MPhys Project Stephen Kershaw & James Keenan • Throughput on the 630Mbit JB-JIVE UKLight Link • 10 Gbit in FABRIC • ATLAS • Network tests on Manchester T2 farm • The Manc-Lanc UKLight Link • ATLAS Remote Farms • RAID Tests • HEP server 8 lane PCIe RAID card ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  3. Collaboration at SC|05 • Caltech Booth • The BWC at the SLAC Booth • SCINet • Storcloud • ESLEA Boston Ltd. & Peta-CacheSun ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  4. SC2004 101 Gbit/s Bandwidth Challenge wins Hat Trick • The maximum aggregate bandwidth was >151 Gbits/s • 130 DVD movies in a minute • serve 10,000 MPEG2 HDTV movies in real-time • 22 10Gigabit Ethernet wavesCaltech & SLAC/FERMI booths • In 2 hours transferred 95.37 TByte • 24 hours moved ~ 475 TBytes • Showed real-time particle event analysis • SLAC Fermi UK Booth: • 1 10 Gbit Ethernet to UK NLR&UKLight: • transatlantic HEP disk to disk • VLBI streaming • 2 10 Gbit Links to SALC: • rootd low-latency file access application for clusters • Fibre Channel StorCloud • 4 10 Gbit links to Fermi • Dcache data transfers In to booth Out of booth ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  5. Reverse TCP ESLEA and UKLight • 6 * 1 Gbit transatlantic Ethernet layer 2 paths UKLight + NLR • Disk-to-disk transfers with bbcp • Seattle to UK • Set TCP buffer and application to give ~850Mbit/s • One stream of data 840-620 Mbit/s • Stream UDP VLBI data • UK to Seattle • 620 Mbit/s ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  6. SLAC 10 Gigabit Ethernet • 2 Lightpaths: • Routed over ESnet • Layer 2 over Ultra Science Net • 6 Sun V20Z systems per λ • dcache remote disk data access • 100 processes per node • Node sends or receives • One data stream 20-30 Mbit/s • Used Netweion NICs & Chelsio TOE • Data also sent to StorCloudusing fibre channel links • Traffic on the 10 GE link for 2 nodes: 3-4 Gbit per nodes 8.5-9 Gbit on Trunk ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  7. VLBI Work TCP Delay and VLBI Transfers Manchester 4th Year MPhys Project by Stephen Kershaw & James Keenan ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  8. VLBI Network Topology ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  9. Timestamp3 Timestamp4 Packet loss Timestamp5 Receiver Sender Data3 Data4 RTT ACK Segment time on wire = bits in segment/BW Time • Remember Bandwidth*Delay Product BDP = RTT*BW VLBI Application Protocol TCP & Network Sender Receiver Timestamp1 Timestamp2 Data1 Data2 ●●● • VLBI data is Constant Bit Rate • tcpdelay • instrumented TCP program emulates sending CBR Data. • Records relative 1-way delay Time ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  10. Send time sec 1 sec Message number Check the Send Time Send time – 10000 packets • 10,000 Messages • Message size: 1448 Bytes • Wait time: 0 • TCP buffer 64k • Route:Man-ukl-JIVE-prod-Man • RTT ~26 ms • Slope 0.44 ms/message • From TCP buffer size & RTT Expect ~42 messages/RTT~0.6ms/message ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  11. 26 messages About 25 us One rtt Message 76 Send time sec Message 102 100 ms Message number Send Time Detail • TCP Send Buffer limited • After SlowStart Buffer full • packets sent out in burstseach RTT • Program blocked on sendto() ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  12. 1 way delay 100 ms 100 ms Message number 1-Way Delay 1 way delay – 10000 packets • 10,000 Messages • Message size: 1448 Bytes • Wait time: 0 • TCP buffer 64k • Route:Man-ukl-JIVE-prod-Man • RTT ~26 ms ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  13. = 1 x RTT 26 ms 1 way delay 10 ms = 1.5 x RTT 10 ms ≠ 0.5 x RTT Message number 1-Way Delay Detail • Why not just 1 RTT? • After SlowStart TCP Buffer Full • Messages at front of TCP Send Buffer have to wait for next burst of ACKs – 1 RTT later • Messages further back in the TCP Send Buffer wait for 2 RTT ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  14. 5 ms 1-Way Delay with packet drop • Route:LAN gig8-gig1 • Ping 188 μs • 10,000 Messages • Message size: 1448 Bytes • Wait times: 0 μs • Drop 1 in 1000 • Manc-JIVE tests showtimes increasing with a “saw-tooth” around 10 s 1 way delay 10 ms Message number 28 ms 800 us ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  15. 10 Gbit in FABRIC ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  16. FABRIC 4Gbit Demo • 4 Gbit Lightpath Between GÉANT PoPs • Collaboration with Dante • Continuous (days) Data Flows – VLBI_UDP and multi-Gigabit TCP tests ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  17. Data Transfer CSR Access 2.8us 10 Gigabit Ethernet: UDP Data transfer on PCI-X • Sun V20z 1.8GHz to2.6 GHz Dual Opterons • Connect via 6509 • XFrame II NIC • PCI-X mmrbc 2048 bytes66 MHz • One 8000 byte packets • 2.8us for CSRs • 24.2 us data transfereffective rate 2.6 Gbit/s • 2000 byte packet, wait 0us • ~200ms pauses • 8000 byte packet, wait 0us • ~15ms between data blocks ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  18. ATLAS ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  19. ESLEA: ATLAS on UKLight • 1 Gbit Lightpath Lancaster-Manchester • Disk 2 Disk Transfers • Storage Element with SRM using distributed disk pools dCache & xrootd ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  20. Send times • Pause 695 μs every 1.7ms • So expect ~600 Mbit/s • Receive times (Manc end) • No corresponding gaps udpmon: Lanc-Manc Throughput • Lanc  Manc • Plateau ~640 Mbit/s wire rate • No packet Loss • Manc Lanc • ~800 Mbit/s but packet loss ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  21. udpmon: Manc-Lanc Throughput • Manc Lanc • Plateau ~890 Mbit/s wire rate • Packet Loss • Large frames 10% when at line rate • Small frames 60% when at line rate • 1way delay ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  22. SFI and SFO Event Filter Daemon EFD Request event Send event data Request-Response time (Histogram) Process event Request Buffer Send OK Send processed event ●●● Time ATLAS Remote Computing: Application Protocol • Event Request • EFD requests an event from SFI • SFI replies with the event ~2Mbytes • Processing of event • Return of computation • EF asks SFO for buffer space • SFO sends OK • EF transfers results of the computation • tcpmon - instrumented TCP request-response program emulates the Event Filter EFD to SFI communication. ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  23. TCP Congestion windowgets re-set on each Request • TCP stack RFC 2581 & RFC 2861 reduction of Cwnd after inactivity • Even after 10s, each response takes 13 rtt or ~260 ms • Transfer achievable throughput120 Mbit/s • Event rate very low • Application not happy! tcpmon: TCP Activity Manc-CERN Req-Resp • Web100 hooks for TCP status • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP in slow start • 1st event takes 19 rtt or ~ 380 ms ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  24. tcpmon: TCP Activity Manc-cern Req-Respno cwnd reduction • Round trip time 20 ms • 64 byte Request green1 Mbyte Response blue • TCP starts in slow start • 1st event takes 19 rtt or ~ 380 ms • TCP Congestion windowgrows nicely • Response takes 2 rtt after ~1.5s • Rate ~10/s (with 50ms wait) • Transfer achievable throughputgrows to 800 Mbit/s • Data transferred WHEN theapplication requires the data 3 Round Trips 2 Round Trips ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  25. Recent RAID Tests Manchester HEP Server ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  26. “Server Quality” Motherboards • Boston/Supermicro H8DCi • Two Dual Core Opterons • 1.8 GHz • 550 MHz DDR Memory • HyperTransport • Chipset: nVidia nForce Pro 2200/2050 • AMD 8132 PCI-X Bridge • PCI • 2 16 lane PCIe buses • 1 4 lane PCIe • 133 MHz PCI-X • 2 Gigabit Ethernet • SATA ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  27. Disk_test: • areca PCI-Express 8 port • Maxtor 300 GB Sata disks • RAID0 5 disks • Read 2.5 Gbit/s • Write 1.8 Gbit/s • RAID5 5 data disks • Read 1.7 Gbit/s • Write 1.48 Gbit/s • RAID6 5 data disks • Read 2.1 Gbit/s • Write 1.0 Gbit/s ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  28. Any Questions? ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  29. More Information Some URLs 1 • UKLight web site: http://www.uklight.ac.uk • MB-NG project web site:http://www.mb-ng.net/ • DataTAG project web site: http://www.datatag.org/ • UDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net • Motherboard and NIC Tests: http://www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt& http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 http:// www.hep.man.ac.uk/~rich/ • TCP tuning information may be found at:http://www.ncne.nlanr.net/documentation/faq/performance.html& http://www.psc.edu/networking/perf_tune.html • TCP stack comparisons:“Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004 • PFLDnet http://www.ens-lyon.fr/LIP/RESO/pfldnet2005/ • Dante PERT http://www.geant2.net/server/show/nav.00d00h002 ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  30. More Information Some URLs 2 • Lectures, tutorials etc. on TCP/IP: • www.nv.cc.va.us/home/joney/tcp_ip.htm • www.cs.pdx.edu/~jrb/tcpip.lectures.html • www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200/CCONTENTS • www.cisco.com/univercd/cc/td/doc/product/iaabu/centri4/user/scf4ap1.htm • www.cis.ohio-state.edu/htbin/rfc/rfc1180.html • www.jbmelectronics.com/tcp.htm • Encylopaedia • http://www.freesoft.org/CIE/index.htm • TCP/IP Resources • www.private.org.il/tcpip_rl.html • Understanding IP addresses • http://www.3com.com/solutions/en_US/ncs/501302.html • Configuring TCP (RFC 1122) • ftp://nic.merit.edu/internet/documents/rfc/rfc1122.txt • Assigned protocols, ports etc (RFC 1010) • http://www.es.net/pub/rfcs/rfc1010.txt & /etc/protocols ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  31. Backup Slides ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  32. SuperComputing ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  33. SC2004: Disk-Disk bbftp • bbftp file transfer program uses TCP/IP • UKLight: Path:- London-Chicago-London; PCs:- Supermicro +3Ware RAID0 • MTU 1500 bytes; Socket size 22 Mbytes; rtt 177ms; SACK off • Move a 2 Gbyte file • Web100 plots: • Standard TCP • Average 825 Mbit/s • (bbcp: 670 Mbit/s) • Scalable TCP • Average 875 Mbit/s • (bbcp: 701 Mbit/s~4.5s of overhead) • Disk-TCP-Disk at 1Gbit/sis here! ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  34. PCI-X bus with RAID Controller Read from diskfor 44 ms every 100ms PCI-X bus with Ethernet NIC Write to Network for 72 ms SC|05 HEP: Moving data with bbcp • What is the end-host doing with your network protocol? • Look at the PCI-X • 3Ware 9000 controller RAID0 • 1 Gbit Ethernet link • 2.4 GHz dual Xeon • ~660 Mbit/s • Power needed in the end hosts • Careful Application design ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  35. 10 Gigabit Ethernet: UDP Throughput • 1500 byte MTU gives ~ 2 Gbit/s • Used 16144 byte MTU max user length 16080 • DataTAG Supermicro PCs • Dual 2.2 GHz Xenon CPU FSB 400 MHz • PCI-X mmrbc 512 bytes • wire rate throughput of 2.9 Gbit/s • CERN OpenLab HP Itanium PCs • Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz • PCI-X mmrbc 4096 bytes • wire rate of 5.7 Gbit/s • SLAC Dell PCs giving a • Dual 3.0 GHz Xenon CPU FSB 533 MHz • PCI-X mmrbc 4096 bytes • wire rate of 5.4 Gbit/s ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  36. mmrbc 512 bytes mmrbc 1024 bytes mmrbc 2048 bytes CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update mmrbc 4096 bytes 5.7Gbit/s 10 Gigabit Ethernet: Tuning PCI-X • 16080 byte packets every 200 µs • Intel PRO/10GbE LR Adapter • PCI-X bus occupancy vs mmrbc • Measured times • Times based on PCI-X times from the logic analyser • Expected throughput ~7 Gbit/s • Measured 5.7 Gbit/s ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  37. Data Transfer CSR Access 10 Gigabit Ethernet: TCP Data transfer on PCI-X • Sun V20z 1.8GHz to2.6 GHz Dual Opterons • Connect via 6509 • XFrame II NIC • PCI-X mmrbc 4096 bytes66 MHz • Two 9000 byte packets b2b • Ave Rate 2.87 Gbit/s • Burst of packets length646.8 us • Gap between bursts 343 us • 2 Interrupts / burst ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  38. TCP on the 630 Mbit Link Jodrell – UKLight – JIVE ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  39. TCP Throughput on 630 Mbit UKLight • Manchester gig7 – JBO mk5 606 • 4 Mbyte TCP buffer • test 0 • Dup ACKs seen • Other Reductions • test 1 • test 2 ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  40. Comparison of Send Time & 1-way delay 26 messages Message 102 Message 76 Send time sec 100 ms Message number ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  41. 1-Way Delay 1448 byte msg • Route:Man-ukl-ams-prod-man • Rtt 27ms • 10,000 Messages • Message size: 1448 Bytes • Wait times: 0 μs • DBP = 3.4MByte • TCP buffer 10MByte 50 ms Message number • Web100 plot • Starts after 5.6 Secdue to Clock Sync. • ~400 pkts/10ms • Rate similar to iperf ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  42. Related Work: RAID, ATLAS Grid • RAID0 and RAID5 tests • 4th Year MPhys project last semester • Throughput and CPU load • Different RAID parameters • Number of disks • Stripe size • User read / write size • Different file systems • Ext2 ext3 XSF • Sequential File Write, Read • Sequential File Write, Read with continuous background read or write • Status • Need to check some results & document • Independent RAID controller tests planned. ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  43. Applications able to sustain high rates. • SuperJANET5, UKLight &new access links very timely HEP: Service Challenge 4 • Objective: demo 1 Gbit/s aggregate bandwidth between RAL and 4 Tier 2 sites • RAL has SuperJANET4 and UKLight links: • RAL Capped firewall traffic at 800 Mbit/s • SuperJANET Sites: • Glasgow Manchester Oxford QMWL • UKLight Site: • Lancaster • Many concurrent transfersfrom RAL to each of the Tier 2 sites ~700 Mbit UKLight Peak 680 Mbit SJ4 ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

  44. Network switch limits behaviour • End2end UDP packets from udpmon • Only 700 Mbit/s throughput • Lots of packet loss • Packet loss distributionshows throughput limited ESLEA Technical Collaboration Meeting , 20-21 Jun 2006, R. Hughes-Jones Manchester

More Related