1 / 35

Olivier Martin, CERN Swiss ICT Task Force (Fribourg)

The ongoing evolution from Packet based networks to Hybrid Networks in Research & Education Networks. Olivier Martin, CERN Swiss ICT Task Force (Fribourg). Presentation Outline. The demise of conventional packet based networks in the R&E community

valin
Télécharger la présentation

Olivier Martin, CERN Swiss ICT Task Force (Fribourg)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The ongoing evolution from Packet based networks to Hybrid Networks in Research & Education Networks Olivier Martin, CERN Swiss ICT Task Force (Fribourg)

  2. Presentation Outline • The demise of conventional packet based networks in the R&E community • The advent of community managed dark fiber networks • The Grid & its associated Wide Area Networking challenges • « On-Demand Lambda Grids » • Ethernet over SONET & new standards • WAN-PHY, GFP, VCAT/LCAS, G.709, OTN Disclaimer: The views expressed herein are not necessarily those of CERN, furthermore although I am formally a CERN staff member until July 31, 2006, I do not work for CERN any more since October 3, being on a pre-retirement program. Swiss ICT Task Force

  3. 1 6 5 4 3 2 10 10 10 10 10 10 10 Gbit/s 1024 10 Gbit/s 160 10 Gbit/s 32 10 Gbit/s 16 4 10 Gbit/s 8 10 Gbit/s 4 10 Gbit/s 2 I/0 Rates = Optical Wavelength Capacity Important Threshold OC-192c 10-GE 1.7 Gbit/s OC-48c OC-48c System Capacity (Mbit/s) 565 Mbit/s GigE OC-12c Fast Ethernet 135 Mbit/s OC-3c Optical DWDM Capacity Ethernet Internet Backbone T3 Ethernet T1 Year 1985 1990 1995 2000 2005 OC-768c 40-GE Swiss ICT Task Force

  4. (5 of 12) Some facts • Internet is everywhere • Ethernet is everywhere • The advent of next generation G.709 Optical Transport Networks is very unsure! • hence one has to learn how to live best with existing network infrastructures, which may well explain all the “hype” about “on-demand” lambda Grids! • For the first time in the history of the Internet, the Commercial and the Research & Education Internet appear to follow different routes • Will they ever converge again? • Dark fiber based, customer owned long distance, networks are booming! • users are becoming their own Telecom Operators • Is it a good or a bad thing? Swiss ICT Task Force

  5. Internet Backbone Speeds MBPS IP/ OC12c OC3c ATM-VCs T3 lines T1 Lines

  6. IP IP ATM IP SONET/SDH SONET/SDH Optical Optical Optical IP Over Optical IP Over ATM IP Over SONET/SDH High Speed IP Network Transport Trends Multiplexing, protection and management at every layer IP Signalling ATM SONET/SDH Optical B-ISDN Higher Speed, Lower cost, complexity and overhead

  7. Network Exponentials • Network vs. computer performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Difference = order of magnitude per 5 years • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010 • Computers: x 60 • Networks: x 4000 Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins. Intro to Grid Computing and Globus Toolkit™

  8. Know the user (3 of 12) # of users A C B ADSL GigE LAN F(t) BW requirements A -> Lightweight users, browsing, mailing, home use B -> Business applications, multicast, streaming, VPN’s, mostly LAN C -> Special scientific applications, computing, data grids, virtual-presence

  9. What the user (4 of 12) Total BW A B C ADSL GigE LAN BW requirements A -> Need full Internet routing, one to many B -> Need VPN services on/and full Internet routing, several to several C -> Need very fat pipes, limited multiple Virtual Organizations, few to few

  10. So what are the facts (5 of 12) • Costs of fat pipes (fibers) are one/third of equipment to light them up • Is what Lambda salesmen told Cees de Laat (University of Amsterdam & Surfnet) • Costs of optical equipment 10% of switching 10 % of full routing equipment for same throughput • 100 Byte packet @ 10 Gb/s -> 80 ns to look up in 100 Mbyte routing table (light speed from me to you on the back row!) • Big sciences need fat pipes • Bottom line: create a hybrid architecture which serves all users in one coherent and cost effective way

  11. Utilization trends Gbps Network Capacity Limit Jan 2005

  12. Regional Today’s hierarchical IP network Other national networks National or Pan-National IP Network NREN A NREN C NREN B NREN D University

  13. Regional Tomorrow’s peer to peer IP network World World National DWDM Network World Child Lightpaths NREN B NREN A NREN C NREN D Child Lightpaths University Server

  14. Creation of application VPNs Direct connect bypasses campus firewall University Dept High Energy Physics Network CERN Commodity Internet Research Network University University Bio-informatics Network University University eVLBI Network

  15. Production vs Research Campus Networks • Increasingly campuses are deploying parallel networks for high end users • Reduces costs by providing high end network capability to only those who need it • Limitations of campus firewall and border router are eliminated • Many issues in regards to security, back door routing, etc • Campus networks may follow same evolution as campus computing • Discipline specific networks being extended into the campus

  16. CAVEwave acquires a separate wavelength between Seattle and Chicago and wants to manage it as part of its network including add/drop, routing, partition etc NLR Condominium lambda network Original CAVEwave UCLP intended for projects like National LambdaRail

  17. GEANT2 POP Design

  18. Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS LHC Data Grid Hierarchy CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 ~PByte/sec ~100-400 MBytes/sec Online System Experiment CERN 700k SI95 ~1 PB Disk; Tape Robot Tier 0 +1 HPSS 10 Gbps Tier 1 FNAL: 200k SI95; 600 TB IN2P3 Center INFN Center RAL Center 2.5/10 Gbps Tier 2 ~2.5 Gbps Tier 3 Institute ~0.25TIPS Institute Institute Institute Physicists work on analysis “channels” Each institute has ~10 physicists working on one or more channels Physics data cache 0.1–1 Gbps Tier 4 Workstations Swiss ICT Task Force

  19. Main Networking Challenges • Fulfill the, yet unproven, assertion that the network can be « nearly » transparent to the Grid • Deploy suitable Wide Area Network infrastructure (50-100 Gb/s) • Deploy suitable Local Area Network infrastructure (matching or exceeding that of the WAN) • Seamless interconnection of LAN & WAN infrastructures • firewall? • End to End issues (transport protocols, PCs (Itanium, Xeon), 10GigE NICs (Intel, S2io), where are we today: • memory to memory: 7.5Gb/s (PCI bus limit) • memory to disk: 1.2MB (Windows 2003 server/NewiSys) • disk to disk: 400MB (Linux), 600MB (Windows) Swiss ICT Task Force

  20. Main TCP issues • Does not scale to some environments • High speed, high latency • Noisy • Unfair behaviour with respect to: • Round Trip Time (RTT • Frame size (MSS) • Access Bandwidth • Widespread use of multiple streams in order to compensate for inherent TCP/IP limitations (e.g. Gridftp, BBftp): • Bandage rather than a cure • New TCP/IP proposals in order to restore performance in single stream environments • Not clear if/when it will have a real impact • In the mean time there is an absolute requirement for backbones with: • Zero packet losses, • And no packet re-ordering • Which re-inforces the case for “lambda Grids” Swiss ICT Task Force

  21. TCP dynamics(10Gbps, 100ms RTT, 1500Bytes packets) • Window size (W) = Bandwidth*Round Trip Time • Wbits = 10Gbps*100ms = 1Gb • Wpackets = 1Gb/(8*1500) = 83333 packets • Standard Additive Increase Multiplicative Decrease (AIMD) mechanisms: • W=W/2 (halving the congestion window on loss event) • W=W + 1 (increasing congestion window by one packet every RTT) • Time to recover from W/2 to W (congestion avoidance) at 1 packet per RTT: • RTT*Wp/2 = 1.157 hour • In practice, 1 packet per 2 RTT because of delayed acks, i.e. 2.31 hour • Packets per second: • RTT*Wpackets = 833’333 packets Swiss ICT Task Force

  22. Internet2 land speed record history (IPv4 & IPv6) period 2000-2004 Swiss ICT Task Force

  23. Layer1/2/3 networking (1) • Conventional layer 3 technology is no longer fashionable because of: • High associated costs, e.g. 200/300 KUSD for a 10G router interfaces • Implied use of shared backbones • The use of layer 1 or layer 2 technology is very attractive because it helps to solve a number of problems, e.g. • 1500 bytes Ethernet frame size (layer1) • Protocol transparency (layer1 & layer2) • Minimum functionality hence, in theory, much lower costs (layer1&2) Swiss ICT Task Force

  24. Layer1/2/3 networking (2) • « 0n-demand Lambda Grids » are becoming very popular: • Pros: • circuit oriented model like the telephone network, hence no need for complex transport protocols • Lower equipment costs (i.e. « in theory » a factor 2 or 3 per layer) • the concept of a dedicated end to end light path is very elegant • Cons: • « End to end » still very loosely defined, i.e. site to site, cluster to cluster or really host to host • Higher circuit costs, Scalability, Additional middleware to deal with circuit set up/tear down, etc • Extending dynamic VLAN functionality is a potential nightmare! Swiss ICT Task Force

  25. « Lambda Grids » What does it mean? • Clearly different things to different people, hence the apparently easy consensus! • Conservatively, on demand « site to site » connectivity • Where is the innovation? • What does it solve in terms of transport protocols? • Where are the savings? • Less interfaces needed (customer) but more standby/idle circuits needed (provider) • Economics from the service provider vs the customer perspective? • Traditionally, switched services have been very expensive, • Usage vs flat charge • Break even, switches vs leased, few hours/day • Why would this change? • In case there are no savings, why bother? • More advanced, cluster to cluster • Implies even more active circuits in paralle • Is it realistic? • Even more advanced, Host to Host or even « per flow » • All optical • Is it really realisitic? Swiss ICT Task Force

  26. Some Challenges • Real bandwidth estimates given the chaotic nature of the requirements. • End-end performance given the whole chain involved • (disk-bus-memory-bus-network-bus-memory-bus-disk) • Provisioning over complex network infrastructures (GEANT, NREN’s etc) • Cost model for options (packet+SLA’s, circuit switched etc) • Consistent Performance (dealing with firewalls) • Merging leading edge research with production networking Swiss ICT Task Force

  27. Tentative conclusions • There is a very clear trend towards community managed dark fiber networks • As a consequence National Research & Education Networks are evolving into Telecom Operators, is it right? • In the short term, almost certainly YES • In the longer term, probably NO • In many countries, there is NO other way to have affordable access to multi-Gbit/s networks, therefore this is clearly the right move • The Grid & its associated Wide Area Networking challenges • « on-demand Lambda Grids » are, according to me, extremely doubtful! • Ethernet over SONET & new standards will revolutionize the Internet • WAN-PHY (IEEE) has, according to me NO future! • However, GFP, VCAT/LCAS, G.709, OTN are very likely to have a very bright future. Swiss ICT Task Force

  28. Single TCP stream performance under periodic losses • TCP throughput much more sensitive to packet loss in WANs than LANs • TCP’s congestion control algorithm (AIMD) is not well-suited to gigabit networks • The effect of packets loss can be disastrous • TCP is inefficient in high bandwidth*delay networks • The future performance-outlook for computational grids looks bad if we continue to rely solely on the widely-deployed TCP RENO Loss rate =0.01%: • LAN BW utilization= 99% • WAN BW utilization=1.2% Bandwidth available = 1 Gbps

  29. Responsiveness • Time to recover from a single packet loss: 2 C . RTT r = C : Capacity of the link 2 . MSS • Large MTU accelerates the growth of the window • Time to recover from a packet loss decreases with large MTU • Larger MTU reduces overhead per frames (saves CPU cycles, reduces the number of packets)

  30. Single TCP stream between Caltech and CERN • Available (PCI-X) Bandwidth=8.5 Gbps • RTT=250ms (16’000 km) • 9000 Byte MTU • 15 min to increase throughput from 3 to 6 Gbps • Sending station: • Tyan S2882 motherboard, 2x Opteron 2.4 GHz , 2 GB DDR. • Receiving station: • CERN OpenLab:HP rx4640, 4x 1.5GHz Itanium-2, zx1 chipset, 8GB memory • Network adapter: • S2IO 10 GbE CPU load = 100% Single packet loss Burst of packet losses

  31. High Throughput Disk to Disk Transfers: From 0.1 to 1GByte/sec • Server Hardware (Rather than Network) Bottlenecks: • Write/read and transmit tasks share the same limited resources: CPU, PCI-X bus, memory, IO chipset • PCI-X bus bandwidth: 8.5 Gbps [133MHz x 64 bit] • Link aggregation (802.3ad): Logical interface with two physical interfaces on two independent PCI-X buses. • LAN test: 11.1 Gbps (memory to memory) Performance in this range (from 100 MByte/sec up to 1 GByte/sec) is required to build a responsive Grid-based Processing and Analysis System for LHC

  32. Transferring a TB from Caltech to CERN in 64-bit MS Windows • Latest disk to disk over 10Gbps WAN: 4.3 Gbits/sec (536 MB/sec) - 8 TCP streams from CERN to Caltech; 1TB file • 3 Supermicro Marvell SATA disk controllers + 24 SATA 7200rpm SATA disks • Local Disk IO – 9.6 Gbits/sec (1.2 GBytes/sec read/write, with <20% CPU utilization) • S2io SR 10GE NIC • 10 GE NIC – 7.5 Gbits/sec (memory-to-memory, with 52% CPU utilization) • 2*10 GE NIC (802.3ad link aggregation) – 11.1 Gbits/sec (memory-to-memory) • Memory to Memory WAN data flow, and local Memory to Disk read/write flow, are not matched when combining the two operations • Quad Opteron AMD848 2.2GHz processors with 3 AMD-8131 chipsets: 4 64-bit/133MHz PCI-X slots. • Interrupt Affinity Filter: allows a user to change the CPU-affinity of the interrupts in a system. • Overcome packet loss with re-connect logic. • Proposed Internet2 Terabyte File Transfer Benchmark

More Related