1 / 30

Lambda Station

Lambda Station. BNL D. Petravick, Fermilab October 25, 2004. Lambda Station. Fermilab (Petravick) Caltech (Newman) Funded by DOE Office of Science Network Research program. 2 or 3 year investigation of issues bridging local production LANs to advanced networks.

dianethayer
Télécharger la présentation

Lambda Station

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lambda Station BNL D. Petravick, Fermilab October 25, 2004

  2. Lambda Station • Fermilab (Petravick) • Caltech (Newman) • Funded by DOE Office of Science Network Research program. • 2 or 3 year investigation of issues bridging local production LANs to advanced networks. Don Petravick -- Fermilab

  3. Flow Distribution on ESNet Don Petravick -- Fermilab

  4. CMS Service Challenge • Even, the initial LHC service challenge would dominate ESNet. • R&E Networks out side the production network framework. • Advanced concepts • Fewer 9’s • Much bandwidth Don Petravick -- Fermilab

  5. What’s Potential Performance? Don Petravick -- Fermilab

  6. R&E networks in the USA • National Lambda Rail • DOE UltraScienceNet • UltraLight • LHCNet • HOPI • FNAL <-> Starlight (humbly) Don Petravick -- Fermilab

  7. Characteristics • DOE UltraScienceNet • Scheduled availability of 1 and 10 Gbit light paths at its POPS • UltraLight • More lambdas • Optical switching (Glimmer glass switch controlled by Mona Lisa) Don Petravick -- Fermilab

  8. Whats’ all this about? Cost: • National Fiber Infrastructure for R&E • Between Big Pops only • Lightpath based • Low cost, low-level transport • Belief that general packet routing logic at high packet rates (and perhaps with large variation in destinations) makes networks prohibitively costly. • Constrained to Circuits • Separate work to get out of the POPs,and to the data. • Higher-layer agnostic • General Transport (e.g. IP, fibre channel,,,,,) Don Petravick -- Fermilab

  9. Quality • Immense efforts on network weather and network quality for shared networks • Highest performance is achieved by knowledgeable, careful administration. • Over the WAN? Consistent Multiple Occurrences of such care. • The inside of an optical path should be • Congestion-free • Loss less Except for bit-errors And be measurably so in a straightforward way. • More lambdas Don Petravick -- Fermilab

  10. Why? • Are do we seem to have created an industry? • Doesn’t this just work with IP? • Why are people tinkering with what seems to be a successful model? • Naïve views from a network-aware HEP Storage System fellows (HEPIX talks) Don Petravick -- Fermilab

  11. Wide Area Characteristics • Most prominent characteristic, compared to LAN, is the very large bandwidth*delay product. • Underlying structure – it’s a packet world! • Possible to use pipes between specific sites • These circuits can be both static and dynamic • Both IP and non-IP (for example, Fibre-channel over sonet) • FNAL has proposed investigations and has just begun studies with its storage systems to optimize WAN file transfers using pipes. Don Petravick -- Fermilab

  12. Bandwidth*Delay • At least bandwidth*delay bytes must be kept in flight on the network to maintain bandwidth. • This fact is independent of protocol. • Current practice uses more than this lower limit. For example, US CMS used ~2x for their DC04. • CERN <–> FNAL has a measured ~60 ms delay • Using the 2x factor, 120 ms delay gives • 30 MB/sec  ~3-4 MB “in flight” • 1000 MB/sec  ~120 MB “in flight” Don Petravick -- Fermilab

  13. Bandwidth*Delay and IP • Given a single lost packet and a standard MTU size of 1500 bytes, the host will receive many out-of-order packets before receiving the retransmitted missing packet. • Must incur at least 2 “delays worth” • FNAL <-> CERN (2*60 ms delay) • 30 MB/sec: more than 2400 packets • 1000 MB/sec: more than 80000 packets Don Petravick -- Fermilab

  14. Knee-Cliff-Collapse Model • When load on a segment approaches a threshold, a modest increases in throughput is a accompanied by a great increases delay. • Even more throughput results in congestion collapse. • Can not load a network arbitrarily. • TCP tries to avoid collapse, but its solution has problems at large bandwidth*delay Don Petravick -- Fermilab

  15. Bandwidth and Delay and TCP • Stream model of TCP implies packet buffering is in kernel - this leads to kernel efficiency issues. • Vanilla TCP behaves as if all packet loss is caused by congestion. • TCP Solution is to back off throughput to avoid the congestion collapse in AIMD fashion: • Lost packet? Cut packets in flight by ½ • Success? Open window next time by one more packet • This leads to a very large recovery time at high bandwidth*delay: • Rho – recovery time is propotional to RTT*RTT/MTU Don Petravick -- Fermilab

  16. Experience from the test stands. Resolved as local switch issue Don Petravick -- Fermilab

  17. Strategies • Smaller, lower bandwidth TCP streams in parallel • Examples of these are GridFTP and BBftp • Tweak AIMD algorithm • Logic is in the sender’s kernel stack only (congestion window) • FAST, and others – USCMS used an FNAL kernel mod in DC04 • May not be “fair” to others using shared network resources • Break the stream model, use UDP and ‘cleverness’, especially for file transfers. But: • You have to be careful and avoid congestion collapse. • You need to be fair to other traffic, and be very certain of it • Isolate strategy by confining transfer to a “pipe” Don Petravick -- Fermilab

  18. Series of TCP investigations Don Petravick -- Fermilab

  19. Pipes and File Transfer Primitives • Tell network the bandwidth of your stream using RSVP, Resource Reservation Protocol • Network will forward the packets/sec you reserved and drop the rest (QoS) • Network will not over subscribe the total bandwidth. • Network leaves some bandwidth out of the QoS for others. • Unused bandwidth is not available to others at high QoS. Don Petravick -- Fermilab

  20. Storage Element File Stage In File Stage In File Stage Out Grid Side WAN FileSrv FileSrv FileSrv FileSrv FileSrv LAN Worker Node Side (POSIX style I/O) worker worker worker worker worker worker Don Petravick -- Fermilab

  21. Storage System and Bandwidth • Storage Element does not know the bandwidth of individual stream very well at all • For example, a disk may have many simultaneous assessors or the file may be in memory cache and transferred immediately • Bandwidth depends on fileserver disk and your disk. • Requested bandwidth too small? • If QoS tosses a packet, AIMD will drastically affect transfer rate • Requested bandwidth too high? • Bandwidth at QoS level wasted, overall experimental rate suffers • Storage Element may know the aggregate bandwidth better than individual stream bandwidth. • Storage Element, therefore needs to aggregate flows onto a pipe between sites, not deal with QoS on a single flow. • This means the local network will be involved in aggregation. Don Petravick -- Fermilab

  22. Lambda Station investigations Investigate support of static and dynamic pipes by storage systems in WAN transfers. • Fiber to Starlight optical exchange at Northwestern University. • Local improvements to forward traffic flows onto the pipe from our LAN • Local improvements to admit traffic flows onto our LAN from the pipe • Need changes to Storage System to exploit the WAN changes. Don Petravick -- Fermilab

  23. Why last hop LAN? • Very,very large commodity infrastructures have been built on LANs and used in HEP. • Specialized SANS are not used generally in HEP • It must at least be the starting point for mingling advanced networks and large HENP data systems. Don Petravick -- Fermilab

  24. Fiber to Starlight • FNAL’s fiber pair has the potential for 33 channels between FNAL and Starlight (3 to be activated soon) • Starlight provides FNAL’s access to Research and Education Networks: • ESnet • DOE Science Ultranet • Abilene • LHCnet (DOE-funded link to CERN) • SurfNet • UKLight • CA*Net • National Lambda Rail Don Petravick -- Fermilab

  25. LAN – Pipe investigation • Starlight path bypasses FNAL border router • Aggregation of many flows to fill a (dynamic) pipe. • We believe that pipes will be ‘owned’ by a VO. • Forwarding to the pipe is done on a per flow basis • Starlight path ties directly to production LAN and production Storage Element (no dual NICs). Don Petravick -- Fermilab

  26. Forwarding Server ESNet Starlight Forwarding server Router and Core Network File server Don Petravick -- Fermilab

  27. Flow-by-flow Strategy • Storage element identifies flows to the forwarding server by using layer 5 information • Host IP, Dest IP, Host Port, Dest Port and Transfer Protocol • And VO information • Forwarding server informs peer site to allow admission • Forwarding server configures local router to forward flow over DWDM link or the flow takes the default route • 1 GB pipe is about 30 flows at 30 MB/S. • If flows are 1 GB files, this yields about 1 flow change/sec • Forwarding server allows flows to take alternate path when dynamic path is torn down. • Firewalls may have issues with this. • Incoming flows are analogous • Flow-by-Flow solution seems to suit problem well, but there are plenty of implementation issues. Don Petravick -- Fermilab

  28. Changes to Storage Element to exploit dynamic pipes • Build semantics into bulk copy interfaces that allow for batching transfers to use bandwidth when available. • Based on bandwidth availability, dynamically change number of files transferred in parallel • Based on bandwidth availability, change the layer-5 (FTP) protocols used • Switch from FTP to UDP blaster (sabul) for example. • Or change the parameters used to tune layer-5 protocols, for example parallelism within ftp. • Deal with flows which have not completed when dynamic pipe is de-allocated. Don Petravick -- Fermilab

  29. Summary (Hepix Talk) • There are conventional and research approaches to wide area networks. • The interactions in the wide area are interesting and important to grid based data systems • FNAL now has the facilities in place to investigate a number of these issues. • Storage Elements are important parts of the investigation and require changes to achieve high throughput and reliable transfers over WAN Don Petravick -- Fermilab

  30. Summary (intro talk) • The vision is that large scale science is enabled by having systems which move data in a state-of-the-art manner. • A problem is that software time constants are many years • The tactic is to create demand and mutual understanding via interoperation of advanced networks and HEP data systems. Don Petravick -- Fermilab

More Related