ATLAS Network Requirements – an Open Discussion

ATLAS Network Requirements – an Open Discussion ATLAS Distributed Computing Technical Interchange Meeting University of Tokyo

ATLAS Networking Needs • Is this something to worry about? • Maybe not, millions of Netflix users generate much more traffic than HEP users do • If it works for them it must work for us too! • Maybe yes, because Netflix users don’t compare well with us • Commercial Internet providers optimize their infrastructure for mainstream clients and not for the specific needs of the HEP community, e.g. • Traffic pattern characterized by small flows (tiny “transactions”) • A few lost packets at 10 Gbps cause 80-fold throughput drop • Connectivity issues between NRENs and “Commercials” • Availability/Reliability issues

Bandwidth and Throughput • We often confuse Bandwidth with Throughput • Bandwidth is what providers have in their infrastructure • Throughput is what we observe with our applications • The two are (very) unlikely the same • Depends on how applications use the network • Shared infrastructure • Throughput very dependent on server configuration

Thoughts on LHCOPN (1/3) • The LHCOPN is the current private, physical circuit infrastructure that serves data transfers from the Tier 0 to the Tier 1s and between the Tier 1s, and an example where circuits – physical or virtual – are needed. The reasons for this are the original requirements for: • guaranteed delivery of data over a long period of time since the source at CERN is essentially a continuous, real-time data source; • long-term use of substantial fractions of the link capacity; • a well-understood and deterministic backup / fail-over mechanism; • guaranteed capacity that does not impact other uses of the network; • a clear cost model with long-term capacity associated with specific sites (the Tier 1s); • a mechanism that is easily integrated into a production operations environment that monitors the circuit health and has established trouble-shooting, resolution responsibility, and provides for problem tracking and reporting

Thoughts on LHCOPN (2/3) However, things (may) have changed a bit meanwhile • Is service availability still the main driver? The original specification was 99.95% on average over a full year, which is hard to achieve for TA links • Instead quality requirements on uptimes could be expressed in terms of capabilities. This could be formulated e.g. like: • "minimum n PB per day for at least 4 days in any week, with no more than 1 week deviation from this per quarter, and never more than 4 consecutive days of no connectivity, and never less than x PB per 2 week interval" or something like that (I'm making this up, don't take the values literally, but I think they are indicative). • After all, we're perfectly capable for a) re-routing traffic to other networks, i.e. LHCONE, either directly or if need be by e.g. "hopping" file transfers (I would hope there are more elegant ways of doing it), and b) we are able to catch up on n days of downtime and have done so when storage systems, firewalls, software releases etc gave us grief in the past.

Thoughts on LHCOPN (3/3) • Possibilities, as recently discussed in the LHCONE WG, include moving the LHCOPN to a virtual circuit (VC) service: • VCs can be moved around on an underlying physical infrastructure to better use available capacity, and potentially, to provide greater robustness in the face of physical circuit outages; • VCs have the potential to allow for sharing of a physical link when the VC is idle or used less than the committed bandwidth. • Our requirements (?) for and benefits from circuits include • Topological flexibility • Circuit implementation that allows sharing the underlying physical link • That is, b/w committed to, but not used by the circuit, are available for other traffic, i.e. Tier-1  Tier-2 via LHCONE

LHCOPN in a Circuit Scenario • Useful semantics in a shared Infrastructure • Although the virtual circuits are rate-limited at the ingress • to limit utilization to that requested by users • they are permitted to burst above the allocated bandwidth if idle capacity is available • Must be done without interfering with other circuits, or other uses of the link, such as general IP traffic, by, for example, marking the over-allocation bandwidth as low- priority traffic • User can request a second circuit that is diversely routed from the first circuit. • In order to provide high reliability for backup circuit ……… • Why is this interesting? • The rise of a general infrastructure that is 100G / link, using dedicated 10G links for Tier 0 – Tier 1 becomes increasingly inefficient • Shifting the OPN circuits to virtual circuits on the general (or LHCONE) infrastructure could facilitate sharing while meeting the minimum required guaranteed OPN bandwidth

B. Johnston et al

Cost Models, Managing Allocations • The reserved bandwidth of a circuit is a scarce commodity • This commodity must be manageable • What sorts of manageability do we/the Users require • What do we need to control in terms of circuit creation?

FAX/Remote IO and the Network • Remote I/O is a scenario that puts the WAN between the data and the executing analysis code • Today processing model based on data affinity • data is staged to the site where the compute resources are located and data access by analysis code is from local, site-resident, storage • Inserting the WAN is a change that potentially requires special measures to ensure the smooth flow of data between disk and computing system, and therefore the “smooth” job execution needed to make effective use of the compute resources.

Programmatic Replication and Remote IO • Mix of remote I/O and bulk data transfer • What will be the mix of remote IO-based access and bulk data transfer? • The change to remote I/O is, to a certain extent, aimed at lessening the use of the bulk data transfers that use GridFTP, so is addressing GridFTP addressing a dying horse? • How much bulk transfer will there be in a mature remote IO scenario, and between what mix of sites?

A possible path forward • Build a full mesh of static circuits whose bandwidth can be increased/decreased based on application/workflow needs: R&E networks & end-sites affected based on decisions above to share their possible NSI v2.0 deployment plans. How does ScienceDMZ/DYNES/CC-NIE installations play into this picture? • Bandwidth used is a portion of the bandwidth used for VRF infrastructure today as it has capacity available. • Routing infrastructure for this circuit infrastructure to be discussed. A couple of alternatives include migrating portion of VRF's over the circuits on the participating sites, with options to shift the routes over to the current VRF infrastructure if something happens to circuit infrastructure due to experimentation. • The right level of API/abstraction discussion between application guys and network folks to design the right interface into the circuit infrastructure. Try and address concerns that circuits are complex to deploy and debug. • Continue joint application guys and networking experts meeting at CERN. Interest in optimization with information from the network and co-scheduling resources • Decide the right metrics and experiment to quantify if and how circuits help applications since there is divided opinion within the group. The experiment needs to be designed properly.

ATLAS Network Requirements – an Open Discussion

ATLAS Network Requirements – an Open Discussion

Presentation Transcript

Chapter 8: Network Operating Systems and Windows Server 2003-Based Networking

How to Train and Coach FBA/BIP Fluency

IEEE 802.11 (WiFi)

Use Cases

CH04 Capturing the Requirements

LOGGING REQUIREMENTS

Open Network Administrator (ona)

Managing Open Source in Your Supply Chain

Chapter 4 – Requirements Engineering

Chapter 4 – Requirements Engineering

Intelligent Environments

Open source Pattern Development for Plex

World-Brain.edu Free Lifetime Distance Learning Author/Expert/Local Councils

Chapter 2 : The ns-3 Network Simulator

ATLAS SUSY review (missing momentum)

Supplemental Material

Warehousing and material handling system

NSTA Web Seminar: Intro to the Atlas of Science Literacy

Search for Standard Mod el Higgs in Two Photon Final State at ATLAS

IV Workshop Atlas-CMS