Experiences and results from implementing the QBone Scavenger

Experiences and results from implementing the QBone Scavenger Les Cottrell – SLAC Presented at the CENIC meeting, San Diego, May 2002 www.slac.stanford.edu/grp/scs/talk/cenic-may02.html Partially funded by DOE/MICS Field Work Proposal on Internet End-to-end Performance Monitoring (IEPM), also supported by IUPAP

Outline • Needs for High Energy & Nuclear Physics (HENP) • Why we need scavenger service • What is scavenger service • How is it used • Results of tests with 10Mbps, 100bps and 2Gbps bottlenecks • How we could use it all

HENP Experiment Model • World wide collaborations necessary for large undertakings • Regional computer centers in France, Italy, UK & US • Spending Euros on data center at SLAC not attractive • Leverage local equipment & expertise • Resources available to all collaborators • Requirements – bulk (60% of SLAC traffic): • Bulk data replication (current goal > 100MBytes/s) • Optimized cached read access to 10-100GB from 1PB data set

Data requirements for HEP • HEP physics accelerator experiments generate 10’s to 100’s Mbytes/s of raw data (100Mbytes/s == 3.6TB/hr) • Already heavily filtered in trigger hardware/software to only choose “potentially” interesting events • Data rate limited by ability to record and use data • Data is analyzed to reconstruct tracks etc., and events from the electronics signal data • Requires computing resources at several (tier 1) sites worldwide, for BaBar this includes: France, UK & Italy • Data has to be sent to sites, reconstructions have to be shared • Reconstructed data is summarized into an object oriented data base providing parameters of the events • Summarized data is analyzed by physicists around the world looking for physics and equipment understanding, thousands of physicists in hundreds of institutions in tens of countries. • In addition use Monte Carlo methods to create simulated events, to compare with real events • Also very cpu intensive, so done at multiple sites such as LBNL, LLNL, Caltech, and results shared with other sites

HENP Data Grid Hierarchy Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center HPSS HPSS HPSS HPSS CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1 ~PByte/sec ~100-400 MBytes/sec Online System Experiment CERN 700k SI95 ~1 PB Disk; Tape Robot Tier 0 +1 HPSS ~2.5 Gbits/sec Tier 1 FNAL: 200k SI95; 600 TB IN2P3 Center INFN Center RAL Center 2.5 Gbps Tier 2 ~2.5 Gbps Tier 3 Institute ~0.25TIPS Institute Institute Institute Physicists work on analysis “channels” Each institute has ~10 physicists working on one or more channels 100 - 1000 Mbits/sec Physics data cache Tier 4 Workstations

HEP Next Gen. Networks needs • Providing rapid access to event samples and subsets from massive data stores • From ~400 Terabytes in 2001, ~Petabytes by 2002, ~100 Petabytes by 2007, to ~1 Exabyte by ~2012. • Providing analyzed results with rapid turnaround, bycoordinating and managing the LIMITED computing, data handling and NETWORK resources effectively • Enabling rapid access to the data and the collaboration • Across an ensemble of networks of varying capability • Advanced integrated applications, such as Data Grids, rely on seamless operation of our LANs and WANs • With reliable, quantifiable (monitored), high performance • For “Grid-enabled” event processing and data analysis,and collaboration

Throughputs today • Can get 400Mbits/s TCP throughput regularly from SLAC to well connected sites on production ESnet or Internet 2 within US. • Need big windows & multiple streams, > 500MHz cpus • Usually single transfer is disk limited to < 70Mbits/s Trans-Atlantic * * * * Also see http://www-iepm.slac.stanford.edu/monitoring/bulk/; and the Internet2 E2E Initiative: http://www.internet2.edu/e2e

Why do we need even higher speeds • Data growth exceeds Moore’s law • New experiments coming on line • Experiment with higher speeds: • Understand next limitations: • End hosts: disks, memory, compression • Application steering, windows, streams, tuning stacks, choosing replicas… • Improve or replace TCP stacks, forward error correction, non congestion related losses … • Coexistence, need for QoS, firewalls … • Set expectations • Change mindset • NTON enabled us to be prepared to change from shipping tapes to using network, assisted in more realistic planning

In addition … • Requirements – interactive: • Remote login, video conferencing, document sharing, joint code development, co-laboratory (remote operations, reduced travel, more humane shifts) • Modest bandwidth – often < 1 Mbps • Emphasis on quality of service & sub-second responses • How to get the best of both worlds: • Use all available bandwidth • Minimize impact on others • One answer is to be a scavenger

What is QBSS 0 1 2 3 4 5 6 7 • QBSS stands for QBone Scavenger Services. It’s an Internet2 initiative, to let users and applications. • take advantage of otherwise unused bandwidth. • without affecting performance of the default best-effort class of service. • QBSS corresponds to a specific Differentiated Service Code Point (DSCP): DSCP = 001000 (binary) • The IPv4 ToS (Type of Service) octet looks like: • Bits 0-2 = Class selector Bits 0-5 = DSCP (Differentiated Service Code Point) Bits 6-7 = Early Congestion Notification (ECN)

How is it used Users can voluntarily mark their traffic with QBSS codepoint: As they would type nice on Unix; Routers can mark packets for users/applications Routers that see traffic marked with QBSS code point can: • Be configured to handle it • Forward at a lower priority than best effort traffic, with possibility of expanding bandwidth when other traffic is not using all capacity • Not know about it • Treat is as regular Best Effort (DSCP 000000)

Impact on Others • Make ping measurements with & without iperf TCP loading • Loss loaded vs unloaded • RTT • Looking at how to avoid impact: e.g. QBSS/LBE, application pacing, control loop on RTT, reducing streams, want to avoid scheduling

QBSS test bed with Cisco 7200s Cisco 7200s • Set up QBSS testbed • Has a 10Mbps bottleneck • Configure router interfaces • 3 traffic types: • QBSS, BE, Priority • Define policy, e.g. • QBSS > 1%, priority < 30% • Apply policy to router interface queues 10Mbps 100Mbps 100Mbps 100Mbps 1Gbps

Using bbcp to make QBSS measurements • Run bbcp src data /dev/zero, dst=/dev/null, report throughput at 1 second intervals • with TOS=3210 (QBSS) • After 20 s. run bbcp with no TOS bits specified (BE) • After 20 s. run bbcp with TOS=4010 (priority) • After 20 more secs turn off Priority • After 20 more secs turn off BE

Example of effects Also tried: 1 stream for all, and priority at 70%

QBSS with Cisco 6500 • 6500s + Policy Feature Card (PFC) • Routing by PFC2, policing on switch interfaces • 2 queues, 2 thresholds each • QBSS assigned to own queue with 5% bandwidth – guarantees QBSS gets something • BE & Priority traffic in 2nd queue with 95% bandwidth • Apply ACL to switch port to police Priority traffic to < 30% BE 100% Cisco 6500s + MSFC/Sup2 QBSS (~5%) Priority (30%) 100Mbps 1Gbps 1Gbps 1Gbps 1Gbps Time

Impact on response time (RTT) • Run ping with Iperf loading with various QoS settings, iperf ~ 93Mbps • No iperf ping avg RTT ~ 300usec (regardless of QoS) • Iperf = QBSS, ping=BE or Priority: RTT~550usec • 70% greater than unloaded • Iperf=Ping QoS (exc. Priority) then RTT~5msec • > factor of 10 larger RTT than unloaded

SC2001 Our challenge: Bandwidth to the world • Demonstrate the current data transfer capabilities to several sites worldwide: • 26 sites all over the world • IPERF servers on each remote side that can accept data coming from the show floor; • Mimic a high energy physics tier 0 or tier 1 site (an accelerator or major computation site) in distributing copies of the raw data to multiple replica sites.

SC2001 Setup SC2001 NOC To the world! The setup at SC2001 had three Linux PCs with a total of 5 gig Eth interfaces. The Gig lines to the NOC were Ether-Channeled together so to have an aggregate 2 Gig line The configuration of the two 6509 switches defined the baseline for QBSS traffic at 5% of the total bandwidth.

SC2001 demo 1/2 • Send data from 3 SLAC/FNAL booth computers to over 20 other sites with good connections in about 6 countries • Iperf TCP throughputs ranged from 3Mbps to ~ 300Mbps • Saturate 2Gbps connection to floor network • Maximum aggregate throughput averaged over 5 min. ~ 1.6Gbps • Apply QBSS to highest performance site, and rest BE Iperf TCP Throughput Per GE interface QBSS No QBSS 100 Mbits/s 5mins 0 Priority: 9+-2 ms BE: 18.5+-3ms QBSS: 54+-100 ms Time Pings to host on show floor

Possible usage • Apply priority to lower volume interactive voice/video-conferencing and real time control • Apply QBSS to high volume data replication • Leave the rest as Best Effort • Since 40-65% of bytes to/from SLAC come from a single application, we have modified to enable setting of TOS bits • Need to identify bottlenecks and implement QBSS there • Bottlenecks tend to be at edges so hope to try with a few HEP sites

Acknowledgements &More Information • Official Internet2 page: • http://qbone.internet2.edu/qbss/ • IEPM/PingER home site: • www-iepm.slac.stanford.edu/ • Bulk throughput site: • www-iepm.slac.stanford.edu/bw • QBSS measurements • www-iepm.slac.stanford.edu/monitoring/qbss/measure.html • CENIC Network Applications Magazine, vol 2, April ’02 • www.cenic.org/InterAct/interactvol2.pdf • Thanks to Stanislav Shalunov of Internet 2, for inspiration and encouragement; Paola Grosso, Stefan Luitz, Warren Matthews & Gary Buhrmaster of SLAC for setting up routers and helping with measurements.

Experiences and results from implementing the QBone Scavenger