Web100/Net100 at Oak Ridge National Lab

Web100/Net100atOak Ridge National Lab Tom Dunigan thd@ornl.gov October 21, 2002

Net100: developing network-aware operating systems • DOE-funded (Office of Science) project ($1M/yr, 3 yrs beginning 9/01) • Principal investigators • Matt Mathis, PSC (mathis@psc.edu) • Brian Tierney, LBNL (bltierney@lbl.gov) • Tom Dunigan, ORNL (thd@ornl.gov) Florence Fowler Nagi Rao • Objective: • measure and understand end-to-end network and application performance • tune network applications (grid and bulk transfer) • first year emphasis: bulk transfer over high delay/bandwidth nets • Components (leverage Web100) • Network Tool Analysis Framework (NTAF) • tool design and analysis • active network probes and passive sensors • network metrics data base • transport protocol analysis • tuning daemon (WAD) to tune network flows based on network metrics www.net100.org

TCP tuning with Web100+/Net100 • Path characterization (NTAF) • both active and passive measurement • data base of measurement data • NTAF/Web100 hosts at PSC, NCAR,LBL,ORNL • Application tuning (tuning daemon, WAD) • Web100 extensions • disable Linux 2.4 caching/SendStall • event notification • more tuning options • daemon tunes application at start up • static tuning information • query NTAF and calculate optimum TCP parameters • dynamically tune application (Web100 feedback) • adjust parameters during flow • split optimum among parallel flows • Transport protocol optimizations • what to tune? • is it fair? stable?

Motivation • Poor network performance … • High bandwidth paths, but app’s slow • Is it application, OS, network? … Yes • Changing: bandwidths • 9.6 Kbs… 1.5 Mbs ..45 …100…1000…? Mbs • Unchanging TCP: • speed of light (RTT) • MTU (still 1500 bytes) • TCP congestion avoidance • TCP is lossy by design ! • 2x overshoot at startup, sawtooth • recovery after a loss can be very slow on today’s high delay/bandwidth links • Recovery proportional to MSS/RTT2 ORNL to NERSC ftp Linear recovery at 0.5 Mb/s! Instantaneous bandwidth Early startup losses Average bandwidth

Net100 TCP tuning • TCP performance • reliable/stable/fair • need buffer = bandwidth*RTT • ORNL/NERSC (80 ms, OC12) need 6 MB • TCP slow-start and loss recovery proportional to MSS/RTT² • slow on today’s high delay/bandwidth paths • TCP is lossy be design • TCP tuning • set optimal (?) buffer size • avoid losses • modified slow-start • reduce bursts • anticipate (Vegas?) loss • reorder threshold • speed recovery • bigger MTU or “virtual MSS” • modified AIMD (0.5,1) • delayed ACKs and initial window ns simulation: 500 mbs link, 80 ms RTT Packet loss early in slow start. Standard TCP with del ACK takes 10 minutes to recover!

Net100 TCP tuning WAD config file [bob] src_addr: 0.0.0.0 src_port: 0 dst_addr: 10.5.128.74 dst_port: 0 mode: 1 sndbuf: 2000000 rcvbuf: 100000 wadai: 6 wadmd: 0.3 maxssth: 100 divide: 1 reorder: 9 delack: 0 floyd: 1 • Work-around Daemon (WAD) • tune unknowing sender/receiver at startup and/or during flow • Web100 kernel extensions • uses netlink to alert daemon of socket open/close • Besides existing Web100 buffer tuning, new code and WAD_* variables • knobs to disable Linux 2.4 caching and sendstall • config file with static tuning data • mode specifies dynamic tuning (Floyd AIMD, NTAF buffer size, concurrent streams) • daemon periodically polls NTAF for fresh tuning data • written in C (LBL has python version)

WAD tuning results (your mileage may vary …) Classic buffer tuning: ORNL to PSC, OC12, 80ms RTT network-challenged app. gets 10 Mbs same app., WAD/NTAF tuned buffer get 143 Mbs Virtual MSS tune TCP’s additive increase (WAD_AI) add K segments per RTT during recovery k=6 like GigE jumboframe

WAD tuning Modified slow-start and AI ORNL to NERSC, OC12, 80 ms RTT often losses in slow start WAD tuned Floyd slowstart (WAD_MaxThresh) and AI (6) WAD tuned AIMD and slow start ORNL to CERN, OC12, 150ms RTT parallel streams AIMD (1/(2k),k) WAD tune single stream (0.125,4) WAD_MD Can tuned single stream compete with parallel streams? pre-tune Floyd AIMD or dynamically adjust tune concurrent flows -- subdivide buffer

Net100 TCP tuning Reorder threshold seeing more out of order packets WAD tune a bigger reorder threshold Linux 2.4 does a good job already LBL to ORNL (using our TCP-over-UDP) dup3 case had 289 retransmits, but all were unneeded! WAD could turn off delayed ACKs -- 2x improvement in recovery rate and slowstart linux 2.4 already turns off delayed ACKs for initial slow-start WARNING: could be unfair, probably stable use only on intranet Web100 has proven very useful for experimenting with TCP tuning options.

Web100 tools • Java applet bandwidth/client tester • measure in/out data rates • report flow characteristics • Try it http://firebird.ccs.ornl.gov:7123 • INSIGHTS: • what happened, what you can expect • from server log: • 25,755 flows • 53% with loss, 23% timeouts • Post-transfer statistics • ttcp100/iperf100 • Web100 daemon • avoid modifying applications • log designated paths/ports/variables • INSIGHTS: later...

Web100 tools • Tracer daemon • collect Web100 variables at 0.1 second intervals • config file specifies • source/port dest/port • web100 variables (current/delta) • log to disk with timestamp and CID • C and python (LBL-based) • INSIGHTS: • watch uninstrumented app’s (GridFTP) • analyze flow dynamics with plots (cwnd, ssthresh, re-xmits,RTT…) • analyze tuned flows • aggregate parallel flow data # traced config file #local lport remote rport 0.0.0.0 0 124.55.182.7 0 0.0.0.0 0 134.67.45.9 0 #v=value d=delta d PktsOut d PktsRetrans v CurrentCwnd v SampledRTT

PIX SACK problem Web100 reports timeouts into ORNL, not at other sites ?? Theory 1: yet another linux 2.4 TCP feature our TCP-over-UDP: no timeouts Tcpdump/tcptrace/xplot of flow both inside and outside ORNL ? Tcptrace bug -- SACK blocks wrong for one of the dumps… NOT. ORNL PIX firewall randomizing TCP sequence numbers, but failed to adjust SACK blocks RESULT: TCP timeouts

Futures www.net100.org • Net100 • analyze effectiveness of current tuning options • NTAF probes -- characterizing a path to tune a flow • additional tuning algorithms (Vegas) • parallel/multipath selection/tuning • WAD-to-WAD tuning • Web100 extensions • Web100 trace files -- log all data efficiently • variable for count of duplicate data segments at receiver • remove wscale restriction • ESnet • Jumbo frames • Router/switch data

Web100/Net100 at Oak Ridge National Lab

Web100/Net100 at Oak Ridge National Lab

Presentation Transcript

Recycling Aluminum Salt Cake

Chapter 6: Measuring National Output and National Income

Jeopardy - Grade 4

National Accounts

The National Register of Historic Places

VERTICAL RIDGE AUGMENTATION

National Urban League

National Quality Framework

Part I: Introductory Materials Introduction to Parallel Computing with R

INDIA’S NATIONAL SYMBOLS

Semester 1 Exam

Geology Plate Tectonics Mid-Ocean Ridge System

National Self-Determination

National Readjustment

ACM Symposium on Applied Computing Nicosia Cyprus March 16, 2004

Geology Plate Tectonics Mid-Ocean Ridge System

ACM Symposium on Applied Computing Nicosia Cyprus March 16, 2004

Marriotts Ridge High School

$100

PAUSE Review ITS JPO Knowledge Resource Development Activities

Development of Computational Model of Sport Utility Vehicle

NATIONAL INCOME (GDP and GNP) MEASUREMENT