Scalable Integrated Performance Analaysis of Multi-Gigabit Networks Ezra Kissel, U. Delaware Ahmed El-Hassany, Guilherme Fernandes, Martin Swany, Indiana U. Dan Gunter, Taghrid Samak, LBNL Jen Schopf, WHOI
What I hope you learn • Why we care about bulk data transfer at multi-gigabit rates • Why and how detailed monitoring is helpful • How dynamic control of monitoring is related to Session Layer protocols
Bulk data transfer needs • Some domains of interest: • Climate simulation (Earth System Grid) • Genomics (JGI) • High-energy physics (Large Hadron Collider) • Astronomy (Large Synoptic Survey Telescope) • Astrophysics (FLASH) Analysis sites Huge data
Multi-gigabit rates • Networks connecting national labs and universities have 10Gb/s and soon 100Gb/s capability. one PB = one day at 100Gb/s • Rarely achieved due to bottlenecks: • Host: Application or Disks • Campus/local networks • Wide area networks • Hard to tell why, where, or even if there is a problem
Solution Monitor all the time Analyze all the time .. but much more when something interesting is happening Use analysis results as feedback
System components • eXtensible Session Protocol (XSP) • Associate multiple TCP connections, L2 circuits, as a "session" • Provide channels for bi-directional metadata • NL-Calipers • Summarize in situ timings of every read/write • BLiPP • Host and TCP stack info. using XSP channels • PerfSONAR • Standard information formats and exchange protocols
Dynamic Session Monitoring Look at the performance Network engineer User (1) Start xfer (3) NL-calipers data 3) data (2) Open session (5) data (4) Signal TCP (5) data (4) Signal TCP
Bottleneck detection Triangles give "instantaneous" throughput Instrumentation On fixed intervals, summarize all measurements into mean, min, max, variance for both rate and #bytes Analysis: pick lowest mean value as bottleneck, apply t-test
TCP throughput Time series of throughput* for representative TCP experiments: (a) 1 stream memory-to-disk with 100ms latency, (b) 1 stream memory-to-memory with no latency, (c) 1 stream disk-to-disk with no latency, (d) 4 streams memory-to-disk with 100ms latency and 1% loss added at 60 seconds.
UDT throughput Time series of throughput* for representative UDT experiments: (a) 4 streams memory-to-disk with 100ms latency, (b) 4 streams memory-to-disk with 100ms latency and 1% loss added at 60 seconds, (c) 4 streams disk-to-disk with 100ms latency, (d) 4 streams memory-to-memory with 100ms latency.
Variance Half as many read()s. Others return zero, not counted Less work being done
Review • Why we care about bulk data transfer at multi-gigabit rates • Why and how detailed monitoring is helpful • How monitoring is related to Session Layer protocols • and how that might integrate with a management framework • Questions?
Related projects • NetLogger netlogger.lbl.gov • perfSONAR perfsonar.org • XSP damsl.cis.udel.edu/ • GENI geni.net • CEDPS cedps-scidac.org