Enhancing Scalability and Accuracy in Large-Scale Network Emulator

Scalability and Accuracy in a Large-Scale Network Emulator Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahadevan, Dejan Kostić, Jeff Chase, and David Becker Duke University Proceedings of 5th Symposium on Operating Systems Design and Implementation (OSDI 2002)

Introduction • Evaluate Internet-scale distributed systems • E.g. peer-to-peer, overlay, wide-area replication • Realistic scenarios: real world • Difficult to deploy and administer • Results not reproducible or not necessarily representative of future behaviour • Simulations: e.g. NS • More control • May miss important system interactions • Emulation • Run unmodified code on target platforms • More control: can subject system traffic to constraints (bandwidth, latency, loss rate, topology,…) • Thus far limited to small and static systems  ModelNet

Goal of ModelNet • Environment should support: • Unmodified applications • Reproducible results • Experimentation under broad range of network topologies and dynamically changing network characteristics • Large-scale experiments with large number of nodes and high traffic

ModelNet Architecture • Scalable Internetemulation environment • Based on dummynet, extended to improve accuracy and include multi-hop and multi-core emulation • Edge nodes running user-specified OS and applications • Each instance is a virtual edge node (VN)with unique IP in emulated topology • Route traffic through core routers • Core nodes emulate behaviour of configured target network • Captures effects of congestion and cross-traffic • Uses emulated links or pipes

ModelNet Phases • CREATE • Generate network topology  GML graph(*) • Can use Internet traces, BGP dumps, synthetic topology generators • User can annotate graph to specify packet loss rates, failure distribution, etc. (*) GML – graph modeling language

ModelNet Phases • DISTILL • Transform GML graph to pipe topology to model target network • Simplify network • Trade accuracy for reduced emulation cost

ModelNet Phases • ASSIGN • Map distilled topology to core nodes, load balancing • Ideal assignment NP-complete problem • Mapping pipes to cores depends on routing, link properties and traffic load • Use simple greedy k-clusters assignment • Randomly pick one node in the topology for each core node, then cores greedily select from connected nodes in round-robin

ModelNet Phases • BIND • Assign VNs to edge nodes • Can have multiple VNs per physical edge node • Bind each physical node to a single core • Install sets of pipes in distilled topology and routing tables with shortest-path between VN pairs • Configure edge nodes with IP addresses for each VN

ModelNet Phases • RUN • Execute target applications on edge nodes

The Core • Principal tasks (in steady state) • Receive packets from network interface • Move packets • Pipe to pipe • Pipe to final destination • Moving packets is strictly higher priority than receiving packets • Preferentially emulate packets already in core  core CPU saturation results in dropped packets at physical level rather than emulation

The Core • Traffic routing • Emulate links as pipes • Pre-computed shortest-path for all VN pairs requires O(n2) space • Route is ordered list of pipes • Move packets through pipes by reference (packet descriptor)

The Core • Packet scheduling • Heap of pipes sorted by earliest deadline (exit time for first packet in queue) • Scheduler executes once per clock tick (10KHz), runs at kernel’s highest priority • Finds heaps with deadline later than current time • Move packets to next destination (tail of next pipe or VN) • Calculate new deadlines and reinsert pipes into heap

The Core • Multi-core configuration • Next pipe may be on different core node • Transfer packet descriptor to next node • Packet contents buffered at entry core node and forwarded to destination upon delivery of packet

Scalability Issues • Bandwidth limitation • Traffic through ModelNet core is limited to cluster’s physical internal bandwidth • Memory requirement • ModelNet must buffer up to full bandwidth-delay product of target network • Routing protocol • Assumes perfect routing protocol: shortest path between all pairs of host • Instantaneous discovery of new shortest path upon node or link failure

Setup for Experiments • Core routers: • 1.4 GHz Pentium-IIIs w/ 1 GB memory • FreeBSD-4.5-STABLE • Connected via 1GB switch • Edge nodes: • 1 GHz Pentium-IIIs w/ 256 MB memory • Linux 2.4.17 • Connected via 100Mb/s Ethernet

Baseline Accuracy • Accurately emulate target packet characteristics on hop-by-hop basis • Use kernel logging to track performance and accuracy • Run ModelNet scheduler at highest kernel priority • Results: • Each hop accurately emulated to granularity of hardware timer (100μs) • Maintains accuracy up to 100% CPU utilization • Future improvement: • in subsequent hops use packet dept handling to correct for emulation errors

Capacity • Quantify as function of load and # of hops • Single core • 1 Gb/s link • 1-5 edge nodes • Each with up to 24 netperf senders (24 VNs) and 24 receivers • 1 Gb/s Ethernet connection • For 1 hop: • At 120 flows CPU is 50% used • Network link is bottleneck • >4 hops • CPU is bottleneck

Additional Cores • Deliver higher throughput • increasing probability of packet’s path crossing node boundary cross-core traffic • Introduces communication overhead • Ability to scale depends on • Application communication characteristics • Partitioning of topology (minimize cross-core traffic)

VN Multiplexing • Mapping of VNs to physical edge nodes • Enables larger-scale emulations • Affects emulation accuracy and scalability • Context switch overhead • Scheduling behaviour • Resource contention at edge nodes

Tradeoff: Accuracy vs. Scalability • Impractical to model every packet and link for large portion of Internet • Create controlled Internet-like execution context for applications • Reduce overhead by making approximations that minimally impact application behaviour • Ideally automate tradeoff to satisfy resource conditions and report degree of inaccuracy to user

Distillation • Hop-by-hop emulation • Distilled topology isomorphic to target network • Accurate but highest per packet cost • End-to-end emulation • Collapse each path to single pipe  full mesh • Lowest overhead • Can capture raw network latency, bandwidth and loss rate • Cannot emulate link contention among competing flows

Distillation • Walk-in • Preserve first walk-in links, replace interior by full mesh • Breadth-first traversal to find successive frontier sets (first frontier set is set of all VNs) • Each packet traverses at most (2*walk-in)+1 pipes • Cannot model contention in interior • Walk-out • Model under-provisioned core • Extend walk-in algorithm to preserve inner core • Find “topological center” by generating successive frontiers until one of size one or zero is found • Collapse paths between walk-in and walk-out

Distillation • Ring topology • 20 routers • Interconnected at 20 Mb/s • 20 VNs connected to each router by 2 Mb/s links • VNs partitioned into generator and receiver sets • Each generator sends to random receiver • Hop by hop: 419 pipes • End to end: 79,800 pipes • Last-mile only: 400 edge links and 190 interior links

Changing Network Characteristics • Evaluation of adaptive Internet systems • User can • directly incorporate generators for competing traffic • accurate for emulation of “background” cross traffic • consumes resources at edge nodes and bandwidth at core • modify pipe parameters during emulation • inject cross traffic by dynamically low overhead, scales independently of traffic rate • does not capture all details of Internet packet dynamics (e.g. slow start, bursty traffic) • not responsive to congestion  emulation error grows with link utilization level • Fault injection

Case Studies • Network of Gnutella clients • 10,000 nodes (100 VNs for each of the 100 edge nodes) • Support for emulation of ad hoc wireless environments • Implemented but not presented in this paper • CFS(1) • Able to reproduce results from CFS implementation running on RON(2) testbed (published by another group) • Replicated web services • Replay of trace to IBM’s main website • Able to show that one additional replica improves latency, third replica only marginally beneficial • Ability to emulate contention on interior links crucial for obtaining these results • Adaptive overlays • ACDC: overlay that adapts to changing network conditions • Similar experiment results obtained by ModelNet and ns2 (1) CFS - Cooperative File System (2) RON - Resilient Overlay Network (MIT)

Related Work • Many other efforts on emulation • Mostly focus on specific, static and small-scale systems • Netbed (Emulab) • Similar to ModelNet, except that ModelNet focuses on scalable emulation of large-scale networks • Will integrate ModelNet efforts into Netbed • Competing research by WASP(1) project • Emulate network characteristics at end host • Requires emulation software on all edge nodes • Cannot capture congestion of multiple flows on single pipe (1) WASP – Wide Area Server Performance (J.-Y. Pan, H. Bhanoo, E. Nahum, M. Rosu, C, Faloutsos, and S. Seshan)

Summary • ModelNet designed to support • Unmodified applications • Reproducible results • Broad range of network topologies and dynamically changing characteristics • Large-scale experiments • Provided means of balancing accuracy and cost • Presented case studies to show generality of approach

Enhancing Scalability and Accuracy in Large-Scale Network Emulator