Managing Cloud Resources: Distributed Rate Limiting

Managing Cloud Resources: Distributed Rate Limiting Alex C. Snoeren Kevin Webb, Bhanu Chandra Vattikonda, BarathRaghavan, KashiVishwanath, SriramRamabhadran,and Kenneth Yocum Building and Programming the Cloud Workshop 13 January 2010

Hosting with a single physical presence However, clients are across the Internet Centralized Internet services Mysore-Park Cloud Workshop – 13 January 2010

Cloud-based services • Resources and clients distributed across the world • Often incorporates resources from multiple providers Windows Live Mysore-Park Cloud Workshop – 13 January 2010

Resources in the Cloud • Distributed resource consumption • Clients consume resources at multiple sites • Metered billing is state-of-the-art • Service “punished” for popularity • Those unable to pay are disconnected • No control of resources used to serve increased demand • Overprovision and pray • Application designers typically cannot describe needs • Individual service bottlenecks varied but severe • IOps, network bandwidth, CPU, RAM, etc. • Need a way to balance resource demand Mysore-Park Cloud Workshop – 13 January 2010

Two lynchpins for success • Need a way to control and manage distributed resources as if they were centralized • All current models from OS scheduling and provisioning literature assume full knowledge and absolute control • (This talk focuses specifically on network bandwidth) • Must be able to efficiently support rapidly evolving application demand • Balance the resource needs to hardware realization automatically without application designer input • (Another talk if you’re interested) Mysore-Park Cloud Workshop – 13 January 2010

Ideal: Emulate a single limiter • Make distributed feel centralized • Packets should experience the same limiter behavior Limiters S D 0 ms S D 0 ms 0 ms S D Mysore-Park Cloud Workshop – 13 January 2010

Accuracy (how close to K Mbps is delivered, flow rate fairness) + Responsiveness (how quickly demand shifts are accommodated) Vs. Communication Efficiency (how much and often rate limiters must communicate) Engineering tradeoffs Mysore-Park Cloud Workshop – 13 January 2010

An initial architecture Limiter 1 Estimate interval timer Limiter 2 Packet arrival Gossip Limiter 3 Enforce limit Estimate local demand Gossip Gossip Limiter 4 Set allocation Global demand Mysore-Park Cloud Workshop – 13 January 2010

Token bucket limiters Packet Token bucket, fill rate K Mbps Mysore-Park Cloud Workshop – 13 January 2010

A global token bucket (GTB)? Demand info (bytes/sec) Limiter 1 Limiter 2 Mysore-Park Cloud Workshop – 13 January 2010

A baseline experiment Limiter 1 3 TCP flows Single token bucket D S 10 TCP flows D S Limiter 2 7 TCP flows S D Mysore-Park Cloud Workshop – 13 January 2010

GTB performance Global token bucket Single token bucket 10 TCP flows 7 TCP flows 3 TCP flows Problem: GTB requires near-instantaneous arrival info Mysore-Park Cloud Workshop – 13 January 2010

Take 2: Global Random Drop Limiters send, collect global rate info from others 5 Mbps (limit) 4 Mbps (global arrival rate) Case 1: Below global limit, forward packet Mysore-Park Cloud Workshop – 13 January 2010

Global Random Drop (GRD) 6 Mbps (global arrival rate) 5 Mbps (limit) Same at all limiters Case 2: Above global limit, drop with probability: Excess / Global arrival rate = 1/6 Mysore-Park Cloud Workshop – 13 January 2010

GRD baseline performance Global token bucket Single token bucket 10 TCPflows 7 TCP flows 3 TCP flows Delivers flow behavior similar to a central limiter Mysore-Park Cloud Workshop – 13 January 2010

GRD under dynamic arrivals Mysore-Park Cloud Workshop – 13 January 2010 (50-ms estimate interval)

Returning to our baseline Limiter 1 3 TCP flows D S Limiter 2 7 TCP flows D S Mysore-Park Cloud Workshop – 13 January 2010

Basic idea: flow counting Goal: Provide inter-flow fairness for TCP flows Local token-bucket enforcement “3 flows” “7 flows” Limiter 1 Limiter 2 Mysore-Park Cloud Workshop – 13 January 2010

Estimating TCP demand Localtoken rate (limit) = 10 Mbps 1 TCP flow S Flow A = 5 Mbps 1 TCP flow S Flow B = 5 Mbps Flow count = 2 flows Mysore-Park Cloud Workshop – 13 January 2010

FPS under dynamic arrivals Mysore-Park Cloud Workshop – 13 January 2010 (500-ms estimate interval)

Comparing FPS to GRD FPS(500-ms est. int.) GRD(50-ms est. int.) • Both are responsive and provide similar utilization • GRD requires accurate estimates of the global rate at all limiters. Mysore-Park Cloud Workshop – 13 January 2010

Estimating skewed demand Limiter 1 1 TCP flow S D 1 TCP flow S Limiter 2 3 TCP flows D S Mysore-Park Cloud Workshop – 13 January 2010

Estimating skewed demand Localtoken rate (limit) = 10 Mbps Flow A = 8 Mbps Bottleneckedelsewhere Flow B = 2 Mbps Flow count ≠ demand Key insight: Use a TCP flow’s rate to infer demand Mysore-Park Cloud Workshop – 13 January 2010

Estimating skewed demand Localtoken rate (limit) = 10 Mbps Flow A = 8 Mbps Bottleneckedelsewhere Flow B = 2 Mbps Local Limit Largest Flow’s Rate 10 8 = 1.25 flows = Mysore-Park Cloud Workshop – 13 January 2010

FPS example Global limit = 10 Mbps Limiter 1 Limiter 2 1.25flows 3flows Global limit x local flow count Total flow count Set local token rate = 10 Mbps x 1.25 1.25 + 3 = = 2.94 Mbps Mysore-Park Cloud Workshop – 13 January 2010

FPS bottleneck example Mysore-Park Cloud Workshop – 13 January 2010 Initially 3:7 split between 10 un-bottlenecked flows At 25s, 7-flow aggregate bottlenecked to 2 Mbps At 45s, un-bottlenecked flow arrives: 3:1 for 8 Mbps

Real world constraints • Resources spent tracking usage is pure overhead • Efficient implementation (<3% CPU, sample & hold) • Modest communication budget (<1% bandwidth) • Control channel is slow and lossy • Need to extend gossip protocols to tolerate loss • An interesting research problem on its own… • The nodes themselves may fail or partition • In an asynchronous system, you cannot tell the difference • Need to have a mechanism that deals gracefully with both Mysore-Park Cloud Workshop – 13 January 2010

Robust control communication Mysore-Park Cloud Workshop – 13 January 2010 7 Limiters enforcing 10 Mbps limit Demand fluctuates every 5 sec between 1-100 flows Varying loss on the control channel

Handling partitions Mysore-Park Cloud Workshop – 13 January 2010 Failsafe operation: each disconnected group k/n Ideally: Bank-o-mat problem (credit/debit scheme) Challege: group membership with asymmetric partitions

Following PlanetLab demand • Apache Web servers on 10 PlanetLab nodes • 5-Mbps aggregate limit • Shift load over time from 10 nodes to 4 5 Mbps Mysore-Park Cloud Workshop – 13 January 2010

Current limiting options Demands at 10 apache servers on Planetlab Wasted capacity Demand shifts to just 4 nodes Mysore-Park Cloud Workshop – 13 January 2010 31

Applying FPS on PlanetLab Mysore-Park Cloud Workshop – 13 January 2010 32

Hierarchical limiting Mysore-Park Cloud Workshop – 13 January 2010

A sample use case Mysore-Park Cloud Workshop – 13 January 2010 • T 0: A: 5 flows at L1 • T 55: A: 5 flows at L2 • T 110: B: 5 flows at L1 • T 165: B: 5 flows at L2

Worldwide flow join Mysore-Park Cloud Workshop – 13 January 2010 • 8 nodes split between UCSD and Polish Telecom • 5 Mbps aggregate limit • A new flow arrives at each limiter every 10 seconds

Worldwide demand shift Mysore-Park Cloud Workshop – 13 January 2010 • Same demand-shift experiment as before • At 50 sec, Polish Telecom demand disappears • Reappears at 90 sec.

Where to go from here • Need to “let go” of full control, make decisions with only a “cloudy” view of actual resource consumption • Distinguish between what you know and what you don’t know • Operate efficiently when you know you know. • Have failsafe options when you know you don’t. • Moreover, we cannot rely upon application/service designers to understand their resource demands • The system needs to dynamically adjust to shifts • We’ve started to manage the demand equation • We’re now focusing on the supply side: custom-tailored resource provisioning. Mysore-Park Cloud Workshop – 13 January 2010

Managing Cloud Resources: Distributed Rate Limiting

Managing Cloud Resources: Distributed Rate Limiting

Presentation Transcript

Managing Human Resources

Managing Information Resources

Cloud Control with Distributed Rate Limiting

Detecting ICMP Rate-Limiting

Managing Cloud Resources for Medical Applications

Cloud Control with Distributed Rate Limiting

Seeding Cloud-based services: Distributed Rate Limiting (DRL)

Managing Resources

Managing Resources

Managing resources

Distributed Energy Resources

Distributed Resources

Managing Resources

Managing distributed computing resources with DIRAC

Cloud Control with Distributed Rate Limiting

Managing Cloud

Limiting Tax Rate Calculation

Managing Financial Resources

Managing Data Resources

Distributed Energy Resources

Distributed Computing Resources

Managing Forest Resources