Decentralised Load Balancing in Closed and Open Systems at A.J. Ganesh University of Bristol

Decentralised load balancing in closed and open systems A. J. Ganesh University of Bristol Joint work with S. Lilienthal, D. Manjunath, A. Proutiere and F. Simatos

Model • Fixed set of m servers • Closed system • Fixed set of n clients • Open system • Clients arrive according to independent Poisson processes of rates 1,…,m • Exponential job sizes, iid with unit mean • Service rates are 1,…,m • Processor sharing service discipline

Objective • Closed system • Balance the server loads • Open system • Maximise throughput • Minimise delay • Seek decentralised algorithms • a client can sample an arbitrary server and decide to move based on the loads at its current and sampled servers

Motivation • Dynamic spectrum access in wireless • servers are channels • Multipath TCP or dynamic routing • servers are routes • Route choice in transport networks • servers are routes • All are examples of congestion games • time to reach Nash equilibrium

Algorithm 1: Random local search (RLS) • Clients pick servers uniformly at random according to independent unit rate Poisson processes • Move if it would strictly improve their individual service rate (= rate of server divided by its load)

Algorithm 2: Random load oblivious (RLO) • Clients are impatient and simply perform independent random walks over the servers until the leave • Random walk described by continuous time Markov chain with rate matrix Q and invariant distribution >0 • Moves are oblivious of server load

Related work: a synchronous model • Berenbrink et al. (2005) • At each time step, each client picks a server at random • If load at current server is A and at new server is B<A, then moves with probability (AB)/A

Previous results: Closed systems • Expected time to reach load balance in asynchronous model is • O(m2): Goldberg (2004) • Expected time to reach balance in synchronous model is • O(loglog(m) + n4): Berenbrink et al. (2005): • O(log(m) + nlog(n)): Berenbrink et al. (2007)

Our results • Closed systems • Time to reach perfect balance is O(m2log(m)/n + log2(m)) • Time to reach -balance is O(log(m)/) • Open systems • Both RLS and RLO are throughput maximising: • system is stable whenever i  i

Notation and definitions • N(t) = (N1(t),…,Nm(t)) : number of clients at servers 1,…,m at time t • N(t) is balanced if |Ni(t)-Nj(t)|  1 for all i and j • N(t) is -balanced if (1)p  Ni(t)  (1+)p for all i, where p=n/m •  = time to reach balance •  = time to reach -balance

Notation and definitions • V(t) = maxj Nj(t) • Cv(t) = number of servers with exactly v clients at time t • Bv(t) = Cv1(t) • Av(t) = number of servers with v2 or fewer clients at time t

Results for closed systems • E[] = O(m2 log(m)/n + log2(m)) • E[] = O(log(m)/) • E[] = (m2/n + log(m)) • Typically interested in n >> m

Proof (perfect balance) • Previous work used quadratic Lyapunov functions • We use V(t) as Lyapunov function • Say RLS algorithm is in phase v at time t if V(t)=v • Cv(t) decreases monotonically during phase v • Phase v ends when Cv(t) hits 0

Proof (cont.) • Cv decreases by 1 when one of the vCv clients at a maximally loaded server samples one of the Av servers with v2 or fewer clients • This happens at rate vCvAv/m • Lower bound for Av: no more than n/(v1) servers can have v1 or more clients • Implies upper bound on mean time for • Cv to decrease by 1 • and hence for V to decrease by 1

Proof (-balance) • Involves counting the number of -balanced, underloaded and overloaded servers, • and the number of clients at overloaded servers, • and using these to bound the expected time till one such client moves to an underloaded or -balanced server

Stability results for closed systems • If  i   i, then the system is stable under both RLO and RLS policies

Proof of stability for RLO algorithm • Proof uses Foster’s criterion, with the total number of clients in the system as Lyapunov function • Denote by |x| the L1-norm of vector x • |N(t)| is the total number of clients in system at time t • || is the total arrival rate • || is the maximum service

Foster’s criterion • Suppose there exist K, and t>0 such that En[|N(t)||n|] <  for all n:|n|>K • Then N(t) is ergodic

Bounding the drift En[|N(t)||n|] = t   iE[Yi(t)] • where Yi(t) is the time up to time t that server i is non-idle (has at least 1 client) • If E[Yi(t)] is very nearly equal to t, then have Foster’s criterion from condition  • Need a lower bound on Yi(t) to get an upper bound on the drift

Bounding the idle time • Clients perform independent random walks on system, but don’t leave • Independent rate iPoisson processes of `virtual’ services at servers • If number of clients at server i at time t is more than the total number of virtual services at all servers on [0,t], then queue i has to be non-empty at time t

Bounding the idle time (cont.) • Suppose |n| is large • Markov chain describing random walks reaches equilibrium in constant time • Number of clients at each server is (|n|) from this time • Number of virtual services is O(1)

Proof of stability for RLS algorithm • Uses a slightly different Lyapunov function f(n) = |n| +  k(n) • for suitably small >0, where k(n) is the number of empty servers in state n

Performance estimates in open systems • Consider large m asymptotics • Xkm(t): proportion of servers with exactly k clients at time t • Xm(t) evolves as density dependent Markov process • By Kurtz’s theorem, evolution converges to solution of deterministic differential equation over finite time-horizons

Performance estimates in open systems (cont.) • Idea: look at equilibrium points of deterministic dynamics • If there is a unique stable equilibrium, we expect that stochastic dynamics will live in vicinity of this equilibrium • Use the equilibrium to derive performance measures

In more detail … • Kurtz’s theorem only applies for finite time horizons • Doesn’t tell us about long-term behaviour • Can get around this by using propagation of chaos techniques developed by Snitzman

Numerical results • Asymptotic estimates pretty accurate even for small m, say m =10 • RLO is only a little bit worse than RLS in terms of mean delay (about 20% worse in parameter range considered)

Conclusions • Random local search balances loads very quickly in closed systems • polylog in number of servers • Impatience is a virtue • impatient customers help to balance load and achieve resources pooling, even if they migrate oblivious of load

Open problems • Have assumed all clients can use all servers, and also that they can move between any pair of servers • What if clients can only move from a server to its neighbours in some graph? • What if clients are of different types, and each type can only use a subset of the servers?

Open problems • Suppose clients can only migrate to neighbouring servers in a graph • Can the time to balance loads be related to mixing times of random walks on this graph?

Open problems • Performance measures in open systems obtained in terms of equilibrium points of a differential equation • Is there perfect resource pooling in heavy traffic limit? • Can we get tail bounds on delays? • What if clients can use multiple servers simultaneously?

Decentralised Load Balancing in Closed and Open Systems at A.J. Ganesh University of Bristol

Decentralised Load Balancing in Closed and Open Systems at A.J. Ganesh University of Bristol

Presentation Transcript

Advanced Load Balancing/Web Systems

Load Balancing and Intelligent Load Balancing

Load Sharing and Balancing

Load Balancing in Distributed Systems

Load Balancing

OPEN AND CLOSED SYSTEMS

What are open and closed systems?

Load Balancing

Closed and Open Circulatory Systems

Open and Closed Circulatory Systems

Load Balancing in Charm++

Load balancing

Open, Closed and Isolated Systems

Open vs. Closed Systems

Load Balancing

Load balancing

Load-Balancing

Load Balancing

OPEN AND CLOSED SYSTEMS

Load Balancing

Open and Closed Systems

Load Balancing in Distributed Systems