250 likes | 374 Vues
This paper presents an asymptotically optimal algorithm for the Max k-armed bandit problem, a key challenge in optimization where one aims to maximize the expected payout from k slot machines, each generating a payoff from an unknown distribution. The authors, Matthew Streeter and Stephen Smith, from Carnegie Mellon University, discuss the problem statement, motivations for the approach, and how payoff distributions are modeled. The proposed algorithm guarantees performance close to the optimal strategy, leveraging generalized extreme value distribution to improve decision-making in uncertain environments.
E N D
An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April 29 2006
Outline • Problem statement & motivations • Modeling payoff distributions • An asymptotically optimal algorithm
D1 D2 Machine 1 D3 Machine 2 Machine 3 The k-Armed Bandit • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize total payoff • > 50 years of papers
The Max k-Armed Bandit D1 • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize highest payoff • Introduced ~2003 D2 Machine 1 D3 Machine 2 Machine 3
The Max k-Armed Bandit: Motivations D1 • Given: some optimization problem, k randomized heuristics • Each time you run a heuristic, get a solution with a certain quality • Allowed n runs • Goal: maximize quality of best solution • Cicirello & Smith (2005) show competitive performance on RCPSP Assumption: each run has the same computational cost D2 Tabu Search D3 Hill Climbing Simulated Annealing
The Max k-Armed Bandit: Example • Given n pulls, what strategy maximizes the (expected) maximum payoff? • If n=1, should pull arm 1 (higher mean) • If n=1000, should pull arm 2 (higher variance)
Can’t Handle Arbitrary Payoff Distributions • Needle in the haystack: can’t distinguish arms until you get payoff > 0, at which point highest payoff can’t be improved
Why? Extremal Types Theorem: max. of n independent draws from some fixed distribution a GEV converges in distribution converges in distribution • Compare to Central Limit Theorem: sum of n draws a Gaussian Assumption • We will assume each machine returns payoff from a generalized extreme value (GEV) distribution
The GEV distribution • Z has a GEV distribution if for constants s, , and > 0. determines mean determines standard deviation s determines shape
Example payoff distribution: Job Shop Scheduling • Job shop scheduling: assign start times to operations, subject to constraints. • Length of schedule = latest completion time of any operation • Goal: find a schedule with minimum length • Many heuristics (branch and bound, simulated annealing...)
Example payoff distribution: Job Shop Scheduling • “ft10” is a notorious instance of the job shop scheduling problem • Heuristic h: do hill-climbing 500 times • Ran h 1000 times on ft10; fit GEV to payoff data
Distribution truncated at 931. Optimal schedule length = 930 (Carlier & Pinson, 1986) probability E[Max. payoff] -(schedule length) num. runs Example payoff distribution: Job shop scheduling Best of 50,000 sampled schedules has length 1014
Notation • mi(t) = expected maximum payoff you get from pulling the ith arm t times • m*(t) = max1ik mi(t) • S(t) = expected maximum payoff you get by following strategy S for t pulls
The Algorithm • Strategy S* ( and to be determined): • For i from 1 to k: • Using D pulls, estimate mi(n). Pick D so that with probability 1-, estimate is within of true mi(n). • For remaining n-kD pulls: • Pull arm with max. estimated mi(n) • Guarantee: S*(n) = m*(n) - o(1).
The GEV distribution • Z has a GEV distribution if for constants s, , and > 0. determines mean determines standard deviation s determines shape
s>0 Lots of algebra s=0 Not so bad s<0 Behavior of the GEV
Empirical mi(1) Predicted mi(n) Empirical mi(2) Predicting mi(n) • Estimation procedure: linear interpolation! • Estimate mi(1) and mi(2) , then interpolate to get mi(n)
Predicting mi(n): Lemma • Let X be a random variable with (unknown) mean and standard deviation max.O(-2 log -1) samples of X suffice to obtain an estimate such that with probability at least 1-, estimate is within of true value. • Proof idea: use “median of means”
Empirical mi(1) Predicted mi(n) Empirical mi(2) Predicting mi(n) • Equation for line: mi(n) = mi(1)+[mi(1)-mi(2)](log n) • Estimating mi(n) requires O((log n)2 -2 log -1) pulls
The Algorithm • Strategy S* ( and to be determined): • For i from 1 to k: • Using D pulls, estimate mi(n). Pick D so that with probability 1-, estimate is within of true mi(n). • For remaining n-kD pulls: • Pull arm with max. predicted mi(n) • Guarantee: S*(n) = m*(n) - o(1) • Three things make S* less than optimal: • • • m*(n) - m*(n-kD)
Analysis • Three things make S* less than optimal: • • • m*(n) - m*(n-kD) • Setting =n-2, =n-1/3 takes care of the first two. Then: • m*(n)-m*(n-kD) = O(log n - log(n-kD)) = O(kD/n) = O(k(log n)2 -2(log -1)/n) = O(k(log n)3 n-1/3) = o(1)
Summary & Future Work • Defined max k-armed bandit problem and discussed applications to heuristic search • Presented an asymptotically optimal algorithm for GEV payoff distributions (we analyzed special case s=0) • Working on applications to scheduling problems
The Extremal Types Theorem • Define Mn = max. of n draws, and suppose where each rn is a linear “rescaling function”. Then G is either a point mass or a “generalized extreme value distribution”: for constants s, , and > 0.