1 / 24

A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem

A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem. Matthew Streeter and Stephen Smith Carnegie Mellon University. Outline. The max k-armed bandit problem Previous work Our distribution-free approach Experimental evaluation. What is the max k-armed bandit problem?.

harsha
Télécharger la présentation

A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Simple Distribution-Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University

  2. Outline • The max k-armed bandit problem • Previous work • Our distribution-free approach • Experimental evaluation

  3. What is the max k-armed bandit problem?

  4. The classical k-armed bandit • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize total payoff • > 50 years of papers

  5. The max k-armed bandit • You are in a room with k slot machines • Pulling the arm of machine i returns a payoff drawn (independently at random) from unknown distribution Di • Allowed n total pulls • Goal: maximize highest payoff • Introduced ~2003

  6. Why study it?

  7. Goal: improve multi-start heuristics • A multi-start heuristic runs an underlying randomized heuristic a bunch of times and returns the best solution • Examples: • HBSS (Bresina 1996) • VBSS (Cicirello & Smith 2005) • GRASPs (Feo & Resende 1995, and many others)

  8. Application: selecting among heuristics • Given: some optimization problem, k randomized heuristics • Each time you run a heuristic, get a solution with a certain quality • Allowed n runs • Goal: maximize quality of best solution

  9. The max k-armed bandit: example • Given n pulls, how can we maximize the (expected) maximum payoff? • If n=1, should pull blue arm (higher mean) • If n=1000, should mainly pull maroon arm (higher variance)

  10. Distributional assumptions? • Without distributional assumptions, optimal strategy is not interesting. • For example suppose payoffs are in {0,1}; arms are shuffled so you don’t know which is which. • Optimal strategy samples the arms in round-robin order! • Can’t distinguish a “good” arm until you receive payoff 1, at which point max payoff can’t be improved

  11. Distributional assumptions? • All previous work assumed each machine returns payoff from a generalized extreme value (GEV) distribution • Why? • Extremal Types Theorem: let Mn = max. of n independent draws from some fixed distribution. As n, distribution of Mn  a GEV distribution • GEV sometimes gives an excellent fit to payoff distributions we care about

  12. Previous work • Cicirello & Smith (CP 2004, AAAI 2005): • Assumed Gumbel distributions (special case of GEV), no rigorous performance guarantees • Good results selecting among heuristics for the RCPSP/max • Streeter & Smith (AAAI 2006) • Rigorous result for general GEV distributions • But no experimental evaluation

  13. Our contributions • Threshold ascent: strategy to solve max k-armed problem using classical k-armed solver as subroutine • Chernoff interval estimation: strategy for classical k-armed bandit algorithm that works well when mean payoffs are small (we assume payoffs in [0,1])

  14. Threshold Ascent • Parameters: strategy S for classical k-armed bandit, integer m > 0 • Idea: • Initialize t - • Use Sto maximize number of payoffs that exceed t • Once m payoffs > t have been received, increase t and repeat

  15. Threshold Ascent • Designed to work well when: • For t > tcritical, there is a growing gap between probability that eventually-best arm yields payoff > t and corresponding prob. for other arms

  16. m controls exploration/exploitation tradeoff (larger m means algorithm converges more before increasing t) • as t gets large, S sees a classical k-armed bandit instance where almost all payoffs are zero • we don’t really start S from scratch each time we increase t Threshold Ascent • Parameters: strategy S for classical k-armed bandit, integer m > 0 • Idea: • Initialize t - • Use Sto maximize number of payoffs that exceed t • Once m payoffs > t have been received, increase t and repeat

  17. Interval Estimation • Interval estimation (Lai & Robbins 1987, Kaelbling 1993) maintains confidence interval for each arm’s mean payoff; pulls arm with highest upper bound 2 1 3 Arm 3 Arm 1 Arm 2

  18. Chernoff Interval Estimation • We analyze a variant of interval estimation with confidence intervals derived from Chernoff bounds • regret = average_payoff(strategy) - *, where * = mean payoff of best arm. • We prove an O(sqrt(*)*X) regret bound, where X = sqrt(k (log n)/n). • Using Hoeffding’s inequality just gives O(X). (Auer et al. 2002). As * 0, our bound is much better. • Can get comparable bounds using “multiplicative weight update” algorithms

  19. Experimental Evaluation

  20. The RCPSP/max • Assign start times to activities subject to resource and temporal constraints • Goal: find a schedule with minimum makespan • NP-hard, “one of the most intractable problems in operations research” (Mohring 2000) • Multi-start heuristics give state-of-the-art performance (Cicirello & Smith 2005)

  21. Note: we use a less aggressive variant of interval estimation in these experiments Evaluation • Five multi-start heuristics; each is a randomized rule for greedily building a schedule • LPF - “longest path following” • LST - “latest start time” • MST - “minimum slack time” • MTS - “most total successors” • RSM - “resource scheduling method” • Three max k-armed bandit strategies: • Threshold Ascent (m=100, S = Chernoff interval estimation with 99% confidence intervals) • round robin sampling • QD-BEACON (Cicirello & Smith 2004, 2005)

  22. Evaluation • Ran on 169 instances from ProGen/max library • For each instance, ran each of five rules 10,000 times and saved results in file • For each of three strategies, solve as max 5-armed bandit with n=10,000 pulls • Define regret = difference between max. possible payoff and max. payoff actually obtained

  23. Results • Threshold Ascent outperforms the other max k-armed bandit strategies, as well as the five “pure” strategies

  24. Summary & Conclusions • The max k-armed bandit problem is a simple online learning problem with applications to heuristic search • We described a new, distribution-free approach to the max k-armed bandit problem • Our strategy is effective at selecting among randomized priority dispatching rules for the RCPSP/max

More Related