1 / 98

Weakly Coupled Stochastic Decision Systems

Weakly Coupled Stochastic Decision Systems. Kamesh Munagala Duke University (joint work with Sudiptio Guha , UPenn and Peng Shi, Duke) . Stochastic Decision System. Decision Algorithm. Guidance. Decision. Stochastic M odel. System. Model Refinement. Example 1: Multi-armed Bandits.

thor
Télécharger la présentation

Weakly Coupled Stochastic Decision Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Weakly Coupled Stochastic Decision Systems Kamesh Munagala Duke University (joint work with SudiptioGuha, UPenn and Peng Shi, Duke)

  2. Stochastic Decision System Decision Algorithm Guidance Decision Stochastic Model System Model Refinement

  3. Example 1: Multi-armed Bandits

  4. Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori

  5. Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori • At any step: • Choose a treatment i and test it on a patient

  6. Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori • At any step: • Choose a treatment i and test it on a patient • Test either passes/fails and costs ci

  7. Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori • At any step: • Choose a treatment iand test it on a patient • Test either passes/fails and costs ci • Repeat until cost budgetTis exhausted

  8. Multi-armed Bandits • n treatments of unknown effectiveness • Model “effectiveness” as probability pi [0,1] • All pi areindependent andunknown a priori • At any step: • Choose a treatment iand test it on a patient • Test either passes/fails and costs ci • Repeat until cost budgetTis exhausted • Choose best treatment/treatments for marketing

  9. Stochastic Decision System Decision Algorithm Which treatment to try next? Guidance Decision Stochastic Model Estimates of pi System ntreatments Model Refinement

  10. Stochastic Decision System Decision Algorithm Which treatment to try next? Guidance Decision Stochastic Model Estimates of pi System ntreatments Model Refinement

  11. Stochastic Model? • Key hurdle for decision maker: • pi are unknown • Stochastic assumption: • pi aredrawn from a known “prior distribution” • Contrast with adversarial assumption: • Assume nothing about pi • Will justify stochastic assumption in a bit…

  12. Example: Beta Prior • pi~ Beta(a,b) • Pr[pi = x] xa-1(1-x)b-1

  13. Example: Beta Prior • pi~ Beta(a,b) • Pr[pi = x] xa-1(1-x)b-1 • Intuition: • Suppose have previously observed (a-1)1’s and (b-1)0’s • Beta(a,b) is posterior distribution given observations • Updated according to Bayes’ rule starting with: • Beta(1,1) = Uniform[0,1] • Expected Reward=E[pi] = a/(a+b)

  14. Stochastic Decision System Decision Algorithm Which treatment to try next? Guidance Decision Stochastic Model pi ~ Beta(ai,bi) System ntreatments Model Refinement

  15. Prior Refined using Bayes’ Rule • If pi = x then next trial passes with probability x • Implies that conditioned on trial passing: • Pr[pi = x] x × xa-1(1-x)b-1 Pr[Success | Prior] Pr[Prior]

  16. Prior Refined using Bayes’ Rule • If pi = x then next trial passes with probability x • Implies that conditioned on trial passing: • Pr[pi = x] x × xa-1(1-x)b-1 xa(1-x)b-1 =Beta(a+1,b)

  17. Prior Refined using Bayes’ Rule • If pi = x then next trial fails with probability 1-x • Implies that conditioned on trial failing: • Pr[pi = x] (1-x) × xa-1(1-x)b-1 xa-1(1-x)b =Beta(a,b +1)

  18. Prior Update for Arm i Beta(1,1) Pr[Reward =1 | Prior] Pr[Reward = 0 | Prior] 1/2 1/2 2,1 1,2 2/3 1/3 1/3 2/3 E[Reward | Prior] = 3/4 3,1 2,2 1,3 1/4 1/2 3/4 3/4 1/2 1/4 4,1 3,2 2,3 1,4

  19. Multi-armed Bandit Lingo System: Multi-armed bandit [Wald ‘47; Arrow et al. ‘49] • Treatment: Bandit arm • Clinical Trial: Playing the arm • Outcome (1/0): Reward

  20. Convenient Abstraction • Posterior density of arm captured by: • Observed rewards from arm so far • Called the “state” of the arm

  21. Convenient Abstraction • Posterior density of arm captured by: • Observed rewards from arm so far • Called the “state” of the arm • State space of a single arm is tractable • Number of states is O(T2) • At mostTplays • Each play yields reward 0 or 1

  22. Stochastic Decision System Decision Algorithm Which treatment to try next? Guidance Decision Stochastic Model pi ~ Beta(ai,bi) System ntreatments Bayes’ Rule

  23. Decision Policy for Playing Arms • Specifies which arm to play next • Function of current states of each arm • Defines a decision tree

  24. Policy: Decision Tree A Play arm A 1 0 pA 1-pA A B 0 1 1 0 Observed Reward A B C 0 1 0 1 Choose arm B B C A

  25. Goal • Find decision policy with maximum value: • Value = E [ Reward of chosen arm ] • What is expectation over?

  26. The Bayesian Objective • Find the policy maximizing expected reward of chosen arm when pidrawn from prior distribution Qi [Wald ‘47, Robbins ‘50, Gittins, Jones ‘72] • Optimize: E[Posterior mean of chosen arm] • Expectation is over paths in the decision tree

  27. The Bayesian Objective • Find the policy maximizing expected reward of chosen arm when pidrawn from prior distribution Qi [Wald ‘47, Robbins ‘50, Gittins, Jones ‘72] • Optimize: E[Posterior mean of chosen arm] • Expectation is over paths in the decision tree • Expectation over two kinds of randomness: • The underlying pidrawn from distribution Qi • The rewards drawn from Bernoulli(1, pi )

  28. The Bayesian Objective • Find the policy maximizing expected reward of chosen arm when pidrawn from prior distribution Qi [Wald ‘47, Robbins ‘50, Gittins, Jones ‘72] • Optimize: E[Posterior mean of chosen arm] • Expectation is over paths in the decision tree • Expectation over two kinds of randomness: • The underlying pidrawn from distribution Qi • The rewards drawn from Bernoulli(1, pi ) • Expected reward of a policy is a unique number • Depends only on known Qibut not on unknown pi

  29. Multi-armed Bandits: Summary • Slot machine (bandit) with narms • Arm = Treatment • When played, arm yields reward • Distribution of reward unknown a priori • Prior specified over possible distributions • Goal: • Design policy for playing arms • Optimize: Scaling factor Reward at step t

  30. Types of Objectives t • Discounted reward: • t = tfor <1 • Finite Horizon: • t = 1 for t T • Budgeted Learning (B-L): • t = 1 for t=T+1 • Solutions are all related to each other t t t T t t T

  31. Weak Coupling [Singh & Cohn ’97; Meuleauet al. ‘98] • Arms are independent: • If played, state evolution of an arm is not conditioned on states of other arms • Playing an arm does not affect states of other arms • Only a few constraints couple arms together in decision policy • T plays over all the arms together • At most one arm chosen finally (in B-L)

  32. Space of Decision Policies

  33. Single Trial (T = 1) Arm 1 ~ Beta(1,2) E[p1] =0.33 Arm 2 ~ Beta(1,3) E[p2] =0.25 a priori better

  34. Single Trial (T = 1) Arm 1 ~ Beta(1,2)E[p1] =0.33 Arm 2 ~ Beta(1,3) E[p2] =0.25 a priori better B(2,2) 1= 0.52=0.25 Y 1/3 Policy 1 (not so good) Arm 1 1= 0.252=0.25 B(1,3) N 2/3 For either outcome, Choose Treatment 1 Effectiveness of finally chosen treatment: Reward = 1/3 0.5 + 2/3 0.25 = 0.33

  35. Single Trial (T = 1) Arm 1 ~ Beta(1,2) E[p1] =0.33 Arm 2 ~ Beta(1,3) E[p2] =0.25 a priori better B(2,2) 1= 0.52=0.25 Y 1/3 Arm 1 1= 0.252=0.25 B(1,3) N 2/3 1= 0.33 2= 0.4 Y 1/4 B(2,3) Choose 2 Arm 2 Policy 2 (optimal) B(1,4) 1= 0.332=0.2 Choose 1 N 3/4 Reward = 1/4  2/5  3/4  1/3 = 0.35 Policy: Play Arm 2 If Y then Choose arm 2 else Choose arm 1

  36. T= 2: Adaptive Solution p1 ~ B(1,1) p2 ~ B(5,2) p3 ~ B(21,11) p1 ~ B(3,1)Choose 1 Y 2/3 p1 ~ B(2,1) Play Arm 1 N 1/3 p1 ~ B(2,2)Choose 2 Y 1/2 Play Arm 1 p2 ~ B(6,2)Choose 2 N 1/2 Y 5/7 p1 ~ B(1,2) Play Arm 2 N 2/7 p2 ~ B(5,3)Choose 3

  37. Curse of Dimensionality [Bellman ‘54] • Policy specifies of action for each “joint state” • Joint state:Cartesian product of current states of arms • Joint state space has size O(T2n)

  38. Curse of Dimensionality [Bellman ‘54] • Policy specifies of action for each “joint state” • Joint state: Cartesian product of current states of arms • Joint state space has size O(T2n) • Dynamic program on state space • Exponential running time and space requirement Approximately optimal poly-size policies?

  39. Our Results • General solution technique: • Works for weakly coupled stochastic systems • Objective needs to be reward maximization • Constant factor approximations • Technique based on: • LP duality • Markov’s inequality

  40. Connection to Existing Heuristics • Gittins and Whittle indexes: • Compute quality measure for each state of each arm • Play arm whose current quality is highest • Exploit weak coupling to separate computation • Greedy algorithm – optimal for discounted reward! • Extremely efficient to compute and execute • Our policies are subtle variants of these indexes • Just as efficient to compute and execute!

  41. Solution Overview (STOC ‘07)

  42. Solution Idea • Consider any decision policy P • Consider its behavior restricted to arm i

  43. Example p1 ~ B(1,1) p2 ~ B(5,2) p3 ~ B(21,11) p1 ~ B(3,1)Choose 1 Y 2/3 p1 ~ B(2,1) Play Arm 1 N 1/3 p1 ~ B(2,2)Choose 2 Y 1/2 Play Arm 1 p2 ~ B(6,2)Choose 2 N 1/2 Y 5/7 p1 ~ B(1,2) Play Arm 2 N 2/7 p2 ~ B(5,3)Choose 3

  44. Behavior Restricted to Arm 2 p1 ~ B(1,1) p2 ~ B(5,2) p3 ~ B(21,11) p1 ~ B(2,1) Play Arm 1 N 1/3 p1 ~ B(2,2)Choose 2 Y 1/2 Play Arm 1 p2 ~ B(6,2)Choose 2 N 1/2 Y 5/7 p1 ~ B(1,2) Play Arm 2

  45. Behavior Restricted to Arm 2 p1 ~ B(1,1) p2 ~ B(5,2) p3 ~ B(21,11) w.p. 1/6 Choose 2 p2 ~ B(6,2)Choose 2 Y 5/7 w.p. 1/2 Play Arm 2 Do nothing N 2/7 With remaining probability, do nothing

  46. Behavior Restricted to Arm i • Yields a randomized policy for arm i • At each state of the arm, policy probabilistically: • Does nothing • Plays the arm • Chooses the arm and obtains posterior reward

  47. Notation • Ti = E[Number of plays made for arm i] • Ci= E[Number of times arm i chosen] • Ri= E[Reward from events when i chosen]

  48. Behavior Restricted to Arm 2 T2 = ½ C2 = 1/6 + ½×5/7 = 11/21 R2 = 1/6 ×5/7 + ½ ×5/7 ×3/4 = 65/168 p2~ B(5,2) w.p. 1/6 Choose 2 p2 ~ B(6,2)Choose 2 Y 5/7 w.p. 1/2 Play Arm 2 Do nothing N 2/7 With remaining probability, do nothing

  49. Weak Coupling • In any decision policy: • Number of plays is at most T • Number of times some arm is chosen is at most 1 • True on all decision paths • Taking expectations over decision paths • ΣiTi ≤ T • ΣiCi ≤ 1 • Value of decision policy = ΣiRi

  50. Relaxed Decision Problem • Find one randomized decision policy Pi for each arm i such that: • ΣiTi (Pi) ≤ T • ΣiCi (Pi) ≤ 1 • Maximize: ΣiRi (Pi) • Why is this a relaxation? • Collection of Pineed not be a feasible policy • Only enforcing coupling in expectation!

More Related