1 / 36

Sequential Off-line Learning with Knowledge Gradients

This paper discusses the knowledge gradient policy for sequential off-line learning and presents theoretical and numerical results. It also introduces various sampling problems and explores different exploration and exploitation techniques.

scottyp
Télécharger la présentation

Sequential Off-line Learning with Knowledge Gradients

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering Princeton University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAA

  2. Overview • Problem Formulation • Knowledge Gradient Policy • Theoretical Results • Numerical Results

  3. Measurement Phase Experimental outcomes Alternative 1 +.2 +1 Alternative 2 -1 Alternative 3 -.2 N opportunities to do experiments Alternative M -1 +1

  4. Implementation Phase Alternative 1 Alternative 2 Alternative 3 Reward Alternative M

  5. Taxonomy of Sampling Problems Sampling

  6. Example 1 • One common experimental design is to spread measurements equally across the alternatives. Quality Alternative

  7. Round Robin Exploration Example 1

  8. Example 2 • How might we improve round-robin exploration for use with this prior?

  9. Largest Variance Exploration Example 2

  10. Example 3 Exploitation:

  11. Model • xn is the alternative tested at time n. • Measure the value of alternative xn • At time n, , independent • Error n+1 is independent N(0,()2). • At time N, choose an alternative. • Maximize

  12. State Transition • At time n measure alternative xn. We update our estimate of based on the measurement • Estimates of other Yx do not change.

  13. n change in best estimate of Yx due to the measurement • The value of the optimal policy satisfies Bellman’s equation • At time n, n+1x is a normal random variable with mean nx and variance satisfying uncertainty about Yx after the measurement uncertainty about Yx before the measurement

  14. Utility of Information Consider our “utility of information”, and consider the random change in utility due to a measurement at time n

  15. Knowledge Gradient Definition • The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,

  16. Knowledge Gradient • We may compute the knowledge gradient policy via which is the expectation of the maximum of a normal and a constant.

  17. Knowledge Gradient The computation becomes where  is the normal cdf,  is the normal pdf, and

  18. Optimality Results • If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.

  19. Optimality Results • The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity. • This is really a convergence result.

  20. Optimality Results • The knowledge gradient policy has sub-optimality bounded by where VKG,n gives the value of the knowledge gradient policy and Vn the value of the optimal policy.

  21. Optimality Results • If there are exactly 2 alternatives (M=2), the knowledge gradient policy is optimal. • In this case, the optimal policy reduces to

  22. Optimality Results • If there is no measurement noise and alternatives may be reordered so that then the knowledge gradient policy is optimal.

  23. Numerical Experiments • 100 randomly generated problems • M Uniform {1,...100} • N Uniform {M, 3M, 10M} • 0x Uniform [-1,1] • (0x)2 = 1 with probability 0.9 = 10-3 with probability 0.1 •  = 1

  24. Numerical Experiments

  25. Interval Estimation • Compare alternatives via a linear combination of mean and standard deviation. • The parameter z/2 controls the tradeoff between exploration and exploitation.

  26. KG / IE Comparison

  27. KG / IE Comparison Value of KG – Value of IE

  28. IE and “Sticking” Alternative 1 is known perfectly

  29. IE and “Sticking”

  30. Thank You Any Questions?

  31. Numerical Example 1

  32. Numerical Example 2

  33. Numerical Example 3

  34. Knowledge Gradient Example

  35. Interval Estimation Example

  36. Boltzmann Exploration Parameterized by a declining sequence of temperatures (T0,...TN-1).

More Related