1 / 21

Transfer in Reinforcement Learning via Markov Logic Networks

Transfer in Reinforcement Learning via Markov Logic Networks. Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison, USA. Possible Benefits of Transfer in RL. Learning curves in the target task:. performance. with transfer.

carson
Télécharger la présentation

Transfer in Reinforcement Learning via Markov Logic Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transfer in Reinforcement Learning via Markov Logic Networks Lisa Torrey, Jude Shavlik, Sriraam Natarajan, Pavan Kuppili, Trevor Walker University of Wisconsin-Madison, USA

  2. Possible Benefits of Transfer in RL Learning curves in the target task: performance with transfer without transfer training

  3. The RoboCup Domain 2-on-1 BreakAway 3-on-2 BreakAway

  4. Reinforcement Learning States are described by features: Actions are: Rewards are: distance(me,teammate1) = 15 distance(me,opponent1) = 5 angle(opponent1, me, teammate1) = 30 … Move Pass Shoot +1 for scoring 0 otherwise reward action state Environment Agent

  5. Our Previous Methods • Skill transfer • Learn a rule for when to take each action • Use rules as advice • Macro transfer • Learn a relational multi-step action plan • Use macro to demonstrate

  6. Transfer via Markov Logic Networks MLN Q-function Source-task learner Source-task Q-function and data Analyze Learn MLN Q-function Target-task learner Demonstrate

  7. Markov Logic Networks Y X Z B A • A Markov network models a joint distribution • A Markov Logic Network combines probability with logic • Template: a set of first-order formulas with weights • Each grounded predicate in a formula becomes a node • Predicates in grounded formula are connected by arcs • Probability of a world: (1/Z) exp( Σ WiNi ) Richardson and Domingos, ML 2006

  8. MLN Q-function IF distance(me, Teammate) < 15 AND angle(me, goalie, Teammate) > 45 THEN Q є (0.8, 1.0) Formula 1 W1 = 0.75 N1 = 1 teammate IF distance(me, GoalPart) < 10 AND angle(me, goalie, GoalPart) > 45 THEN Q є (0.8, 1.0) Formula 2 W1 = 1.33 N1 = 3 goal parts Probability that Q є (0.8, 1.0): __exp(W1N1 + W1N1)__ 1 + exp(W1N1 + W1N1)

  9. Grounded Markov Network distance(me, teammate1) < 15 angle(me, goalie, teammate1) > 45 angle(me, goalie, goalLeft) > 45 distance(me, goalLeft) < 10 Q є (0.8, 1.0) distance(me, goalRight) < 10 angle(me, goalie, goalRight) > 45

  10. Learning an MLN • Find good Q-value bins using hierarchical clustering • Learn rules that classify examples into bins using inductive logic programming • Learn weights for these formulas to produce the final MLN

  11. Binning via Hierarchical Clustering Frequency Q-value Frequency Q-value Frequency Q-value

  12. Classifying Into Bins via ILP • Given examples • Positive: inside this Q-value bin • Negative: outside this Q-value bin • The Aleph* ILP learning system finds rules that separate positive from negative • Builds rules one predicate at a time • Top-down search through the feature space * Srinivasan, 2001

  13. Learning Formula Weights • Given formulas and examples • Same examples as for ILP • ILP rules as network structure • Alchemy* finds weights that make the probability estimates accurate • Scaled conjugate-gradient algorithm * Kok, Singla, Richardson, Domingos, Sumner, Poon and Lowd, 2004-2007

  14. Using an MLN Q-function Q є (0.8, 1.0) P1 = 0.75 Q = P1 ● E [Q | bin1] + P2 ● E [Q | bin2] + P3 ● E [Q | bin3] Q є (0.5, 0.8) P2 = 0.15 Q є (0, 0.5) P2 = 0.10 Q-value of most similar training example in bin

  15. Example Similarity • E [Q | bin] = Q-value of most similar training example in bin • Similarity = dot product of example vectors • Example vector shows which bin rules the example satisfies Rule 1 Rule 2 Rule 3 … 1 -1 1 1 1 -1

  16. Experiments • Source task: 2-on-1 BreakAway • 3000 existing games from the learning curve • Learn MLNs from 5 separate runs • Target task: 3-on-2 BreakAway • Demonstration period of 100 games • Continue training up to 3000 games • Perform 5 target runs for each source run

  17. Discoveries • Results can vary widely with the source-task chunk from which we transfer • Most methods use the “final” Q-function from the last chunk • MLN transfer performs better from chunks halfway through the learning curve

  18. Results in 3-on-2 BreakAway

  19. Conclusions • MLN transfer can significantly improve initial target-task performance • Like macro transfer, it is an aggressive approach for tasks with similar strategies • It “lifts” transferred information to first-order logic, making it more general for transfer • Theory refinement in the target task may be viable through MLN revision

  20. Potential Future Work • Model screening for transfer learning • Theory refinement in the target task • Fully relational RL in RoboCup using MLNs as Q-function approximators

  21. Acknowledgements • DARPA Grant HR0011-07-C-0060 • DARPA Grant FA 8650-06-C-7606 Thank You

More Related