1 / 55

Autonomous Inter-Task Transfer in Reinforcement Learning Domains

Autonomous Inter-Task Transfer in Reinforcement Learning Domains. Matthew E. Taylor Learning Agents Research Group Department of Computer Sciences University of Texas at Austin 6/24/2008. Inter-Task Transfer. Learning tabula rasa can be unnecessarily slow Humans can use past information

phyllist
Télécharger la présentation

Autonomous Inter-Task Transfer in Reinforcement Learning Domains

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Autonomous Inter-Task Transfer in Reinforcement Learning Domains Matthew E. Taylor Learning Agents Research Group Department of Computer Sciences University of Texas at Austin 6/24/2008 1

  2. Inter-Task Transfer • Learning tabula rasa can be unnecessarily slow • Humans can use past information • Soccer with different numbers of players • Agents leverage learned knowledge in novel tasks 2

  3. Primary Questions Source Ssource, Asource Target Starget, ATarget • Is it possible to transfer learned knowledge? • Possible to transfer without a providing a task mapping? • Only consider reinforcement learning tasks 3

  4. Reinforcement Learning (RL): Key Ideas • Markov Decision Process (MDP): ⟨SATR⟩ • Policy: π(s) = a • Action-Value function: Q(s, a) = ℜ • State variables: s = ⟨x1, x2, … xn⟩ Environment MDP S: States in task A: Actions agent can take T: T(S, A) → S’ R: R(S) → ℜ Action State Reward Agent 4

  5. Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 5

  6. Enabling Transfer Source Task QS: SS×AS→ℜ Target Task Environment Environment QT: ST×AT→ℜ ActionT ActionS StateT StateS RewardT RewardS Agent Agent 6

  7. Inter-Task Mappings Source Target Source Target 7

  8. Inter-Task Mappings • χx: starget→ssource • Given state variable in target task (some x from s=x1, x2, … xn) • Return corresponding state variable in source task • χA: atarget→asource • Similar, but for actions • Intuitive mappings exist in some domains (Oracle) • Used to construct transfer functional Source Target χx STarget SSOURCE ⟨x1…xn⟩ ⟨x1…xk⟩ χA ATarget ASOURCE {a1…am} {a1…aj} 8

  9. Keepaway [Stone, Sutton, and Kuhlmann 2005] Goal: Maintain possession of ball 5 agents 3 (stochastic) actions 13 (noisy & continuous) state variables K2 K1 T1 K3 T2 Keeper with ball may hold ball or pass to either teammate 4 vs. 3: 7 agents 4 actions 19 state variables Both takers move towards player with ball

  10. Keepaway Hand-coded χA Actions in 4 vs. 3 have “similar” actions in 3 vs. 2 • Hold4v3 Hold3v2 • Pass14v3 Pass13v2 • Pass24v3 Pass23v2 • Pass34v3 Pass23v2 Pass14v3 K2 K2 K1 K1 Pass24v3 T1 T1 K3 K3 T3 T2 T2 Pass34v3 K4 10

  11. Keepaway Hand-coded χX Define similar state variables in two tasks Example: distances from player with ball to teammates K2 K2 K1 K1 T1 T1 K3 K3 T3 T2 T2 K4 11

  12. Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 12

  13. Value Function Transfer Source Ssource, Asource Target Starget, ATarget 13

  14. ρ Value Function Transfer Source Task Q not defined on ST and AT ρ(QS (SS, AS)) = QT (ST, AT) Action-Value function transferred ρ is task-dependant: relies on inter-task mappings QS: SS×AS→ℜ Target Task Environment Environment QT: ST×AT→ℜ ActionS ActionT StateT StateS RewardS RewardT Agent Agent 14

  15. Learning Keepaway • Sarsa update • CMAC, RBF, and neural network approximation successful • Qπ(s,a): Predicted number of steps episode will last • Reward = +1 for every timestep 15

  16. ’s Effect on CMACs • For each weight in 4 vs. 3 function approximator: • Use inter-task mapping to find corresponding 3 vs. 2 weight 3 vs. 2 4 vs. 3 16

  17. Threshold: 8.5 Performance Target: no Transfer Target: with Transfer Target + Source: with Transfer Target: with Transfer Target: no transfer Transfer Evaluation Metrics “Sunk Cost” is ignored Source task(s) independently useful AI Goal Effectively utilize past knowledge Only care about Target Source Task(s) not useful Engineering Goal Minimize total training Set a threshold performance Majority of agents can achieve with learning Two distinct scenarios: 1. Target Time Metric: Successful if target task learning time reduced • 2.Total Time Metric: Successful if total (source + target) time reduced 17

  18. Value Function Transfer: Time to threshold in 4 vs. 3 No Transfer Target Task Time Total Time } 18

  19. Value Function Transfer Flexibility • Different Function Approximators • Radial Basis Function & Neural Network • Different Actuators • Pass accuracy • “Accurate” passers have normal actuators • “Inaccurate” passers have less capable kick actuators • Value Function Transfer also reduces target task time and total time: • Inaccurate 3 vs. 2 → Inaccurate 4 vs. 3 • Accurate 3 vs. 2 → Inaccurate 4 vs. 3 • Inaccurate 3 vs. 2 → Accurate 4 vs. 3 19

  20. Value Function Transfer Flexibility • Different Function Approximators • Different Actuators • Different Keepaway Tasks • 5 vs. 4, 6 vs. 5, 7 vs. 6 20

  21. Value Function Transfer Flexibility • Different Function Approximators • Different Actuators • Different Keepaway Tasks • Partial Mappings K2 K2 K1 K1 T1 T1 K3 K3 T3 T2 T2 K4 21

  22. Value Function Transfer Flexibility • Different Function Approximators • Different Actuators • Different Keepaway Tasks • Partial Mappings • Different Domains • Knight Joust to 4 vs. 3 Keepaway Goal: Travel from start to goal line 2 agents 3 actions 3 state variables Fully Observable Discrete State Space (Q-table with ~600 s,a pairs) Deterministic Actions Opponent moves directly towards player Player may move North, or take a knight jump to either side 22

  23. Value Function Transfer Flexibility • Different Function Approximators • Different Actuators • Different Keepaway Tasks • Partial Mappings • Different Domains • Knight Joust to 4 vs. 3 Keepaway • 3 vs. 2 Flat Reward, 3 vs. 2 Giveaway 23

  24. Transfer Methods 24

  25. Empirical Evaluation • Keepaway: 3 vs. 2, 4 vs. 3, 5 vs. 4, 6 vs. 5, 7 vs. 6 • Server Job Scheduling • Autonomic Computing Task • Server processes jobs in a queue while new jobs arrive • Policy selects between jobs with different utility functions Source Job Types 1,2 Target Job Types 1-4 25

  26. Empirical Evaluation • Keepaway: 3 vs. 2, 4 vs. 3, 5 vs. 4, 6 vs. 5, 7 vs. 6 • Server Job Scheduling • Autonomic Computing Task • Server processes jobs in a queue while new jobs arrive • Policy selects between jobs with different utility functions • Mountain Car • 2D • 3D • Cross-Domain Transfer • Ringworld to Keepaway • Knight’s Joust to Keepaway K2 K1 T1 # Actions # State Variables Discrete vs. Continuous Deterministic vs. Stochastic Fully vs. Partially Observable Single Agent vs. Multi-Agent K3 T2

  27. Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 27

  28. Learning Task Relationships • Sometimes task relationships are unknown • Necessary for Autonomous Transfer • But finding similarities (analogies) can be very hard! • Key idea: • Agents may generate data (experience) in both tasks • Leverage existing machine learning techniques • 2 Techniques, differ in amount of background knowledge 28

  29. Context ? • Steps to enable Autonomous transfer: • Select a relevant source task, given a target task • Learn how the source and target tasks are related • Effectively transfer knowledge between tasks • Transfer is Feasible (step 3) • Steps toward Finding Mappings between Tasks (step 2) • Leverage full QDBNs to search for mappings [Liu and Stone, 2006] • Test possible mappings on-line [Soni and Singh, 2006] • Mapping Learning via Classification 29

  30. Context Source Target • Steps to enable Autonomous transfer: • Select a relevant source task, given a target task • Learn how the source and target tasks are related • Effectively transfer knowledge between tasks • Transfer is Feasible (step 3) • Steps toward Finding Mappings between Tasks (step 2) • Leverage full QDBNs to search for mappings [Liu and Stone, 2006] • Test possible mappings on-line [Soni and Singh, 2006] • Mapping Learning via Classification S, A, r, S’ S, A, r, S’ A A→A S, r, S’ S, r, S’ Action Classifier 30

  31. MASTER OverviewModeling Approximate State Transitions by Exploiting Regression • Goals: • Learn inter-task mapping between tasks • Minimize data complexity • No background knowledge needed • Algorithm Overview: • Record data in source task • Record small amount of data in target task • Analyze data off-line to determine best mapping • Use mapping in target task Environment Source Task Target Task StateT RewardT ActionT Environment ActionS StateS RewardS Agent Agent MASTER

  32. MASTER Algorithm Record observed (ssource, asource, s’source) tuples in source task Record small number of (starget, atarget, s’target) tuples in target task Learn one-step transition model, T(ST, AT), for the target task: M(starget, atarget) →s’target for every possible action mapping χA for every possible state variable mapping χX Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(stransformed, atransformed) – s’transformed)2 returnχA,χX with lowest error Environment Source Task Target Task StateT RewardT ActionT Environment ActionS StateS RewardS Agent Agent MASTER

  33. Observations • Pros: • Very little target task data needed (sample complexity) • Analysis for discovering mappings is off-line • Cons: • Exponential in # of state variables and actions 33

  34. Generalized Mountain Car • 2D Mountain Car • x, • Left, Neutral, Right • 3D Mountain Car (novel task) • x, y, , • Neutral, West, East, South, North 34

  35. Generalized Mountain Car • Both tasks: • Episodic • Scaled State Variables • Sarsa • CMAC function approximation • 2D Mountain Car • x, • Left, Neutral, Right • 3D Mountain Car (novel task) • x, y, , • Neutral, West, East, South, North • χX • x, y → x • , → • χA • Neutral → Neutral • West, South → Left • East, North → Right 35

  36. MASTER Algorithm Record observed (ssource, asource, s’source) tuples in source task Record small number of (starget, atarget, s’target) tuples in target task Learn one-step transition model, T(S,A), for the target task: M(starget, atarget) →s’target for every possible action mapping χA for every possible state variable mapping χX Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(stransformed, atransformed) – s’ transformed)2 returnχA,χX with lowest error 36

  37. MASTER and Mountain Car Record observed (x, , a2D, x’, ’) tuples in 2D task Record small number of (x, y, , , a3D, x’, y’, ’, ’) tuples in 3D task Learn one-step transition model, T(S,A), for the 3D task: M(x, y, , , a3D) →x’, y’, ’, ’ for every possible action mapping χA for every possible state variable mapping χX Transform recorded source task tuples Calculate the error of the transformed source task tuples on the target task model: ∑(M(stransformed, atransformed) – s’ transformed)2 returnχA,χX with lowest error (of 240 possible mappings: 16 state variables × 15 actions) 39

  38. Utilizing Mappings in 3D Mountain Car Hand coded mappings No Transfer 42

  39. Experimental Setup • Learn in 2D Mountain Car for 100 episodes • Learn in 3D Mountain Car for 25 episodes • Apply MASTER • Train transition model off-line using backprop in Weka • Transfer from 2D to 3D: Q-Value Reuse • Learn the 3D Task 43

  40. State Variable Mappings Evaluated 44

  41. Action Mappings Evaluated (-0.50, 0.01, Right, -0.49, 0.02) (-0.50, -0.50, 0.01, 0.01, East, -0.49, -0.49, 0.02, 0.02) (-0.50, -0.50, 0.01, 0.01, North, -0.49, -0.49, 0.02, 0.02) 45

  42. Transfer in 3D Mountain Car Hand-Coded 1/MSE Average Actions No Transfer Average Both 46

  43. Transfer in 3D Mountain Car: Zoom Average Actions No Transfer 47

  44. MASTER Wrap-up • First fully autonomous mapping-learning method • Learning done off-line • Use to select most relevant source task or transfer from multiple source tasks • Future work • Incorporate heuristic search • Use in more complex domains • Formulate as optimization problem? 48

  45. Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 49

  46. Related Work: Framework • Allowed task differences • Source task selection • Type of knowledge transferred • Allowed base learners • + 3 others 50

  47. Selected Related Work: Transfer Methods • Same state variables and actions [Selfridge+, 1985] • Multi-task learning [Fernandez and Veloso, 2006] • Methods to avoid inter-task mappings [Konidaris and Barto, 2007] • Different state variables and actions [Torrey+, •] T(s, a)=s’ Action State Reward s = ⟨x1, … xn⟩ 51

  48. Hold: 2 vs. 1 Keepaway Selected Related Work: Mapping Learning Methods On-line: • Test possible mappings on-line as new actions [Soni and Singh, 2006] • k-Armed bandit, each arm is a mapping [Talvite and Singh, 2007] Off-line • Full Qualitative Dynamic Bayes Networks (QDBNs) [Liu and Stone, 2006] • Assume T types of task-independent objects • Keepaway domain has 2 object types: Keepers and Takers 52

  49. Outline • Reinforcement Learning Background • Inter-Task Mappings • Value Function Transfer • MASTER: Learning Inter-Task Mappings • Related Work • Future Work and Conclusion 53

  50. Open Question 1:Optimize for Metrics • Minimize target time: more source task training? • Minimize total time: “moderate” amount of training? • Depends on task similarity 3 vs. 2 to 4 vs. 3 54

More Related