120 likes | 253 Vues
Monte-Carlo Methods. Learning methods averaging complete episodic returns Slides based on [Sutton & Barto : Reinforcement Learning: An Introduction , 1998]. Differences with DP/TD. Differences with DP methods: Real RL: Complete transition model not necessary
E N D
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
Differences with DP/TD • Differences with DP methods: • Real RL: Complete transition model not necessary • They sample experience; can be used for direct learning • They do not bootstrap • No evaluation of successor states • Differences with TD methods • Well, they do not bootstrap • they average episodic returns Slides prepared by Georgios Chalkiadakis
Overview and Advantages • Learn from experience – sample episodes • Sample sequences of states, actions, rewards • Either on-line, or from simulated (model-based) interactions with environment. • But no complete model required. • Advantages • Provably learn optimal policy without model • Can be used with sample /easy-to-produce models • Can focus on interesting state regions easily • More robust wrt Markov property violations Slides prepared by Georgios Chalkiadakis
Policy Evaluation Slides prepared by Georgios Chalkiadakis
Action-value functions required • Without a model, we need Q-value estimates • MC methods now average returns following visits to state-action pairs • All such pairs “need” to be visited! • …sufficient exploration required • Randomize episode starts (“exploring-starts”) • …or behave using a stochastic (e.g. ε-greedy) policy • …thus “Monte-Carlo” Slides prepared by Georgios Chalkiadakis
Monte-Carlo Control (to generate optimal policy) • For now, assume “exploring starts” • Does “policy iteration” work? • Yes! Where evaluation of each policy is over multiple episodes And improvement make policy greedy wrt current Q-value function Slides prepared by Georgios Chalkiadakis
Monte-Carlo Control (to generate optimal policy) • Why? is greedy wrt • Then, policy-improvement theorem applies because, for all s : is uniformly better than Thus Slides prepared by Georgios Chalkiadakis
A Monte-Carlo control algorithm Slides prepared by Georgios Chalkiadakis
What about ε-greedy policies? ε-greedy Exploration • If not “greedy”, select with Otherwise: Slides prepared by Georgios Chalkiadakis
Yes, policy iteration works • See the details in book • ε-soft on-policy algorithm:
…and you can have off-policy learning as well… • Why? Slides prepared by Georgios Chalkiadakis