1 / 21

Evolutionary Algorithms for Reinforcement Learning

Evolutionary Algorithms for Reinforcement Learning. David E.Moriarty Alan C.Schultz John J.Grefenstette. Overview. Reinforcement Learning TD Algorithms for Reinforcement Learning Evolutionary Algorithms for RL Policy Representations in EARL Fitness and Credit Assignment in EARL

jessie
Télécharger la présentation

Evolutionary Algorithms for Reinforcement Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolutionary Algorithms for Reinforcement Learning David E.Moriarty Alan C.Schultz John J.Grefenstette

  2. Overview • Reinforcement Learning • TD Algorithms for Reinforcement Learning • Evolutionary Algorithms for RL • Policy Representations in EARL • Fitness and Credit Assignment in EARL • Strengths of EARL • Limitations of EARL

  3. Reinforcement Learning • Flexible approach to the design of intelligent agents in situations for which both planning and supervised learning ard impractical. • Goal is to solve sequential decision tasks through trail and error interactions with the environments. • At any given time step t, an agent perceves its state st and selects an action at. • The system responds by giving the agent some numerical reward and changing into state st+1=

  4. Reinforcement Learning • The agent’s goal is to learn a policy,π:S->A. • Optimal policy is typically defined as the policy that produces the greatest cumulative reward over all states • is the cumulative reward received from state s using policy π. Infinite horizon finite horizon

  5. Reinforcement Learning • Agent’s state descriptions are usually identified with the values returned by its sensors. Often the sensors do not give the agent complete state information (Partial observability) • RL provides a flexible approach to the design of intelligent agents in situations for which both planning and supervised learning are impractical. • RL can be applied to problems for which significant domain knowledge is either unavailable or costly to obtain.

  6. Reinforcement Learning • Policy space vs Value function space • Goal:find an optimal policy π* • Policy-space search • Maintain explicit representations of policies and modify them through a viriety of search operators. • Dynamic programming, value iteration, simulated annealing, evolutionary algorithms • Value-space search • Attempt to learn the value function V* which returns the expected cumulative reward for the optimal policy from the state. • TD algorithms

  7. Temporal Difference Algorithms • Uses observations of prediction differences from consecutive states to update value predictions. • Value function update rule • Q-Learning: compute Q function(a value function that represents the expected value of taking action a in state s and being optimally thereafter) • Q function update rule

  8. Evolutionary Algorithms for Reinforcement Learning(EARL) • Policy-space search using Evolutionary algorithm • Requirement of EA • An appropriate mapping between the search space and the space of chromosomes • An appropriate fitness function. • For many problems, EA can be applied in a relative straightforward manner • The most critical design choice of EA is the representation. • A form of search bias similar to biases in other ML methods. • EA is sensitive to the choice of representations

  9. A simple EARL • Use single chromosome per policy with a single gene associated with each observed state. • Each Gene’s value represents the action value associated with the corresponding state. • Fitness can be evaluated during a single trail(deterministic case) or averaged over a sample of trails. • Basic crossover and mutation operator is used.

  10. Policy Representations in EARL • Single-Chromosome Representation • Rule-based representation • A set of condition-action rules • Neural net based representation • Use function approximator and use EA to adjust parameters

  11. Policy Representations in EARL • Distributed Representation • Allow evolution to work at a more detailed level. • Permits the user to exploit background knowledge. • Rule-based representation • Learning Classifier Systems(LCS) • Uses an EA to evolve if-then rules called classifiers that map sensory input to an appropriate action. • When sensory input is received it is posted on the message list. • If the left hand side of a classifier matches a message on the message list, its right hand side is posted on the message list. These new messages may subsequently trigger other classifiers • Each chronosome represents a single decision rule and the entire populations represents the agent’s policy.

  12. Policy Representations in EARL • Holland’s Learning Classifier System • LCS population for grid world.

  13. Policy Representations in EARL • Distributed Neural net based representation • Use a population of neurons and a population of network blueprints • Uses a priori knowledge that individual neurons are building blocks in neural networks.

  14. Fitness and Credit assignments in EARL • Policy Level Credit Assignment • How to apportion the rewards of a sequence of decisions to individual decisions. • In EARL, credit is implicitly assigned over extended sequence since policies that prescribe poor individual decisions will have fewer offspring. • In TD, immediate reward and the estimated payoff are explicitly propagated back.

  15. Fitness and Credit assignments in EARL • Subpolicy Credit Assignment • For distributed-representation EARLs, fitness is explicitly assigned to individual components. • Classifier system • Each classifier has a strength which is updated using a TD-like method called Bucket bridge algorithm • SAMUEL • Each gene maintains a quantity called strength • Strength pays a role in resolving conflict and triggering mutation.

  16. Strengths of EARL • Scaling up to Large State Spaces • Policy generalization • Most EARL specifies the policy at a level of abstraction higher than an explicit mapping from observed states to actions. • Policy Selection • Attention is focused on profitable action only,reducing space requirements for policies.

  17. Strengths of EARL • Dealing with Incomplete State Information • Implicitly distinguishes among ambigious states • More robust than simple TD methods An PO environment The Policy obtained

  18. Strengths of EARL • Simple TD methods are vulnerable to hidden state problems. • EARL methods associate credit with entire policies, they rely more on net results of decision sequences than on sensor information that may be ambigious. • The agent itself remains unable to distinguish the two blue states, but the EARL implicitly distinguishes among ambigious states by rewarding policies that avoids the bad states. • Additional features such as the agent’s previous decisions and observations can help disambiguating the two blue states.

  19. Strengths of EARL • Non-stationary Environments • As long as the environment changes slowly with respect to the time required to evaluate a population of policies, the population should be able to track a changing fitness landscape without any alternation of the algorithm. • Algorithms for ensuring diversity in evolving populations can be used • Fitness sharing • Crowding • Local mating

  20. Strengths of EARL • EARL with distributed policy representations achieve diversity automatically and are well-suited for adaptation in dynamic environment. • If the learning system can detect changes in the environment, even more direct response is possible • Anytime learning • When environmental change is detected, the population of policies is partially reinitialized, using previously learned policies selected on the basis of similarity between the previously encountered environment and the current environment. • Having a population of policies can help from being affected by some kind of errors in detecting environmental changes.

  21. Limitations of EARL • Online Learning • Require a large number of experiences to evaluate a large population of policies. • It may be dangerous to permit an agent to perform random actions. • Both of the objections apply to TD methods as well. • Rare States • TD maintain statistics concerning every state-action pair. • Rare state information may eventually be lost due to mutation. • Proofs of Optimality • Q learning has a proof of optimality. • No general theoretical tools are available that can be applied to realistic EARL problems.

More Related