1 / 24

Object Focused Q-learning for Autonomous Agents

Object Focused Q-learning for Autonomous Agents. M. Onur canci. What is of-q learning. Object Focused Q-learning (OF-Q), a novel re- inforcement learning algorithm that can o ff er exponential speed-ups over classic Q-learning on domains composed of independent objects .

rosie
Télécharger la présentation

Object Focused Q-learning for Autonomous Agents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Object Focused Q-learning for Autonomous Agents M. Onur canci

  2. What is of-q learning • Object Focused Q-learning (OF-Q), a novel re-inforcementlearning algorithm that can offer exponentialspeed-ups over classic Q-learning on domains composed ofindependent objects. • An OF-Q agent treats the state spaceas a collection of objects organized into differentobject classes. • Keypart isa control policy that uses non-optimalQ-functions to estimate the risk of ignoring parts of the statespace.

  3. reinforcementlearning • Whyreinforcementlearningisn’tenough? • Becauseof the curse of dimensionality,the time required for convergence in high-dimensional statespaces can make the use of RL impractical. • Solution? • There is noeasy solution for this problem • Observinghumanbehaviours

  4. reinforcementlearning • Adult humans are consciously aware of only one stimulus out of each 3 hundred thousand receivedand theycan only hold a maximum of 3 to 5 meaningful items orchunksin their short-term memory. • Humansdeal with high-dimensionality simply by paying attention to a very small number of features at once, a capability knownas selective attention.

  5. OF-Q learning • OF-Q embracesthe approach of paying attention to a small number of objects atany moment justlikehuman • Instead of a high-dimensional policy, OF-Q simultaneously learn a collection of low-dimensional policiesalong with when to apply each one, i.e. :where to focus ateach moment.

  6. Of-q learning • OF-Q learningsharessameconceptwith ; • OO-MDP OO-MDP solvers see the state space asa combination of objects of specific classes.Solversalso need a designer to define a set of domainspecicrelationsthatdefine how differentobjectsinteractwitheachother. • Modular reinforcementlearning (RMDP)Requires a full dynamic Bayesian networkwhichneeds either an initial policy that performs well or agood cost heuristic of the domain. • Differswith; • Learning theQ-values of non-optimal policies to measure the consequences of ignoring parts of the state space • OF-Q onlyrelies on online exploration of the domain. • For each object class, we learn the Q-values for optimaland non-optimal policies and use that information for theglobal control policy.

  7. NOTATION • A Markov decision process (MDP) is a tuple • Theprobability of transitioning to state s0 when taking action ain state s • Immediatereward when takingaction a in state s, • Discount factor

  8. NOTATION - 2 • Defineswhich action an agent takes ina particular state s. • the state-value of s when following policy , i.e., the expected sum of discounted rewards that the agent obtainswhen following policy from state s. • the discounted reward received when choosing action ain state s and then following policy .

  9. Notation - 3 • is the optimalpolicy maximizing the value of each state. • andare then the optimal state valueand Q-values. • Q valuesare not always optimal respect to a given policy that is notnecessarily optimal.

  10. Of-q learning’starget • OF-Q is designed to solve episodic MDPs with the following properties; • The state space S is defined by a variable number ofindependent objects. These objects are organized intoclasses of objects that behave alike. • The agent is seen as an object of a specific class, constrained to be instantiated exactly once in each state.Other classes can have none to many instances in anyparticular state. • Each object provides its own reward signal and theglobal reward is the sum of all object rewards. • Reward from objects can be positive or negative

  11. C is the set of object classes in the domain. Astate s S is a variable-length collection of objects s = { oa, o1, …. , ok} Object class identifiero.classC. Using the same sample, the Q-valueestimate for the random policy of class o:class is updated with

  12. Overview of of-q • Q-value estimation • Algorithm follows any control policy and still use each object o present in thestate to update the Q-values of its class o.class. • For eachobject class c C, our algorithm learns the Q-functionfor the optimal policy *, and Qthe Q-function for therandom policy where the hat denotes that this is an estimate of the true

  13. Overview of of-q • Control Policy Control policy is simple.First decideA, the set of actions that are safe to take where To.classis a per-class dynamic threshold obtained as described in nextslide. The set of all thresholds is The control policy then picks the action that returns the highest Q-value over allobjects

  14. Overview of of-q (Thresholdoperations) • Thresholdinitialization To avoid poor actions that result in low rewards,thresholdforeachclass is initializedwithQmin, worst Q value of domain. • ARBITRATION Modular RL approachesusetwosimplearbitrationpolicy: • Winnertakesall : modulewithhighest Q valuedecidesthenextaction • Greatestmass : modulewithhighestsum of the Q valuesdecidesthenextaction

  15. Test domain • Experimental domain fortesting OF-Q is spaceinvadersgame

  16. Illusion of control • Winnertakesall • Problem withwinnertakesallapproach is that; it maytakean action that is very positivefor one object but fatal for the overall reward. In the SpaceInvaders domain, this control policy would be completelyblind to the bombs that the enemies drop, because therewill always be an enemy to kill that offersa positive Q-value,while bomb Q-values are always negative.

  17. Illusion of control GreatestMass It does not make sense to sum Q-values from different policies, because Q-values from differentmodules are definedwith respect to differentpolicies, and in subsequentsteps we will not be able to follow several policies at once. With these two sources of reward, greatest- mass would choose the lower state, expecting a reward of 10. The optimal action is going to the upper state.

  18. OF-Q ARBITRATION For the pessimal Q-values, both bombs are just as dangerous, because they both can possibly hit the ship. Random policy Q- Values willidentify the closest bomb as a bigger threat.

  19. OF-Q ARBITRATION • In OF-Q algorithm, the control policy chooses the actionthat is acceptable for all the objects in the state and has thehighest Q-value for one particular object. • To estimate howinconvenient a certain action is with respect to each object,we learn the random policy Q-function Qfor each objectclass c. • Qis a measure of how dangerous it is to ignore acertain object. • As an agent iterates on the risk thresholds,it learns when the risk is too high and a given object shouldnot be ignored.

  20. OF-Q ARBITRATION • It would be impossible to measure risk if we were learningonly the optimal policy Q-values • The optimal policy Q-valueswould not reflectany risk until the risk could not beavoided, because the optimal policy can often evade negativereward at the last moment; • however, there are many objectsin the state space and at that last moment a different objectin the state may introduce a constraint that prevents theagent from taking the evasive action.

  21. OF-Q LEARNING (spaceinvaders)

  22. Of-q learning(normandy) Normandy. The agent starts in a random cell in thebottom row and must collect the two rewards randomlyplaced at the top, avoiding the cannon re.

  23. Of-q learning(normandy)

  24. END • Thanks

More Related