Object Focused Q-learning for Autonomous Agents

Object Focused Q-learning for Autonomous Agents M. Onur canci

What is of-q learning • Object Focused Q-learning (OF-Q), a novel re-inforcementlearning algorithm that can offer exponentialspeed-ups over classic Q-learning on domains composed ofindependent objects. • An OF-Q agent treats the state spaceas a collection of objects organized into differentobject classes. • Keypart isa control policy that uses non-optimalQ-functions to estimate the risk of ignoring parts of the statespace.

reinforcementlearning • Whyreinforcementlearningisn’tenough? • Becauseof the curse of dimensionality,the time required for convergence in high-dimensional statespaces can make the use of RL impractical. • Solution? • There is noeasy solution for this problem • Observinghumanbehaviours

reinforcementlearning • Adult humans are consciously aware of only one stimulus out of each 3 hundred thousand receivedand theycan only hold a maximum of 3 to 5 meaningful items orchunksin their short-term memory. • Humansdeal with high-dimensionality simply by paying attention to a very small number of features at once, a capability knownas selective attention.

OF-Q learning • OF-Q embracesthe approach of paying attention to a small number of objects atany moment justlikehuman • Instead of a high-dimensional policy, OF-Q simultaneously learn a collection of low-dimensional policiesalong with when to apply each one, i.e. :where to focus ateach moment.

Of-q learning • OF-Q learningsharessameconceptwith ; • OO-MDP OO-MDP solvers see the state space asa combination of objects of specific classes.Solversalso need a designer to define a set of domainspecicrelationsthatdefine how differentobjectsinteractwitheachother. • Modular reinforcementlearning (RMDP)Requires a full dynamic Bayesian networkwhichneeds either an initial policy that performs well or agood cost heuristic of the domain. • Differswith; • Learning theQ-values of non-optimal policies to measure the consequences of ignoring parts of the state space • OF-Q onlyrelies on online exploration of the domain. • For each object class, we learn the Q-values for optimaland non-optimal policies and use that information for theglobal control policy.

NOTATION • A Markov decision process (MDP) is a tuple • Theprobability of transitioning to state s0 when taking action ain state s • Immediatereward when takingaction a in state s, • Discount factor

NOTATION - 2 • Defineswhich action an agent takes ina particular state s. • the state-value of s when following policy , i.e., the expected sum of discounted rewards that the agent obtainswhen following policy from state s. • the discounted reward received when choosing action ain state s and then following policy .

Notation - 3 • is the optimalpolicy maximizing the value of each state. • andare then the optimal state valueand Q-values. • Q valuesare not always optimal respect to a given policy that is notnecessarily optimal.

Of-q learning’starget • OF-Q is designed to solve episodic MDPs with the following properties; • The state space S is defined by a variable number ofindependent objects. These objects are organized intoclasses of objects that behave alike. • The agent is seen as an object of a specific class, constrained to be instantiated exactly once in each state.Other classes can have none to many instances in anyparticular state. • Each object provides its own reward signal and theglobal reward is the sum of all object rewards. • Reward from objects can be positive or negative

C is the set of object classes in the domain. Astate s S is a variable-length collection of objects s = { oa, o1, …. , ok} Object class identifiero.classC. Using the same sample, the Q-valueestimate for the random policy of class o:class is updated with

Overview of of-q • Q-value estimation • Algorithm follows any control policy and still use each object o present in thestate to update the Q-values of its class o.class. • For eachobject class c C, our algorithm learns the Q-functionfor the optimal policy *, and Qthe Q-function for therandom policy where the hat denotes that this is an estimate of the true

Overview of of-q • Control Policy Control policy is simple.First decideA, the set of actions that are safe to take where To.classis a per-class dynamic threshold obtained as described in nextslide. The set of all thresholds is The control policy then picks the action that returns the highest Q-value over allobjects

Overview of of-q (Thresholdoperations) • Thresholdinitialization To avoid poor actions that result in low rewards,thresholdforeachclass is initializedwithQmin, worst Q value of domain. • ARBITRATION Modular RL approachesusetwosimplearbitrationpolicy: • Winnertakesall : modulewithhighest Q valuedecidesthenextaction • Greatestmass : modulewithhighestsum of the Q valuesdecidesthenextaction

Test domain • Experimental domain fortesting OF-Q is spaceinvadersgame

Illusion of control • Winnertakesall • Problem withwinnertakesallapproach is that; it maytakean action that is very positivefor one object but fatal for the overall reward. In the SpaceInvaders domain, this control policy would be completelyblind to the bombs that the enemies drop, because therewill always be an enemy to kill that offersa positive Q-value,while bomb Q-values are always negative.

Illusion of control GreatestMass It does not make sense to sum Q-values from different policies, because Q-values from differentmodules are definedwith respect to differentpolicies, and in subsequentsteps we will not be able to follow several policies at once. With these two sources of reward, greatestmass would choose the lower state, expecting a reward of 10. The optimal action is going to the upper state.

OF-Q ARBITRATION For the pessimal Q-values, both bombs are just as dangerous, because they both can possibly hit the ship. Random policy Q- Values willidentify the closest bomb as a bigger threat.

OF-Q ARBITRATION • In OF-Q algorithm, the control policy chooses the actionthat is acceptable for all the objects in the state and has thehighest Q-value for one particular object. • To estimate howinconvenient a certain action is with respect to each object,we learn the random policy Q-function Qfor each objectclass c. • Qis a measure of how dangerous it is to ignore acertain object. • As an agent iterates on the risk thresholds,it learns when the risk is too high and a given object shouldnot be ignored.

OF-Q ARBITRATION • It would be impossible to measure risk if we were learningonly the optimal policy Q-values • The optimal policy Q-valueswould not reflectany risk until the risk could not beavoided, because the optimal policy can often evade negativereward at the last moment; • however, there are many objectsin the state space and at that last moment a different objectin the state may introduce a constraint that prevents theagent from taking the evasive action.

OF-Q LEARNING (spaceinvaders)

Of-q learning(normandy) Normandy. The agent starts in a random cell in thebottom row and must collect the two rewards randomlyplaced at the top, avoiding the cannon re.

Of-q learning(normandy)

END • Thanks

Object Focused Q-learning for Autonomous Agents

Object Focused Q-learning for Autonomous Agents

Presentation Transcript

Autonomous Agents

AUTONOMOUS GROUP LEARNING

Autonomous Agents and Self Organization

Autonomous Game Agents

Autonomous Agents Semester Project

Teaching Strategies for Autonomous Learning

Autonomous Learning of Object Models on Mobile Robots

Q-learning

AUTONOMOUS GROUP LEARNING

FOCUSED LEARNING TARGET

Learning-Focused Interactions

Collaboration rules for autonomous software agents

Autonomous Learning Agents for Decentralised Data and Information Networks (ALADDIN)

AUTONOMOUS GROUP LEARNING

Object Focused Q-learning for Autonomous Agents

AUTONOMOUS GROUP LEARNING

Learning-Focused Strategies

AUTONOMOUS GROUP LEARNING

Learning-Focused Interactions

Global Autonomous Agents Market