Machine Learning Lecture outline

Multi-Agent SystemsLecture 10 & 11University “Politehnica” of Bucarest2004-2005Adina Magda Floreaadina@cs.pub.rohttp://turing.cs.pub.ro/blia_2005

Machine LearningLecture outline 1 Learning in AI (machine learning) 2 Learning decision trees 3 Version space learning 4 Reinforcement learning 5 Learning in multi-agent systems 5.1 Learning action coordination 5.2 Learning individual performance 5.3 Learning to communicate 5.4 Layered learning 6 Conclusions

1 Learning in AI • What is machine learning? Herbet Simon defines learning as: “any change in a system that allows it to perform better the second time on repetition of the same task or another task drawn from the same population (Simon, 1983).” In ML the agent learns: • knowledge representation of the problem domain • problem solving rules, inferences • problem solving strategies 3

Classifying learning In MAS learning the agents should learn: • what an agent learns in ML but in the context of MAS - both cooperative and self-interested agents • how to cooperate for problem solving - cooperative agents • how to communicate - both cooperative and self-interested agents • how to negotiate - self interested agents Different dimensions • explicitly represented domain knowledge • how the critic component (performance evaluation) of a learning agent works • the use of knowledge of the domain/environment 4

Teacher Single agent learning Learning Process Feed-back Data Learning results Problem Solving K & B Inferences Strategy Environment Results Performance Evaluation Feed-back 5

Learning Process Learning results Problem Solving K & B Self Inferences Other Strategy agents Results Agent Agent Agent Performance Evaluation Self-interested learning agent Feed-back Communication Data NB: Both in this diagram and the next, not all components or flow arrows are always present - it depends on the type of agent (cognitive, reactive), type of learning, etc. Environment Actions Feed-back 6

Problem Solving K & B Self Inferences Other Strategy agents Problem Solving K & B Self Inferences Other Strategy agents Agent Agent Cooperative learning agents Feed-back Learning Process Feed-back Learning Process Communication Learning results Learning results Data Results Results Performance Evaluation Feed-back Actions Actions Communication Communication Environment 7

2 Learning decision trees • ID3 - Quinlan -’80 • ID3 algorithm classifies training examples in several classes • Training examples: attributes and values 2 phases: • build decision tree • use tree to classify unknown instances Decision tree - definition 2 3 1 Shape Color Size Classification circle red small + circle red big + triangle yellow small - circle yellow small - triangle red big - circle yellow big - 8

No. Risk (Classification) Credit History Debt Collateral Income 1 High/grand Bad High None $0 to $15k 2 High Unknown High None $15 to $35k 3 Moderate Unknown Low None $15 to $35k 4 High Unknown Low None $0k to $15k 5 Low Unknown Low None Over $35k 6 Low Unknown Low Adequate Over $35k 7 High Bad Low None $0 to $15k 8 Moderate Bad Low Adequate Over $35k 9 Low Good Low None Over $35k 10 Low Good High Adequate Over $35k 11 High Good High None $0 to $15k 12 Moderate Good High None $15 to $35k 13 Low Good High None Over $35k 14 High Bad High None $15 to $35k The problem of estimating an individual’s credit risk on the basis of: - credit history,- current debt,- collateral,- income 9

Income? $0K-$15K $Over 35K $15K-$35K High risk Credit history? Credit history? Unknown Good Bad Unknown Bad Good Debt? High risk Moderate risk Low risk Moderate risk High risk High Low High risk Moderate risk Decision tree ID3 assumes the simplest decision tree that covers all the training examples is the one it should be picked. Ockham’s Razor (or Occam ??), 1324: “It is vain to do with more what can be done with less… Entities should not be multiplied beyond necessity.”

Information theoretic test selection in ID3 • Information theory – the info content of a message • M={m1, …, mn}, p(mi) • The information content of the message M I(M) = Sumi=1,n[-p(mi)*log2(p(mi))] I(Coin_toss) = - p(heads)log2(p(heads)) – p(tails)log2(p(tails) = -1/2log2(1/2) – 1/2log2(1/2) = 1 bit I(Coin_toss) = - p(heads)log2(p(heads)) – p(tails)log2(p(tails) = -3/4log2(3/4) – 1/4log2(1/4) = 0.811 bits p(risk_high) = 6/14 p(risk_moderate) = 3/14 p(risk_low) = 5/14 • The information in any tree that covers the examples I(Tree) = -6/14log2(6/14)-3/14log2(3/14)-5/14log2(5/14)

The information gain provided by making a test at the root of the current attribute The amount of information needed to complete the tree after making A the root A – n values {C1, C2, …,Cn} • Gain(A) = I(C) - E(A) • C1={1,4,7,11} C2={2,3,12,14} C3={5,6,8,9,10,13} • E(income)=4/14*I(C1) + 4/14*I(C2) + 6/14*I(C3) = 0.564

Assesing the performance of ID3 Training set and test set – average prediction of quality, happy graphs • Broadening the applicability of decision trees • Missing data: how to classify an instance that is missing one of test attributes? Pretend the instance has all possible values for the attribute, weight each value according to its frequency among the examples, follow all branches and multiply weights along the path • Multivalued attributes: an attribute with a large number of possible values – gain ratio gain ratio = selects attributes according to Gain(A)/I(CA) • Continuous-valued attributes - discretize

3 Version space learning • P and Q – sets that match p, q in FOPL • Expression p is more general than qiff P  Q - we say that p covers q color(X,red)  color(ball,red) • If a concept p is more general than a concept q then p(x), q(x) descriptions that classify objects as being positive examples: x p(x)  positive(x) x q(x)  positive(x) • p covers q iff q(x)  positive(x) is a logical consequence of p(x)  positive(x) • Concept space obj(X,Y,Z) • A concept c is maximally specific if it covers all positive examples, none of the negative examples, and for any other concept c’ that covers the positive examples, c’ is more general than c - S • A concept c is maximally general if it covers none of the negative examples, and for any other concept c’ that covers none negative examples, c is more general than c’ - G 14

The candidate elimination algorithm • Algorithms for searching the concept space; overgeneralization, overspecialization Specific to general search for hypothesis set S: maximally specific generalizations • Initialize S to the first positive training instance • Be N the set of all negative instances seen so far • for each positive instance pdo • for every sSdoifs does not match p, replace s with its most specific generalization that matches p • delete from S all hypotheses more general than some other hypothesis in S • delete from S all hypotheses that match a previously observed negative instance in N • for every negative instance ndo • delete all members of S that match n • add n to N to check future hypotheses for overgeneralization 15

The candidate elimination algorithm General to specific search for hypothesis set G: maximally general specializations • Initialize G to contain the most general concept in the space • Be P the set of all positive instances seen so far • for each negative instance ndo • for every gG that matches ndo replace g with its most general specializations that do not match n • delete from G all hypotheses more specific than some other hypothesis in G • delete from G all hypotheses that fail to match some positive instance in P • for every positive instance pdo • delete all members of G that fail to match p • add p to P to check future hypotheses for overspecialization 16

Bidirectional search; S and G Candidate elimination • Initialize S to the first positive training instance • for each positive instance pdo • delete from G all hypotheses that fail to match p • for every sSdoifs does not match p, replace s with its most specific generalization that matches p • delete from S all hypotheses more general than some other hypothesis in S • delete from S all hypotheses more general than some hypothesis in G • for every negative instance ndo • delete all members of S that match n • for every gG that matches ndo replace g with its most general specializations that do not match n • delete from G all hypotheses more specific than some other hypothesis in G • delete from G all hypotheses more specific than some other hypothesis in S 17

G: {obj(X, Y, Z)} Positive: obj(small, red, ball) S: { } G: {obj(X,Y,Z)} Negative: obj(small, blue, ball) S: {obj(small, red, ball) } G: {obj(X, red, Z)} Positive: obj(large, red, ball) S: {obj(small, red, ball) } G: {obj(X, red, Z)} Negative: obj(large, red, cube) S: {obj(X, red, ball) } G: {obj(X, red, ball)} S: {obj(X, red, ball) 18

4 Reinforcement learning • Combines dynamic programming and AI machine learning techniques • Trial-and-error interactions with a dynamic environment • The feedback of the environment – reward or reinforcement search in the space of behaviors – genetic algorithms • Two main approaches learn utility based on statistical techniques and dynamic programming methods 19

E T s a I i R B r A reinforcement-learning model B – agent's behavior i – input = current state of the env r – value of reinforcement (reinforcement signal) T – model of the world The model consists of: - a discrete set of environment states S (sS) - a discrete set of agent actions A (a  A) - a set of scalar reinforcement signals, typically {0, 1} or real numbers - the transition model of the world, T • environment is nondeterministic T : S x A P(S) – T = transition model T(s, a, s’) Environment history = a sequence of states that leads to a terminal state 20

Features varying RL • accessible / inaccessible environment • has (T known) / has not a model of the environment • learn behavior / learn behavior + model • reward received only in terminal states or in any state • passive/active learner: • learn utilities of states (state-action) • active learner – learn also what to do • how does the agent represent B, namely its behavior: • utility functions on states or state histories (T is known) • active-value functions (T is not necessarily known) - assigns an expected utility to taking a given action in a given state E(a,e) =e’env(e,a)(prob(ex(a,e)=e’)*utility(e’)) 21

The RL problem • the agent has to find a policy  = a function which maps states to actions and which maximizes some long-time measure of reinforcement. • The agent has to learn an optimal behavior = optimal policy = a policy which yields the highest expected utility The utility function depends on the environment history (a sequence of states) In each state s the agents receives a reward - R(s) Uh([s0, s1, …, sn]) – utility function on histories 22

Models of optimal behavior • Finite-horizon model: at a given moment of time the agent should optimize its expected reward for the next h steps E(t=0, h R(st)) rt represents the reward received t steps into the future. • Infinite-horizon model: optimize the long-run reward E(t=0, R(st)) • Infinite-horizon discounted model: optimize the long-run reward but rewards received in the future are geometrically discounted according to a discount factor. E(t=0,t R(st)) 0  < 1. 23

Models of optimal behavior • Additive rewards: Uh([s0, s1, …, sn]) = R(s0)+R(s1)+R(s2)+… if U is separable / additive Uh([s0, …, sn]) = R(s0) + Uh([s1, .., sn]) • Discounted rewards: Uh([s0, s1, …, sn]) = R(s0)+  *R(s1)+ 2 *R(s2)+… 0  < 1.  can be interpreted in several ways. It can be seen as an interest rate, a probability of living another step, or as a mathematical trick to bound an infinite sum. 24

Exploitation versus exploration • Difference between RL and supervised learning: a reinforcement learner must explicitly explore its environment. • The representative problem is the n-armed bandit problem • The agent might believe that a particular arm has a fairly high payoff probability; should it choose that arm all the time, or should it choose another one that it has less information about, but seems to be worse? • Answers to these questions depend on how long the agent is expected to play the game; the longer the game lasts, the worse the consequences of prematurely converging to a sub-optimal arm, and the more the agent should explore. 25

Utilities of states • Utility of a state defined in terms of the utility of a state sequence = expected utility of the state sequence that might follow it, if the agents follows the policy  • U (s) = E(H(s,) | T) =  (P(H(s, )| T) * Uh(H(s, ))) • H – history beginning in s • E- expected utility Uh – utility function on histories •  - is a policy defined by the transition model T and the utility function on histories Uh. • If U is separable / additive: Uh([s0, …, sn]) = R(s0) + Uh([s1, .., sn]) st – the state after executing t steps using  • U(s) = E(t=0, (t *R(st) | , s0=s) U(s) long term, R(s) short term 26

+1 +1 -1 -1 A 4 x 3 environment • The intended outcome occurs with probability 0.8, and with probability 0.2 (0.1, 0.1) the agent moves at right angles to the intended direction. • The two terminal states have reward +1 and –1, all other states have a reward of –0.04, =1 0.8 0.1 0.1 3 2 1 3 2 1 0.812 0.868 0.918 0.762 0.660 0.705 0.655 0.611 0.388 1 2 3 4 1 2 3 4 27

The utility function U(s) allows the agent to select actions by using the Maximum Expected Utility principle *(s) = argmax s’T(s,a,s’)*U(s’) optimal policy The utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action • U(s) = R(s) + max as’T(s,a,s’)*U(s’) • Bellman equation - U(s) – unique solutions • U(s) = E(t=0, (t *R(st) | , s0=s) 28

+1 -1 Bellman equation for the 4x3 world Equation for the state (1,1) U(1,1) = -0.04 +  max{ 0.8 U(1,2) + 0.1 U(2,1) + 0.1 U(1,1), Up 0.9U(1,1) + 0.1U(1,2), Left 0.9U(1,1) + 0.1U(2,1), Down 0.8U(2,1) +0.1U(1,2) + 0.1U(1,1)} Right Up is the best action 3 2 1 0.812 0.868 0.918 0.762 0.660 0.705 0.655 0.611 0.388 1 2 3 4

Value Iteration • Given the maximal expected utility, the optimal policy is: *(s) = arg maxa(R(s) + s’ T(s,a,s’) * U(s’)) • Compute U*(s) using an iterative approach Value Iteration U0(s) = R(s) Ut+1(s) = R(s) + maxa(s’ T(s,a,s’) * Ut(s’)) t  inf ….utility values converge to the optimal values • When do we stop the algorithm ? • - RMS (root mean square) • - Policy loss 30

Policy iteration Manipulate the policy directly, rather than finding it indirectly via the optimal value function • choose an arbitrary policy  • compute the utility using , i.e. solve the equations U(s) = R(s) + s’ T(s,a,s’) * U(s’)) • improve the policy at each state (s)  R(s) + arg maxa (s’ T(s,a,s’) * U(s’)) 31

a) Passive reinforcement learning ADP (Adaptive Dynamic Programming) learning • The problem of calculating an optimal policy in an accessible, stochastic environment. • Markov Decision Problem (MDP) consists of: <S, A, P, R> S - a set of states A - a set of actions R – reward function, R: S x A R T : S x A (S), with (S) the probability distribution over the states S The model is Markov if the state transitions are independent of any previous environment states or agent actions. • MDP: finite-state and finite-action – focus on that / infinite state and action space • Finding the optimal policy given a model T = calculate the utility of each state U(state) and use state utilities to select an optimal action in each state. 32

The utility of a state s is given by the expected utility of the history beginning at that state and following an optimal policy. • If the utility function is separable U(s) = R(s) + maxas’ T(s,a,s’) * U(s’), s (1) (this is a dynamic programming technique) • The equation asserts that the utility of a state s is the expected instantaneous reward plus the expected utility (expected discounted utility) of the next state, using the best available action. • The maximal expected utility, considering moments of time t0, t1, …, is U*(s) = maxE(t=0,inft rt )  - the discount factor • This is the expected infinite discounted sumof reward that the agent will gain if it is in state s and executes an optimal policy. This optimal value is unique and can be defined as the solution of the simultaneous equations (1). 33

ADP (Adaptive Dynamic Programming) learning function Passive-ADP-Agent(percept) returns an action inputs: percept, a percept indicating the current state s’ and reward signal r’ variable: , a fixed policy mdp, an MDP with model T, rewards R, discount  U, a table of utilities, initially empty Nsa, a table of frequencies for state-action pairs, initially zero Nsas’, a table of frequencies of state-action-state triples, initially zero s, a, the previous state and action, initially null if s’ is new then U[s’]  r’, R[s’]  r’ if s is not null then increment Nsa[s,a] and Nsas’[s,a,s’] for each t such that Nsas’[s,a,t] <>0 do T[s,a,t]  Nsas’[s,a,t] / Nsa[s,a] U  Value-Determination(,U,mdp) if Terminal[s’] then s,a  null else s,a  s’, [s’] return a end 34

Temporal difference learning Passive learning in an unknown environment • The value function is no longer implemented by solving a set of linear equations, but it is computed iteratively. • Used observed transitions to adjust the values of the observed states so that they agree with the constraint equations. U(s)  U (s) + (R(s) +  U (s’) – U (s))  is the learning rate. • Whatever state is visited, its estimated value is updated to be closer to R(s) +  U (s’) since R(s) is the instantaneous reward received and U (s') is the estimated value of the actually occurring next state. 35

Temporal difference learning function Passive-TD-Agent(percept) returns an action inputs: percept, a percept indicating the current state s’ and reward signal r’ variable: , a fixed policy U, a table of utilities, initially empty Ns, a table of frequencies for states, initially zero s, a, r, the previous state, action, and reward, initially null if s’ is new then U[s’]  r’ if s is not null then increment Ns[s] U[s]  U[s] + (Ns[s])(r +  U [s’] – U [s]) if Terminal[s’] then s, a, r  null else s, a, r  s’, [s’], r’ return a end 36

Temporal difference learning • Does not need a model to perform its updates • The environment supplies the connections between neighboring states in the form of observed transitions. ADP and TD comparison • ADP and TD try both to make local adjustments to the utility estimates in order to amke each state « agree » with its successors • TD adjusts a state to agree with the observed successor • ADP adjusts a state to agree with all of the successors that might occur, weighted by their probabilities 37

b) Active reinforcement learning • Passive learning agent – has a fixed policy that determines its behavior • An active learning agent must decide what action to take • The agent must learn a complete model with outcome probabilities for all actions (instead of a model for the fixed policy) Active learning of action-value functions  Q-learning action-value function = assigns an expected utility to taking a given action in a given state, Q-values Q(a, s)– the value of doing action a in state i (expected utility) Q-values are related to utility values by the equation: U(s) = maxaQ(a, s) 38

Active learning of action-value functions  Q-learning TD learning, unknown environment Q(a,s)  Q(a,s) + (R(s) +  maxa’Q(a’, s’) – Q(a,s)) calculated after each transition from state s to s’. • Is it better to learn a model and a utility function or to learn an action-value function with no model? 39

Q-learning function Q-Learning-Agent(percept) returns an action inputs: percept, a percept indicating the current state s’ and reward signal r’ variable: Q, a table of action values index by state and action Nsa, a table of frequencies for state-action pairs s, a, r the previous state, action, and reward, initially null if s is not null then increment Nsa[s,a] Q[a,s]  Q[a,s] + (Nsa[s,a])(r +  maxa’Q [a’,s’] – Q [a,s]) if Terminal[s’] then s, a, r  null else s, a, r`  s’, argmaxa’ f(Q[a’, s’], Nsa[a’,s’]), r’ return a end f(u,n) – increasing in u and decreasing in n 40

Generalization of RL • The problem of learning in large spaces is addressed through generalization techniques, which allow compact storage of learned information and transfer of knowledge between "similar" states and actions. • The RL algorithms include a variety of mappings, including S  A (policies), S  R (value functions), S  A  R (Q functions and rewards), S  A  S (deterministic transitions), and S  A  S  [0, 1] (transitions probabilities). • Some of these, such as transitions and immediate rewards, can be learned using straightforward supervised learning. • Popular techniques: various neural network methods, fuzzy logic, logical approaches to generalization. 41

5 Learning in MAS • The credit-assignment problem (CAP) = the problem of assigning feed-back (credit or blame) for an overall performance of the MAS Increase, decrease) to each agent that contributed to that change • inter-agent CAP = assigns credit or blame to the external actions of agents • intra-agent CAP = assigns credit or blame for a particular external action of an agent to its internal inferences and decisions • distinction not always obvious • one or another 42

5.1 Learning action coordination • s – current environment state • Agent i – determines the set of actions it can do in s: Ai(s) = {Aij(s)} • Computes the goal relevance of each action: Eij(s) • Agent i announces a bid for each action with Eij(s) > threshold • Bij(s) = ( + ) Eij(s) •  - risk factor (small)  - noise term (to prevent convergence to local minima) 43

The action with the highest bid is selected • Incompatible actions are eliminated • Repeat process until all actions in bids are either selected or eliminated • A – selected actions = activity context • Execute selected actions • Update goal relevance for actions in A Eij(s)  Eij(s) – Bij(s) + (R / |A|) R –external reward received • Update goal relevance for actions in the previous activity context Ap (actions Akl) Ekl(sp)  Ekl(sp) + (AijA Bij(s)/ |Ap|) 44

5.2 Learning individual performance The agent learns how to improve its individual performance in a multi-agent settings Examples • Cooperative agents - learning organizational roles • Competitive agents - learning from market conditions 45

5.2.1 Learning organizational roles(Nagendra, e.a.) • Agents learn to adopt a specific role in a particular situation (state) in a cooperative MAS. • Aim = to increase utility of final states • Each agent may play several roles in a situation • The agents learn to select the most appropriate role • Use reinforcement learning • Utility, Probability, and Cost (UPC) estimates of a role in a situation • Utility - the agent's estimate of a final state worth for a specific role in a situation – world states mapped to smaller set of situations S = {s0,…,sf} Urs = U(sf), s0  …  sf 46

Probability - the likelihood of reaching a final state for a specific role in a situation Prs = p(sf), s0  …  sf • Cost - the computational cost of reaching a final state for a specific role in a situation • Potential of a role - estimates the usefulness of a role, discovering pertinent global information and constraints (ortogonal to utilities) • Representation: • Sk - vector of situations for agent k, SK1,…,SKn • Rk - vector of roles for agent k, Rk1,…,Rkm • |Sk| x |Rk| x 4 values to describe UPC and Potential 47

Functioning Phase I: Learning Several learning cycles; in each cycle: • each agent goes from s0 to sf and selects its role as the one with the highest probability Probability of selecting a role r in a situation s f - objective function used to rate the roles (e.g., f(U,P,C,Pot) = U*P*C + Pot) - depends on the domain 48

Use reinforcement learning to update UPC and the potential of a role • For every s  [s0,…,sf] and chosen role r in s Ursi+1 = (1-)Ursi + Usf i - the learning cycle Usf - the utility of a final state 01 - the learning rate Prsi+1 = (1-)Prsi + O(sf) O(sf) = 1 if sf is successful, 0 otherwise 49

Potrsi+1 = (1-)Potrsi + Conf(Path) Path = [s0,…,sf] Conf(Path) = 0 if there are conflicts on the Path, 1 otherwise • The update rules for cost are domain dependent Phase II: Performing In a situation s the role r is chosen such that: 50

Machine Learning Lecture outline