Machine Learning and ILP for Multi-Agent Systems

Machine Learning and ILP for Multi-Agent Systems Daniel Kudenko & Dimitar Kazakov Department of Computer Science University of York, UK

Why Learning Agents? • Agent designers are not able to foresee all situations that the agent will encounter. • To display full autonomy Agents need to learn from and adapt to novel environments. • Learning is a crucial part of intelligence.

A Brief History Disembodied ML Single-Agent Learning Machine Learning Multiple Single-Agent Learners Social Multi-Agent Learners Social Multi-Agent System Multiple Single-Agent System Agents Single-Agent System

Outline • Principles of Machine Learning (ML) • ML for Single Agents • ML for Multi-Agent Systems • Inductive Logic Programming for Agents

What is Machine Learning? • Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. [Mitchell 97] • Example: T = “play tennis”, E = “playing matches”, P = “score”

Types of Learning • Inductive Learning (Supervised Learning) • Reinforcement Learning • Discovery (Unsupervised Learning)

Inductive Learning [An inductive learning] system aims at determining a description of a given concept from a set of concept examples provided by the teacher and from background knowledge. [Michalski et al. 98]

Inductive Learning Examples of Category C1 Examples of Category C2 Examples of Category Cn Inductive Learning System Hypothesis (Procedure to Classify New Examples)

Inductive Learning Example Ammo: low Monster: near Light: good Category:shoot Ammo: low Monster: far Light: medium Category: ¬shoot Ammo: high Monster: far Light: good Category:shoot Inductive Learning System If (Ammo = high) and (light {medium, good}) then shoot; ………..

Performance Measure • Classification accuracy on unseen test set. • Alternatively: measure that incorporates cost of false-positives and false-negatives (e.g. recall/precision).

Where’s the knowledge? • Example (or Object) language • Hypothesis (or Concept) language • Learning bias • Background knowledge

Example Language • Feature-value vectors, logic programs. • Which features are used to represent examples (e.g., ammunition left)? • For agents: which features of the environment are fed to the agent (or the learning module)? • Constructive Induction: automatic feature selection, construction, and generation.

Hypothesis Language • Decision trees, neural networks, logic programs, … • Further restrictions may be imposed, e.g., depth of decision trees, form of clauses. • Choice of hypothesis language influences choice of learning methods and vice versa.

Learning bias • Preference relation between legal hypotheses. • Accuracy on training set. • Hypothesis with zero error on training data is not necessarily the best (noise!). • Occam’s razor: the simpler hypothesis is the better one.

Inductive Learning • No “real” learning without language or learning bias. • IL is search through space of hypotheses guided by bias. • Quality of hypothesis depends on proper distribution of training examples.

Inductive Learning for Agents • What is the target concept (i.e., categories)? • Example: do(a), ¬do(a) for specific action a. • Real-valued categories/actions can be discretized. • Where does the training data come from and what form does it take?

Batch vs Incremental Learning • Batch Learning: collect a set of training examples and compute hypothesis. • Incremental Learning: update hypothesis with each new training example. • Incremental learning more suited for agents.

Batch Learning for Agents • When should (re-)computation of hypothesis take place? • Example: after experienced accuracy of hypothesis drops below threshold. • Which training examples should be used? • Example: sequences of actions that led to success.

Eager vs. Lazy learning • Eager learning: commit to hypothesis computed after training. • Lazy learning: store all encountered examples and perform classification based on this database (e.g. nearest neighbour).

Active Learning • Learner decides which training data to receive (i.e. generates training examples and uses oracle to classify them). • Closed Loop ML: learner suggests hypothesis and verifies it experimentally. If hypothesis is rejected, the collected data gives rise to a new hypothesis.

Black-Box vs. White-Box • Black-Box Learning: Interpretation of the learning result is unclear to a user. • White-Box Learning: Creates (symbolic) structures that are comprehensible.

Reinforcement Learning • Agent learns from environmental feedback indicating the benefit of states. • No explicit teacher required. • Learning target: optimal policy (i.e., state-action mapping) • Optimality measure: e.g., cumulative discounted reward.

Q Learning Value of a state: discounted cumulative reward V(st) = i  0ir(st+i,at+i) 0   < 1 is a discount factor ( = 0 means that only immediate reward is considered). r(st+i ,at+i) is the reward determined by performing actions specified by policy . Q(s,a) = r(s,a) + V*((s,a)) Optimal Policy: *(s) = argmaxa Q(s,a)

Q Learning Initialize all Q(s,a) to 0 In some state s choose some action a. Let s’ be the resulting state. Update Q: Q(s,a) = r +  maxa’ Q(s’,a’)

Q Learning • Guaranteed convergence towards optimum (state-action pairs have to be visited infinitely often). • Exploration strategy can speed up convergence. • Basic Q Learning does not generalize: replace state-action table with function approximation (e.g. neural net) in order to handle unseen states.

Pros and Cons of RL • Clearly suited to agents acting and exploring an environment. • Simple. • Engineering of suitable reward function may be tricky. • May take a long time to converge. • Learning result may be not transparent (depending on representation of Q function).

Combination of IL and RL • Relational reinforcement learning [Dzeroski et al. 98]: leads to more general Q function representation that may still be applicable even if the goals or environment change. • Explanation-based learning and RL [Dietterich and Flann, 95]. • More ILP and RL: see later.

Unsupervised Learning • Acquisition of “useful” or “interesting” patterns in input data. • Usefulness and interestingness are based on agent’s internal bias. • Agent does not receive any external feedback. • Discovered concepts are expected to improve agent performance on future tasks.

Learning and Verification • Need to guarantee agent safety. • Pre-deployment verification for non-learning agents. • What to do with learning agents?

Learning and Verification[Gordon ’00] • Verification after each self-modification step. • Problem: Time-consuming. • Solution 1: use property-preserving learning operators. • Solution 2: use learning operators which permit quick (partial) re-verification.

Learning and Verification What to do if verification fails? • Repair (multi)-agent plan. • Choose different learning operator.

Learning in Multi-Agent Systems • Classification • Social Awareness. • Communication • Role Learning. • Distributed Learning.

Types of Multi-Agent Learning[Weiss & Dillenbourg 99] • Multiplied Learning: No interference in the learning process by other agents (except for exchange of training data or outputs). • Divided Learning: Division of learning task on functional level. • Interacting Learning: cooperation beyond the pure exchange of data.

Social Awareness • Awareness of existence of other agents and (eventually) knowledge about their behavior. • Not necessary to achieve near optimal MAS behavior: rock sample collection [Steels 89]. • Can it degrade performance?

Levels of Social Awareness [Vidal&Durfee 97] • 0-level agent: no knowledge about existence of other agents. • 1-level agent: recognizes that other agents exist, model other agents as 0-level. • 2-level agent: has some knowledge about behavior of other agents and their behavior; model other agents as 1-level agents. • k-level agent: model other agents as (k-1)-level.

Social Awareness and Q Learning • 0-level agents already learn implicitly about other agents. • [Mundhe and Sen, 00]: study of two Q learning agents up to level 2. • Two 1-level agents display slowest and least effective learning (worse than two 0-level agents).

Agent models and Q Learning • Q: S  An R, where n is the number of agents. • If other agent’s actions are not observable, need assumption for actions of other agents. • Pessimistic assumption: given an agent’s action choice other agents will minimize reward. • Optimistic assumption: other agents will maximize reward.

Agent Models and Q Learning • Pessimistic Assumption leads to overly cautious behavior. • Optimistic Assumption guarantees convergence towards optimum [Lauer & Riedmiller ‘00]. • If knowledge of other agent’s behavior available, Q value update can be based on probabilistic computation [Claus and Boutilier ‘98]. But: no guarantee of optimality.

Q Learning and Communication[Tan 93] Types of communication: • Sharing sensation • Sharing or merging policies • Sharing episodes Results: • Communication generally helps • Extra sensory information may hurt

Role Learning • Often useful for agents to specialize in specific roles for joint tasks. • Pre-defined roles: reduce flexibility, often not easy to define optimal distribution, may be expensive. • How to learn roles? • [Prasad et al. 96]: learn optimal distribution of pre-defined roles.

Q Learning of roles • [Crites&Barto 98]: elevator domain; regular Q learning; no specialization achieved (but highly efficient behavior). • [Ono&Fukumoto 96]: Hunter-Prey domain, specialization achieved with greatest mass merging strategy.

Q Learning of Roles [Balch 99] • Three types of reward function: local performance-based, local shaped, global. • Global reward supports specialization. • Local reward supports emergence of homogeneous behaviors. • Some domains benefit from learning team heterogeneity (e.g., robotic soccer), others do not (e.g., multi-robot foraging). • Heterogeneity measure: social entropy.

Distributed Learning • Motivation: Agents learning a global hypothesis from local observations. • Application of MAS techniques to (inductive) learning. • Applications: Distributed Data Mining [Provost & Kolluri ‘99], Robotic Soccer.

Distributed Data Mining • [Provost& Hennessy 96]: Individual learners see only subset of all training examples and compute a set of local rules based on these. • Local rules are evaluated by other learners based on their data. • Only rules with good evaluation are carried over to the global hypothesis.

BREAK

Machine Learning and ILP for MAS: Part II Integration of ML and Agents ILP and its potential for MAS Agent Applications of ILP Learning, Natural Selection and Language

From Machine Learning to Learning Agents Machine Learning: Learning as the only goal Classic Machine Learning Active Learning Closed Loop Machine Learning Learning as one of many goals: Learning Agent(s)

Integrating Machine Learning into the Agent Architecture • Time constraints on learning • Synchronisation between agents’ actions • Learning and Recall

Time Constraints on Learning • Machine Learning alone: • predictive accuracy matters, time doesn’t (just a price to pay) • ML in Agents • Soft deadlines: resources must be shared with other activities (perception, planning, control) • Hard deadlines: imposed by environment: Make up your mind now! (or they’ll eat you)

Machine Learning and ILP for Multi-Agent Systems