Intelligent Systems

Intelligent Systems Inductive Logic Programming (ILP) – Lecture 11 Prof. Dieter Fensel (& Dumitru Roman)

Agenda • Motivation • Technical Solution • Model Theory of ILP • A Generic ILP Algorithm • Proof Theory of ILP • ILP Systems and Applications • Illustration by a Larger Example • Summary and Conclusions

(Machine) Learning • The process by which relatively permanent changes occur in behavioral potential as a result of experience. (Anderson) • Learning is constructing or modifying representations of what is being experienced. (Michalski) • A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. (Mitchell)

(Machine) Learning Techniques • Decision tree learning • Conceptual clustering • Case-based learning • Reinforcement learning • Neural networks • Genetic algorithms • and… Inductive Logic Programming (ILP)

Inductive Logic Programming = Inductive Learning ∩Logic Programming

The inductive learning and logic programming sides of ILP • From inductive machine learning, ILP inherits its goal: to develop tools and techniques to • Induce hypotheses from observations (examples) • Synthesise new knowledge from experience • By using computational logic as the representational mechanism for hypotheses and observations, ILP can overcome the two main limitations of classical machine learning techniques: • The use of a limited knowledge representation formalism (essentially a propositional logic) • Difficulties in using substantial background knowledge in the learning process

The inductive learning and logic programming sides of ILP (cont’) • ILP inherits from logic programming its • Representational formalism • Semantical orientation • Various wellestablished techniques • Compared to other approaches to inductive learning, ILP is interested in • Properties of inference rules • Convergence of algorithms • Computational complexity of procedures • ILP systems benefit from using the results of logic programming • E.g. by making use of work on termination, types and modes, knowledgebase updating, algorithmic debugging, abduction, constraint logic programming, program synthesis and program analysis

The inductive learning and logic programming sides of ILP (cont’) • Inductive logic programming extends the theory and practice of logic programming by investigating induction rather than deduction as the basic mode of inference • Logic programming theory describes deductive inference from logic formulae provided by the user • ILP theory describes the inductive inference of logic programs from instances and background knowledge • ILP contributes to the practice of logic programming by providing tools that assist logic programmers to develop and verify programs

Outline • Motivation • Technical Solution • Model Theory of ILP • A Generic ILP Algorithm • Proof Theory of ILP • ILP Systems and Applications • Illustration by a Larger Example • Summary and Conclusions

Motivation – Family example • Imagine yourself as learning about the relationships between people in your close family circle • You have been told that your grandfather is the father of one of your parents, but do not yet know what a parent is • You might have the following beliefs (B): grandfather(X, Y) ← father(X, Z); parent(Z, Y) father(henry, jane) ← mother(jane. john) ← mother(jane, alice) ← • You are now given the following facts (positive examples) concerning the relationships between particular grandfathers and their grandchildren (E+): grandfather(henry, john) ← grandfather(henry, alice) ← • You might be told in addition that the following relationships do not hold (negative examples) (E-) ← grandfather(john, henry) ← grandfather(alice, john)

Motivation – Family example (2) • Believing B, and faced with the new facts E+ and E- you might guess the following hypothesisH parent(X, Y) ← mother(X, Y) • H is not a consequence of B, and E-: B /\ E- |≠ □ (prior satisfiability) • However, H allows us to explain E+ relative to B: B /\ H|= E+ (posterior sufficiency) • B and H are consistent with E-: B /\ H /\ E- |≠ □ (posterior satisfiability) • The question: how it is possible to derive (even tentatively) the hypothesis H?

Motivation – Tweety example • Suppose that you know the following about birds: haswings(X) ← bird(X) hasbeak(X) ← bird(X) bird(X) ← vulture(X) carnivore(X) ← vulture(X) • Imagine now that an expedition comes across a creature "Tweety''. The expedition leader telegraphs you to let you know that Tweety has wings and a beak, i.e. E+ is: haswings(tweety) ← hasbeak(tweety) ←

Motivation – Tweety example (2) • Even without any negative examples it would not take a very inspired ornithologist with belief set B to hazard the guess "Tweety is a bird'‘, i.e. H = bird(tweety) ← • It could clearly be refuted if further evidence revealed Tweety to be made of e.g. plastic • The ornithologist would be unlikely to entertain the more speculative hypothesis "vulture(tweety)'', even though this could also be used to explain all the evidence: H’ = vulture(tweety) ← • The question: how do we know from B and E+ that H’ is more speculative than H?

Motivation – Sorting example • Imagine that a learning program is to be taught the logic program for “quicksort” • The following definitions are provided as background knowledge B: part(X, [], [], []) ← part(X, [Y | T ], [Y | S1], S2) ← Y =< X, partition(X, T, S1, S2) part(X, [Y | T ], S1, [Y | S2]) ← Y > X, part(X, T, S1, S2) app([], L, L) ← app([X | T ], L, [X | R]) ← app(T, L, R) Negative examples (E-): ← qsort([1, 0], [1, 0]) ← qsort([0], []) … Positive examples (E+): qsort([], []) ← qsort([0], [0]) ← qsort([1, 0], [0, 1]) ← …

Motivation – Sorting example (2) • In this case we might hope that the algorithm would, given a sufficient number of examples, suggest the following H for "quicksort": qsort([], []) ← qsort([XjT ], S) ← part(X, T, L1, L2), qsort(L1, S1), qsort(L2, S2), app(S1, [XjS2], S) • ILP systems such as Golem and FOIL can learn this definition of quicksort from as few as 6 or 10 examples • Although much background knowledge is required to learn quicksort, the mentioned ILP systems are able to select the correct hypothesis from a huge space of possible hypotheses

Model Theory – Normal Semantics • The problem of inductive inference: • Given is background (prior) knowledge B and evidence E • The evidence E = E+ /\ E- consists of positive evidence E+ and negative evidence E- • The aim is then to find a hypothesis H such that the following conditions hold: Prior Satisfiability: B /\ E- |≠ □ Posterior Satisfiability: B /\ H /\ E- |≠ □ Prior Necessity: B |≠ E+ Posterior Sufficiency: B /\ H|= E+ • The Sufficiency criterion is sometimes named completeness with regard to positive evidence • The Posterior Satisfiability criterion is also known as consistency with the negative evidence

Model Theory – Definite Semantics • In most ILP systems background theory and hypotheses are restricted to being definite • This setting is simpler than the general setting because a definite clause theory T has a unique minimal Herbrand model M+(T), and any logical formulae is either true or false in the minimal model • This setting is formalised as follows: Prior Satisfiability: all e in E-are false in M+(B) Posterior Satisfiability: all e in E-are false in M+(B /\ H) Prior Necessity: some e in E+ are false in M+(B) Posterior Sufficiency: all e in E+ are true in M+(B /\ H) • The special case of the definite semantics, where the evidence is restricted to true and false ground facts (examples), is called the example setting • The example setting is equivalent to the normal semantics, where B and H are definite clauses and E is a set of ground unit clauses

Model Theory – Non-monotonic Semantics • In the nonmonotonic setting: • The background theory is a set of definite clauses • The evidence is empty (Reason: the positive evidence is considered part of the background theory and the negative evidence is derived implicitly, by making a kind of closed world assumption) • The hypotheses are sets of general clauses expressible using the same alphabet as the background theory

Model Theory – Non-monotonic Semantics (2) • The following conditions should hold for H and B: • Validity: all h in H are true in M+(B) • All clauses belonging to a hypothesis hold in the database B, i.e. that they are true properties of the data • Completeness: if general clause g is true in M+(B) then H |= g • All information that is valid in the database should be encoded in the hypothesis • Minimality: there is no proper subset G of H which is valid and complete • Aims at deriving non redundant hypotheses

Model Theory – Non-monotonic Semantics (3) • Example for B: male(luc) ← female(lieve) ← human(lieve) ← human(luc) ← • A possible solution is then H: ← female(X), male(X) human(X) ← male(X) human(X) ← female(X) female(X), male(X) ← human(X)

Model Theory – Non-monotonic Semantics (4) • Example for B1 bird(tweety) ← bird(oliver) ← • Example for E1+ flies(tweety) • In the example setting an acceptable hypothesis H1 would be flies(X) ← bird(X) • In the nonmonotonic setting H1 is not a solution since there exists a substitution {X ← oliver} which makes the clause false (nonvalid) • This demonstrates that the nonmonotonic setting hypothesises only properties that hold in the database

Model Theory – Non-monotonic Semantics (5) • The nonmonotonic semantics realises induction by deduction • The induction principle of the nonmonotonic setting states that the hypothesis H, which is, in a sense, deduced from the set of observed examples E and the background theory B (using a kind of closed world and closed domain assumption), holds for all possible sets of examples • This produces generalisation beyond the observations • As a consequence, properties derived in the nonmonotonic setting are more conservative than those derived in the normal setting

ILP as a Search Problem • ILP can be seen as a search problem - this view follows immediately from the modeltheory of ILP • In ILP there is a space of candidate solutions, i.e. the set of hypotheses, and an acceptance criterion characterizing solutions to an ILP problem • Question: how the space of possible solutions can be structured in order to allow for pruning of the search? • The search space is typically structured by means of the dual notions of generalisation and specialisation • Generalisation corresponds to induction • Specialisation to deduction • Induction is viewed here as the inverse of deduction

Specialisation and Generalisation Rules • A hypothesis G is more general than a hypothesis S if and only if G |= S • S is also said to be more specific than G. • In search algorithms, the notions of generalisation and specialisation are incorporated using inductive and deductive inference rules: • A deductive inference rule r in R maps a conjunction of clauses G onto a conjunction of clauses S such that G |= S • r is called a specialisation rule • An inductive inference rule r in R maps a conjunction of clauses S onto a conjunction of clauses G such that G |= S • r is called a generalisation rule

Specialisation and Generalisation – Examples • Example of deductive inference rule is resolution; also dropping a clause from a hypothesis realises specialisation • Example of an inductive inference rule is Absorption: • In the rule of Absorption the conclusion entails the condition • Applying the rule of Absorption in the reverse direction, i.e. applying resolution, is a deductive inference rule • Other inductive inference rules generalise by adding a clause to a hypothesis, or by dropping a negative literal from a clause

Soundness of inductive inference rules • Inductive inference rules, such as Absorption, are not sound • This soundness problem can be circumvented by associating each hypothesised conclusion H with a labelL = p(H | B /\ E) where L is the probability that H holds given that the background knowledge B and evidence E hold • Assuming the subjective assignment of probabilities to be consistent, labelled rules of inductive inference are as sound as deductive inference • The conclusions are simply claimed to hold in a certain proportion of interpretations

Pruning the search space • Generalisation and specialisation form the basis for pruning the search space; this is because: • When B /\ H |≠ e, where B is the background theory, H is the hypothesis and e is positive evidence, then none of the specialisations H’ of H will imply the evidence. Each such hypothesis will be assigned a probability label p(H’ | B /\ E) = 0. • They can therefore be pruned from the search. • When B /\ H /\ e |= □, where B is the background theory, H is the hypothesis and e is negative evidence, then all generalisations H’ of H will also be inconsistent with B /\ E. These will again have p(H’ | B /\ E) = 0.

A Generic ILP Algorithm • Given the key ideas of ILP as search, inference rules and labeled hypotheses, a generic ILP system is defined as: • The algorithm works as follows: • It keeps track of a queue of candidate hypotheses QH • It repeatedly deletes a hypothesis H from the queue and expands that hypotheses using inference rules; the expanded hypotheses are then added to the queue of hypotheses QH, which may be pruned to discard unpromising hypotheses from further consideration • This process continues until the stopcriterion is satisfied

Algorithm – Generic Parameters • Initialize denotes the hypotheses started from • Rdenotes the set of inference rules applied • Deleteinfluences the search strategy • Using different instantiations of this procedure, one can realise a depthfirst (Delete = LIFO), breadthfirst Delete = FIFO) or bestfirst algorithm • Choosedetermines the inference rules to be applied on the hypothesis H

Algorithm – Generic Parameters (2) • Prunedetermines which candidate hypotheses are to be deleted from the queue • This is usually realized using the labels (probabilities) of the hypotheses on QH or relying on the user (employing an “oracle”) • Combining Delete with Prune it is easy to obtain advanced search strategies such as hillclimbing, beamsearch, bestfirst, etc. • The Stopcriterion states the conditions under which the algorithm stops • Some frequently employed criteria require that a solution be found, or that it is unlikely that an adequate hypothesis can be obtained from the current queue

“specifictogeneral” vs “generaltospecific” • “specifictogeneral” systems • Start from the examples and background knowledge, and repeatedly generalize their hypothesis by applying inductive inference rules • During the search they take care that the hypothesis remains satisfiable (i.e. does not imply negative examples) • “generaltospecific” systems • Start with the most general hypothesis (i.e. the inconsistent clause) and repeatedly specializes the hypothesis by applying deductive inference rules in order to remove inconsistencies with the negative examples • During the search care is taken that the hypotheses remain sufficient with regard to the positive evidence

Proof Theory of ILP • Inductive inference rules can be obtained by inverting deductive ones • Deduction: Given B /\ H |= E+ , derive E+ from B /\ H • Induction: Given B /\ H |= E+ , derive H from B and B and E+ • Inverting deduction paradigm can be studied under various assumptions, corresponding to different assumptions about the deductive rule for |= and the format of background theory B and evidence E+ • Different models of inductive inference are obtained • θ-subsumption • The background knowledge is supposed to be empty, and the deductive inference rule corresponds to θ-subsumption among single clauses • Inductive inference • Takes into account background knowledge and aims at inverting the resolution principle, the bestknown deductive inference rule • Inverting implication

Rules of inductive inference and operators • Inference rules = what can be inferred from what • A wellknown problem: the unrestricted application of inference rules results in combinatorial explosions • To control the application of inference rules, the search strategy employs operators that expand a given node in the search tree into a set of successor nodes in the search • A specialisation operator maps a conjunction of clauses G onto a set of maximal specialisations of S; A maximal specialisationS of G is a specialisation of G such that • G is not a specialisation of S • There is no specialisation S’ of G such that S is a specialisation of S’ • A generalisation operator maps a conjunction of clauses S onto a set of minimal generalisations of S; A minimal generalisationG of S is a generalisation of S such that • S is not a generalisation of G • There is no generalisation G’ of S such that G is a generalisation of G’

θ-subsumption • A clause c1θ-subsumes a clause c2 if and only if there exists a substitution θsuch that c1θ in c2 • c1 is a generalisation of c2 (and c2 a specialisation of c1) under θsubsumption • Clauses are seen as sets of (positive and negative) literals • The θsubsumption inductive inference rule is: • For example father(X,Y) ← parent(X,Y), male(X) θsubsumes father(jef,paul) ← parent(jef,paul), parent(jef,ann), male(jef), female(ann) with θ = {X = jef, Y = ann}

Some properties of θ-subsumption • Implication: If c1 θ-subsumes c2 then c1 |= c2 • The opposite does not hold for selfrecursive clauses: let c1 = p(f(X)) ← p(X); c2 = p(f(f(Y ))) ← p(Y ); c1 |= c2 but c1 does not θsubsume c2 • Deduction using θsubsumption is not equivalent to implication among clauses • Infinite Descending Chains: There exist infinite descending chains, e.g. h(X1,X2) ← h(X1,X2) ← p(X1,X2) h(X1,X2) ← p(X1,X2),p(X2,X3) h(X1,X2) ← p(X1,X2),p(X2,X3),p(X3,X4) ... This series is bounded from below by h(X,X) ← p(X,X)

Some properties of θ-subsumption (2) • Equivalence: there exist different clauses that are equivalent under θsubsumption • E.g. parent(X,Y) ← mother(X,Y), mother(X,Z) θsubsumes parent(X,Y) ← mother(X,Y) and vice versa • Reduction: for equivalent classes of clauses, there is a unique representative (up to variable renamings) of each clause (reduced clause) • The reduced clause r of a clause c is a minimal subset of literals of c such that r is equivalent to c • Lattice: the set of reduced clauses form a lattice, i.e. any two clauses have unique lub (the least general generalisation - lgg) and any two clauses have a unique glb

Inverting resolution • Four rules of inverse resolution • Lowercase letters are atoms and uppercase letters are conjunctions of atoms

Absorption and Identification • Both Absorption and Identification invert a single resolution step • This is shown diagrammatically as a V with the two premises on the base and one of the arms • The new clause in the conclusion is the clause found on the other arm of the V • Absorption and Identification were called collectively Voperators

Inter and Intraconstruction • The rules of Inter and Intraconstruction introduce a new predicate symbol • The new predicate can then be generalised using a Voperator; diagrammatically the construction operators can be shown as two linked V's, or a W, each respresenting a resolution • The premises are placed at the two bases of the W and the three conclusions at the top of the W • One of the clauses is shared in both resolutions. • Intra and Interconstruction are collectively called Woperators

V and W operators • The V and W operators have most specific forms: • In this form the Voperators realise both generalisation and specialisation since the conclusions entail the premises • Use of most specific operators is usually implemented by having a two stage operation • In the first phase, inverse resolution operators are applied to examples • In the second phase clauses are reduced by generalisation through the θsubsumption lattice

Inverting resolution - Matching subclauses • Just as resolution requires unification to match terms, inverse resolution operators require a matching operation • In matching operation all clauses, including the examples are flattened • This involves introducing a new (n+1)ary predicate for every nary function symbol, e.g. the clause member(a,[a,b]) ← becomes member(U,V ) ← a(U), dot(V,U,X), dot(X,Y,Z),b(Y),nil(Z). • Each new predicate symbol is then separately defined, e.g. dot([X|Y ],X,Y) ← • After flattening the problem of matching clauses when applying the inverse resolution operators reduces to onesided matching of clause bodies

Characteristics of ILP systems • Incremental/nonincremental: describes the way the evidence E (examples) is obtained • In nonincremental or empirical ILP, the evidence is given at the start and not changed afterwards • Nonincremental systems search typically either specifictogeneral or generaltospecific. • In incremental ILP, the examples are input one by one by the user, in a piecewise fashion. • Incremental systems usually employ a mixture of specifictogeneral and generaltospecific strategies as they may need to correct earlier induced hypotheses • Interactive/ Noninteractive • In interactive ILP, the learner is allowed to pose questions to an oracle (i.e. the user) about the intended interpretation • Usually these questions query the user for the intended interpretation of an example or a clause. • The answers to the queries allow to prune large parts of the search space • Most systems are noninteractive

Characteristics of ILP systems (2) • Single/Multiple Predicate Learning/Theory Revision • Suppose P(F) represent the predicate symbols found in formula F • In single predicate learning from examples, the evidence E is composed of examples for one predicate only, i.e. P(E) is a singleton • In multiple predicate learning, P(E) is not restricted as the aim is to learn a set of possibly interrelated predicate definitions • Theory revision is usually a form of incremental multiple predicate learning, where one starts from an initial approximation of the theory • Numerical Data • With a small number of examples it is hard to get enough examples in which the prediction is an exact number • Instead rules should predict an interval

Application Areas – Drug design • The majority of pharmaceutical R&D is based on finding slightly improved variants of patented active drugs; this involves laboratories of chemists synthesising and testing hundreds of compounds almost at random • The ability to automatically discover the chemical properties which affect the activity of drugs could provide a great reduction in pharmaceutical R&D costs (the average cost of developing a single new drug is $230 million) • ILP techniques are capable of constructing rules which predict the activity of untried drugs • Rules are constructed from examples of drugs with known medicinal activity • The accuracy of the rules was found to be higher than for traditional statistical methods • More importantly the easily understandable rules can provide key insights, allowing considerable reductions in the numbers of compounds that need to be tested.

Application Areas – Protein primarysecondary shape prediction • Predicting the threedimensional shape of proteins from their amino acid sequence is widely believed to be one of the hardest unsolved problems in molecular biology • It is also of considerable interest to pharmaceutical companies since shape generally determines the function of a protein. • ILP techniques have recently had considerable success within this area • Over the last 20 years many attempts have been made to apply methods ranging from statistical regression to decision tree and neural net learning to this problem • Published accuracy results for the general prediction problem have ranged between 50 and 60 %, very close to random prediction. • With ILP it was found that the ability to make use of biological background knowledge together with the ability to describe structural relations boosted the predictivity for a restricted subproblem from around 70% to around 80% on an independently chosen test set.

Application Areas – Satellite diagnosis • ILP techniques have been applied to problems within the Aerospace industry • In this case a complete and correct set of rules for diagnosing power supply failures was developed by generating examples from a qualitative model of the power subsystem of an existing satellite • The resulting rules are guaranteed complete and correct for all single faults since all examples of these were generated from the original model • Rules were described using a simple temporal formalism in which each predicate had an associated time variable

Intelligent Systems