Download Presentation
## Intelligent Systems

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Intelligent Systems**Inductive Logic Programming (ILP) – Lecture 11 Prof. Dieter Fensel (& Dumitru Roman)**Agenda**• Motivation • Technical Solution • Model Theory of ILP • A Generic ILP Algorithm • Proof Theory of ILP • ILP Systems and Applications • Illustration by a Larger Example • Summary and Conclusions**(Machine) Learning**• The process by which relatively permanent changes occur in behavioral potential as a result of experience. (Anderson) • Learning is constructing or modifying representations of what is being experienced. (Michalski) • A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. (Mitchell)**(Machine) Learning Techniques**• Decision tree learning • Conceptual clustering • Case-based learning • Reinforcement learning • Neural networks • Genetic algorithms • and… Inductive Logic Programming (ILP)**Inductive Logic Programming**= Inductive Learning ∩Logic Programming**The inductive learning and logic programming sides of ILP**• From inductive machine learning, ILP inherits its goal: to develop tools and techniques to • Induce hypotheses from observations (examples) • Synthesise new knowledge from experience • By using computational logic as the representational mechanism for hypotheses and observations, ILP can overcome the two main limitations of classical machine learning techniques: • The use of a limited knowledge representation formalism (essentially a propositional logic) • Difficulties in using substantial background knowledge in the learning process**The inductive learning and logic programming sides of ILP**(cont’) • ILP inherits from logic programming its • Representational formalism • Semantical orientation • Various wellestablished techniques • Compared to other approaches to inductive learning, ILP is interested in • Properties of inference rules • Convergence of algorithms • Computational complexity of procedures • ILP systems benefit from using the results of logic programming • E.g. by making use of work on termination, types and modes, knowledgebase updating, algorithmic debugging, abduction, constraint logic programming, program synthesis and program analysis**The inductive learning and logic programming sides of ILP**(cont’) • Inductive logic programming extends the theory and practice of logic programming by investigating induction rather than deduction as the basic mode of inference • Logic programming theory describes deductive inference from logic formulae provided by the user • ILP theory describes the inductive inference of logic programs from instances and background knowledge • ILP contributes to the practice of logic programming by providing tools that assist logic programmers to develop and verify programs**Outline**• Motivation • Technical Solution • Model Theory of ILP • A Generic ILP Algorithm • Proof Theory of ILP • ILP Systems and Applications • Illustration by a Larger Example • Summary and Conclusions**Motivation – Family example**• Imagine yourself as learning about the relationships between people in your close family circle • You have been told that your grandfather is the father of one of your parents, but do not yet know what a parent is • You might have the following beliefs (B): grandfather(X, Y) ← father(X, Z); parent(Z, Y) father(henry, jane) ← mother(jane. john) ← mother(jane, alice) ← • You are now given the following facts (positive examples) concerning the relationships between particular grandfathers and their grandchildren (E+): grandfather(henry, john) ← grandfather(henry, alice) ← • You might be told in addition that the following relationships do not hold (negative examples) (E-) ← grandfather(john, henry) ← grandfather(alice, john)**Motivation – Family example (2)**• Believing B, and faced with the new facts E+ and E- you might guess the following hypothesisH parent(X, Y) ← mother(X, Y) • H is not a consequence of B, and E-: B /\ E- |≠ □ (prior satisfiability) • However, H allows us to explain E+ relative to B: B /\ H|= E+ (posterior sufficiency) • B and H are consistent with E-: B /\ H /\ E- |≠ □ (posterior satisfiability) • The question: how it is possible to derive (even tentatively) the hypothesis H?**Motivation – Tweety example**• Suppose that you know the following about birds: haswings(X) ← bird(X) hasbeak(X) ← bird(X) bird(X) ← vulture(X) carnivore(X) ← vulture(X) • Imagine now that an expedition comes across a creature "Tweety''. The expedition leader telegraphs you to let you know that Tweety has wings and a beak, i.e. E+ is: haswings(tweety) ← hasbeak(tweety) ←**Motivation – Tweety example (2)**• Even without any negative examples it would not take a very inspired ornithologist with belief set B to hazard the guess "Tweety is a bird'‘, i.e. H = bird(tweety) ← • It could clearly be refuted if further evidence revealed Tweety to be made of e.g. plastic • The ornithologist would be unlikely to entertain the more speculative hypothesis "vulture(tweety)'', even though this could also be used to explain all the evidence: H’ = vulture(tweety) ← • The question: how do we know from B and E+ that H’ is more speculative than H?**Motivation – Sorting example**• Imagine that a learning program is to be taught the logic program for “quicksort” • The following definitions are provided as background knowledge B: part(X, [], [], []) ← part(X, [Y | T ], [Y | S1], S2) ← Y =< X, partition(X, T, S1, S2) part(X, [Y | T ], S1, [Y | S2]) ← Y > X, part(X, T, S1, S2) app([], L, L) ← app([X | T ], L, [X | R]) ← app(T, L, R) Negative examples (E-): ← qsort([1, 0], [1, 0]) ← qsort([0], []) … Positive examples (E+): qsort([], []) ← qsort([0], [0]) ← qsort([1, 0], [0, 1]) ← …**Motivation – Sorting example (2)**• In this case we might hope that the algorithm would, given a sufficient number of examples, suggest the following H for "quicksort": qsort([], []) ← qsort([XjT ], S) ← part(X, T, L1, L2), qsort(L1, S1), qsort(L2, S2), app(S1, [XjS2], S) • ILP systems such as Golem and FOIL can learn this definition of quicksort from as few as 6 or 10 examples • Although much background knowledge is required to learn quicksort, the mentioned ILP systems are able to select the correct hypothesis from a huge space of possible hypotheses**Agenda**• Motivation • Technical Solution • Model Theory of ILP • A Generic ILP Algorithm • Proof Theory of ILP • ILP Systems and Applications • Illustration by a Larger Example • Summary and Conclusions**Model Theory – Normal Semantics**• The problem of inductive inference: • Given is background (prior) knowledge B and evidence E • The evidence E = E+ /\ E- consists of positive evidence E+ and negative evidence E- • The aim is then to find a hypothesis H such that the following conditions hold: Prior Satisfiability: B /\ E- |≠ □ Posterior Satisfiability: B /\ H /\ E- |≠ □ Prior Necessity: B |≠ E+ Posterior Sufficiency: B /\ H|= E+ • The Sufficiency criterion is sometimes named completeness with regard to positive evidence • The Posterior Satisfiability criterion is also known as consistency with the negative evidence**Model Theory – Definite Semantics**• In most ILP systems background theory and hypotheses are restricted to being definite • This setting is simpler than the general setting because a definite clause theory T has a unique minimal Herbrand model M+(T), and any logical formulae is either true or false in the minimal model • This setting is formalised as follows: Prior Satisfiability: all e in E-are false in M+(B) Posterior Satisfiability: all e in E-are false in M+(B /\ H) Prior Necessity: some e in E+ are false in M+(B) Posterior Sufficiency: all e in E+ are true in M+(B /\ H) • The special case of the definite semantics, where the evidence is restricted to true and false ground facts (examples), is called the example setting • The example setting is equivalent to the normal semantics, where B and H are definite clauses and E is a set of ground unit clauses**Model Theory – Non-monotonic Semantics**• In the nonmonotonic setting: • The background theory is a set of definite clauses • The evidence is empty (Reason: the positive evidence is considered part of the background theory and the negative evidence is derived implicitly, by making a kind of closed world assumption) • The hypotheses are sets of general clauses expressible using the same alphabet as the background theory**Model Theory – Non-monotonic Semantics (2)**• The following conditions should hold for H and B: • Validity: all h in H are true in M+(B) • All clauses belonging to a hypothesis hold in the database B, i.e. that they are true properties of the data • Completeness: if general clause g is true in M+(B) then H |= g • All information that is valid in the database should be encoded in the hypothesis • Minimality: there is no proper subset G of H which is valid and complete • Aims at deriving non redundant hypotheses**Model Theory – Non-monotonic Semantics (3)**• Example for B: male(luc) ← female(lieve) ← human(lieve) ← human(luc) ← • A possible solution is then H: ← female(X), male(X) human(X) ← male(X) human(X) ← female(X) female(X), male(X) ← human(X)**Model Theory – Non-monotonic Semantics (4)**• Example for B1 bird(tweety) ← bird(oliver) ← • Example for E1+ flies(tweety) • In the example setting an acceptable hypothesis H1 would be flies(X) ← bird(X) • In the nonmonotonic setting H1 is not a solution since there exists a substitution {X ← oliver} which makes the clause false (nonvalid) • This demonstrates that the nonmonotonic setting hypothesises only properties that hold in the database**Model Theory – Non-monotonic Semantics (5)**• The nonmonotonic semantics realises induction by deduction • The induction principle of the nonmonotonic setting states that the hypothesis H, which is, in a sense, deduced from the set of observed examples E and the background theory B (using a kind of closed world and closed domain assumption), holds for all possible sets of examples • This produces generalisation beyond the observations • As a consequence, properties derived in the nonmonotonic setting are more conservative than those derived in the normal setting**Agenda**• Motivation • Technical Solution • Model Theory of ILP • A Generic ILP Algorithm • Proof Theory of ILP • ILP Systems and Applications • Illustration by a Larger Example • Summary and Conclusions**ILP as a Search Problem**• ILP can be seen as a search problem - this view follows immediately from the modeltheory of ILP • In ILP there is a space of candidate solutions, i.e. the set of hypotheses, and an acceptance criterion characterizing solutions to an ILP problem • Question: how the space of possible solutions can be structured in order to allow for pruning of the search? • The search space is typically structured by means of the dual notions of generalisation and specialisation • Generalisation corresponds to induction • Specialisation to deduction • Induction is viewed here as the inverse of deduction**Specialisation and Generalisation Rules**• A hypothesis G is more general than a hypothesis S if and only if G |= S • S is also said to be more specific than G. • In search algorithms, the notions of generalisation and specialisation are incorporated using inductive and deductive inference rules: • A deductive inference rule r in R maps a conjunction of clauses G onto a conjunction of clauses S such that G |= S • r is called a specialisation rule • An inductive inference rule r in R maps a conjunction of clauses S onto a conjunction of clauses G such that G |= S • r is called a generalisation rule**Specialisation and Generalisation – Examples**• Example of deductive inference rule is resolution; also dropping a clause from a hypothesis realises specialisation • Example of an inductive inference rule is Absorption: • In the rule of Absorption the conclusion entails the condition • Applying the rule of Absorption in the reverse direction, i.e. applying resolution, is a deductive inference rule • Other inductive inference rules generalise by adding a clause to a hypothesis, or by dropping a negative literal from a clause**Soundness of inductive inference rules**• Inductive inference rules, such as Absorption, are not sound • This soundness problem can be circumvented by associating each hypothesised conclusion H with a labelL = p(H | B /\ E) where L is the probability that H holds given that the background knowledge B and evidence E hold • Assuming the subjective assignment of probabilities to be consistent, labelled rules of inductive inference are as sound as deductive inference • The conclusions are simply claimed to hold in a certain proportion of interpretations**Pruning the search space**• Generalisation and specialisation form the basis for pruning the search space; this is because: • When B /\ H |≠ e, where B is the background theory, H is the hypothesis and e is positive evidence, then none of the specialisations H’ of H will imply the evidence. Each such hypothesis will be assigned a probability label p(H’ | B /\ E) = 0. • They can therefore be pruned from the search. • When B /\ H /\ e |= □, where B is the background theory, H is the hypothesis and e is negative evidence, then all generalisations H’ of H will also be inconsistent with B /\ E. These will again have p(H’ | B /\ E) = 0.**A Generic ILP Algorithm**• Given the key ideas of ILP as search, inference rules and labeled hypotheses, a generic ILP system is defined as: • The algorithm works as follows: • It keeps track of a queue of candidate hypotheses QH • It repeatedly deletes a hypothesis H from the queue and expands that hypotheses using inference rules; the expanded hypotheses are then added to the queue of hypotheses QH, which may be pruned to discard unpromising hypotheses from further consideration • This process continues until the stopcriterion is satisfied**Algorithm – Generic Parameters**• Initialize denotes the hypotheses started from • Rdenotes the set of inference rules applied • Deleteinfluences the search strategy • Using different instantiations of this procedure, one can realise a depthfirst (Delete = LIFO), breadthfirst Delete = FIFO) or bestfirst algorithm • Choosedetermines the inference rules to be applied on the hypothesis H**Algorithm – Generic Parameters (2)**• Prunedetermines which candidate hypotheses are to be deleted from the queue • This is usually realized using the labels (probabilities) of the hypotheses on QH or relying on the user (employing an “oracle”) • Combining Delete with Prune it is easy to obtain advanced search strategies such as hillclimbing, beamsearch, bestfirst, etc. • The Stopcriterion states the conditions under which the algorithm stops • Some frequently employed criteria require that a solution be found, or that it is unlikely that an adequate hypothesis can be obtained from the current queue**“specifictogeneral” vs “generaltospecific”**• “specifictogeneral” systems • Start from the examples and background knowledge, and repeatedly generalize their hypothesis by applying inductive inference rules • During the search they take care that the hypothesis remains satisfiable (i.e. does not imply negative examples) • “generaltospecific” systems • Start with the most general hypothesis (i.e. the inconsistent clause) and repeatedly specializes the hypothesis by applying deductive inference rules in order to remove inconsistencies with the negative examples • During the search care is taken that the hypotheses remain sufficient with regard to the positive evidence**Agenda**• Motivation • Technical Solution • Model Theory of ILP • A Generic ILP Algorithm • Proof Theory of ILP • ILP Systems and Applications • Illustration by a Larger Example • Summary and Conclusions**Proof Theory of ILP**• Inductive inference rules can be obtained by inverting deductive ones • Deduction: Given B /\ H |= E+ , derive E+ from B /\ H • Induction: Given B /\ H |= E+ , derive H from B and B and E+ • Inverting deduction paradigm can be studied under various assumptions, corresponding to different assumptions about the deductive rule for |= and the format of background theory B and evidence E+ • Different models of inductive inference are obtained • θ-subsumption • The background knowledge is supposed to be empty, and the deductive inference rule corresponds to θ-subsumption among single clauses • Inductive inference • Takes into account background knowledge and aims at inverting the resolution principle, the bestknown deductive inference rule • Inverting implication**Rules of inductive inference and operators**• Inference rules = what can be inferred from what • A wellknown problem: the unrestricted application of inference rules results in combinatorial explosions • To control the application of inference rules, the search strategy employs operators that expand a given node in the search tree into a set of successor nodes in the search • A specialisation operator maps a conjunction of clauses G onto a set of maximal specialisations of S; A maximal specialisationS of G is a specialisation of G such that • G is not a specialisation of S • There is no specialisation S’ of G such that S is a specialisation of S’ • A generalisation operator maps a conjunction of clauses S onto a set of minimal generalisations of S; A minimal generalisationG of S is a generalisation of S such that • S is not a generalisation of G • There is no generalisation G’ of S such that G is a generalisation of G’**θ-subsumption**• A clause c1θ-subsumes a clause c2 if and only if there exists a substitution θsuch that c1θ in c2 • c1 is a generalisation of c2 (and c2 a specialisation of c1) under θsubsumption • Clauses are seen as sets of (positive and negative) literals • The θsubsumption inductive inference rule is: • For example father(X,Y) ← parent(X,Y), male(X) θsubsumes father(jef,paul) ← parent(jef,paul), parent(jef,ann), male(jef), female(ann) with θ = {X = jef, Y = ann}**Some properties of θ-subsumption**• Implication: If c1 θ-subsumes c2 then c1 |= c2 • The opposite does not hold for selfrecursive clauses: let c1 = p(f(X)) ← p(X); c2 = p(f(f(Y ))) ← p(Y ); c1 |= c2 but c1 does not θsubsume c2 • Deduction using θsubsumption is not equivalent to implication among clauses • Infinite Descending Chains: There exist infinite descending chains, e.g. h(X1,X2) ← h(X1,X2) ← p(X1,X2) h(X1,X2) ← p(X1,X2),p(X2,X3) h(X1,X2) ← p(X1,X2),p(X2,X3),p(X3,X4) ... This series is bounded from below by h(X,X) ← p(X,X)**Some properties of θ-subsumption (2)**• Equivalence: there exist different clauses that are equivalent under θsubsumption • E.g. parent(X,Y) ← mother(X,Y), mother(X,Z) θsubsumes parent(X,Y) ← mother(X,Y) and vice versa • Reduction: for equivalent classes of clauses, there is a unique representative (up to variable renamings) of each clause (reduced clause) • The reduced clause r of a clause c is a minimal subset of literals of c such that r is equivalent to c • Lattice: the set of reduced clauses form a lattice, i.e. any two clauses have unique lub (the least general generalisation - lgg) and any two clauses have a unique glb**Inverting resolution**• Four rules of inverse resolution • Lowercase letters are atoms and uppercase letters are conjunctions of atoms**Absorption and Identification**• Both Absorption and Identification invert a single resolution step • This is shown diagrammatically as a V with the two premises on the base and one of the arms • The new clause in the conclusion is the clause found on the other arm of the V • Absorption and Identification were called collectively Voperators**Inter and Intraconstruction**• The rules of Inter and Intraconstruction introduce a new predicate symbol • The new predicate can then be generalised using a Voperator; diagrammatically the construction operators can be shown as two linked V's, or a W, each respresenting a resolution • The premises are placed at the two bases of the W and the three conclusions at the top of the W • One of the clauses is shared in both resolutions. • Intra and Interconstruction are collectively called Woperators**V and W operators**• The V and W operators have most specific forms: • In this form the Voperators realise both generalisation and specialisation since the conclusions entail the premises • Use of most specific operators is usually implemented by having a two stage operation • In the first phase, inverse resolution operators are applied to examples • In the second phase clauses are reduced by generalisation through the θsubsumption lattice**Inverting resolution - Matching subclauses**• Just as resolution requires unification to match terms, inverse resolution operators require a matching operation • In matching operation all clauses, including the examples are flattened • This involves introducing a new (n+1)ary predicate for every nary function symbol, e.g. the clause member(a,[a,b]) ← becomes member(U,V ) ← a(U), dot(V,U,X), dot(X,Y,Z),b(Y),nil(Z). • Each new predicate symbol is then separately defined, e.g. dot([X|Y ],X,Y) ← • After flattening the problem of matching clauses when applying the inverse resolution operators reduces to onesided matching of clause bodies**Agenda**• Motivation • Technical Solution • Model Theory of ILP • A Generic ILP Algorithm • Proof Theory of ILP • ILP Systems and Applications • Illustration by a Larger Example • Summary and Conclusions**Characteristics of ILP systems**• Incremental/nonincremental: describes the way the evidence E (examples) is obtained • In nonincremental or empirical ILP, the evidence is given at the start and not changed afterwards • Nonincremental systems search typically either specifictogeneral or generaltospecific. • In incremental ILP, the examples are input one by one by the user, in a piecewise fashion. • Incremental systems usually employ a mixture of specifictogeneral and generaltospecific strategies as they may need to correct earlier induced hypotheses • Interactive/ Noninteractive • In interactive ILP, the learner is allowed to pose questions to an oracle (i.e. the user) about the intended interpretation • Usually these questions query the user for the intended interpretation of an example or a clause. • The answers to the queries allow to prune large parts of the search space • Most systems are noninteractive**Characteristics of ILP systems (2)**• Single/Multiple Predicate Learning/Theory Revision • Suppose P(F) represent the predicate symbols found in formula F • In single predicate learning from examples, the evidence E is composed of examples for one predicate only, i.e. P(E) is a singleton • In multiple predicate learning, P(E) is not restricted as the aim is to learn a set of possibly interrelated predicate definitions • Theory revision is usually a form of incremental multiple predicate learning, where one starts from an initial approximation of the theory • Numerical Data • With a small number of examples it is hard to get enough examples in which the prediction is an exact number • Instead rules should predict an interval**Application Areas – Drug design**• The majority of pharmaceutical R&D is based on finding slightly improved variants of patented active drugs; this involves laboratories of chemists synthesising and testing hundreds of compounds almost at random • The ability to automatically discover the chemical properties which affect the activity of drugs could provide a great reduction in pharmaceutical R&D costs (the average cost of developing a single new drug is $230 million) • ILP techniques are capable of constructing rules which predict the activity of untried drugs • Rules are constructed from examples of drugs with known medicinal activity • The accuracy of the rules was found to be higher than for traditional statistical methods • More importantly the easily understandable rules can provide key insights, allowing considerable reductions in the numbers of compounds that need to be tested.**Application Areas – Protein primarysecondary shape**prediction • Predicting the threedimensional shape of proteins from their amino acid sequence is widely believed to be one of the hardest unsolved problems in molecular biology • It is also of considerable interest to pharmaceutical companies since shape generally determines the function of a protein. • ILP techniques have recently had considerable success within this area • Over the last 20 years many attempts have been made to apply methods ranging from statistical regression to decision tree and neural net learning to this problem • Published accuracy results for the general prediction problem have ranged between 50 and 60 %, very close to random prediction. • With ILP it was found that the ability to make use of biological background knowledge together with the ability to describe structural relations boosted the predictivity for a restricted subproblem from around 70% to around 80% on an independently chosen test set.**Application Areas – Satellite diagnosis**• ILP techniques have been applied to problems within the Aerospace industry • In this case a complete and correct set of rules for diagnosing power supply failures was developed by generating examples from a qualitative model of the power subsystem of an existing satellite • The resulting rules are guaranteed complete and correct for all single faults since all examples of these were generated from the original model • Rules were described using a simple temporal formalism in which each predicate had an associated time variable