360 likes | 588 Vues
A New Linear-threshold Algorithm. Anna Rapoport Lev Faivishevsky. Introduction. Valiant (1984) and others have studied the problem of learning various classes of Boolean functions from examples. Now we’re going to discuss incremental learning of these functions.
E N D
A New Linear-threshold Algorithm Anna Rapoport Lev Faivishevsky
Introduction • Valiant (1984) and others have studied the problem of learning various classes of Boolean functions from examples. Now we’re going to discuss incremental learning of these functions. • We consider a setting in which the learner responds to each example according to a current hypothesis. Then the learner updates it, if necessary, based on the correct classification of the example.
Introduction (cont.) • One natural measure of the quality of learning in this setting is the number of mistakes the learner makes. • For suitable classes of functions, learning algorithms are available that make a bounded number of mistakes, with the bound independent of the number of examplesseen by the learner.
Introduction (cont.) • We present an algorithm that learns disjunctive Boolean functions, along with variants for learning other classes of Boolean functions. • The basic method can be expressed as a linear- threshold algorithm. • A primary advantage of this algorithm is that the number of mistakes grows only logarithmically with the number of irrelevant attributes in the examples. Also it is computationally efficient in both time and space.
How does it work? • We study learning in an on-line setting – there’s no separate set of training examples. The learner attempts to predict the appropriate response for each example, starting with the first example received. • After making this prediction, the learner is told whether the prediction was correct, and then uses this information to improve its hypothesis. • The learner continues to learn as long as it receives examples.
The Setting Now we’re going to describe in more detail the learning environment that we consider and the classes of functions that the algorithm can learn. We assume that learning takes place in a sequence of trials. The order of events in a trial is as follows:
The Setting (cont.) (1) The learner receives some information about the world, corresponding to a single example. This information consists of the values of n Boolean attributes, for some n that remains fixed. We think of the information received as a point in {0,1}n. We call this point an instance and we call {0,1}nthe instance space.
The Setting (cont.) (2) The learner makes a response. The learner has a choice of two responses, labeled 0 and 1. We call this response the learner’s prediction of the correct value. (3) The learneris told whether or not the response was correct. This information is called the reinforcement.
The Setting (cont.) • Each trial begins after the previous trial has ended. • We assume that for entire sequence of trials, there is a single function f:{0,1}n →{0,1}which maps each instance to the correct response to that instance. This function is called target function or target concept. • The algorithm for learning in this setting is called algorithm for on-line learning from examples (AOLLE)
Mistake Bound (introduction) • We evaluate the algorithm’s learning behavior by counting the worst-case number of mistakes that it will make while learning a function from a specified class of functions. Computational complexity is also considered. The method is computationally time and space efficient.
General results about mistake bounds for AOOLE • At first we present upper and lower bounds on the number of mistakes in the case where one ignores issues of computational efficiency. • The instant space can be any finite space X, and the target class is assumed to be a collection of functions, each with domain X and range {0,1}.
Some definitions: Def 1:For any learning algorithm A and any target function ƒ, let MA(ƒ) be the maximum over all possible sequences of instance of the number of mistakes that algorithm A makes when the target function is ƒ.
Some definitions: Def 2:For any learning algorithm A and any non-empty target class C, let MA(C) max ƒєC MA(ƒ). Define MA(C) := -1, if C is empty. Any number greater than or equal to MA(C) will be called a mistake bound for algorithm A applied to class C.
Some definitions: Def 3:The optimal mistake bound for a target class C, denoted opt(C), is the minimum over all algorithms A of MA(C) (regardless algorithm’s computational efficiency) . An algorithm A is called optimal for class C if MA(C) = opt(C). Thus opt(C) represents the best possible worst case mistake bound for any algorithm learning C.
2 Auxiliary algorithms If computational resources are no issue, there’s a straightforward learning algorithm that has excellent mistake bounds for many classes of functions. We’re going briefly to observe them, because it gives an upper limit on the mistake bound and because it suggests strategies that one might explore in searching for computationally efficient algorithms.
Algorithm 1: halving algorithm (HA) • The HA can be applied to any finite class C of functions taking values in {0,1}. The HA maintains a variable CONSIST = C (initially). When it receives an instance, it determines the sets ξ0(CONSIST,x) ={ƒєC, ƒ(x)=0} ξ1(CONSIST,x) ={ƒєC, ƒ(x)=1}
HA: scheme of the work |ξ1(CONSIST,x)| > |ξ0(CONSIST,x)| true false Predicts 1 Predicts 0 When it receives the reinforcement, it sets: CONSIST= ξ1(CONSIST,x), if correct is 1; CONSIST= ξ0(CONSIST,x), if correct is 0;
HA: main results Def: Let MHALVING(C) denote the maximum number of mistakesthat the algorithm will make when it is run for the class C. Th 1:For any non-empty target class C, MHALVING(C) log2|C| Th 2:For any finite target class C, opt (C) log2|C|
Algorithm 2: standard optimal algorithm (SOA) Def 1:A mistake tree for a target class C over an instance space X is a binary tree each of whose nodes is a non-empty subset of C, and each internal node is labeled with a point of X and satisfies: 1.The root of the tree is C; 2. For any internal node C’ labeled with x the left child of C’ is ξ0(C’,x) and right one is ξ1(C’,x).
SOA Def 2:A complete k-mistake tree is a mistake tree that is a complete binary tree of height k. Def 3:For any non-empty finite target class C, let K(C) equal the largest integer k s.t. there exists a complete k-mistake free for C. K()= -1. The SOA is similar to HA, but it compares K(ξ1(CONSIST,x)) > K(ξ0(CONSIST,x))
SOA: main results Th1:Let X be any instance space. C:X{0,1} opt(C) = MSOA(C) = K(C) Def 4:SX is shattered by a target class C if for US ƒєC s.t ƒ(U)=1 & ƒ(S-U)=0 Def 5:The Vapnik-Chervonenkis dimension is the card of the largest set, shattered by C Th2:For any target class C: VCdim(C) opt(C)
The linear-threshold algorithm (LTA) Def 1:ƒ:{0,1}n{0,1}is linearly-separable if there is a hyperplane in Rn separating the points on which the function is 1 from those on which it’s 0. Def 1:A monotone disjunction is such in which no literal appears negated: ƒ(x1,..,xn) = xi1 … xik A hyperplane given by xi1 +…+ xik= ½ is a separating hyperplane for ƒ.
WINNOW 1 • The instance space is X={0,1}n • The algorithm maintains weights {w1,..,wn}єR+ , each having 1 as its initial value. • θ є R – the threshold. • When the learner receives an instance (x1,..,xn), the learner responds as follows: if wi xi θ, then it predicts 1; if wi xi θ, then it predicts 0.
WINNOW 1 The weights are changed only if the learner makes a mistake according the table:
Requirements for WINNOW1 • The space needed (without counting bits per weight) and the sequential time needed per trial are both linear in n. • Non-zero weights are powers of , so the weights are at most . Thus if the logarithms (base ) of the weights are stored, only O(log2log) bits per weight are needed
Mistake bound for WINNOW1 Th: Suppose that the target function is a k- literal monotone disjunction given by ƒ(x1,..,xn) xi1 … xik. If WINNOW1 is run with 1 and 1/, then for any sequence of instances the total number of mistakes will be bounded by k(log1) n/
Example: Good bounds are obtained if 2, θ n/. We get the bound 2k*log2n 2 , the dominating first term is minimized for e; the bound then becomes (e/log2e)*k*log2n e 1.885k*log2n e
Lower mistake bound Def: For 1 k n, let Ĉk denote the class of k-literal monotone disjunctions, and let Ck denote the class of all those monotone disjunctions that have at most k literals. Th: (lower bound) For1 k n, opt(Ck) opt(Ĉk) k [log2(n/k)].For n1 we also have opt(Ck) k/8(1 log2(n/k))
Modified WINNOW1 • For instance space X{0,1}n, and s.t. 0<1 let F(X,):X{0,1}s.t. for ƒєF(X,) µ1,..,µn 0 s.t. for all (x1,..,xn) є X µi xi 1 if ƒ (x1,..,xn)=1 (*) µi xi 1- if ƒ (x1,..,xn)=0 (**) • So the inverse images of 0 and 1 are linearly separable with a minimum separation that depends on . The mistake bound that we derive will be practical only for those functions for which is sufficiently large.
Example: an r-of-k threshold function Def:Let X={0,1}n,an r-of-k threshold function ƒis defined by selecting a set of k significant variables. ƒ=1 whenever at least r of this k variables are 1. • ƒ=1 xi1 +…+ xik r (1/r)xi1 +…+(1/r) xik 1 if ƒ (x1,..,xn)=1 (1/r)xi1 +…+(1/r) xik 1-r if ƒ (x1,..,xn)=0 Thus the r-of-k threshold functions є F({0,1}n,1/r)
WINNOW2 • The only change to WINNOW1 updating rule when a mistake is made.
Requirements for WINNOW2 • We use = 1+ /2 for learning target function in F(X,). • Space & time requirements for WINNOW2 are similar to those for WINNOW1. However, more bits will be needed to store each weight, perhaps as many as the logarithm of the mistake bound.
Mistake bound for WINNOW2 Th: For 0<1, if the target function ƒ is in F(X,) for X{0,1}n, if µ1,..,µn have been chosen s.t. ƒ satisfies (*), (**), and if WINNOW2 is run with =1+ /2 and θ1 and the algorithm receives instances from X, then the number of mistakes will be bounded by (8/2)(n/θ) + {5/+14lnθ/2} µi.
Example: an r-of-k threshold function Now we are going to calculate mistake bound for r-of-k threshold functions. We have: =1/r and µi= k/r. So for =1+1/2r and θ=n mistake bound 8r2 + 5k +14krlnn. Note that 1-of-k threshold functions are just k-literal monotone disjunctions. Thus if =3/2, WINNOW2 will learn monotone disjunctions. The mistake bound is similar to the bound for WINNOW1, though with larger constants.
Conclusion: • The first part gives us general results about how many mistakes an effective learner might make if computational complexity were not an issue. • The second part describes an efficient algorithm for learning specific target class. A key advantage of WINNOW1 and WINNOW2 is their performance when few attributes are relevant.
Conclusion: • If we define the number of relevant variables needed to express a function in the class F({0,1}n, ) to be the least number of strictly positive weights needed to describe a separating hyperplane, then this target class for n > 1 can be learned with a number of mistakes bounded by C*klogn/2 when the target function can be expressed with k relevant variables.