160 likes | 246 Vues
情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm. 有村 博紀 , 喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻 email: {arim,kida}@ist.hokudai.ac.jp http://www-ikn.ist.hokudai.ac.jp/ikn-tokuron/ http://www-ikn.ist.hokudai.ac.jp/~arim .. Prediction and Learning. Training Data
E N D
情報知識ネットワーク特論Prediction and Learning 1:Majority vote algorithm 有村 博紀,喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻email: {arim,kida}@ist.hokudai.ac.jphttp://www-ikn.ist.hokudai.ac.jp/ikn-tokuron/http://www-ikn.ist.hokudai.ac.jp/~arim .
Prediction and Learning • Training Data • A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule • Prediction • Predict the output y given a new input x • Learning • Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.
An On-line Learning Framework • Data • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule. • Learning • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process. • Goal • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.
Learning an unknown function Strategy • Select a hypothesis h∈H for making the prediction y = h(x)from a given class of functions H= {h0, h1, h2, ..., hi, ...}. • Question • How can we select a best hypothesis h∈Hthat minimizes the number of mistakes during prediction? • We ignore the computation time.
Naive Algorithm (Sequential) • Algorithm: • Given: the hypothesis class H= {h1, ..., hN}. • Initialize: k = 1; • Repeat the followings: • Receive the next input x. • Predict by h(x) = hk(x). Receive the correct output y. • If the mistake occurs then k = k + 1. Exhaustive search! • Observation: • Naive algorithm makes at most N mistakes.
Halving Algorithm • Naive Algorithm • causes N mistakes in the worst case. • is usually exponentially large in the size |h| of a hypothesis h∈H. • Basic Idea • Want to acheive exponential speed-up! • Eliminate at least half of the hypotheses whenever a mistake happens. • A key is to carefully choose the prediction value h(x) by majority voting so that one mistake implies at least half of the hypotheses fail.
Halving Algorithm • Algorithm: • Initialize the hypothesis class H= {h1, ..., hN}. • Repeat the followings: • Receive the next input x. • Splits H into A+1 = { h∈H : h(x) = +1} and A+1 = { h∈H : h(x) = -1}. • If |A+1| ≥ |A-1| then predict y = +1; otherwise predict y = -1. Receive the output x. • If the prediction is wrong then remove all hypotheses that make mistake by A = A - Ay. [Barzdin and Feivalds 1972] Majority voting Eliminate at least half
Halving Algorithm: Result • Assumption (Consistent case): • The unknown function fbelongs to class H. • Theorem (Barzdins '72; Littlestone '87): • The Halving algorithm makes at most log Nmistakes where N is the number of hypotheses in H. • This gives a general strategy to design efficient online learning algorithms. • Halving algorithm is not optimal [Littlestone 90] • ]
[Proof] • When receiving the input vector xi ∈{+1,-1}n, • xi splits the active experts in A into A+1 and A-1, where Aα = { i ∈A : xi = val } for every val ∈{+1,-1}. • Since the prediction is made according to the larger set,if a mistake occurs then the larger half is removed from A. • Therefore, the number of active experts in A decreases at least half. • It follows that |A| ≤n⋅(1/2)n after M mistakes. Note that any subset Aα (val ∈{+1,-1}) to which a perfect expert belong always makes the correct prediction. • This ensures that all perfect expert survives after any update of A. • Since |A| ≥1 by assumption, we conclude that the halving algorithm makes at most M = lg n mistakes.
Majority Vote Algorithm Naive & Halving Algorithms • Works only in consistent case • Often miss the correct answer in an inconsistent case • Inconsistent case • A target function does not always belong to the hypothesis class H. • None of the hypotheses can completely trace the target function. • Tentative Goal • Predict as well as the best experts
Majority Vote Algorithm • Majority Vote algorithm: • Initialize: w= (1, ...,1) ∈RN; • For i = 1,..., mdo: • Receive the next input x. • Predict by f(x) = Σh∈H wihi(x) (majority vote) • Receive the correct answer y ∈{+1,-1}. • If the mistake occurs (y ≠ f(x) ) then For all hi(x)∈H such that f(x) = hi(x) dowi = wi / 2 //majority hypotheses who contributed to the //last prediction [Littlestone & Warmuth 1989]
Majority Vote Algorithm: Result • Assumption (Inconsistent case): • The unknown function fmay not belong to H. • The best expert makes M mistakes according to the target function f. • Theorem (Littlestone & Warmuth) • the majority vote algorithm makes at most 2.4(M + log N) mistakes, where N is the number of hypotheses in H. • The majority vote algorithm behaves as well as the unknown best expert.
[Proof] • First, • we focus on the change of the sum of the weights W = ∑i wi during learning. • Suppose that at a round h ≥1, the best expert made m mistakes and the majority vote algorithm made M mistakes so far. Initially, the sum of the weight is set to W = n by construction. • Suppose that • the majority vote algorithm makes a mistake on an input vector x with weight vector w. • Let I be the set of experts who contributed to the prediction, and let WI = ∑i ∈I wi be the sum of the corresponding weights.
By assumption, • we have WI ≥W⋅(1/2) (*1). • Since the weights of the wrong experts in I are halved, the sum W' of the weights after update is given by W' =W - WI⋅(1/2) ≤W - W⋅(1/2)⋅(1/2) =W - W⋅(1/4) = W⋅(3/4) from (*1). Thus, • Wt ≤ Wt-1⋅(3/4) • Since whenever a mistake occurs, the sum W becomes 3/4 of the before, the current sum is upperbounded by • W ≤n⋅(3/4)M(*2). • On the other hand, • we observe the change of the weight of a best expert, say k. By assumption, the best expert k made m mistakes. • Since the initial weight is wk = 1 and its weight must be halved m times, we have that the current weight is wk = (1/2)m ≥0 (*3).
If k is one of the experts in E = {1,...,n}, then its weight is a part of W. Therefore, at any round h, we have the inequiality wk ≤W (*4). • Combining • the above discussions (*2), (*3), and (*4), we have an inequation • (1/2)m ≤n⋅( 3/4)M. • Solving this inequation: (1/2)m ≤n⋅(3/4)M ⇒ (4/3)M ≤n⋅2m ⇒ M lg(4/3) ≤(m + lg n) ⇒ M ≤(1/lg(4/3))(m + lg n), • we have M ≤2.40⋅(m + lg n) since 1/lg(4/3) = 2.4094.... ■
Conclusion • Learning functions from examples • Given a class H of exponentially many hypotheses • A simplest strategy: Select best one from H • Sequential algorithm • Consistent case: O(|H|) mistakes • Halving algorithms • Consistent case: O(log |H|) mistakes • Majority Voting algorithm • Inconsistent case: O(m + log |H|) mistakes for the mistake bound m of the best hypothesis. • Next • Linear learning machines (Perceptron and Winnow)