Lecture 3: Perceptron

Lecture 3: Perceptron

Recap: Perceptron algorithm Datapoints (x1,y1), (x2, y2), …, xt2 Rd, yt2 {+1,-1}, are separable by a hyperplane through the origin w = 0 for t = 1,2,… if yt(w ¢ xt) · 0: w = w + yt xt Claim Suppose • ||xt|| · R for all t • There is some unit vector u 2 Rd and some “margin”  > 0 such that yt (u ¢ xt) ¸ for all t Then Perceptron makes at most (R/)2 mistakes/updates.

Preprocessing step Points (x,y) where x 2 Rd, y 2 {+1,-1} Add an extra feature to x, and set it to 1: x0 = (x,1) 2 Rd+1 Then: points (x,y) linearly separable  points (x0, y) linearly separable by a hyperplane through the origin

Fisher’s IRIS data Four features sepal length sepal width petal length petal width Three classes (species of iris) setosa versicolor virginica 50 instances of each

Features 1 and 2 (sepal width/length)

Features 3 and 4 (petal width/length)

Features 1 and 2; goal: separate setosa from other two 1500 updates (different permutation: 900)

Features 3 and 4; goal: separate setosa from other two Point 51 Points 1,2 Iteration 1 [1,51] Iteration 2 [1,2] Iteration 3 [ ]

Linear separator vs nearest neighbor Linear separators parametric model fixed number of parameters to learn Nearest neighbor nonparametric prediction on test point x depends only on training data near x, not on the rest of the training data Advantages of linear separators: compact fast convergence potentially meaningful

Nonseparable data What if data is not linearly separable? In this case: almost separable… how will the perceptron perform?

Online perceptron Data comes in an endless stream… convergence is not an issue.But how many mistakes does it make? Suppose that for all t ¸ 0: there is some u 2 Rd and some kt¸ 0 such that for all but kt of the first t datapoints (x,y), yt(u ¢ xt) ¸ Then for all t ¸ 0: the perceptron algorithm makes at most (R/)2 + kt(1 + 2R/) updates (ie. mistakes) upto time t.

Batch perceptron Batch algorithm: w = 0 while some (xi,yi) is misclassified: w = w + yi xi Nonseparable data: will never converge. How can this be fixed? Dream: somehow find the separator that misclassifies the fewest points… but this is NP-hard (in fact, even NP-hard to approximately solve).

Fixing the batch perceptron Idea one: only go through the data once, or a fixed number of times w = 0 for k = 1 to K for i = 1 to m if (xi,yi) is misclassified: w = w + yi xi At least this stops! Problem: the final w might not be good Eg. right before terminating, the alg might perform an update on a total outlier…

Voted-perceptron Idea two: keep around intermediate hypotheses, and have them “vote” [Freund and Schapire, 1998] n = 1 w1 = 0 c1 = 0 for k = 1 to K for i = 1 to m if (xi,yi) is misclassified: wn+1 = wn + yi xi cn+1 = 1 n = n + 1 else cn = cn + 1 At the end, a collection of linear separators w0, w1, w2, …, along with survival times: cn = amount of time that wn survived.

Voted-perceptron, cont’d Idea two: keep around intermediate hypotheses, and have them “vote” [Freund and Schapire, 1998] At the end, a collection of linear separators w0, w1, w2, …, along with survival times: cn = amount of time that wn survived. This cn is a good measure of the reliability of wn. To classify a test point x, use a weighted majority vote:

Voted-perceptron, cont’d • Problem: need to keep around a lot of wn vectors • Solutions: • Find “representatives” • Alternative prediction rule: wavg

IRIS: features 3 and 4; goal: separate setosa (circle) from the rest Corrupted setosa Run Voted-Perc for five rounds: cn = 0 1 2 3 1 1 5 117 2 41 2 13 1 3 8 222 2 173 3 95 3 52 Final hypothesis: 1 wrong (either voting or averaging)

IRIS: features 3 and 4; goal: separate + from o/x 100 rounds, 1595 updates (5 errors) Final hypothesis: 5 errors for voting, 6 for averaging

Postscript: multiclass What if there are k classes? Reduce to binary: all-vs-one Not always easy to do: 1 2 2 1 3 4 3

Some open problems Modify the (voted) perceptron algorithm to: [1] Find a linear separator with large margin [2] “Give up” on troublesome points after a while

Lecture 3: Perceptron