Machine Learning CS 165B Spring 2012

Machine LearningCS 165BSpring 2012

Course outline • Introduction (Ch. 1) • Concept learning (Ch. 2) • Decision trees (Ch. 3) • Ensemble learning • Neural Networks (Ch. 4) • Linear classifiers • Support Vector Machines • Bayesian Learning (Ch. 6) • Instance-based Learning • Clustering • Computational learning theory (?)

Linear classifiers x f f(x,w,b) = sign(w. x+ b) denotes +1 denotes -1 How would you classify this data?

Linear classifiers x f f(x,w,b) = sign(w. x+ b) denotes +1 denotes -1 Any of these would be fine.. ..but which is best?

Classifier margin x f f(x,w,b) = sign(w. x+ b) denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Maximum margin x f f(x,w,b) = sign(w. x+ b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called LSVM) Linear SVM

Maximum margin x f f(x,w,b) = sign(w. x+ b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (called LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM

Why maximum margin? Intuitively this feels safest. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. The model is immune to removal of any non-support-vector datapoints. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. Empirically it works very very well. f(x,w,b) = sign(w. x- b) denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against

Specifying the margin Plus-Plane • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Classifier Boundary “Predict Class = +1” zone Minus-Plane “Predict Class = -1” zone w.x+b=1 w.x+b=0 w.x+b=-1 Slide 53 of “linear classifiers”

Learning the maximum margin classifier M = Margin Width = x+ • Given a guess of w and b we can • Compute whether all data points lie in the correct half-planes • Compute the width of the margin • So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? • Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? “Predict Class = +1” zone x- “Predict Class = -1” zone w.x+b=1 w.x+b=0 w.x+b=-1

Classification rule • Solution: • α is 0 for all data points except the support vectors. yi is +1/-1 and defines the classification • The final classification rule is quite simple: • All the cleverness goes into selecting the support vectors that maximize the margin and computing the weight to use on each support vector. The set of support vectors

Why SVM? • Very popular machine learning technique • Became popular in the late 90s (Vapnik 1995; 1998) • Invented in the late 70s (Vapnik, 1979) • Controls complexity and overfitting issues, so it works well on a wide range of practical problems • Because of this, it can handle high dimensional vector spaces, which makes feature selection less critical • Very fast and memory efficient implementations, e..g. svm_light • It’s not always the best solution, especially for problems with small vector spaces

Kernel Trick Instead of just using the features, represent the data using a high-dimensional feature space constructed from a set of basis functions (e.g., polynomial and Gaussian combinations of the base features) Then find a separating plane / SVM in that high-dimensional space Voila: A nonlinear classifier!

Binary vs. Multi-class classification • SVMs can only do binary classification • One-vs-all: can turn an n-way classification into n binary classification tasks • E.g., for the zoo problem, do mammal vs not-mammal, fish vs. not-fish, … • Pick the one that results in the highest score • N*(N-1)/2 One-vs-one classifiers that vote on results • Mammal vs. fish, mammal vs. reptile, etc…

From LSVM to general SVM Suppose we’re in 1-dimension What would SVMs do with this data? x=0

Not a big surprise x=0 Positive “plane” Negative “plane”

Harder 1-dimensional dataset Points are not linearly separable. What can we do now? x=0

Harder 1-dimensional dataset Transform the data points from 1-dim to 2-dim by some nonlinear basis function (called Kernel functions) x=0

Harder 1-dimensional dataset These points are linearly separable now! x=0

Non-linear SVMs • General idea: the original input space can be mapped to some higher-dimensional feature space where the training set is separable: Φ: x→φ(x)

Nonlinear SVMs: The Kernel Trick • With this mapping, our discriminant function is now: • No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. • A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:

Nonlinear SVMs: The Kernel Trick • An example: 2-dimensional vector x=[x1 x2] letK(xi,xj)=(1 + xiTxj)2 Need to show thatK(xi,xj) = φ(xi)Tφ(xj) K(xi,xj) = (1 + xiTxj)2 = 1+ xi12xj12 + 2 xi1xj1xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2 = [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] = φ(xi)Tφ(xj), whereφ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]

Nonlinear SVMs: The Kernel Trick • Examples of commonly-used kernel functions: • Linear kernel: • Polynomial kernel: • Gaussian (Radial-Basis Function (RBF) ) kernel: • Sigmoid: • In general, functions that satisfy Mercer’s condition can be kernel functions.

Mercer’s condition • Necessary and sufficient condition for a valid kernel: • Consider the Gram matrix K over any set of S points whose elements are given by K(xi,xj); xi , xj belonging to S • K should be positive semi-definite: • zTKz ≥ 0 for all real z.

Making new kernels from the old New kernels can be made from existing kernels by operations such as addition, multiplication and rescaling:

What the kernel trick achieves • All of the computations that we need to do to find the maximum-margin separator can be expressed in terms of scalar products between pairs of datapoints (in the high-dimensional feature space). • These scalar products are the only part of the computation that depends on the dimensionality of the high-dimensional space. • So if we had a fast way to do the scalar products we would not have to pay a price for solving the learning problem in the high-D space. • The kernel trick is just a magic way of doing scalar products a whole lot faster than is usually possible. • It relies on choosing a way of mapping to the high-dimensional feature space that allows fast scalar products.

For many mappings from a low-D space to a high-D space, there is a simple operation on two vectors in the low-D space that can be used to compute the scalar product of their two images in the high-D space. The kernel trick Low-D High-D doing the scalar product in the obvious way Letting the kernel do the work

Mathematical details (Linear SVM) • Let {x1, ..., xn} be our data set and let yi{1, –1} be the class label of xi • The decision boundary should classify all points correctly  yig(xi) > 0 • For any given w, b, the distance of xi from the decision surface is given by yig(xi) / ||w|| = yi(wTxi + b) / ||w|| • For any given w, b, the minimum distance over all xi from the decision surface is given by min { yi(wTxi + b) } / ||w|| • The best set of parameters are given by arg max {min { yi(wTxi + b)} / ||w||} i i w,b

Stretching w, b • Setting w ← κw and b ← κb leaves the distance from the decision surface unchanged but can change the value of yi(wTxi + b) • Example • w = (1,1), ||w|| = √2, b = –1, xi = (0.8,0.4) • yi(wTxi + b) = 0.2 • distance from decision surface = 0.2 / √2 • choose κ = 1/0.2 • yi(wTxi + b) = 1 • distance from decision surface = 0.2 / √2 • This can be used to normalize any (w, b) such that the lower bound in the inequality constraint = 1 (next slide)

Reformulated optimization problem • For all i, yi(wTxi + b) ≥ 1, and for some i, yi(wTxi + b) = 1. • Therefore, min {yi(wTxi + b)} = 1 • Solve arg max 1 / ||w|| subject to yi(wTxi + b) ≥ 1 • Or, solve arg min (1/2) ||w||2 subject to yi(wTxi + b) ≥ 1 • Quadratic Programming (special case of Convex optimization) • No local minima • Geometric view • Well established tools for solving this optimization problem (e.g. cplex) • However, the story continues! Will reformulate the optimization i w,b w,b

Detour: Constrained optimization • minimize f(x1, x2) = x12 + x22 – 1 subject to h(x1, x2) = x1 + x2 – 1 = 0 • Solve the usual way by rewriting x2 and differentiating • In general, consider the constraint surface h(x) = 0 • is normal to this surface • Since we are minimizing f, is also normal to the surface at the optimal point. • Therefore, and are parallel or anti-parallel at the optima. • Therefore, there exists a parameter λ ≠ 0 such that

Find Subject to Linear Programming • Very very useful algorithm • 1300+ papers • 100+ books • 10+ courses • 100s of companies • Main methods • Simplex method • Interior point method

Lagrange multipliers • The method of Lagrange multipliers gives a set of necessary conditions to identify optimal points of constrained optimization problems. • This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. • The classical problem formulation minimize f(x1, x2, ..., xn) subject to h(x1, x2, ..., xn) = 0 can be converted to minimize L(x, l) = f(x) + lh(x) where L(x, l) is the Lagrangian function • is an unspecified constant called the Lagrangian Multiplier

Finding optimum using Lagrange Multipliers • New problem: minimize L(x, l) = f(x) + lh(x) • Take the partial derivative wrt x and λ and set them to 0. • minimize f(x1, x2) = x12 + x22 – 1 subject to h(x1, x2) = x1 + x2 – 1 = 0 can be converted to minimize L(x, l) = f(x) + lh(x) =x12 + x22 – 1+ l(x1 + x2 – 1) • Solution ..

Another example • Introduce a Lagrange multiplier  for constraint • Construct the Lagrangian • Stationary points • So,

How about inequality constraints? • Search for SVM maximum margin involves inequality constraints • minimize f(x) subject to h(x) ≤ 0 • Two possibilities • h(x) < 0 ; constraint is inactive • solution provided by = 0; • λ = 0 in the Lagrangian equation • h(x) = 0; constraint is active • solution as in the equality constraint case • λ > 0 • and are anti-parallel • λh(x) = 0

Karush-Kuhn-Tucker constraints • minimize f(x) subject to h(x) ≤ 0 • L(x, l) = f(x) + lh(x) • At the optimal solution, lh(x) = 0 • Can be extended to multiple constraints • Valid for convex objective function and convex constraints • Same as min (max L(x, l)) st l ≥ 0. • Consider the function max L(x, l) st l ≥ 0 for a fixed x. • If h(x) > 0 then the function = ∞ • If h(x) ≤ 0 then l = 0 and the function = f(x) • Therefore, the minimum over all x is the same as minimizing f(x). x l

Primal & Dual functions • Primal: • Dual: • Weak duality: • Strong duality:

Modified example • Introduce a Lagrange multiplier  for the inequality constraint • Construct the Lagrangian • Stationary points • Case λ = 0: x=y =0, −xy= 0 • Case λ ≠ 0: x=y= λ=3, −xy= − 9 • Primal: • Dual:

s.t. Returning to SVM optimization Quadratic programming with linear constraints Lagrangian Function α: Lagrangian multipliers α≥0 w,b (KKT condition)

Solving the Dual optimization problem Plug these and reduce

Reduction • L(w,b,α) =½ ∑ ∑ αiαjyiyjxiTxj – ∑ ∑ αiαjyiyjxiTxj – b∑αiyi + ∑αi = ∑αi – ½ ∑ ∑ αiαjyiyjxiTxj At the optimal solution to the Dual i j i j i i i j i

Solving the optimization problem min max w,b α≥0 #dims = dims(w) Lagrangian Dual Problem Which is easier to solve? #dims = dims(α) α≥0

Karush-Kuhn-Tucker Constraint • αi (yi(wTxi + b) – 1) = 0 • For every data point, either • αi = 0 (not a support vector), or • yi(wTxi + b) =1 (support vector) • lies on the maximum margin hyperplane

Support vectors ’s with values different from zero (they hold up the separating plane)! A geometrical interpretation Class 2 10=0 8=0.6 7=0 2=0 5=0 1=0.8 4=0 6=1.4 9=0 3=0 Class 1

Solution steps • Solve for α • Plug in and solve for w • Solve for b: • For any support vector, yi(wTxi+ b) = 1 • yi(wTxi + b) = 1 • yi(∑αjyj xjTxi + b) = 1 • b = yi – ∑αjyj xjTxi • Averaging over all support vectors, • b = (1/s) ∑ [yi – ∑αjyj xjTxi] • s is the number of support vectors j εSV j εSV j εSV i εSV

Nonlinear SVM: Optimization • Use K(xi,xj) instead of xiTxj • The optimization technique is the same.

Machine Learning CS 165B Spring 2012