580.691 Learning Theory Reza Shadmehr

580.691 Learning Theory Reza Shadmehr Classification via regressionFisher linear discriminantBayes classifierConfidence and Error rate of the Bayes classifier

Classification via regression • Suppose we wish to classify vector x as belonging to either class C0 or C1. • We can approach the problem as if it was regression:

6 -2 -2 x2 x2 0 0 4 2 2 4 4 6 6 1.5 1 2 y 0.5 0 -0.5 0 6 6 4 4 2 2 0 0 -2 -2 x1 x1 -4 -4 -2 -4 -2 0 2 4 6 • Classification via regression • Model:

6 4 2 0 -2 -4 -2 0 2 4 6 • Classification via regression: concerns • Model: 6 4 2 0 -2 -4 -4 -2 0 2 4 6 This classification looks good. This one not so good. • Sometimes an x can give us a y that is outside our range (outside 0-1).

Classification via regression: concerns • Model: Since y is a random variable that can only take on values of 0 or 1, error in regression will not be normally distributed. • Variance of the error (which is equal to the variance of y) depends on x, unlike in regression.

4 4 4 2 2 2 0 0 0 -2 -2 -2 -4 -4 -4 -4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4 • Regression as projection • A linear regression function projects each data point: • Each data point x(n)=[x1,x2] is projected onto For a given w1, there will be a specific distribution of the projected points z={z(1),z(2),…,z(n)}. We can study how well the projected points are distributed into classes.

Fisher discriminant analysis • Suppose we wish to classify vector x as belonging to either class C0 or C1. • Class y=0: n0 number of points, mean m0, variance S0 • Class y=1: n1 number of points, mean m1, variance S1 • Class descriptions in the classification (or projected) space: (i.e., variance of yhat for x’s that belong to class 0 or class 1)

4 4 2 2 0 0 -2 -2 -4 -4 -4 -2 0 2 4 -4 -2 0 2 4 • Fisher discriminant analysis • Find w so that when each point is projected to the classification space, the classes are maximally separated. Large separation Small separation

Fisher discriminant analysis Symmetric positive definite We can always write S like this, where R is a “square root” matrix Using R, change the coordinate system of J from w to v:

Fisher discriminant analysis arbitrary constant Dot product of a vector of norm 1 and another vector is maximum when the two have the same direction.

Bayesian classification • Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function: Likelihood prior Classify x into the class l that maximizes the posterior probability. marginal

0.04 0.03 0.02 0.035 0.02 0.01 0.03 0.015 0.025 0.02 160 180 200 0.01 0.015 0.01 0.005 0.005 160 180 200 160 180 200 • Classification when distributions have equal variance • Suppose we wish to classify a person as male or female based on height. What we have: What we want: Assume equal probability of being male or female: female male Note that the two densities have equal variance

0.02 0.015 0.01 1 0.8 0.005 0.6 160 180 200 0.4 0.2 160 180 200 Classification when distributions have equal variance Decision boundary posterior Decision boundary= To classify, we really don’t need to compute the posterior prob. All we need is: If this ratio is greater than 1, then we choose class 0, otherwise class 1. The boundaries between classes occur where the ratio is 1. In other words, the boundary occurs where the log of the ratio is 0

Uncertainty of the classifier Starting with our likelihood and prior: we compute a posterior probability distribution as a function of x: This is a binomial distribution. We can compute the variance of this distribution: 1 0.8 0.6 0.4 0.2 0 140 180 200 160 0.25 0.2 0.15 0.1 Classification is most uncertain at the decision boundary 0.05 140 160 180 200

0.025 Assume: 0.02 0.015 1 0.035 0.01 0.03 0.8 0.005 0.025 0.6 0.02 160 180 200 0.015 0.4 0.01 0.25 0.2 0.005 0.2 160 180 200 160 180 200 0.15 0.1 0.05 0 200 140 160 180 Classification when distributions have unequal variance What we have: Classification:

Prob of data belonging to c0 but we classify as c1 Prob of data belonging to c1, but we classify as c0 0.025 0.02 0.015 0.01 0.005 160 180 200 In general, it is actually quite hard to compute P(error) because we will need to integrate the posterior probabilities over decision regions that may be discontinuous (for example, when the distributions have unequal variances). To help with this, there is the Chernoff bound. Bayes error rate: Probability of misclassification decision boundary

Bayes error rate: Chernoff bound In the two class classification problem, we note that the classification error depends on the area under the minimum of the two posterior probabilities. 1 0.8 0.6 0.4 0.2 0 140 180 200 160

Bayes error rate: Chernoff bound To compute the minimum, we will need the following inequality: To help figure out this inequality, we note that: And without loss of generality, if we suppose that b is smaller than a. Then a/b>1, and we have: So we can think of the term a^b*b^(1-b) (for all values of b), as an upper bound on the min[a,b]. Returning to our P(error) problem, we can replace the min[] function with our inequality: The bound is found by numerically finding the value of b that minimizes the above expression. The key benefit here is that our search is in the one dimensional space of b, and we also got rid of the discontinuous decision regions.

580.691 Learning Theory Reza Shadmehr