Linear Regression & Classification

Linear Regression & Classification
Prof. NavneetGoyal CS & IS BITS, Pilani

Fundamentals of Modeling Abstract representation of a real-world process Y=3X+2 is a very simple model of how variable Y might relate to variable X Instance of a more general model structure Y =aX+b a & b are parameters θ is generally used to denote a generic parameter or a set (or vector) of parameters θ={a,b} Values of parameters are chosen by estimation – that is by min. or max. an appropriate score function measuring the fit of the model to the data Before we can choose the parameters, we must choose an app. functional form of the model itself

Fundamentals of Modeling Predictive modeling PM can be thought of as learning a mapping from an input set of vector measurements x to a scalar output y Vector output also possible but rarely used in practice One of the variable is expressed as a function of others (predictor variables) Response variable – Y and predictor variables – Xi Ÿ = f(x1,x2,….xp; θ) When Y is quantitative, this task of estimating a mapping from the p-dimensional X to Y is called as regression When Y is categorical, the task of learning a mapping from X to Y is called classification learning or supervised classification

Predictive Modeling Predictive modeling Predicts the value of some target characteristic of an object on the basis of observed values of other characteristics of the object Examples: Regression (Prediction in DM) & Classification

Predictive Modeling Prediction Linear regression Nonlinear regression Classification (supervised learning) Decision trees k-NN SVM ANN

Definition of Regression Regression is a (statistical) methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others. Examples: Sales of a product can be predicted by using the relationship between sales volume and amount of advertising The performance of an employee can be predicted by using the relationship between performance and aptitude tests The size of a child’s vocabulary can be predicted by using the relationship between the vocabulary size, the child’s age and the parents’ educational input.

rmse = s Regression Problem Visualisation + + + + + ^ ^ + y, y y + + + + + + + + + + + + + + + + + x

Given a set of features x, a linear predictor has the form: The output is a real-valued, quantitative variable ^ y, y x Structure of a Linear Regression Model

Classification Problem Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DC,where each ti is assigned to one class. Predictionis similar, but may be viewed as having infinite number of classes.

Classification Classification is the task of assigning an object, described by a feature vector, to one of a set of mutually exclusive groups A linear classifier has a linear decision boundary The perceptron training algorithm is guaranteed to converge in a finite time when the data set is linearly separable

What is Classification? Classification is also known as (statistical) pattern recognition The aim is to build a machine/algorithm that can assign appropriate qualitative labels to new, previously unseen quantitative data using a priori knowledge and/or information contained in a training set. The patterns to be classified are usually groups of measurements/observations, that are believed to be informative for the classification task. Example: Face recognition Training data: D = {X,y} Prior knowledge Design/ learn Classifier m(q,x) ^ Predict ^ Predicted class label: y New pattern: x

Classification: Applications Spam mail IDS (rare event classification) Credit- rating Medical diagnosis Categorizing cells as malignant or benign based on MRI scans Classifying galaxies based on their shapes Predicting preterm births Crop yield prediction Identify mushrooms as poisonous or edible …

Classification: Applications Example: Credit Card Company Every purchase is placed in 1 of 4 classes Authorize Ask for further identification before authorizing Do not authorize Do not authorize but contact police Two functions of Data Mining Examine historical data to determine how the data fit into 4 classes Apply the model to each new purchase

Classification: 3 phase job Model building phase (learning phase) Testing phase Model usage phase

Compute Distance Test Record Training Records Choose k of the “nearest” records Distance-based Classification Nearest Neighbors If it walks like a duck, quacks like a duck, and looks like a duck, then it is probably a duck

Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

Find a linear hyperplane (decision boundary) that will separate the data Support Vector Machines

One Possible Solution Support Vector Machines

Another possible solution Support Vector Machines

Other possible solutions Support Vector Machines

Which one is better? B1 or B2? How do you define better? Support Vector Machines

Find a hyperplane that maximizes the margin => B1 is better than B2 Support Vector Machines

Support Vector Machines What if the problem is not linearly separable?

Nonlinear Support Vector Machines What if decision boundary is not linear?

Support Vector Machines Solid line is preferred Geometrically we can characterize the solid plane as being “furthest” from both classes How can we construct the plane “furthest’’ from both classes?

Support Vector Machines Examine the convex hull of each class’ training data (indicated by dotted lines) and then find the closest points in the two convex hulls (circles labeled d and c). The convex hull of a set of points is the smallest convex set containing the points. If we construct the plane that bisects these two points (w=d-c), the resulting classifier should be robust in some sense. Figure – Best plane bisects closest points in the convex hulls

Convex Sets Convex Set Non-Convex or Concave Set A function (in blue) is convex if and only if the region above its graph (in green) is a convex set.

Convex Hulls Convex hull: elastic band analogy For planar objects, i.e., lying in the plane, the convex hull may be easily visualized by imagining an elastic band stretched open to encompass the given object; when released, it will assume the shape of the required convex hull.

Var1 Var2 Disadvantages of Linear Decision Surfaces

Var1 Var2 Advantages of Non-Linear Surfaces

Linear Classifiers in High-Dimensional Spaces Constructed Feature 2 Var1 Var2 Constructed Feature 1 Find function (x) to map to a different space Go back

Handwriting Recognition Task T recognizing and classifying handwritten words within images Performance measure P percent of words correctly classified Training experience E a database of handwritten words with given classifications

Handwriting Recognition

Pattern Recognition Example Handwriting Digit Recognition

Pattern Recognition Example Handwriting Digit Recognition Each digit represented by a 28x28 pixel image Can be represented by a vector of 784 real no.s Objective: to have an algorithm that will take such a vector as input and identify the digit it is representing Non-trivial problem due to variability in handwriting Take images of a large no. of digits (N) – training set Use training set to tune the parameters of an adaptive model Each digit in the training set has been identified by a target vector t, which represents the identity of the corresp. digit. Result of running a ML algo. can expressed as a fn. y(x) which takes input a new digit x and outputs a vector y. Vector y is encoded in the same way as t The form of y(x) is determined through the learning (training) phase

Pattern Recognition Example Generalization The ability to categorize correctly new examples that differ from those in training Generalization is a central goal in pattern recognition Preprocessing Input variables are preprocessed to transform them into some new space of variables where it is hoped that the problem will be easier to solve (see fig.) Images of digits are translated and scaled so that each digit is contained within a box of fixed size. This reduces variability. Preprocessing stage is referred to as feature extraction New test data must be preprocessed using the same steps as training data

Pattern Recognition Example Preprocessing Can also speed up computations For eg.: Face detection in a high resolution video stream Find useful features that are fast to compute and yet that also preserve useful discriminatory information enabling faces to be distinguished form non-faces Avg. value of image intensity in a rectangular sub-region can be evaluated extremely efficiently and a set of such features are very effective in fast face detection Such features are smaller in number than the number of pixels, it is referred to as a form of Dimensionality Reduction Care must be taken so that important information is not discarded during pre processing

Pattern Recognition Example Supervised & unsupervised learning If training data consists of both input vectors and target vectors – supervised learning Digit recognition problem – classification Predicting crop yield – regression If training data consists of only input vectors – unsupervised learning Discover groups of similar examples within data – clustering Find distribution of data within the input space – density estimation Project data from a HD space to 2-3 D space for the purpose of visualization

Reinforcement Learning The problem of finding suitable actions to take in a given situation in order to maximize a reward

Polynomial Curve Fitting Observe Real-valued input variable x • Use x to predict value of target variable t • Synthetic data generated from sin(2π x) • Random noise in target values Target Variable Input Variable

Polynomial Curve Fitting N observations of x x = (x1,..,xN)T t = (t1,..,tN)T • Goal is to exploit training set to predict value of from x • Inherently a difficult problem Target Variable Data Generation: N = 10 Spaced uniformly in range [0,1] Generated from sin(2πx) by adding small Gaussian noise Noise typical due to unobserved variables Input Variable

Polynomial Curve Fitting • Where M is the order of the polynomial • Is higher value of M better? We’ll see shortly! • Coefficients w0 ,…wM are denoted by vector w • Nonlinear function of x, linear function of coefficients w • Called Linear Models Target Variable Input Variable

Sum-of-Squares Error Function

Polynomial curve fitting

Polynomial curve fitting Choice of M?? Called the model selection or model comparison

0th Order Polynomial Poor representations of sin(2πx)

1st Order Polynomial Poor representations of sin(2πx)

3rd Order Polynomial Best Fit to sin(2πx)

9th Order Polynomial Over Fit: Poor representation of sin(2πx)

Polynomial Curve Fitting Good generalization is the objective Dependence of generalization performance on M? Consider a data set of 100 points Calculate E(w*) for both training data & test data Choose M which minimizes E(w*) Root Mean Square Error (RMS) Sometimes convenient to use as division by N allows us to compare different sizes of data sets on equal footing Square root ensures ERMS is measure on the same scale ( and in same units) as the target variable t

Over-fitting For small M(0,1,2) Inflexible to handle oscillations of sin(2πx) M(3-8) flexible enough to handle oscillations of sin(2πx) For M=9 Too flexible!! TE = 0 GE = high Why is it happening?

Polynomial Coefficients

Data Set Size: M=9 - Larger the data set, the more complex model we can afford to fit to the data - No. of data pts should be no less than 5-10 times the no. of adaptive parameters in the model

Over-fitting Problem Should we limit the no. of parameters according to the available training set? Complexity of the model should depend only on the complexity of the problem! LSE represents a specific case of Maximum Likelihood Over-fitting is a general property of maximum likelihood Over-fitting problem can be avoided using the Bayesian Approach!

Regularization Penalize large coefficient values

Regularization:

Regularization: vs.

Polynomial Coefficients

Linear Models for Regression Polynomial is an example of a broad class of functions called linear regression models The role of regression is to predict the value of one or more continuous target variables t given the value of a D-dimensional vector x of input variables We have already discussed Polynomial Curve Fitting for Regression A polynomial is a specific example of a broad class of functions called Linear Regression Models Functions which are linear functions of the adjustable parameters Simplest form of linear regression models are also linear functions of the input variables A more useful class of functions can be obtained by taking a linear combination of a fixed set of nonlinear functions of the input variables, known as basis functions Linear functions of parameters Non-linear wrt input variables

Linear Models for Regression Linear models have significant limitations as practical techniques for ML, particularly for problems involving high dimensionality Linear models possess nice analytical properties and form the foundation for more sophisticated models

Linear Basis Function Models Simplest linear model for regression with d input variables: Where are the input variables Compare with linear regression with one variable Compare with polynomial regression with one variable Linear in both parameters and input variables Significant limitations since it is a linear fn. of input variables 1-D case – straight line fit

Linear Basis Function Models

Linear Basis Function Models Polynomial regression is a particular example of this model!! How?? Single input variable: x Basis function Polynomial basis Limitation of polynomial basis function? Global: • changes in one region of input space affects others Can divide input space into regions • use different polynomials in each region • equivalent to spline functions

Linear Basis Function Models Polynomial basis functions: These are global; a small change in x affect all basis functions.

Linear Basis Function Models (4) Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. μjand s control location and scale (width).

Linear Basis Function Models (5) Sigmoidal basis functions: where Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).

Home Work Read about Gaussian, Sigmoidal, & Fourier basis functions Sequential Learning & Online algorithms Will discuss in the next class!

The Bias-Variance Decomposition Bias-variance decomposition is a formal method for analyzing the prediction error of a predictive model Bias = avg. distance bet the target and the location where the projectile hits the ground (depends on angle) Variance = deviation bet x and the avg. position where the projectile hits the floor (depends on force) Noise: if the target is not stationary then the observed distance is also affected by changes in the location of target

The Bias-Variance Decomposition Low degree polynomial has high bias (fits poorly) but has low variance with different data sets High degree polynomial has low bias (fits well) but has high variance with different data sets Interactive demo @: http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_bias_variance.htm

The Bias-Variance Decomposition True height of Chinese emperor: 200cm, about 6’6”. Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average

The Bias-Variance Decomposition Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate Squared error = Square of bias error + Variance As variance increases, error increases

Effect of regularization parameter on the bias and variance terms high variance low variance low bias high bias

An example of the bias-variance trade-off

Beating the bias-variance trade-off We can reduce the variance term by averaging lots of models trained on different datasets. This seems silly. If we had lots of different datasets it would be better to combine them into one big training set. With more training data there will be much less variance. Weird idea: We can create different datasets by bootstrap sampling of our single training dataset. This is called “bagging” and it works surprisingly well. But if we have enough computation its better to do the right Bayesian thing: Combine the predictions of many models using the posterior probability of each parameter vector as the combination weight.

Linear Regression & Classification