Machine Learning

Machine Learning UNIT 1

3. CURSE OF DIMENSIONALITY • Complexity increases with dimension d. • The complexity of methods are represented as O(nd^2) . • So as d increases, computation becomes costly. • Thus for accurate estimation n should be much bigger than d^2. Otherwise the model is said to be overfitting.

4. OVERFITTNG • Example: In KNN • If k=1, no errors (correctly fits). • If k=5,smoother prediction because average over a larger neighborhood. • As K increases prediction becomes smoother K-N, best

Selection of Model to avoid over fitting • Using Linear model, quadratic and Join the dots.

A Regression Problem y = f(x) + noise Can we learn f from this data? Let’s consider three methods… y x

y x Linear Regression

Quadratic Regression y x

y x Join-the-dots Also known as piecewise linear nonparametric regression if that makes you feel better

Which is best? y y x x Why not choose the method with the best fit to the data?

What do we really want? y y x x Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?”

The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y x

The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set y x (Linear regression example)

The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set y x (Linear regression example) Mean Squared Error = 2.4

The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set y x (Quadratic regression example) Mean Squared Error = 0.9

The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set y x (Join the dots example) Mean Squared Error = 2.2

Advantages: • Very very simple • Can then simply choose the method with the best test-set score • Disadvantages: • Wastes data: we get an estimate of the best method to apply to 30% less data • If we don’t have much data, our test-set might just be lucky or unlucky

LOOCV (Leave-one-out Cross Validation) • For k=1 to R • 1. Let (xk,yk) be the kth record y x

LOOCV (Leave-one-out Cross Validation) • For k=1 to R • 1. Let (xk,yk) be the kth record • 2. Temporarily remove (xk,yk) from the dataset y x

LOOCV (Leave-one-out Cross Validation) • For k=1 to R • 1. Let (xk,yk) be the kth record • 2. Temporarily remove (xk,yk) from the dataset • 3. Train on the remaining R-1 datapoints y x

LOOCV (Leave-one-out Cross Validation) • For k=1 to R • 1. Let (xk,yk) be the kth record • 2. Temporarily remove (xk,yk) from the dataset • 3. Train on the remaining R-1 datapoints • 4. Note your error (xk,yk) y x

y x LOOCV (Leave-one-out Cross Validation) • For k=1 to R • 1. Let (xk,yk) be the kth record • 2. Temporarily remove (xk,yk) from the dataset • 3. Train on the remaining R-1 datapoints • 4. Note your error (xk,yk) • When you’ve done all points, report the mean error.

LOOCV (Leave-one-out Cross Validation) • For k=1 to R • 1. Let (xk,yk) be the kth record • 2. Temporarily remove (xk,yk) from the dataset • 3. Train on the remaining R-1 datapoints • 4. Note your error (xk,yk) • When you’ve done all points, report the mean error. y y y x x x y y y x x x MSELOOCV = 2.12 y y y x x x

LOOCV for Quadratic Regression • For k=1 to R • 1. Let (xk,yk) be the kth record • 2. Temporarily remove (xk,yk) from the dataset • 3. Train on the remaining R-1 datapoints • 4. Note your error (xk,yk) • When you’ve done all points, report the mean error. y y y x x x y y y x x x MSELOOCV=0.962 y y y x x x

LOOCV for Join The Dots • For k=1 to R • 1. Let (xk,yk) be the kth record • 2. Temporarily remove (xk,yk) from the dataset • 3. Train on the remaining R-1 datapoints • 4. Note your error (xk,yk) • When you’ve done all points, report the mean error. y y y x x x y y y x x x MSELOOCV=3.33 y y y x x x

Which kind of Cross Validation? ..can we get the best of both worlds?

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) k-fold Cross Validation y x

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) • For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. k-fold Cross Validation y x

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) • For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. • For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. k-fold Cross Validation y x

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) • For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. • For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. • For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. k-fold Cross Validation y x

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) • For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. • For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. • For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. • Then report the mean error k-fold Cross Validation y x Linear Regression MSE3FOLD=2.05

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) • For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. • For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. • For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. • Then report the mean error k-fold Cross Validation y x Quadratic Regression MSE3FOLD=1.11

Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) • For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. • For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. • For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. • Then report the mean error k-fold Cross Validation y x Joint-the-dots MSE3FOLD=2.93

Which kind of Cross Validation?

4. LINEAR REGRESSION • Response is a linear function of the input. • Y(x)= • scalar product between input vector x and the models weight vector w. • Ɛ residual error between our linear predictions and the true response. • Residual error= estimated observed value

5. BIAS –VARIANCE TRADEOFF Any learning algorithm has error that comes from two sources: • Bias • Variance error(X) = noise(X) + bias(X) + variance(X)

Bias –variance Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data (underfitting).Variance is the algorithm's tendency to learn random things irrespective of the real signal by fitting highly flexible models that follow the error/noise in the data too closely (overfitting).

Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data (under fitting). For some estimator Y for any parameter Ɵ, we define Bias of estimator Y=E[Y]-Ɵ Variance of estimator Y = E[]Variance is the algorithm's tendency to learn random things irrespective of the real signal by fitting highly flexible models that follow the error/noise in the data too closely (over fitting).

Bias –variance • ‘Bias' is also used to denote by how much the average accuracy of the algorithm changes as input/training data changes. • ‘Variance' is used to denote how sensitive the algorithm is to the chosen input data.

6. LEARNING CURVE • Graph that compares the performance of a model on training and testing data over a varying number of training instances. • Performance improve as the number of training points increases. • When we separate training and testing sets and graph them individually • We can get an idea of how well the model can generalize to new data • Learning curve allows us to verify when a model has learning as much as it can about the data

LEARNING CURVE • When it occurs • The performances on the training and testing sets reach a plateau • There is a consistent gap between the two error rates • The key is to find the sweet spot that minimizes bias and variance by finding the right level of model complexity • Of course with more data any model can improve, and different models may be optimal.

Types of learning curves • Bad Learning Curve: High Bias • When training and testing errors converge and are high • No matter how much data we feed the model, the model cannot represent the underlying relationship and has high systematic errors • Poor fit • Poor generalization

Bad Learning Curve: High Variance • When there is a large gap between the errors • Require data to improve • Can simplify the model with fewer or less complex features • Ideal Learning Curve • Model that generalizes to new data • Testing and training learning curves converge at similar values • Smaller the gap, the better our model generalizes

8. Generalization Error and Noise • It is the expected value of the misclassification rate when averaged over future data. • This can be approximated by computing the misclassification rate on a large independent test data set, not used during model training.

7. CLASSIFICATION • classify a document into a predefined category. • documents can be text, images • Popular one is Naive Bayes Classifier. • Steps: • Step1 : Train the program (Building a Model) using a training set with a category for e.g. sports, cricket, news, • Classifier will compute probability for each word, the probability that it makes a document belong to each of considered categories • Step2 : Test with a test data set against this Model • http://en.wikipedia.org/wiki/Naive_Bayes_classifier

Machine Learning