Last lecture summary

Last lecture summary

Test-data and Cross Validation

testing error training error model complexity

Test set method • Split the data set into training and test data sets. • Common ration – 70:30 • Train the algorithm on training set, assess its performance on the test set. • Disadvantages • This is simple, however it wastes data. • Test set estimator of performance has high variance Train Test adopted from Cross Validation tutorial, Andrew Moore http://www.autonlab.org/tutorials/overfit.html

stratified division • same proportion of data in the training and test sets

Training error can not be used as an indicator of model’s performance due to overfitting. • Training data set - train a range of models, or a given model with a range of values for its parameters. • Compare them on independent data – Validation set. • If the model design is iterated many times, then some overfitting to the validation data can occur and so it may be necessary to keep aside a third • Test set on which the performance of the selected model is finally evaluated.

LOOCV • choose one data point • remove it from the set • fit the remaining data points • note your error using the removed data point as test Repeat these steps for all points. When you are done report the mean square error (in case of regression).

k-fold crossvalidation • randomly break data into k partitions • remove one partition from the set • fit the remaining data points • note your error using the removed partition as test data set Repeat these steps for all partitions. When you are done report the mean square error (in case of regression).

Selection and testing • Complete procedure to algorithm selection and estimation of its quality • Divide data to train/test • By Cross Validation on the Train choose the algorithm • Use this algorithm to construct a classifier using Train • Estimate its quality on the Test Train Test Val Train Train Test

adopted from Cross Validation tutorial by Andrew Moore, http://www.autonlab.org/tutorials/overfit.html Model selection via CV polynomial regression

Nearest Neighbors Classification

instances

Similaritysij is quantity that reflects the strength of relationship between two objects or two features. • Distancedij measures dissimilarity • Dissimilarity measure the discrepancy between the two objects based on several features. • Distance satisfies the following conditions: • distance is always positive or zero (dij≥ 0) • distance is zero if and only if it measured to itself • distance is symmetric (dij = dji) • In addition, if distance satisfies triangular inequality |x+y| ≤ |x|+|y|, then it is called metric.

Distances for quantitative variables • Minkowski distance (Lp norm) • distance matrix – matrix with all pairwise distances

Manhattan distance y2 x2 y1 x1

Euclidean distance y2 x2 y1 x1

k-NN • supervised learning • target function f may be • dicrete-valued (classification) • real-valued (regression) • We assign to the class which instance is most similar to the given point.

k-NN is a lazy learner • lazy learning • generalization beyond the training data is delayed until a query is made to the system • opposed to eager learning – system tries to generalize the training data before receiving queries

Which k is best? Crossvalidation k = 1 k = 15 fitting noise, outliers overfitting value not too small smooth out distinctive behavior Hastie et al., Elements of Statistical Learning

Real-valued target function • Algorithm calculates the mean value of the k nearest training examples. k = 3 value = 12 value = (12+14+10)/3 = 12 value = 14 value = 10

Distance-weighted NN • Give greater weight to closer neighbors • unweighted • 2 votes • 2 votes • weighted • 1/12 + 1/22 = 1.25 votes • 1/42 + 1/52 = 0.102 votes k = 4 4 2 5 1

k-NN issues • Curse of dimensionality is a problem. • Significant computation may be required to process each new query. • To find nearest neighbors one has to evaluate full distance matrix. • Efficient indexing of stored training examples helps • kd-tree

Cluster Analysis

We have data, we don’t know classes. • Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.

Stages of clustering process On clustering validation techniques, M. Halkidi, Y. Batistakis, M. Vazirgiannis

How would you solve the problem? • How to find clusters? • Group together most similar patterns.

Single linkage(metoda nejbližšího souseda) based on A Tutorial on Clustering Algorithms http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

Milano Torino Florence Rome Bari Naples

877 Milano Torino Florence 877 996 Rome Bari Naples

877 295 Milano Torino 295 400 Florence Rome Bari Naples

877 295 754 Milano Torino Florence 754 869 Rome Bari Naples

Dendrogram Torino → Milano Rome → Naples → Bari → Florence Join Torino–Milano and Rome–Naples–Bari–Florence

Dendrogram Torino → Milano (138) Rome → Naples (219) → Bari (255) → Florence (268) Join Torino–Milano and Rome–Naples–Bari–Florence (295) 295 dissimilarity 268 255 219 138 BA NA RM FL MI TO

Milano Milano Milano dissimilarity Torino Torino Torino Florence BA NA RM FL MI TO Florence Florence Rome Rome Rome Bari Bari Bari Naples Naples Naples

Complete linkage(metoda nejvzdálenějšího souseda)

996 Milano Torino Florence 877 996 Rome Bari Naples

400 996 Milano Torino 295 400 Florence Rome Bari Naples

400 869 996 669 Milano Torino Florence 564 669 Rome Bari Naples

Last lecture summary