Machine Learning in Practice Lecture 26

Machine Learning in PracticeLecture 26 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the day • Announcements • Questions? • Readings for next 3 lectures on Blackboard • Mid-term Review

Locally Optimal Solutions http://biology.st-andrews.ac.uk/vannesmithlab/simanneal.png

What do we learn from this? • No algorithm is guaranteed to find the globally optimal solution • Some algorithms or variations on algorithms may do better on one set just because of where in the space they started • Instability can be exploited • Noise can put you in a different starting place • Different views on the same data is useful • When you tune, you need to carefully avoid overfitting to flukes in your data

Optimization

1 2 3 4 5 Optimizing Parameter Settings Test • This approach assumes that • you want to estimate the • generalization you will get from your • learning and tuning approach • together. • If you just want to know what the best • performance you can get on *this* set • by tuning, you can just use standard • cross-validation Validation Train

Overview of Optimization • Stage 1: Estimate Tuned Performance • On each fold, test all versions of algorithm over training data to find optimal one for that fold • Train model with optimal setting over training data • Apply that model to the testing data for that fold • Do for all folds and average across folds • Stage 2: Find Optimal settings over whole set • Test each version of the algorithm using cross-validation over the whole set • Pick the one that works the best • But ignore the performance value you get! • Stage 3: Train Optimal Model over whole set

Overview of Optimization • Stage 1: Estimate Tuned Performance • On each fold, test all versions of algorithm over training data to find optimal one for that fold • Train model with optimal setting over training data • Apply that model to the testing data for that fold • Do for all folds and average across folds • Stage 1 tells you how well the optimized model you will train in Stage 3 over the whole set will do on a new data set

Overview of Optimization • Stage 2: Find Optimal settings over whole set • Test each version of the algorithm using cross-validation over the whole set • Pick the one that works the best • But ignore the performance value you get! • Stage 3: Train Optimal Model over whole set • The result of stage 3 is the trained, optimized model that you will use!!!

Optimization in Weka • Divide your data into 10 train/test pairs • Tune parameters using cross validation on the training set (this is the inner loop) • Use those optimized settings on the corresponding test set • Note that you may have a different set of parameter setting for each of the 10 train/test pairs • You can do the optimization in the Experimenter

Train/Test Pairs * Use the StratifiedRemoveFolds filter

Setting Up for Optimization * Prepare to save the results • Load in training sets for • all folds • We’ll use cross validation • Within training folds to • Do the optimization

What are we optimizing? Let’s optimize the confidence factor. Let’s try .1, .25, .5, and .75

Add Each Algorithm to Experimenter Interface

Look at the Results * Note that optimal setting varies across folds.

Apply the optimized settings on each fold * Performance on Test1 using optimized settings from Train1

Using CVParameterSelection

Using CVParameterSelection You have to know what the command line options look like. You can find out on line or in the Experimenter Don’t forget to click Add!

Using CVParameterSelection Best setting over whole set

Using CVParameterSelection * Tuned performance.

Non-linearity in Support Vector Machines

Maximum Margin Hyperplanes Convex Hull • The maximum margin hyperplane is computed by taking the perpendicular • bisector of shortest line that connects the two convex hulls.

Maximum Margin Hyperplanes Support Vectors Convex Hull • The maximum margin hyperplane is computed by taking the perpendicular • bisector of shortest line that connects the two convex hulls. • Note that the maximum margin hyperplane depends only on the support vectors, • which should be relatively few in comparison with the total set of data points.

“The Kernel Trick”If your data is not linearly separable • Note that “the kernel trick” can be applied to other algorithms, like perceptron learners, • but they will not necessarily learn the maximum margin hyperplane.

An example of a polynomial kernel function

What is the connection between the meta-features we have been talking about under feature space design and kernel functions? Thought Question!

Remember: Use just as much power as you need, and no more

Similarity

What does it mean for two vectors to be similar?

What does it mean for two vectors to be similar? • If there are k attributes: • Squrt((a1 – b1)^2 + (a2 – b2)^2 … + (an –bn)^2) • For nominal attributes, difference is 0 when the values are the same and 1 otherwise • A common policy for missing values is that if either or both of the values being compared are missing, they are treated as different

What does it mean for two vectors to be similar? • Cosine similarity = Dot(A,B)/Len(A)Len(B) • (a1b1 + a2b2 + … anbn) / Squrt(a1^2 + a2^2 + … an^2) Squrt(b1^2 …)

What does it mean for two vectors to be similar? • Cosine similarity rates B and A as more similar than C and A • Euclidean distance rates C and A closer than B and A A B C

What does it mean for two vectors to be similar? • Cosine similarity rates B and A as more similar than C and A • Euclidean distance also rates B and A closer than C and A A B C

Remember! Different similarity metrics will lead to a different grouping of your instances! Think in terms of neighborhoods of instances…

Feature Selection

Why do irrelevant features hurt performance? • Divide-and-conquer approaches have the problem that the further down in the tree you get, the less data you are paying attention to • it’s easy for the classifier to get confused • Naïve Bayes does not have this problem, but it has other problems, as we have discussed • SVM is relatively good at ignoring irrelevant attributes, but it can still suffer • Also, it’s very computationally expensive with large attribute spaces

Take Home Message • Good Luck!

Machine Learning in Practice Lecture 26

Machine Learning in Practice Lecture 26

Presentation Transcript

Machine Learning in Practice Lecture 9

Machine Learning in Practice Lecture 3

Machine Learning in Practice Lecture 18

Machine Learning in Practice Lecture 12

Machine Learning in Practice Lecture 19

Machine Learning – Lecture 4

Machine Learning in Practice MidTerm Review

Machine Learning in Practice Lecture 14

Machine Learning in Practice Lecture 7

Machine Learning in Practice Lecture 5

Machine Learning in Practice Lecture 8

Machine Learning: Lecture 6

Machine Learning: Lecture 5

Lecture 26 Introduction to Machine Learning

Machine Learning in Practice Lecture 27

Machine Learning in Practice Lecture 7

Machine Learning in Practice Lecture 6

Machine Learning: Lecture 6