Machine Learning in Practice Lecture 21

Machine Learning in PracticeLecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day • Announcements • No quiz • Last assignment Thursday • 2nd midterm goes out next Thursday after class • Finish Instance Based Learning • Weka helpful hints • Clustering • Advanced Statistical Models • More on Optimization and Tuning

Finish Instance Based Learning

Instance Based Learning • Rote learning is at the extreme end of instance based representations • A more general form of instance based representation is where membership is computed based on a similarity measure between a centroid vector and the vector of the example instance • Advantage: Possible to learn incrementally

Why “Lazy”? ? http://www.cs.mcgill.ca/~cs644/Godfried/2005/Fall/mbouch/rubrique_fichiers/image003-3.png

Finding Nearest Neighbors Efficiently • Bruit force method – compute distance between new vector and every vector in the training set, and pick the one with the smallest distance • Better method: divide the search space, and strategically select relevant regions so you only have to compare to a subset of instances

kD-trees • kD-trees partition the space so that nearest neighbors can be found more efficiently • Each split takes place along one attribute and splits the examples at the parent node roughly in half • Split points chosen in such a way as to keep the tree as balanced as possible (*not* to optimize for accuracy or information gain) • Can you guess why you would want the tree to be as balanced as possible? • Hint – think about computational complexity of search algorithms

A B C D E F G Algorithm D E G F New Approx. Nearest Neighbor

A B C D E F G Algorithm Tweak: sometimes you Average over k nearest neighbors rather than taking the absolute nearest neighbor D E Tweak: Use ball shaped regions rather than rectangles to keep number of regions overlapped down G F New Approx. Nearest Neighbor

Locally Weighted Learning • Base predictions on models trained specifically for regions within the vector space • Caveat! This is an over-simplification of what’s happening • Weighting of examples accomplished as in cost sensitive classification • Similar idea to M5P (learning separate regressions for different regions of the vector space determined by a path through a decision tree)

Locally Weighted Learning • LBR baysean classification that relaxes independence assumptions using similarity between training and test instances • Only assumes independence within a neighborhood • LWL is a general locally weighted learning approach • Note that Baysean Networks are another way of taking non-independence into account with probabilistic models by explicitly modeling interactions (see last section of Chapter 6)

Problems with Nearest-Neighbor Classification • Slow for large numbers of exemplars • Performs poorly with noisy data if only single closest exemplar is used for classification • All attributes contribute equally to the distance comparison • If no normalization is done, then attributes with the biggest range have the biggest effect, regardless of their importance for classification

Problems with Nearest-Neighbor Classification • Even if you normalize, you still have the problem that attributes are not weighted by importance • Normally does not do any sort of explicit generalization

Reducing the Number of Exemplars • Normally unnecessary to retain all examples ever seen • Ideally only one important example per section of instance space is needed • One strategy that works reasonably well is to only keep exemplars that were initially classified wrong • Over time the number of exemplars kept increases, and the error rate goes down

Reducing the Number of Exemplars • One problem is that sometimes it is not clear that an exemplar is important until sometime after it has been thrown away • Also, this strategy of keeping just those exemplars that are classified wrong is bad for noisy data, because it will tend to keep the noisy examples

Tuning K for K-Nearest Neighbors Compensates for noise

Tuning K for K-Nearest Neighbors Compensates for noise * Tune for optimal value of K

Pruning Noisy Examples • Using success ratios, it is possible to reduce the number of examples you are paying attention to based on their observed reliability • You can compute a success ratio for every instance within range K of the new instance • based on the accuracy of their prediction • Computed over examples seen since they were added to the space

Pruning Noisy Examples • Keep an upper and lower threshold • Throw out examples that fall below the lower threshold • Only use exemplars that are above the upper threshold • But keep updating the success ratio of all exemplars

Don’t do anything rash! • We can compute confidence intervals on the success ratios we compute based on the number of observations we have made • You won’t pay attention to an exemplar that just happens to look good at first • You won’t throw instances away carelessly Eventually, these will be thrown out.

What do we do about irrelevant attributes? • You can compensate for irrelevant attributes by scaling attribute values based on importance • Attribute weights modified after a new example is added to the space • Use the most similar exemplar to the new training instance

What do we do about irrelevant attributes? • Adjust the weights so that the new instance comes closer to the most similar exemplar if it classified it correctly or farther away if it was wrong • Weights are usually renormalized after this adjustment • Weights will be trained to emphasize attributes that lead to useful generalizations

Instance Based Learning with Generalization • Instances generalized to • regions. Allows instance based • learning algorithms to behave • like other machine learning • algorithms (just another complex • decision boundary) • Key idea is determining how far to • generalize from each instance

IB1: Plain Vanilla Nearest Neighbor Algorithm • Keeps all training instances, doesn’t normalize • Uses euclidean distance • Bases prediction on the first instance found with the shortest distance • Nothing to optimize • Published in 1991 by my AI programming professor from UCI!

IBK: More general than IB1 • kNN: how many neighbors to do pay attention to • crossValidate: use leave one out cross-validation to select optimal K • distanceWeighting: allows you to select the method for weighting based on distance • meanSquared: if it’s true, use mean squared error rather than absolute error for regression problems

IBK: More general than IB1 • noNormalization: turns off normalization • windowSize: sets the maximum number of instances to keep. Prunes off older instances when necessary. 0 means no limit.

K* • Uses an entropy based distance metric rather than euclidean distance • Much slower than IBK! • Optimizations related to concepts we aren’t learning in this course • Allows you to choose what to do with missing values

What is special about K*? • Distance is computed based on a computation of how many transformation operations it would take to map one vector onto another • There may be multiple transformation paths, and all of them are taken into account • So the distance is an average over all possible transformation paths (randomly generated – so branching factor matters!) • That’s why it’s slow!!! • Allows for a more natural way of handling distance when your attribute space has many different types of attributes

What is special about K*? • Also allows a natural way of handling unknown values (probabilistically imputing values) • K* is likely to do better than other approaches if you have lots of unknown values or a very heterogeneous feature space (in terms of types of features)

Locally Weighted Numeric Prediction • Two Main types of trees used for numeric prediction • Regression trees: average values computed at leaf nodes • Model trees: regression functions trained at leaf nodes • Rather than maximize information gain, these algorithms minimize variation within subsets at leaf nodes

Locally Weighted Numeric Prediction • Locally weighted regression is an alternative to regression trees where the regression is computed at testing time rather than training time • Compute a regression for instances that are close to the testing instance

Summary of Locally Weighted Learning • Use Instance Based Learning together with a base classifier – almost like a wrapper • Learn a model within a neighborhood • Basic idea: approximate non-linear function learning with simple linear algorithms

Summary of Locally Weighted Learning • Big advantage: allows for incremental learning, whereas things like SVM do not • If you don’t need the incrementality, then it is probably better not to go with instance based learning

Take Home Message • Many ways of evaluating similarity of instances, which lead to different results • Instance based learning and clustering both make use of these approaches • Locally weighted learning is another way (besides the “kernel trick”) to get nonlinearity into otherwise linear approaches

Weka Helpful Hints

Remember SMOreg vs SMO… SMO is for classification SMOreg is for numeric prediction!

Setting the Exponent in SMO * Note that an exponent larger than 1.0 means you are using a non-linear kernel.

Clustering

What is clustering • Finding natural groupings of your data • Not supervised! No class attribute. • Usually only works well if you have a huge amount of data!

InfoMagnets: Interactive Text Clustering

What does clustering do? • Finds natural breaks in your data • If there are obvious clusters, you can do this with a small amount of data • If you have lots of weak predictors, you need a huge amount of data to make it work

Clustering in Weka * You can pick which clustering algorithm you want to use and how many clusters you want.

Clustering in Weka * Clustering is unsupervised, so you want it to ignore your class attribute! Click here Select the class attribute

Clustering in Weka * You can evaluate the clustering in comparison with class attribute assignments

Adding a Cluster Feature

Adding a Cluster Feature * You should set it explicitly to ignore the class attribute * Set the pulldown menu to No Class

Why add cluster features? Class 1 Class 2

Machine Learning in Practice Lecture 21

Machine Learning in Practice Lecture 21

Presentation Transcript

Machine Learning in Practice Lecture 9

LECTURE 21: FOUNDATIONS OF MACHINE LEARNING

Machine Learning in Practice Lecture 3

Machine Learning in Practice Lecture 18

Machine Learning in Practice Lecture 12

Machine Learning in Practice Lecture 19

Machine Learning – Lecture 4

Machine Learning in Practice MidTerm Review

Machine Learning in Practice Lecture 14

Machine Learning in Practice Lecture 7

Machine Learning in Practice Lecture 5

Machine Learning in Practice Lecture 8

Machine Learning: Lecture 6

ECE 517 Reinforcement Learning in Artificial Intelligence Lecture 21: Deep Machine Learning

Machine Learning: Lecture 5

Machine Learning in Practice Lecture 26

Machine Learning in Practice Lecture 27

Machine Learning in Practice Lecture 7

Machine Learning in Practice Lecture 6

Machine Learning: Lecture 6