1 / 84

Machine Learning in Practice Lecture 21

Learn about instance based learning and nearest neighbor classification in machine learning. Understand the importance of kD-trees, locally weighted learning, and the challenges of noisy data. Discover strategies for reducing the number of exemplars and tuning the value of K.

jefferya
Télécharger la présentation

Machine Learning in Practice Lecture 21

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning in PracticeLecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

  2. Plan for the Day • Announcements • No quiz • Last assignment Thursday • 2nd midterm goes out next Thursday after class • Finish Instance Based Learning • Weka helpful hints • Clustering • Advanced Statistical Models • More on Optimization and Tuning

  3. Finish Instance Based Learning

  4. Instance Based Learning • Rote learning is at the extreme end of instance based representations • A more general form of instance based representation is where membership is computed based on a similarity measure between a centroid vector and the vector of the example instance • Advantage: Possible to learn incrementally

  5. Why “Lazy”? ? http://www.cs.mcgill.ca/~cs644/Godfried/2005/Fall/mbouch/rubrique_fichiers/image003-3.png

  6. Finding Nearest Neighbors Efficiently • Bruit force method – compute distance between new vector and every vector in the training set, and pick the one with the smallest distance • Better method: divide the search space, and strategically select relevant regions so you only have to compare to a subset of instances

  7. kD-trees • kD-trees partition the space so that nearest neighbors can be found more efficiently • Each split takes place along one attribute and splits the examples at the parent node roughly in half • Split points chosen in such a way as to keep the tree as balanced as possible (*not* to optimize for accuracy or information gain) • Can you guess why you would want the tree to be as balanced as possible? • Hint – think about computational complexity of search algorithms

  8. A B C D E F G Algorithm D E G F New Approx. Nearest Neighbor

  9. A B C D E F G Algorithm Tweak: sometimes you Average over k nearest neighbors rather than taking the absolute nearest neighbor D E Tweak: Use ball shaped regions rather than rectangles to keep number of regions overlapped down G F New Approx. Nearest Neighbor

  10. Locally Weighted Learning • Base predictions on models trained specifically for regions within the vector space • Caveat! This is an over-simplification of what’s happening • Weighting of examples accomplished as in cost sensitive classification • Similar idea to M5P (learning separate regressions for different regions of the vector space determined by a path through a decision tree)

  11. Locally Weighted Learning • LBR baysean classification that relaxes independence assumptions using similarity between training and test instances • Only assumes independence within a neighborhood • LWL is a general locally weighted learning approach • Note that Baysean Networks are another way of taking non-independence into account with probabilistic models by explicitly modeling interactions (see last section of Chapter 6)

  12. Problems with Nearest-Neighbor Classification • Slow for large numbers of exemplars • Performs poorly with noisy data if only single closest exemplar is used for classification • All attributes contribute equally to the distance comparison • If no normalization is done, then attributes with the biggest range have the biggest effect, regardless of their importance for classification

  13. Problems with Nearest-Neighbor Classification • Even if you normalize, you still have the problem that attributes are not weighted by importance • Normally does not do any sort of explicit generalization

  14. Reducing the Number of Exemplars • Normally unnecessary to retain all examples ever seen • Ideally only one important example per section of instance space is needed • One strategy that works reasonably well is to only keep exemplars that were initially classified wrong • Over time the number of exemplars kept increases, and the error rate goes down

  15. Reducing the Number of Exemplars • One problem is that sometimes it is not clear that an exemplar is important until sometime after it has been thrown away • Also, this strategy of keeping just those exemplars that are classified wrong is bad for noisy data, because it will tend to keep the noisy examples

  16. Tuning K for K-Nearest Neighbors Compensates for noise

  17. Tuning K for K-Nearest Neighbors Compensates for noise

  18. Tuning K for K-Nearest Neighbors Compensates for noise * Tune for optimal value of K

  19. Pruning Noisy Examples • Using success ratios, it is possible to reduce the number of examples you are paying attention to based on their observed reliability • You can compute a success ratio for every instance within range K of the new instance • based on the accuracy of their prediction • Computed over examples seen since they were added to the space

  20. Pruning Noisy Examples • Keep an upper and lower threshold • Throw out examples that fall below the lower threshold • Only use exemplars that are above the upper threshold • But keep updating the success ratio of all exemplars

  21. Don’t do anything rash! • We can compute confidence intervals on the success ratios we compute based on the number of observations we have made • You won’t pay attention to an exemplar that just happens to look good at first • You won’t throw instances away carelessly Eventually, these will be thrown out.

  22. What do we do about irrelevant attributes? • You can compensate for irrelevant attributes by scaling attribute values based on importance • Attribute weights modified after a new example is added to the space • Use the most similar exemplar to the new training instance

  23. What do we do about irrelevant attributes? • Adjust the weights so that the new instance comes closer to the most similar exemplar if it classified it correctly or farther away if it was wrong • Weights are usually renormalized after this adjustment • Weights will be trained to emphasize attributes that lead to useful generalizations

  24. Instance Based Learning with Generalization • Instances generalized to • regions. Allows instance based • learning algorithms to behave • like other machine learning • algorithms (just another complex • decision boundary) • Key idea is determining how far to • generalize from each instance

  25. IB1: Plain Vanilla Nearest Neighbor Algorithm • Keeps all training instances, doesn’t normalize • Uses euclidean distance • Bases prediction on the first instance found with the shortest distance • Nothing to optimize • Published in 1991 by my AI programming professor from UCI!

  26. IBK: More general than IB1 • kNN: how many neighbors to do pay attention to • crossValidate: use leave one out cross-validation to select optimal K • distanceWeighting: allows you to select the method for weighting based on distance • meanSquared: if it’s true, use mean squared error rather than absolute error for regression problems

  27. IBK: More general than IB1 • noNormalization: turns off normalization • windowSize: sets the maximum number of instances to keep. Prunes off older instances when necessary. 0 means no limit.

  28. K* • Uses an entropy based distance metric rather than euclidean distance • Much slower than IBK! • Optimizations related to concepts we aren’t learning in this course • Allows you to choose what to do with missing values

  29. What is special about K*? • Distance is computed based on a computation of how many transformation operations it would take to map one vector onto another • There may be multiple transformation paths, and all of them are taken into account • So the distance is an average over all possible transformation paths (randomly generated – so branching factor matters!) • That’s why it’s slow!!! • Allows for a more natural way of handling distance when your attribute space has many different types of attributes

  30. What is special about K*? • Also allows a natural way of handling unknown values (probabilistically imputing values) • K* is likely to do better than other approaches if you have lots of unknown values or a very heterogeneous feature space (in terms of types of features)

  31. Locally Weighted Numeric Prediction • Two Main types of trees used for numeric prediction • Regression trees: average values computed at leaf nodes • Model trees: regression functions trained at leaf nodes • Rather than maximize information gain, these algorithms minimize variation within subsets at leaf nodes

  32. Locally Weighted Numeric Prediction • Locally weighted regression is an alternative to regression trees where the regression is computed at testing time rather than training time • Compute a regression for instances that are close to the testing instance

  33. Summary of Locally Weighted Learning • Use Instance Based Learning together with a base classifier – almost like a wrapper • Learn a model within a neighborhood • Basic idea: approximate non-linear function learning with simple linear algorithms

  34. Summary of Locally Weighted Learning • Big advantage: allows for incremental learning, whereas things like SVM do not • If you don’t need the incrementality, then it is probably better not to go with instance based learning

  35. Take Home Message • Many ways of evaluating similarity of instances, which lead to different results • Instance based learning and clustering both make use of these approaches • Locally weighted learning is another way (besides the “kernel trick”) to get nonlinearity into otherwise linear approaches

  36. Weka Helpful Hints

  37. Remember SMOreg vs SMO… SMO is for classification SMOreg is for numeric prediction!

  38. Setting the Exponent in SMO * Note that an exponent larger than 1.0 means you are using a non-linear kernel.

  39. Clustering

  40. What is clustering • Finding natural groupings of your data • Not supervised! No class attribute. • Usually only works well if you have a huge amount of data!

  41. InfoMagnets: Interactive Text Clustering

  42. What does clustering do? • Finds natural breaks in your data • If there are obvious clusters, you can do this with a small amount of data • If you have lots of weak predictors, you need a huge amount of data to make it work

  43. What does clustering do? • Finds natural breaks in your data • If there are obvious clusters, you can do this with a small amount of data • If you have lots of weak predictors, you need a huge amount of data to make it work

  44. Clustering in Weka * You can pick which clustering algorithm you want to use and how many clusters you want.

  45. Clustering in Weka * Clustering is unsupervised, so you want it to ignore your class attribute! Click here Select the class attribute

  46. Clustering in Weka * You can evaluate the clustering in comparison with class attribute assignments

  47. Adding a Cluster Feature

  48. Adding a Cluster Feature * You should set it explicitly to ignore the class attribute * Set the pulldown menu to No Class

  49. Why add cluster features? Class 1 Class 2

  50. Why add cluster features? Class 1 Class 2

More Related