1 / 35

k-Nearest Neighbors

k-Nearest Neighbors. When Should We Apply ML?. A pattern exists We cannot pin it down mathematically We have data on it. Supervised Learning. Major ML Categories Supervised Learning Classification Regression Unsupervised Learning Reinforcement Learning.

kburton
Télécharger la présentation

k-Nearest Neighbors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. k-Nearest Neighbors

  2. When Should We Apply ML? • A pattern exists • We cannot pin it down mathematically • We have data on it

  3. Supervised Learning • Major ML Categories • Supervised Learning • Classification • Regression • Unsupervised Learning • Reinforcement Learning Many problems can be addressed with supervised learning…

  4. Creditworthiness • Banks would like to decide whetheror not to extend credit to new customers • Good customers pay back loans • Bad customers default • Task: Predict creditworthiness based on: • Salary, Years in residence, Current debt, Age, etc.

  5. Genetic Testing • Microarray (DNA Chip, biochip) • Each spot represents amount of a particular DNA sequence • Different people have different expression profiles • Task: Separate malignant from healthy tissues based on the DNA expression profile

  6. Signature Recognition • Electronic signature pads could be used to authenticate signatures • Task: Does a signature (represented as an image) belong to a specific person?

  7. Text Categorization Task: Categorize documents into predefined categories. For example, categorize news into ‘sports’, ‘politics’, ‘science’, etc. Soft tissue found in T-rex fossil Find may reveal details about cells and blood vessels of dinosaurs Thursday, March 24, 2005 Posted: 3:14 PM EST WASHINGTON (AP) -- For more than a century, the study of dinosaurs has been limited to fossilized bones. Now, researchers have recovered 70-million-year-old soft tissue, including what may be blood vessels and cells, from a Tyrannosaurus rex. Health may be concern when giving kids cell phones Wednesday, March 23, 2005 Posted: 11:14 AM EST SEATTLE, Washington (AP) -- Parents should think twice before giving in to a middle-schooler's demands for a cell phone, some scientists say, because potential long-term health risks remain unclear. Wall Street gears up for jobsSaturday, March 26, 2005: 11:41 AM EST NEW YORK (CNN/Money) - Investors on Inflation Watch 2005 have a big week to look forward to -- or be wary of -- depending on how you look at it. Probe finds atmosphere on Saturn moon Thursday, March 17, 2005 Posted: 11:17 AM EST LOS ANGELES, California (Reuters) -- The space probe Cassini discovered a significant atmosphere around Saturn's moon Enceladus during two recent passes close by, the Jet Propulsion Laboratory said on Wednesday

  8. ML Problem Components • Task: What are we trying to do? • Important to be sure that our example (training) data is useful • Experience: What data do we provide the algorithm? • Defines the input (and output) to the learning system and the data on which it bases its decisions • Performance Metrics: How do we measure how well the system is doing? • Gives us an objective measure to judge the learning process • Also allows comparison between competing methods

  9. Components of Supervised Learning • Task: What are we trying to do? • Predict the target variable for a given example • Experience: What data do we provide the algorithm? • A training set of paired examples and target variables • Performance Metrics: How do we measure how well the system is doing? • Classification accuracy on a (separate) testing set This is typical for supervised learning (classification) problems.

  10. Data Representation • Inputs are quite different • Creditworthiness (demographic information) • Microarray (expression profile) • Document (natural language) • Images • ML algorithms need a fixed representation of data • Usually (fixed-length) vectors

  11. Representing People? • Problem: Predict creditworthiness based on: • Salary, Years in residence, Current debt, Age, etc.

  12. Representing Microarray Data? Each spot represents the abundance of specific DNA sequences in a target

  13. Representing Images?

  14. Representing Documents? • Soft tissue found in T-rex fossil • Find may reveal details about cells and blood vessels of dinosaurs • Thursday, March 24, 2005 Posted: 3:14 PM EST • WASHINGTON (AP) -- For more than a century, the study of dinosaurs has been limited to fossilized bones. Now, researchers have recovered 70-million-year-old soft tissue, including what may be blood vessels and cells, from a Tyrannosaurus rex. • Health may be concern when giving kids cell phones • Wednesday, March 23, 2005 Posted: 11:14 AM EST • SEATTLE, Washington (AP) -- Parents should think twice before giving in to a middle-schooler's demands for a cell phone, some scientists say, because potential long-term health risks remain unclear.

  15. Feature Space • Each feature vector, X, lives in feature space • Vector length determines dimensionality

  16. Classification (in feature space) x2 ? ? ? ? x1

  17. Classification • Given: a set of examples (xi,yi) • sampled from some distribution D • called the ‘training set’ • Learn:a function f which classifies ‘well’ examples xjsampled from D • Y is the ‘target variable’ x2 ? ? ? ? x1

  18. Classification • Given an input vector, x • Assign it to one of K discrete classes Ck • where k = 1, … , K • Each input is assigned to one class • Binary (K =2) Classification is the most common • Examples with yi=+1 are called ‘positive examples’ • Examples with yi=-1 are called ‘negative examples’ • Now, our first ML algorithm…

  19. Nearest Neighbor (NN) Classification Toy example: two classes, 2D feature vectors. k-NN classification For a given query point q, assign the class of the nearest neighbour. k = 1 Compute the k nearest neighbours and assign the class by majority vote. k = 3

  20. NN Classification? • Is this a good algorithm? • Why or why not? For a given query point q, assign the class of the nearest neighbour. k = 1 Compute the k nearest neighbours and assign the class by majority vote. k = 3

  21. NN Pros & Cons • Expensive (Time & Space) • Basic version: O(Nd) complexity for both storage and query time • Pre-sort training examples into fast data structures (kd-trees) • Pre-sorting often increases the storage requirements • Remove redundant data (condensing) • Limited with High-Dimensional Data • “Curse of Dimensionality” • Required amount of training data increases exponentially with dimension • Computational cost also increases dramatically • However, k-NN can work quite well in practice • With lots of training data, it’s provably good

  22. Example: Digit Recognition • Yann LeCunn – MNIST Digit Recognition • Handwritten digits • 28x28 pixel images: d = 784 • 60,000 training samples • 10,000 test samples • Nearest Neighbor is competitive

  23. k-NN in Practice • What distance measure to use? • Often Euclidean distance is used • Locally adaptive metrics • More complicated with non-numeric data, or when different dimensions have different scales • Choice of k? • Cross-validation (we’ll cover this later) • 1-NN often performs well in practice • k-NN needed for overlapping classes • Re-label all data according to k-NN, then classify with 1-NN • Reduce k-NN problem to 1-NN through dataset editing

  24. NN in Feature Space • Let’s find the regions of feature space closest to each training point… • Voronoi decomposition • Each cell contains one sample • Every location within the cell is closer to that sample than to any other sample

  25. Decision Regions • Every query point will be assigned the classification of the sample within that cell • The decision boundaryseparates the class regions based on the 1-NN decision rule • Knowledge of this boundary is sufficient to classify new points • The boundary itself is rarely computed. We either: • retain only points necessary to generate an identicalboundary • retain only points necessary to generate a similar boundary

  26. Condensing • Aim is to reduce the number of training samples • Retain only the samples that are needed to define the decision boundary • Decision Boundary Consistent – a subset whose nearest neighbor decision boundary is identical to the boundary of the entire training set • Consistent Set --- – a subset of the training data that correctly classifies all of the original training data • Minimum Consistent Set – smallest consistent set Original data Condensed data Minimum Consistent Set

  27. Condensing • Initialize subset with a single training example • Classify all remaining samples using the subset, and transfer an incorrectly classified sample to the subset • Return to 2 until no transfers occurred or the subset is full • Condensed Nearest Neighbor (CNN) Hart 1968 • Incremental • Order dependent • Neither minimal nor decision boundary consistent • O(n3) for brute-force method • Can follow up with reduced NN [Gates72] • Remove a sample if doing so does not cause any incorrect classifications Produces consistent set

  28. Condensing • Condensed Nearest Neighbor (CNN) Hart 1968 • Incremental • Order dependent • Neither minimal nor decision boundary consistent • O(n3) for brute-force method • Can follow up with reduced NN [Gates72] • Remove a sample if doing so does not cause any incorrect classifications • Initialize subset with a single training example • Classify all remaining samples using the subset, and transfer an incorrectly classified sample to the subset • Return to 2 until no transfers occurred or the subset is full

  29. Condensing • Condensed Nearest Neighbor (CNN) Hart 1968 • Incremental • Order dependent • Neither minimal nor decision boundary consistent • O(n3) for brute-force method • Can follow up with reduced NN [Gates72] • Remove a sample if doing so does not cause any incorrect classifications • Initialize subset with a single training example • Classify all remaining samples using the subset, and transfer an incorrectly classified sample to the subset • Return to 2 until no transfers occurred or the subset is full

  30. Condensing • Condensed Nearest Neighbor (CNN) Hart 1968 • Incremental • Order dependent • Neither minimal nor decision boundary consistent • O(n3) for brute-force method • Can follow up with reduced NN [Gates72] • Remove a sample if doing so does not cause any incorrect classifications • Initialize subset with a single training example • Classify all remaining samples using the subset, and transfer an incorrectly classified sample to the subset • Return to 2 until no transfers occurred or the subset is full

  31. Condensing • Condensed Nearest Neighbor (CNN) Hart 1968 • Incremental • Order dependent • Neither minimal nor decision boundary consistent • O(n3) for brute-force method • Can follow up with reduced NN [Gates72] • Remove a sample if doing so does not cause any incorrect classifications • Initialize subset with a single training example • Classify all remaining samples using the subset, and transfer an incorrectly classified sample to the subset • Return to 2 until no transfers occurred or the subset is full Done!

  32. Condensing • Condensed Nearest Neighbor (CNN) Hart 1968 • Incremental • Order dependent • Neither minimal nor decision boundary consistent • O(n3) for brute-force method • Can follow up with reduced NN [Gates72] • Remove a sample if doing so does not cause any incorrect classifications • Initialize subset with a single training example • Classify all remaining samples using the subset, and transfer an incorrectly classified sample to the subset • Return to 2 until no transfers occurred or the subset is full Done!

  33. Condensing • Condensed Nearest Neighbor (CNN) Hart 1968 • Incremental • Order dependent • Neither minimal nor decision boundary consistent • O(n3) for brute-force method • Can follow up with reduced NN [Gates72] • Remove a sample if doing so does not cause any incorrect classifications • Initialize subset with a single training example • Classify all remaining samples using the subset, and transfer an incorrectly classified sample to the subset • Return to 2 until no transfers occurred or the subset is full Done! Done!

  34. Where are we with respect to NN? • Simple method, pretty powerful rule • Very popular in text mining • Seems to work well for this task • Can be made to run fast • Requires a lot of training data • Condense to remove data that are not needed

  35. Recap • Introduced binary classification • Very common machine learning problem • Feature Space Model • Data set as points in a high dimensional space • Nearest-Neighbor Classification • Simple, powerful classification algorithm • Considers feature distances • Implies decision boundaries

More Related