1 / 45

Statistical Classification

Statistical Classification. Rong Jin. X. Input. Y. Output. ?. Classification Problems. Given input X={ x 1 , x 2 , …, x m } Predict the class label y  Y Y = {-1,1}, binary class classification problems Y = {1, 2, 3, …, c }, multiple class classification problems

keran
Télécharger la présentation

Statistical Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Classification Rong Jin

  2. X Input Y Output ? Classification Problems • Given input X={x1, x2, …, xm} • Predict the class label y  Y • Y = {-1,1}, binary class classification problems • Y = {1, 2, 3, …, c}, multiple class classification problems • Goal: need to learn the function: f: X  Y

  3. Examples of Classification Problem • Text categorization: • Input features X: • Word frequency • {(campaigning, 1), (democrats, 2), (basketball, 0), …} • Class label y: • Y = +1: ‘politics’ • Y = -1: ‘non-politics’ Politics Non-politics Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic:

  4. Examples of Classification Problem • Text categorization: • Input features X: • Word frequency • {(campaigning, 1), (democrats, 2), (basketball, 0), …} • Class label y: • Y = +1: ‘politics’ • Y = -1: ‘not-politics’ Politics Non-politics Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic:

  5. Examples of Classification Problem • Image Classification: • Input features X • Color histogram • {(red, 1004), (red, 23000), …} • Class label y • Y = +1: ‘bird image’ • Y = -1: ‘non-bird image’ Which images are birds, which are not?

  6. Examples of Classification Problem • Image Classification: • Input features X • Color histogram • {(red, 1004), (blue, 23000), …} • Class label y • Y = +1: ‘bird image’ • Y = -1: ‘non-bird image’ Which images are birds, which are not?

  7. X Input Y Output ? Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … f: image  topic f: doc  topic Birds Not-Birds Politics Not-politics Classification Problems How to obtain f ? Learn classification function f from examples

  8. Learning from Examples • Training examples: • Identical Independent Distribution (i.i.d.) • Each training example is drawn independently from the identical source • Training examples are similar to testing examples

  9. Learning from Examples • Training examples: • Identical Independent Distribution (i.i.d.) • Each training example is drawn independently from the identical source

  10. Learning from Examples • Given training examples • Goal: learn a classification function f(x):XY that is consistent with training examples • What is the easiest way to do it ?

  11. (k=4) (k=1) K Nearest Neighbor (kNN) Approach How many neighbors should we count ?

  12. Cross Validation • Divide training examples into two sets • A training set (80%) and a validation set (20%) • Predict the class labels of the examples in the validation set by the examples in the training set • Choose the number of neighbors k that maximizes the classification accuracy

  13. Leave-One-Out Method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given K to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal

  14. Leave-One-Out Method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given K to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal

  15. (k=1) Leave-One-Out Method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal

  16. (k=1) Leave-One-Out Method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal Err(1) = 1

  17. Leave-One-Out Method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal Err(1) = 1

  18. k = 2 Leave-One-Out Method • For k = 1, 2, …, K • Err(k) = 0; • Randomly select a training data point and hide its class label • Using the remaining data and given k to predict the class label for the left data point • Err(k) = Err(k) + 1 if the predicted label is different from the true label • Repeat the procedure until all training examples are tested • Choose the k whose Err(k) is minimal Err(1) = 3 Err(2) = 2 Err(3) = 6

  19. Probabilistic interpretation of KNN • Estimate the probability density function Pr(y|x) around the location of x • Count of data points in class y in the neighborhood of x • Bias and variance tradeoff • A small neighborhood  large variance  unreliable estimation • A large neighborhood  large bias  inaccurate estimation

  20. Weighted kNN • Weight the contribution of each close neighbor based on their distances • Weight function • Prediction

  21. Estimate 2in the Weight Function • Leave one cross validation • Training dataset D is divided into two sets • Validation set • Training set • Compute the

  22. Estimate 2in the Weight Function Pr(y|x1, D-1) is a function of 2

  23. Estimate 2in the Weight Function Pr(y|x1, D-1) is a function of 2

  24. Estimate 2in the Weight Function • In general, we can have expression for • Validation set • Training set • Estimate 2 by maximizing the likelihood

  25. Estimate 2in the Weight Function • In general, we can have expression for • Validation set • Training set • Estimate 2 by maximizing the likelihood

  26. Optimization • It is a DC (difference of two convex functions) function

  27. Challenges in Optimization • Convex functions are easiest to be optimized • Single-mode functions are the second easiest • Multi-mode functions are difficult to be optimized

  28. Gradient Ascent

  29. Gradient Ascent (cont’d) • Compute the derivative of l(λ), i.e., • Update λ How to decide the step size t?

  30. Gradient Ascent: Line Search Excerpt from the slides by Steven Boyd

  31. Gradient Ascent • Stop criterion •  is predefined small value • Start λ=0, Define , , and  • Compute • Choose step size t via backtracking line search • Update • Repeat till

  32. Gradient Ascent • Stop criterion •  is predefined small value • Start λ=0, Define , , and  • Compute • Choose step size t via backtracking line search • Update • Repeat till

  33. ML = Statistics + Optimization • Modeling Pr(y|x;) •  is the parameter(s) involved in the model • Search for the best parameter  • Maximum likelihood estimation • Construct a log-likelihood function l() • Search for the optimal solution 

  34. Instance-Based Learning (Ch. 8) • Key idea: just store all training examples • k Nearest neighbor: • Given query example , take vote among its k nearest neighbors (if discrete-valued target function) • take mean of f values of k nearest neighbors if real-valued target function

  35. When to Consider Nearest Neighbor ? • Lots of training data • Less than 20 attributes per example • Advantages: • Training is very fast • Learn complex target functions • Don’t lose information • Disadvantages: • Slow at query time • Easily fooled by irrelevant attributes

  36. KD Tree for NN Search • Each node contains • Children information • The tightest box that bounds all the data points within the node.

  37. NN Search by KD Tree

  38. NN Search by KD Tree

  39. NN Search by KD Tree

  40. NN Search by KD Tree

  41. NN Search by KD Tree

  42. NN Search by KD Tree

  43. NN Search by KD Tree

  44. Curse of Dimensionality • Imagine instances described by 20 attributes, but only 2 are relevant to target function • Curse of dimensionality: nearest neighbor is easily mislead when high dimensional X • Consider N data points uniformly distributed in a p-dimensional unit ball centered at original. Consider the nn estimate at the original. The mean distance from the original to the closest data point is:

  45. Curse of Dimensionality • Imagine instances described by 20 attributes, but only 2 are relevant to target function • Curse of dimensionality: nearest neighbor is easily mislead when high dimensional X • Consider N data points uniformly distributed in a p-dimensional unit ball centered at origin. Consider the nn estimate at the original. The mean distance from the origin to the closest data point is:

More Related