1 / 49

Learning: Nearest Neighbor, Perceptrons & Neural Nets

Learning: Nearest Neighbor, Perceptrons & Neural Nets. Artificial Intelligence CSPP 56553 February 4, 2004. Nearest Neighbor Example II. Credit Rating: Classifier: Good / Poor Features: L = # late payments/yr; R = Income/Expenses. Name L R G/P. A 0 1.2 G.

cloris
Télécharger la présentation

Learning: Nearest Neighbor, Perceptrons & Neural Nets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning: Nearest Neighbor, Perceptrons & Neural Nets Artificial Intelligence CSPP 56553 February 4, 2004

  2. Nearest Neighbor Example II • Credit Rating: • Classifier: Good / Poor • Features: • L = # late payments/yr; • R = Income/Expenses Name L R G/P A 0 1.2 G B 25 0.4 P C 5 0.7 G D 20 0.8 P E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P

  3. Nearest Neighbor Example II Name L R G/P A 0 1.2 G A F B 25 0.4 P 1 G R E C 5 0.7 G H D C D 20 0.8 P E 30 0.85 P B F 11 1.2 G G 7 1.15 G 10 20 30 L H 15 0.8 P

  4. Nearest Neighbor Example II Name L R G/P I 6 1.15 G A F K J 22 0.45 P 1 I G ?? E K 15 1.2 D H R C J B Distance Measure: Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2)) - Scaled distance 10 20 30 L

  5. Nearest Neighbor: Issues • Prediction can be expensive if many features • Affected by classification, feature noise • One entry can change prediction • Definition of distance metric • How to combine different features • Different types, ranges of values • Sensitive to feature selection

  6. Efficient Implementations • Classification cost: • Find nearest neighbor: O(n) • Compute distance between unknown and all instances • Compare distances • Problematic for large data sets • Alternative: • Use binary search to reduce to O(log n)

  7. Efficient Implementation: K-D Trees • Divide instances into sets based on features • Binary branching: E.g. > value • 2^d leaves with d split path = n • d= O(log n) • To split cases into sets, • If there is one element in the set, stop • Otherwise pick a feature to split on • Find average position of two middle objects on that dimension • Split remaining objects based on average position • Recursively split subsets

  8. R > 0.825? L > 17.5? L > 9 ? R > 0.6? R > 0.75? R > 1.175 ? R > 1.025 ? K-D Trees: Classification Yes No No Yes Yes No No Yes No Yes No No Yes Yes Poor Good Good Poor Good Good Poor Good

  9. Efficient Implementation:Parallel Hardware • Classification cost: • # distance computations • Const time if O(n) processors • Cost of finding closest • Compute pairwise minimum, successively • O(log n) time

  10. Nearest Neighbor: Analysis • Issue: • What features should we use? • E.g. Credit rating: Many possible features • Tax bracket, debt burden, retirement savings, etc.. • Nearest neighbor uses ALL • Irrelevant feature(s) could mislead • Fundamental problem with nearest neighbor

  11. Nearest Neighbor: Advantages • Fast training: • Just record feature vector - output value set • Can model wide variety of functions • Complex decision boundaries • Weak inductive bias • Very generally applicable

  12. Summary: Nearest Neighbor • Nearest neighbor: • Training: record input vectors + output value • Prediction: closest training instance to new data • Efficient implementations • Pros: fast training, very general, little bias • Cons: distance metric (scaling), sensitivity to noise & extraneous features

  13. Learning: Perceptrons Artificial Intelligence CSPP 56553 February 4, 2004

  14. Agenda • Neural Networks: • Biological analogy • Perceptrons: Single layer networks • Perceptron training • Perceptron convergence theorem • Perceptron limitations • Conclusions

  15. Neurons: The Concept Dendrites Axon Nucleus Cell Body Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires” Sends output along axon to other neurons Brain: 10^11 neurons, 10^16 synapses

  16. Artificial Neural Nets • Simulated Neuron: • Node connected to other nodes via links • Links = axon+synapse+link • Links associated with weight (like synapse) • Multiplied by output of node • Node combines input via activation function • E.g. sum of weighted inputs passed thru threshold • Simpler than real neuronal processes

  17. Artificial Neural Net w x w Sum Threshold + x w x

  18. Perceptrons • Single neuron-like element • Binary inputs • Binary outputs • Weighted sum of inputs > threshold

  19. Perceptron Structure y w0 wn w1 w3 w2 x0=1 x1 x2 x3 xn . . . compensates for threshold x0 w0

  20. Perceptron Example • Logical-OR: Linearly separable • 00: 0; 01: 1; 10: 1; 11: 1 x2 x2 + + + + 0 0 + + x1 x1 or or

  21. Perceptron Convergence Procedure • Straight-forward training procedure • Learns linearly separable functions • Until perceptron yields correct output for all • If the perceptron is correct, do nothing • If the percepton is wrong, • If it incorrectly says “yes”, • Subtract input vector from weight vector • Otherwise, add input vector to weight vector

  22. Perceptron Convergence Example • LOGICAL-OR: • Sample x0 x1 x2 Desired Output • 1 1 0 0 0 • 2 1 0 1 1 • 3 1 1 0 1 • 4 1 1 1 1 • Initial: w=(000);After S2, w=w+s2=(101) • Pass2: S1:w=w-s1=(001);S3:w=w+s3=(111) • Pass3: S1:w=w-s1=(011)

  23. Perceptron Convergence Theorem • If there exists a vector W s.t. • Perceptron training will find it • Assume for all +ive examples x • ||w||^2 increases by at most ||x||^2, in each iteration • ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2 • v.w/||w|| > <= 1 Converges in k <= O steps

  24. x2 0 0 0 0 + +++ + + 0 0 0 x1 Perceptron Learning • Perceptrons learn linear decision boundaries • E.g. x2 + 0 But not 0 + x1 xor X1 X2 -1 -1 w1x1 + w2x2 < 0 1 -1 w1x1 + w2x2 > 0 => implies w1 > 0 1 1 w1x1 + w2x2 >0 => but should be false -1 1 w1x1 + w2x2 > 0 => implies w2 > 0

  25. Perceptron Example • Digit recognition • Assume display= 8 lightable bars • Inputs – on/off + threshold • 65 steps to recognize “8”

  26. Perceptron Summary • Motivated by neuron activation • Simple training procedure • Guaranteed to converge • IF linearly separable

  27. Neural Nets • Multi-layer perceptrons • Inputs: real-valued • Intermediate “hidden” nodes • Output(s): one (or more) discrete-valued X1 Y1 Y2 X2 X3 X4 Inputs Hidden Hidden Outputs

  28. Neural Nets • Pro: More general than perceptrons • Not restricted to linear discriminants • Multiple outputs: one classification each • Con: No simple, guaranteed training procedure • Use greedy, hill-climbing procedure to train • “Gradient descent”, “Backpropagation”

  29. Solving the XOR Problem o1 w11 Network Topology: 2 hidden nodes 1 output w13 x1 w01 w21 y -1 w23 w12 w03 w22 x2 -1 w02 o2 Desired behavior: x1 x2 o1 o2 y 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 -1 Weights: w11= w12=1 w21=w22 = 1 w01=3/2; w02=1/2; w03=1/2 w13=-1; w23=1

  30. Neural Net Applications • Speech recognition • Handwriting recognition • NETtalk: Letter-to-sound rules • ALVINN: Autonomous driving

  31. ALVINN • Driving as a neural network • Inputs: • Image pixel intensities • I.e. lane lines • 5 Hidden nodes • Outputs: • Steering actions • E.g. turn left/right; how far • Training: • Observe human behavior: sample images, steering

  32. Backpropagation • Greedy, Hill-climbing procedure • Weights are parameters to change • Original hill-climb changes one parameter/step • Slow • If smooth function, change all parameters/step • Gradient descent • Backpropagation: Computes current output, works backward to correct error

  33. Producing a Smooth Function • Key problem: • Pure step threshold is discontinuous • Not differentiable • Solution: • Sigmoid (squashed ‘s’ function): Logistic fn

  34. Neural Net Training • Goal: • Determine how to change weights to get correct output • Large change in weight to produce large reduction in error • Approach: • Compute actual output: o • Compare to desired output: d • Determine effect of each weight w on error = d-o • Adjust weights

  35. z1 z2 z3 y3 z3 w03 -1 w23 w13 y1 y2 z2 z1 w21 w01 w22 w02 w11 -1 w12 -1 x2 x1 Neural Net Example xi : ith sample input vector w : weight vector yi*: desired output for ith sample - Sum of squares error over training samples From 6.034 notes lozano-perez Full expression of output in terms of input and weights

  36. Gradient Descent • Error: Sum of squares error of inputs with current weights • Compute rate of change of error wrt each weight • Which weights have greatest effect on error? • Effectively, partial derivatives of error wrt weights • In turn, depend on other weights => chain rule

  37. E = G(w) Error as function of weights Find rate of change of error Follow steepest rate of change Change weights s.t. error is minimized Gradient Descent dG dw E G(w) w0w1 w Local minima

  38. z1 z2 z3 y3 z3 w03 -1 w23 w13 y1 y2 z2 z1 w21 w01 w22 w02 w11 -1 w12 -1 x2 x1 Gradient of Error - Note: Derivative of sigmoid: ds(z1) = s(z1)(1-s(z1)) dz1 From 6.034 notes lozano-perez MIT AI lecture notes, Lozano-Perez 2000

  39. From Effect to Update • Gradient computation: • How each weight contributes to performance • To train: • Need to determine how to CHANGE weight based on contribution to performance • Need to determine how MUCH change to make per iteration • Rate parameter ‘r’ • Large enough to learn quickly • Small enough reach but not overshoot target values

  40. Backpropagation Procedure i j k • Pick rate parameter ‘r’ • Until performance is good enough, • Do forward computation to calculate output • Compute Beta in output node with • Compute Beta in all other nodes with • Compute change for all weights with

  41. y3 z3 w03 -1 w13 y1 w23 y2 z2 z1 w21 w01 w22 w02 -1 w11 w12 -1 x2 x1 Backprop Example Forward prop: Compute zi and yi given xk, wl

  42. Backpropagation Observations • Procedure is (relatively) efficient • All computations are local • Use inputs and outputs of current node • What is “good enough”? • Rarely reach target (0 or 1) outputs • Typically, train until within 0.1 of target

  43. Neural Net Summary • Training: • Backpropagation procedure • Gradient descent strategy (usual problems) • Prediction: • Compute outputs based on input vector & weights • Pros: Very general, Fast prediction • Cons: Training can be VERY slow (1000’s of epochs), Overfitting

  44. Training Strategies • Online training: • Update weights after each sample • Offline (batch training): • Compute error over all samples • Then update weights • Online training “noisy” • Sensitive to individual instances • However, may escape local minima

  45. Training Strategy • To avoid overfitting: • Split data into: training, validation, & test • Also, avoid excess weights (less than # samples) • Initialize with small random weights • Small changes have noticeable effect • Use offline training • Until validation set minimum • Evaluate on test set • No more weight changes

  46. Classification • Neural networks best for classification task • Single output -> Binary classifier • Multiple outputs -> Multiway classification • Applied successfully to learning pronunciation • Sigmoid pushes to binary classification • Not good for regression

  47. Neural Net Example • NETtalk: Letter-to-sound by net • Inputs: • Need context to pronounce • 7-letter window: predict sound of middle letter • 29 possible characters – alphabet+space+,+. • 7*29=203 inputs • 80 Hidden nodes • Output: Generate 60 phones • Nodes map to 26 units: 21 articulatory, 5 stress/sil • Vector quantization of acoustic space

  48. Neural Net Example: NETtalk • Learning to talk: • 5 iterations/1024 training words: bound/stress • 10 iterations: intelligible • 400 new test words: 80% correct • Not as good as DecTalk, but automatic

  49. Neural Net Conclusions • Simulation based on neurons in brain • Perceptrons (single neuron) • Guaranteed to find linear discriminant • IF one exists -> problem XOR • Neural nets (Multi-layer perceptrons) • Very general • Backpropagation training procedure • Gradient descent - local min, overfitting issues

More Related