Learning: Nearest Neighbor, Perceptrons & Neural Nets

Learning: Nearest Neighbor, Perceptrons & Neural Nets Artificial Intelligence CSPP 56553 February 4, 2004

Nearest Neighbor Example II • Credit Rating: • Classifier: Good / Poor • Features: • L = # late payments/yr; • R = Income/Expenses Name L R G/P A 0 1.2 G B 25 0.4 P C 5 0.7 G D 20 0.8 P E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P

Nearest Neighbor Example II Name L R G/P A 0 1.2 G A F B 25 0.4 P 1 G R E C 5 0.7 G H D C D 20 0.8 P E 30 0.85 P B F 11 1.2 G G 7 1.15 G 10 20 30 L H 15 0.8 P

Nearest Neighbor Example II Name L R G/P I 6 1.15 G A F K J 22 0.45 P 1 I G ?? E K 15 1.2 D H R C J B Distance Measure: Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2)) - Scaled distance 10 20 30 L

Nearest Neighbor: Issues • Prediction can be expensive if many features • Affected by classification, feature noise • One entry can change prediction • Definition of distance metric • How to combine different features • Different types, ranges of values • Sensitive to feature selection

Efficient Implementations • Classification cost: • Find nearest neighbor: O(n) • Compute distance between unknown and all instances • Compare distances • Problematic for large data sets • Alternative: • Use binary search to reduce to O(log n)

Efficient Implementation: K-D Trees • Divide instances into sets based on features • Binary branching: E.g. > value • 2^d leaves with d split path = n • d= O(log n) • To split cases into sets, • If there is one element in the set, stop • Otherwise pick a feature to split on • Find average position of two middle objects on that dimension • Split remaining objects based on average position • Recursively split subsets

R > 0.825? L > 17.5? L > 9 ? R > 0.6? R > 0.75? R > 1.175 ? R > 1.025 ? K-D Trees: Classification Yes No No Yes Yes No No Yes No Yes No No Yes Yes Poor Good Good Poor Good Good Poor Good

Efficient Implementation:Parallel Hardware • Classification cost: • # distance computations • Const time if O(n) processors • Cost of finding closest • Compute pairwise minimum, successively • O(log n) time

Nearest Neighbor: Analysis • Issue: • What features should we use? • E.g. Credit rating: Many possible features • Tax bracket, debt burden, retirement savings, etc.. • Nearest neighbor uses ALL • Irrelevant feature(s) could mislead • Fundamental problem with nearest neighbor

Nearest Neighbor: Advantages • Fast training: • Just record feature vector - output value set • Can model wide variety of functions • Complex decision boundaries • Weak inductive bias • Very generally applicable

Summary: Nearest Neighbor • Nearest neighbor: • Training: record input vectors + output value • Prediction: closest training instance to new data • Efficient implementations • Pros: fast training, very general, little bias • Cons: distance metric (scaling), sensitivity to noise & extraneous features

Learning: Perceptrons Artificial Intelligence CSPP 56553 February 4, 2004

Agenda • Neural Networks: • Biological analogy • Perceptrons: Single layer networks • Perceptron training • Perceptron convergence theorem • Perceptron limitations • Conclusions

Neurons: The Concept Dendrites Axon Nucleus Cell Body Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires” Sends output along axon to other neurons Brain: 10^11 neurons, 10^16 synapses

Artificial Neural Nets • Simulated Neuron: • Node connected to other nodes via links • Links = axon+synapse+link • Links associated with weight (like synapse) • Multiplied by output of node • Node combines input via activation function • E.g. sum of weighted inputs passed thru threshold • Simpler than real neuronal processes

Artificial Neural Net w x w Sum Threshold + x w x

Perceptrons • Single neuron-like element • Binary inputs • Binary outputs • Weighted sum of inputs > threshold

Perceptron Structure y w0 wn w1 w3 w2 x0=1 x1 x2 x3 xn . . . compensates for threshold x0 w0

Perceptron Example • Logical-OR: Linearly separable • 00: 0; 01: 1; 10: 1; 11: 1 x2 x2 + + + + 0 0 + + x1 x1 or or

Perceptron Convergence Procedure • Straight-forward training procedure • Learns linearly separable functions • Until perceptron yields correct output for all • If the perceptron is correct, do nothing • If the percepton is wrong, • If it incorrectly says “yes”, • Subtract input vector from weight vector • Otherwise, add input vector to weight vector

Perceptron Convergence Example • LOGICAL-OR: • Sample x0 x1 x2 Desired Output • 1 1 0 0 0 • 2 1 0 1 1 • 3 1 1 0 1 • 4 1 1 1 1 • Initial: w=(000);After S2, w=w+s2=(101) • Pass2: S1:w=w-s1=(001);S3:w=w+s3=(111) • Pass3: S1:w=w-s1=(011)

Perceptron Convergence Theorem • If there exists a vector W s.t. • Perceptron training will find it • Assume for all +ive examples x • ||w||^2 increases by at most ||x||^2, in each iteration • ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2 • v.w/||w|| > <= 1 Converges in k <= O steps

x2 0 0 0 0 + +++ + + 0 0 0 x1 Perceptron Learning • Perceptrons learn linear decision boundaries • E.g. x2 + 0 But not 0 + x1 xor X1 X2 -1 -1 w1x1 + w2x2 < 0 1 -1 w1x1 + w2x2 > 0 => implies w1 > 0 1 1 w1x1 + w2x2 >0 => but should be false -1 1 w1x1 + w2x2 > 0 => implies w2 > 0

Perceptron Example • Digit recognition • Assume display= 8 lightable bars • Inputs – on/off + threshold • 65 steps to recognize “8”

Perceptron Summary • Motivated by neuron activation • Simple training procedure • Guaranteed to converge • IF linearly separable

Neural Nets • Multi-layer perceptrons • Inputs: real-valued • Intermediate “hidden” nodes • Output(s): one (or more) discrete-valued X1 Y1 Y2 X2 X3 X4 Inputs Hidden Hidden Outputs

Neural Nets • Pro: More general than perceptrons • Not restricted to linear discriminants • Multiple outputs: one classification each • Con: No simple, guaranteed training procedure • Use greedy, hill-climbing procedure to train • “Gradient descent”, “Backpropagation”

Solving the XOR Problem o1 w11 Network Topology: 2 hidden nodes 1 output w13 x1 w01 w21 y -1 w23 w12 w03 w22 x2 -1 w02 o2 Desired behavior: x1 x2 o1 o2 y 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 -1 Weights: w11= w12=1 w21=w22 = 1 w01=3/2; w02=1/2; w03=1/2 w13=-1; w23=1

Neural Net Applications • Speech recognition • Handwriting recognition • NETtalk: Letter-to-sound rules • ALVINN: Autonomous driving

ALVINN • Driving as a neural network • Inputs: • Image pixel intensities • I.e. lane lines • 5 Hidden nodes • Outputs: • Steering actions • E.g. turn left/right; how far • Training: • Observe human behavior: sample images, steering

Backpropagation • Greedy, Hill-climbing procedure • Weights are parameters to change • Original hill-climb changes one parameter/step • Slow • If smooth function, change all parameters/step • Gradient descent • Backpropagation: Computes current output, works backward to correct error

Producing a Smooth Function • Key problem: • Pure step threshold is discontinuous • Not differentiable • Solution: • Sigmoid (squashed ‘s’ function): Logistic fn

Neural Net Training • Goal: • Determine how to change weights to get correct output • Large change in weight to produce large reduction in error • Approach: • Compute actual output: o • Compare to desired output: d • Determine effect of each weight w on error = d-o • Adjust weights

z1 z2 z3 y3 z3 w03 -1 w23 w13 y1 y2 z2 z1 w21 w01 w22 w02 w11 -1 w12 -1 x2 x1 Neural Net Example xi : ith sample input vector w : weight vector yi*: desired output for ith sample - Sum of squares error over training samples From 6.034 notes lozano-perez Full expression of output in terms of input and weights

Gradient Descent • Error: Sum of squares error of inputs with current weights • Compute rate of change of error wrt each weight • Which weights have greatest effect on error? • Effectively, partial derivatives of error wrt weights • In turn, depend on other weights => chain rule

E = G(w) Error as function of weights Find rate of change of error Follow steepest rate of change Change weights s.t. error is minimized Gradient Descent dG dw E G(w) w0w1 w Local minima

z1 z2 z3 y3 z3 w03 -1 w23 w13 y1 y2 z2 z1 w21 w01 w22 w02 w11 -1 w12 -1 x2 x1 Gradient of Error - Note: Derivative of sigmoid: ds(z1) = s(z1)(1-s(z1)) dz1 From 6.034 notes lozano-perez MIT AI lecture notes, Lozano-Perez 2000

From Effect to Update • Gradient computation: • How each weight contributes to performance • To train: • Need to determine how to CHANGE weight based on contribution to performance • Need to determine how MUCH change to make per iteration • Rate parameter ‘r’ • Large enough to learn quickly • Small enough reach but not overshoot target values

Backpropagation Procedure i j k • Pick rate parameter ‘r’ • Until performance is good enough, • Do forward computation to calculate output • Compute Beta in output node with • Compute Beta in all other nodes with • Compute change for all weights with

y3 z3 w03 -1 w13 y1 w23 y2 z2 z1 w21 w01 w22 w02 -1 w11 w12 -1 x2 x1 Backprop Example Forward prop: Compute zi and yi given xk, wl

Backpropagation Observations • Procedure is (relatively) efficient • All computations are local • Use inputs and outputs of current node • What is “good enough”? • Rarely reach target (0 or 1) outputs • Typically, train until within 0.1 of target

Neural Net Summary • Training: • Backpropagation procedure • Gradient descent strategy (usual problems) • Prediction: • Compute outputs based on input vector & weights • Pros: Very general, Fast prediction • Cons: Training can be VERY slow (1000’s of epochs), Overfitting

Training Strategies • Online training: • Update weights after each sample • Offline (batch training): • Compute error over all samples • Then update weights • Online training “noisy” • Sensitive to individual instances • However, may escape local minima

Training Strategy • To avoid overfitting: • Split data into: training, validation, & test • Also, avoid excess weights (less than # samples) • Initialize with small random weights • Small changes have noticeable effect • Use offline training • Until validation set minimum • Evaluate on test set • No more weight changes

Classification • Neural networks best for classification task • Single output -> Binary classifier • Multiple outputs -> Multiway classification • Applied successfully to learning pronunciation • Sigmoid pushes to binary classification • Not good for regression

Neural Net Example • NETtalk: Letter-to-sound by net • Inputs: • Need context to pronounce • 7-letter window: predict sound of middle letter • 29 possible characters – alphabet+space+,+. • 7*29=203 inputs • 80 Hidden nodes • Output: Generate 60 phones • Nodes map to 26 units: 21 articulatory, 5 stress/sil • Vector quantization of acoustic space

Neural Net Example: NETtalk • Learning to talk: • 5 iterations/1024 training words: bound/stress • 10 iterations: intelligible • 400 new test words: 80% correct • Not as good as DecTalk, but automatic

Neural Net Conclusions • Simulation based on neurons in brain • Perceptrons (single neuron) • Guaranteed to find linear discriminant • IF one exists -> problem XOR • Neural nets (Multi-layer perceptrons) • Very general • Backpropagation training procedure • Gradient descent - local min, overfitting issues

Learning: Nearest Neighbor, Perceptrons & Neural Nets