Artificial Neural Networks

Artificial Neural Networks

Outline • Biological Motivation • Perceptron • Gradient Descent • Least Mean Square Error • Multi-layer networks • Sigmoid node • Backpropagation

Biological Neural Systems • Neuron switching time : > 10-3 secs • Number of neurons in the human brain: ~1010 • Connections (synapses) per neuron : ~104–105 • Face recognition : 0.1 secs • High degree of parallel computation • Distributed representations

Artificial Neural Networks • Many simple neuron-like threshold units • Many weighted interconnections • Multiple outputs • Highly parallel and distributed processing • Learning by tuning the connection weights

x1 x2 xn Perceptron: Linear threshold unit x0=1 w1 w0 w2 S o . . . i=0n wi xi wn 1 if i=0nwi xi >0 o(xi)= -1 otherwise {

x2 + - x1 + - Xor Decision Surface of a Perceptron x2 + + + - - x1 + - - Linearly Separable Theorem: VC-dim = n+1

Perceptron Learning Rule S sample xi input vector t=c(x) is the target value o is the perceptron output  learning rate(a small constant ), assume=1 wi = wi + wi wi =  (t - o) xi

Perceptron Algo. • Correct Output (t=o) • Weights are unchanged • Incorrect Output (to) • Change weights ! • False Positive (t=1 and o=-1) • Add x to w • False Negative (t=-1 and o=1) • Subtract x from w

t=-1 t=1 o=1 w=[0.25 –0.1 0.5] x2 = 0.2 x1 – 0.5 o=-1 (x,t)=([2,1],-1) o=sgn(0.45-0.6+0.3) =1 (x,t)=([-1,-1],1) o=sgn(0.25+0.1-0.5) =-1 w=[0.2 –0.2 –0.2] w=[-0.2 –0.4 –0.2] (x,t)=([1,1],1) o=sgn(0.25-0.7+0.1) =-1 w=[0.2 0.2 0.2] Perceptron Learning Rule

Perceptron Algorithm: Analysis • Theorem: The number of errors of the Perceptron Algorithm is bounded • Proof: • Make all examples positive • change <xi,bi> to <bixi, +1> • Margin of hyperplan w

Perceptron Algorithm: Analysis II • Let mibe the number of errors of xi • M=  mi • From the algorithm: w=  mixi • Let w* be a separating hyperplane

Perceptron Algorithm: Analysis III • Change in weights: • Since w errs on xi , we have wxi <0 • Total weight:

Perceptron Algorithm: Analysis IV • Consider the angle between w and w* • Putting it all together

Gradient Descent Learning Rule • Consider linear unit without threshold and continuous output o (not just –1,1) • o=w0 + w1 x1 + … + wn xn • Train the wi’s such that they minimize the squared error • E[w1,…,wn] = ½ dS (td-od)2 where S is the set of training examples

(w1,w2) Gradient: E[w]=[E/w0,… E/wn] (w1+w1,w2 +w2) Gradient Descent S={<(1,1),1>,<(-1,-1),1>, <(1,-1),-1>,<(-1,1),-1>} w=- E[w] wi=- E/wi =/wi 1/2d(td-od)2 = /wi 1/2d(td-i wi xi)2 = d(td- od)(-xi)

Gradient Descent Gradient-Descent(S:training_examples, ) Until TERMINATION Do • Initialize each wi to zero • For each <x,t> in S Do • Compute o=<x,w> • For each weight wiDo • wi= wi +  (t-o) xi • For each weight wi Do1 • wi=wi+wi

Incremental Stochastic Gradient Descent • Batch mode : Gradient Descent w=w -  ES[w] over the entire data S ES[w]=1/2d(td-od)2 • Incremental mode: gradient descent w=w -  Ed[w] over individual training examples d Ed[w]=1/2 (td-od)2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if  is small enough

Comparison Perceptron and Gradient Descent Rule Perceptron learning rule guaranteed to succeed if • Training examples are linearly separable • No guarantee otherwise Linear unit using Gradient Descent • Converges to hypothesis with minimum squared error. • Given sufficiently small learning rate  • Even when training data contains noise • Even when training data not linearly separable

Multi-Layer Networks output layer hidden layer(s) input layer

x1 x2 xn Sigmoid Unit x0=1 w1 w0 z=i=0n wi xi o=(z)=1/(1+e-z) w2 S o . . . wn (z) =1/(1+e-z) sigmoid function.

Sigmoid Function (z) =1/(1+e-z) d(z)/dz= (z) (1- (z)) • Gradient Decent Rule: • one sigmoid function • E/wi = -d(td-od) od (1-od) xi • Multilayer networks of sigmoid units: • backpropagation

Backpropagation: overview • Make threshold units differentiable • Use sigmoid functions • Given a sample compute: • The error • The Gradient • Use the chain rule to compute the Gradient

Backpropagation Motivation • Consider the square error • ES[w]=1/2d  S k  output (td,k-od,k)2 • Gradient: ES[w] • Update: w=w -  ES[w] • How do we compute the Gradient?

Backpropagation: Algorithm • Forward phase: • Given input x, compute the output of each unit • Backward phase: • For each output k compute

Backpropagation: Algorithm • Backward phase • For each hidden unit h compute: • Update weights: • wi,j=wi,j+wi,jwherewi,j=  j xi

Backpropagation: output node

Backpropagation: inner node

Backpropagation: Summary • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Finds a local, not necessarily global error minimum • in practice often works well • requires multiple invocations with different initial weights • A variation is to include momentum term wi,j(n)=  j xi +  wi,j (n-1) • Minimizes error training examples • Training is fairly slow, yet prediction is fast

Expressive Capabilities of ANN Boolean functions • Every boolean function can be represented by network with single hidden layer • But might require exponential (in number of inputs) hidden units Continuous functions • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]

VC-dim of ANN • A more general bound. • Concept class F(C,G): • G : Directed acyclic graph • C: concept class, d=VC-dim(C) • n: input nodes • s : inner nodes (of degree r) Theorem: VC-dim(F(C,G)) < 2ds log (es)

Proof: • Bound |F(C,G)(m)| • Find smallest d s.t. |F(C,G)(m)| <2m • Let S={x1, … , xm} • For each fixed G we define a matrix U • U[i,j]= ci(xj), where ci is a specific i-th concept • U describes the computations of S in G • TF(C,G) = number of different matrices.

Proof (cont.) • Solve for: (em/d)ds2m • Holds for m  2ds log(es) • QED • Back to ANN: • VC-dim(C)=n+1 • VC(ANN)  2(n+1) log (es)

Artificial Neural Networks