590 likes | 604 Vues
Discover the robust approach of artificial neural networks in approximating target functions over various domains. Learn about neural network applications, inspiration from natural systems, representations, appropriate problems, and more.
E N D
Artificial Neural Networks #1Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha
Artificial Neural Networks • Robust approach to approximating target functions over attributes with continuous domains as well as discrete. • Can approximate unknown target functions. • Target function can be • Discrete valued. • Real valued. • Vector valued.
Neural Net Applications • Robust to noise in training data • Among most successful methods at interpreting noisy real world sensor data. • Microphone input / speech recognition. • Camera input. • Handwriting recognition • Face recognition and Image Processing. • Robotic control. • Fuzzy neural nets.
Inspiration for Artificial Neural nets. • Natural learning systems (i.e. brains) are composed of very complex webs of interconnected neurons. • Each neuron receives signals (current spikes). • When neuron threshold is reached, neuron sends its signal downstream to…. • Other neurons • Physical actuators • Perception in the Neocortex • Etc….
Artificial Neural nets are built out of densely interconnected sets of units. • Each unit (artificial neuron) takes many real-valued inputs, and produces a real-valued output. • Output is then sent downstream to • Other units within net • Output layer of net.
Brain estimated to contain ~ 100 billion neurons. • Each neuron connects to an average of 10,000 others. • Neuron switching speeds are on the order of a thousandth of a second • (versus a ten-billionth of a second for logic gates) • Yet brains can make decisions/ recognize images, etc, VERY fast.
Hypothesis: • Thought /information processing in the brain is result of massively parallelized processing of distributed inputs.
Neural Nets are built on this idea: • In parallel, process Distributed data.
Artificial vs. Natural. • Many Complexities of Natural Neural Nets not present in Artificial ones. • Feedback (uncommon in ANNs) • Etc. • Many features of Artificial Neural Nets not compatible with Natural ones. • Units in Artificial Neural Nets produce one constant output rather than a time-varying sequence of current pulses.
Neural Network representations. • ALVINN • learned to drive an autonomous vehicle on highway. • Input: 30 x 32 pixel matrix from camera • 960 values (B/W pixel intensities) • Output: Steering direction for vehicle. • (30 real values) • Two layer Neural Net: • Input (not counted) • Hidden Layer • Output.
ALVINN explained • Typical Neural Net Structure. • All 960 inputs in the 30x32 matrix are sent to the four hidden neurons/units, where weighted linear sums are computed. • Hidden unit: a unit whose output is only accessible in the net, but not at the output layer. • Outputs from the 4 hidden neurons are sent downstream to 30 output neurons, each of which outputs confidence value corresponding to steering in a specific direction. • Fuzzy truth?? • Probability measure? • Program chooses direction with highest confidence
Typical Neural Net structure. • Usually, Layers are connected in Directed, Acyclic Graph. • Can, in general, have any structure: • Cyclic/Acyclic • Directed/Undirected • Feedforward/feedback • Most common & practical Nets trained using Backpropagation • Learning = Select weight value for each connection. • Assumes net is directed. • Cycles are restricted • Usually there are none in practice.
Appropriate problems for A.N.N.s • Great for problems with noisy data and complex sensor inputs. • Camera/microphone, etc VS. • Shaft encoder/light-sensor, etc • Symbolic problems: • As good as Decision Tree Learning!! • BACKPROPAGATION • Most widely used technique.
Appropriate problems for A.N.N.s • Neural Net Learning suitable for problems with following attributes: • Instances represented by many attribute, value pairs. • Target function depends on vector of predefined attributes • Real-valued inputs • Attributed can be correlated or inependent. • Target function can be discrete-valued, real-valued, or a vector! • ALVINN’s target function was a 30 real-element vector.
Appropriate problems for A.N.N.s • Training Data can contain Errors/Noise • Neural nets very robust to noisy data • Thankfully so are natural ones ;-) • Long Training Times are acceptable. • Neural nets usu. Take longer than other machine-learning algorithms. • Few minutes several hours. • Fast evaluation of learned function may be required. • Neural nets do compute learned function very fast. • ALVINN re-computes bearing several times/second.
Appropriate problems for A.N.N.s • Not important if humans can understand learned function!! • ALVINN • 960 inputs • 4 hidden nodes • 30 outputs • Get somewhat messy looking to humans!! • Thanks to the massive parallelism and distributed data.
Perceptrons. • Basic building block of Neural Nets • Takes several real-valued inputs • Computes weighted sum • Checks sum against threshold: - w0 • If > threshold output +1 • Else output -1. o(x1, …, xn) = 1 if w0 + w1*x1 + …+ wn*xn > 0 -1 otherwise.
For simplicity, let x0 = 1, then • o(x1, …, xn) = sgn(w.x) • Vectors denoted in Bold!! • The . is vector dot product!! • Hypothesis space = • All possible combinations of real-valued weights. • All w in Rn+1
Perceptron can represent any linearly separable concept. • Learned hypothesis is a hyperplane in Rn • Equation of hyperplane is w.x = 0 • Example: • AND, OR, NAND, NOR - vs, XOR, etc. • Any boolean function can be represented by 2-layer perceptron network!!
Training a single Perceptron • Perceptron training rule. • Delta training rule / Gradient Descent. • Converge to different hypotheses, • Under different conditions!!
Perceptron training rule. • Start with random weights • Go through training examples • When an error is made, • Update weights: • For each wi wi wi + Δwi Δwi = η(t - o)xi • Terminology: • η is the learning rate • typically small • Sometimes decreases as we proceed. • t is the value of the target function • o is the value outputted by perceptron.
When the set of training examples is completed perfectly, STOP.
Pros and Cons • Can be proven to converge to a w that correctly classifies all training examples in finite time, if η is small enough. • Convergence guaranteed only if concept is linearly seperable. • If not, no convergence no stopping!! • Can be an infinite loop of indecision!!
Delta rule & Gradient descent. • Addresses problem of non-convergence for nonlinear concepts. • Will give a linear approximation to nonlinear concepts that will minimize error. • …how do we measure error??
Consider a perceptron with the “thresholding” function removed. • Then o = w.x • Can define error as sum of squared deviations: • E = ½ Σ (td – od)2, sum over all training examples d. • td is the target function value for example d. • od is the computed value of the weighted sum for d.
With this choice of error, it can be proven that minimizing E will lead to the most probable hypothesis that fits the training data, under the assumption that the noise is normally distributed with mean 0. • Note “most probable” hypothesis and “correct” hypothesis still can be different.
E will always be parabolic (by definition) • So it has only one global minimum. • Goal: Descent to the global minimum ASAP! • How? • Gradient definition. • Meaning of the Gradient. • Gradient descent.
The Gradient of E tells us the direction of steepest ascent. • So, – Gradient E tells us us direction of steepest descent. • Go in that direction with step size η. • The learning rule becomes:
Derivation of simple update rule. • Finally, the Delta Rule weight update is. Δwi = η*Σ(td – od)xid , over all training examples d.
Gradient descent pseudocode. • Pick an initial random weight vector w • Until the termination condition is met, do • Set Δwi = 0 for all i. • For each <x, t> in D, do • Run net on x :compute o(x) • For each wi, do Δ wi = Δ wi + η (t – o) xi • For each wi, do • wi = wi + Δ wi
Results of Gradient Descent. • Because of the shape of the error surface, with only a local minimum, the algorithm will converge to a w with minimum squared deviation/error as long as η is small enough. • This holds regardless of linear seperability. • If η is too large, the algorithm may skip the global minimum instead of settling in. • A common remedy is to decrease η with time.
Gradient descent can be used to search large or infinite hypothesis spaces when: • The hypothesis space is continuously parameterized. • The error can be differentiated with respect to the hypothesis parameters. • However • Convergence can be slooooooow • If there are lots of local minima, there’s no guarantee it will find the global one.
Stochastic gradient descent. • Stochastic gradient descent a.k.a. Incremental gradient descent. • Instead of updating the weights massively after going through all the training examples, we update them after each example. • This really descends the gradient for a single-example error function (an example per step): • E = ½ (td – od)2 • If η is small enough, this is as optimal as true gradient descent.
Stochastic Gradient Descent. • Pick an initial random weight vector w • Until the termination condition is met, do • For each <x, t> in D, do • Run net on x :compute o(x) • For each wi, do wi = wi + η (t – o) xi
Results. • Compared to Stochastic gradient descent, standard gradient descent takes more computation per step, but generally has a larger step size. • When E has multiple local minima, Stochastic gradient descent can sometimes avoid them.
Perceptrons with discrete output. • For Delta-learning/gradient descent, we discussed the unthresholded perceptrons. • It can simply be modified to thresholded perceptrons. • Just use the thresholded t values as the t values for the perceptron delta-learning algorithm. • (with the unthresholded o values) • Unfortunately, this may not necessarily reduce the percent of errors in training data by the thresholded output, just the squared error in the thresholded output.
Multilayer Networks and the Backpropagation Algorithm • In order to learn non-linear decision surfaces, a system more complex than perceptrons is needed.
Choice of base unit. • Multiple layers of linear units still linear. • Unthresholded perceptron • Thresholded perceptron has non-differentiable thresholding function: • Cannot compute gradient of E • Need something different. • Must be non-linear • And continuously differentiable.
Sigmoid unit. • In place of the perceptron step function, use the sigmoid function as thresholding function. • Sigmoid: σ(y) = 1 / (1 + e-y)
The sigmoid unit computes the weighted linear sum w.x, and then applied the sigmoid “squashing function”. • Steepness of incline increases with coefficient of –y. • Continuously differentiable: • Derivative: dσ(y)/dy = σ(y)*(1 - σ(y))
Backpropagation algorithm. • Learns the weights that minimize squared error given fixed # of units/neurons, and interconnections. • Employs gradient descent similar to delta rule. • Error is measured by:
Error Surface can have multiple local minima. • No guarantee algorithm will find the Global Minimum. • However, in practice, backpropagation performs very well. • Recall Stochastic Gradient descent vs. • Gradient descent.
Stochastic Backpropagation Algorithm • Considering a feedforward network with two layers of sigmoid units which is fully connected in one direction. • Each unit is assigned an index (I = 0, 1, 2, …) • xji denotes input from i into j. • wji denotes weight on connection from i to j. • δn is the error term associated with unit n.
Backpropagation explained • Make the network, randomly initialize the weights. • For each training example d, apply the network, calculate the squared error Ed, apply the gradient, and proceed a step of size η in direction of steepest decline. • Weight update rule: • Recall Delta rule: Δ wi = Δ wi + η (t – o) xi • Here we have Δwji = ηδjxji • The error term δj is more complex here.
Error term for unit j. • Intuitively, if j is an output node k, then its error is the standard tk – ok multiplied by ok(1- ok) : the derivative of the sigmoid function. • Derivative of sigmoid because we’re using gradient(E). • If j is a hidden node h, have no th to compare it with. • Must sum error in the output nodes k influenced by h: δk • weighted by how much they were influenced by h: wkh • δh = oh(1- oh) Σ wkhδk