590 likes | 594 Vues
Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5. Promethea Pythaitha. Artificial Neural Networks. Robust approach to approximating target functions over attributes with continuous domains as well as discrete. Can approximate unknown target functions. Target function can be
E N D
Artificial Neural Networks #1Machine Learning CH4 : 4.1 – 4.5 Promethea Pythaitha
Artificial Neural Networks • Robust approach to approximating target functions over attributes with continuous domains as well as discrete. • Can approximate unknown target functions. • Target function can be • Discrete valued. • Real valued. • Vector valued.
Neural Net Applications • Robust to noise in training data • Among most successful methods at interpreting noisy real world sensor data. • Microphone input / speech recognition. • Camera input. • Handwriting recognition • Face recognition and Image Processing. • Robotic control. • Fuzzy neural nets.
Inspiration for Artificial Neural nets. • Natural learning systems (i.e. brains) are composed of very complex webs of interconnected neurons. • Each neuron receives signals (current spikes). • When neuron threshold is reached, neuron sends its signal downstream to…. • Other neurons • Physical actuators • Perception in the Neocortex • Etc….
Artificial Neural nets are built out of densely interconnected sets of units. • Each unit (artificial neuron) takes many real-valued inputs, and produces a real-valued output. • Output is then sent downstream to • Other units within net • Output layer of net.
Brain estimated to contain ~ 100 billion neurons. • Each neuron connects to an average of 10,000 others. • Neuron switching speeds are on the order of a thousandth of a second • (versus a ten-billionth of a second for logic gates) • Yet brains can make decisions/ recognize images, etc, VERY fast.
Hypothesis: • Thought /information processing in the brain is result of massively parallelized processing of distributed inputs.
Neural Nets are built on this idea: • In parallel, process Distributed data.
Artificial vs. Natural. • Many Complexities of Natural Neural Nets not present in Artificial ones. • Feedback (uncommon in ANNs) • Etc. • Many features of Artificial Neural Nets not compatible with Natural ones. • Units in Artificial Neural Nets produce one constant output rather than a time-varying sequence of current pulses.
Neural Network representations. • ALVINN • learned to drive an autonomous vehicle on highway. • Input: 30 x 32 pixel matrix from camera • 960 values (B/W pixel intensities) • Output: Steering direction for vehicle. • (30 real values) • Two layer Neural Net: • Input (not counted) • Hidden Layer • Output.
ALVINN explained • Typical Neural Net Structure. • All 960 inputs in the 30x32 matrix are sent to the four hidden neurons/units, where weighted linear sums are computed. • Hidden unit: a unit whose output is only accessible in the net, but not at the output layer. • Outputs from the 4 hidden neurons are sent downstream to 30 output neurons, each of which outputs confidence value corresponding to steering in a specific direction. • Fuzzy truth?? • Probability measure? • Program chooses direction with highest confidence
Typical Neural Net structure. • Usually, Layers are connected in Directed, Acyclic Graph. • Can, in general, have any structure: • Cyclic/Acyclic • Directed/Undirected • Feedforward/feedback • Most common & practical Nets trained using Backpropagation • Learning = Select weight value for each connection. • Assumes net is directed. • Cycles are restricted • Usually there are none in practice.
Appropriate problems for A.N.N.s • Great for problems with noisy data and complex sensor inputs. • Camera/microphone, etc VS. • Shaft encoder/light-sensor, etc • Symbolic problems: • As good as Decision Tree Learning!! • BACKPROPAGATION • Most widely used technique.
Appropriate problems for A.N.N.s • Neural Net Learning suitable for problems with following attributes: • Instances represented by many attribute, value pairs. • Target function depends on vector of predefined attributes • Real-valued inputs • Attributed can be correlated or inependent. • Target function can be discrete-valued, real-valued, or a vector! • ALVINN’s target function was a 30 real-element vector.
Appropriate problems for A.N.N.s • Training Data can contain Errors/Noise • Neural nets very robust to noisy data • Thankfully so are natural ones ;-) • Long Training Times are acceptable. • Neural nets usu. Take longer than other machine-learning algorithms. • Few minutes several hours. • Fast evaluation of learned function may be required. • Neural nets do compute learned function very fast. • ALVINN re-computes bearing several times/second.
Appropriate problems for A.N.N.s • Not important if humans can understand learned function!! • ALVINN • 960 inputs • 4 hidden nodes • 30 outputs • Get somewhat messy looking to humans!! • Thanks to the massive parallelism and distributed data.
Perceptrons. • Basic building block of Neural Nets • Takes several real-valued inputs • Computes weighted sum • Checks sum against threshold: - w0 • If > threshold output +1 • Else output -1. o(x1, …, xn) = 1 if w0 + w1*x1 + …+ wn*xn > 0 -1 otherwise.
For simplicity, let x0 = 1, then • o(x1, …, xn) = sgn(w.x) • Vectors denoted in Bold!! • The . is vector dot product!! • Hypothesis space = • All possible combinations of real-valued weights. • All w in Rn+1
Perceptron can represent any linearly separable concept. • Learned hypothesis is a hyperplane in Rn • Equation of hyperplane is w.x = 0 • Example: • AND, OR, NAND, NOR - vs, XOR, etc. • Any boolean function can be represented by 2-layer perceptron network!!
Training a single Perceptron • Perceptron training rule. • Delta training rule / Gradient Descent. • Converge to different hypotheses, • Under different conditions!!
Perceptron training rule. • Start with random weights • Go through training examples • When an error is made, • Update weights: • For each wi wi wi + Δwi Δwi = η(t - o)xi • Terminology: • η is the learning rate • typically small • Sometimes decreases as we proceed. • t is the value of the target function • o is the value outputted by perceptron.
When the set of training examples is completed perfectly, STOP.
Pros and Cons • Can be proven to converge to a w that correctly classifies all training examples in finite time, if η is small enough. • Convergence guaranteed only if concept is linearly seperable. • If not, no convergence no stopping!! • Can be an infinite loop of indecision!!
Delta rule & Gradient descent. • Addresses problem of non-convergence for nonlinear concepts. • Will give a linear approximation to nonlinear concepts that will minimize error. • …how do we measure error??
Consider a perceptron with the “thresholding” function removed. • Then o = w.x • Can define error as sum of squared deviations: • E = ½ Σ (td – od)2, sum over all training examples d. • td is the target function value for example d. • od is the computed value of the weighted sum for d.
With this choice of error, it can be proven that minimizing E will lead to the most probable hypothesis that fits the training data, under the assumption that the noise is normally distributed with mean 0. • Note “most probable” hypothesis and “correct” hypothesis still can be different.
E will always be parabolic (by definition) • So it has only one global minimum. • Goal: Descent to the global minimum ASAP! • How? • Gradient definition. • Meaning of the Gradient. • Gradient descent.
The Gradient of E tells us the direction of steepest ascent. • So, – Gradient E tells us us direction of steepest descent. • Go in that direction with step size η. • The learning rule becomes:
Derivation of simple update rule. • Finally, the Delta Rule weight update is. Δwi = η*Σ(td – od)xid , over all training examples d.
Gradient descent pseudocode. • Pick an initial random weight vector w • Until the termination condition is met, do • Set Δwi = 0 for all i. • For each <x, t> in D, do • Run net on x :compute o(x) • For each wi, do Δ wi = Δ wi + η (t – o) xi • For each wi, do • wi = wi + Δ wi
Results of Gradient Descent. • Because of the shape of the error surface, with only a local minimum, the algorithm will converge to a w with minimum squared deviation/error as long as η is small enough. • This holds regardless of linear seperability. • If η is too large, the algorithm may skip the global minimum instead of settling in. • A common remedy is to decrease η with time.
Gradient descent can be used to search large or infinite hypothesis spaces when: • The hypothesis space is continuously parameterized. • The error can be differentiated with respect to the hypothesis parameters. • However • Convergence can be slooooooow • If there are lots of local minima, there’s no guarantee it will find the global one.
Stochastic gradient descent. • Stochastic gradient descent a.k.a. Incremental gradient descent. • Instead of updating the weights massively after going through all the training examples, we update them after each example. • This really descends the gradient for a single-example error function (an example per step): • E = ½ (td – od)2 • If η is small enough, this is as optimal as true gradient descent.
Stochastic Gradient Descent. • Pick an initial random weight vector w • Until the termination condition is met, do • For each <x, t> in D, do • Run net on x :compute o(x) • For each wi, do wi = wi + η (t – o) xi
Results. • Compared to Stochastic gradient descent, standard gradient descent takes more computation per step, but generally has a larger step size. • When E has multiple local minima, Stochastic gradient descent can sometimes avoid them.
Perceptrons with discrete output. • For Delta-learning/gradient descent, we discussed the unthresholded perceptrons. • It can simply be modified to thresholded perceptrons. • Just use the thresholded t values as the t values for the perceptron delta-learning algorithm. • (with the unthresholded o values) • Unfortunately, this may not necessarily reduce the percent of errors in training data by the thresholded output, just the squared error in the thresholded output.
Multilayer Networks and the Backpropagation Algorithm • In order to learn non-linear decision surfaces, a system more complex than perceptrons is needed.
Choice of base unit. • Multiple layers of linear units still linear. • Unthresholded perceptron • Thresholded perceptron has non-differentiable thresholding function: • Cannot compute gradient of E • Need something different. • Must be non-linear • And continuously differentiable.
Sigmoid unit. • In place of the perceptron step function, use the sigmoid function as thresholding function. • Sigmoid: σ(y) = 1 / (1 + e-y)
The sigmoid unit computes the weighted linear sum w.x, and then applied the sigmoid “squashing function”. • Steepness of incline increases with coefficient of –y. • Continuously differentiable: • Derivative: dσ(y)/dy = σ(y)*(1 - σ(y))
Backpropagation algorithm. • Learns the weights that minimize squared error given fixed # of units/neurons, and interconnections. • Employs gradient descent similar to delta rule. • Error is measured by:
Error Surface can have multiple local minima. • No guarantee algorithm will find the Global Minimum. • However, in practice, backpropagation performs very well. • Recall Stochastic Gradient descent vs. • Gradient descent.
Stochastic Backpropagation Algorithm • Considering a feedforward network with two layers of sigmoid units which is fully connected in one direction. • Each unit is assigned an index (I = 0, 1, 2, …) • xji denotes input from i into j. • wji denotes weight on connection from i to j. • δn is the error term associated with unit n.
Backpropagation explained • Make the network, randomly initialize the weights. • For each training example d, apply the network, calculate the squared error Ed, apply the gradient, and proceed a step of size η in direction of steepest decline. • Weight update rule: • Recall Delta rule: Δ wi = Δ wi + η (t – o) xi • Here we have Δwji = ηδjxji • The error term δj is more complex here.
Error term for unit j. • Intuitively, if j is an output node k, then its error is the standard tk – ok multiplied by ok(1- ok) : the derivative of the sigmoid function. • Derivative of sigmoid because we’re using gradient(E). • If j is a hidden node h, have no th to compare it with. • Must sum error in the output nodes k influenced by h: δk • weighted by how much they were influenced by h: wkh • δh = oh(1- oh) Σ wkhδk