Machine Learning Artificial Neural Networks (ANN)

Machine LearningArtificial Neural Networks (ANN) (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997) Shanghai Jiao Tong University

Artificial Neural Networks • What can they do? • How do they work? • What might we use them for it in our project?

History • late-1800's - Neural Networks appear as an analogy to biological systems • 1960's and 70's – Simple neural networks appear • Fall out of favor because the perceptron is not effective by itself, and there were no good algorithms for multilayer nets • 1986 – Backpropagation algorithm appears • Neural Networks have a resurgence(复苏) in popularity

Applications • Handwriting recognition • Recognizing spoken words • Face recognition • You will get a chance to play with this later! • ALVINN • TD-BACKGAMMON

Neural Network as a Classifier • Weakness • Long training time • Require a number of parameters typically best determined empirically, e.g., the network topology or “structure.” • Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of “hidden units” in the network • Strength • High tolerance to noisy data • Ability to classify untrained patterns • Well-suited for continuous-valued inputs and outputs • Successful on a wide array of real-world data • Algorithms are inherently parallel • Techniques have recently been developed for the extraction of rules from trained neural networks

Motivation • Analogy to biological neural systems, • the most robust learning systems we know. • Attempt to understand natural biological systems through computational modeling. • Parallelism • Massive parallelism allows for computational efficiency. • Distributed • Help understand “distributed” nature of neural representations that allow robustness. • Neural nets • Intelligent behavior as an “emergent” property of large number of simple units rather than from explicitly encoded symbolic rules and algorithms.

Neural Speed Constraints • Number of neurons: • Human brain has about 1011 neurons with an average of 104 connections each. • slow switching time: • Neurons have a “switching time” of a few milliseconds(10-3 second), compared to nanoseconds for current computing hardware. • Perform complex tasks: • However, neural systems can perform complex cognitive tasks (vision, speech understanding) in tenths of a second. • Benefit from parallelism: • Only time for performing 100 serial steps in this time frame, compared to orders of magnitude more for current computers. • Must be exploiting “massive parallelism.”

Neural Network Learning • Learning approach based on adaptations from biological neural systems. • Perceptron: Initial algorithm for learning simple neural networks (single layer) developed in the 1950’s. • Backpropagation: More complex algorithm for learning multi-layer neural networks developed in the 1980’s.

Real Neurons • Cell structures • Cell body • Dendrites(树突) • Axon轴突(输出端) • Synaptic terminals (神经键末梢)

Real Neural Learning • Synapses change size and strength with experience. • Hebbian learning: When two connected neurons are firing at the same time, the strength of the synapse between them increases. • “Neurons that fire together, wire together.”

Examples: Output units weights Hidden Units Input Units Multilayer Neural Network Representation

y1 k j i wkj’s g (sigmoid): h1 h2 h3 1 wji’s 1/2 0 x1 x2 x3 x4 x5 x6 0 How is a function computed by a Multilayer Neural Network? • hj=g(wji.xi) • y1=g(wkj.hj) • where g(x)= 1/(1+e ) Typically, y1=1 for positive example and y1=0 for negative example i j -x

Learning in Multilayer Neural Networks • Learning consists of searching through the space of all possible matrices of weight values for a combination of weights that satisfies a database of positive and negative examples (multi-class as well as regression problems are possible). • Note that a Neural Network model with a set of adjustable weights defines a restricted hypothesis space corresponding to a family of functions. The size of this hypothesis space can be increased or decreased by increasing or decreasing the number of hidden units present in the network.

1 w12 w16 w15 w14 w13 2 3 4 5 6 Artificial Neuron Model • Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji • Model net input to cell as • Cell output is: oj 1 (Tjis threshold for unit j) 0 Tj netj

感知器(Perceptron) • Perceptron

感知器(Perceptron)-cont. • Perceptron can be displayed as： Can be Simplified as (add a constant input x0=1) OR Where

感知器(Perceptron)-cont. • Learning a perceptron means: selecting the values of weights w0,…,wn, so the hypothesis space here is:

Perceptron • Perceptron can be viewed as the Decision Surface in n dimensional instance space • For instances in one side, the output of the perceptron is 1,for instance of other side, output is -1. • The formulation of the surface • Linearly separable instance set(线性可分样例集合) • those set of instances which can be divided by super-plane (可以被某个超平面分割的样例集合)

1 (1,1) (-1,1) -1 1 (1,-1) (-1,-1) -1 Decision Surface of a Perceptron • Represents some useful functions • What weights represent E.g，w0= -0.8,w1=w2=0.5 g(x1,x2)=sgn(0.5 x1+0.5x2-0.8); g(x1,x2)=sgn(0.5 x1+0.5x2-0.2); …, … OR(x1,x2)=sgn(0.5 x1+0.5x2+0.3)OR(x1,x2)=sgn(0.5 x1+0.5x2+0.8)

Decision Surface of a Perceptron • Represents some Boolean functions e.g., m-of-n functions(m<n)：g(x1,x2,…xn)=1, if xi=1 for at least m items xi，otherwise g(x1,x2,…xn)= -1 • ANDfunction：n-of-n function • ORfunction： 1-of-n function Assume that arguments, Function value take the values 1（true），-1 (false) e.g., 3-of-5 function: w0=-0.2,w1=w2= w3 =w4 =w5= 0.5 i.e., g(x1,x2)=sgn(0.5 x1+0.5x2+…+ 0.5x5 –0.2)

Decision Surface of a Perceptron • Perceptron can represent all the primitive Boolean functions(原子布尔函数)：AND(与)、OR(或)、 NAND（与非）、 NOR(或非) • NOT: Let threshold be 0, single input with a negative weight. E.g.，OR(x1,x2,…xn)=sgn(0.5x1+0.5x2+…+0.5xn+0.5(n-2)+0.2) NAND(x1,x2,…xn)=NOT(AND (x1,x2,…xn)) =OR(NOT(x1),…, NOT(xn)) = OR(-x1,…, -xn) =sgn(-0.5x1-0.5x2-…-0.5xn+0.5(n-2)+0.2)

Decision Surface of a Perceptron • Single perceptron can represent many Boolean functions • But, Some Boolean functions not representable • E.g., XOR • not linearly separable • Therefore we’ll want networks of these…

OR AND x1 x2 x3 x4 Potentials of Perceptron • Two layer network(感知器网络)can compute any boolean function using a two level AND-OR network. F(x1, x2, x3,x4)=(x1 x3)  (x1 x2 x4)  (x2 x3 x4) 即：F=OR(AND(x1, x3), AND(x1,x2 ,x4), AND(x2,x3 ,x4))

Perceptron Training • Assume supervised training examples • Learn synaptic weights so that unit produces the correct output for each example. • Perceptron uses iterative update algorithm to learn a correct set of weights.

Perceptron training rule • We need to begin at learning a single perceptron, although our purpose is to learn perceptron networks • Task of learning a perceptron: determine a vector of weights, that makes the perceptron generate correct outputs of 1 and -1 • Two algorithms: • Perceptron rule • Deltarule • The two algorithms are the basis for learning network of multiple units.

Perceptron training rule ---cont. • Process of the algorithm • Begin with random weights; • Iteratively apply the perceptron to each training example, if it misclassifies an example, then modify the weights • Repeated the above process, until the perceptron correctly classify all the training examples • Perceptron training rule where η is the “learning rate” t is the teacher specified output.

Perceptron training rule ---cont. • Equivalent to rules: • If output is correct do nothing. • If output is high, lower weights on active inputs • If output is low, increase weights on active inputs

Gradient Descent(梯度下降)and DeltaRule • Deltarule overcome the drawbacks of perceptron rule: converge to the best approximation of the target function, for not linearly separable training examples. • Key idea: use Gradient Descent to search the hypothesis space of possible weights，to find the weights that best fit the training examples • Deltarule provides the basis for the Back-propagation algorithm (BP,反向传播算法)； • Gradient Descent serve as the basis for learning algorithms that must search through hypothesis space containing continuously parameterized hypotheses.

Gradient Descent(梯度下降)and DeltaRule • Deltarule can be regarded as: learning a perceptron without threshold. i.e., a simpler linear unit • i.e. • The measure for the training error of an hypothesis, relative to the training example

Visualize the error surface • According to the definition of E，for linear units, the error surface is parabolic(抛物线型的) with an single global minimum

Visualize the error surface • Gradient descent: start with an arbitrary initial vector of weights, then repeatedly modify it in small steps, in the direction of the steepest descent along the error surface, until the global minimum error is reached.

4.4.3.2 Derivation of the Gradient Descent Rule • How to determine the steepest direction along the error surface? • Gradient：computing the derivatives of E with respect to each component of the vector • The gradient specifies the steepest increase in E. Hence, the negative of this direction give the steepest decrease in E. • Gradient Descent rule i.e.,

Derivation of the Gradient Descent Rule-cont. • the partial derivatives

Derivation of the Gradient Descent Rule-cont. • Gradient descent component • Gradient descent weight update rule

Table 4-1 Gradient Descent algorithm for training a linear unit Gradient-Descent(training_examples, ) training_examples: each training example is a pair of the form < , t>， is the vector of network input, t is the vector of network output  is the learning rate • Initialize all network weights to small random numbers • Until the termination condition is met, Do • Initialize wito 0 • For each < , t> in training_examples，do • Input the instance to the unit，and compute the Output o • For each liner unit weight wi, do wiwi+(t-o)xi • For each liner unit wi, do wiwi+wi

4.4.3.2 Gradient Descent algorithm • Because the error surface contains only a single global minimum, the algorithm will converge to a weight vector with minimum error, regardless of whether the training examples are linearly separable , given a sufficiently small learning rate is used. • One modification is to gradually reduce the value of the learning rate as the number of gradient descent steps grows.

Summary of Gradient Descent Algorithm • Gradient descent is an important general paradigm for learning. It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever • the hypothesis space contains continuously parameterized hypotheses (e.g., the weights in a linear unit), and • the error can be differentiated with respect to these hypothesis parameters. • The key practical difficulties in applying gradient descent are • converging to a local minimum can sometimes be quite slow (i.e., it can require many thousands of gradient descent steps), and • if there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum.

Stochastic (Incremental ) Gradient Descent • Stochastic Gradient Descent(随机梯度下降,或称增量梯度下降, Incremental Gradient Descent) • Whereas the gradient descent training rule presented in Equation (4.7) computes weight updates after summing over all the training examples in D, the idea behind stochastic gradient descent is to approximate this gradient descent search by updating weights incrementally, following the calculation of the error for each individual example. • The modified training rule is like the training rule given by Equation (4.7) except that as we iterate through each training example, we update the weight according to • wi(t-o)xi

Stochastic (Incremental ) Gradient Descent • Stochastic Gradient Descent(随机梯度下降,或称增量梯度下降, Incremental Gradient Descent) • wi(t-o)xi • to consider a distinct error function Ed(w) defined for each individual training example d • The sequence of these weight updates, when iterated over all training examples, provides a reasonable approximation to descending the gradient with respect to our original error function • By making the value of  (the gradient descent step size) sufficiently small, stochastic gradient descent can be made to approximate true gradient descent arbitrarily closely.

Incremental Gradient Descent-cont.

Stochastic (Incremental ) Gradient Descent-cont. • The key difference between Standard Gradient Decent and Stochastic Gradient Decent • In standard gradient descent, the error is summed over all examples before updating weights, whereas in stochastic gradient descent weights are updated upon examining each training example. • Summing over multiple examples in standard gradient descent requires more computation per weight update step. On the other hand, because it uses the true gradient, standard gradient descent is often used with a larger step size per weight update than stochastic gradient descent. • In cases where there are multiple local minima, stochastic gradient descent can sometimes avoid falling into these local minima because it uses the various gradient descent rather than the true gradient to guide its search.

Stochastic (Incremental ) Gradient Descent • Deltarule，LMS (least-mean-square) rule,Adalinerule、Windrow-Hoffrule: • Notice the delta rule wi(t-o)xi • in Equation ( 4.1 0) is similar to the perceptron training rule wi(t-o)xi in Equation (4.4.2). • But they are different. Because the two O is different,

Stochastic (Incremental ) Gradient Descent-cont. • Delta rule - can also be used to train threshold perceptron units; • If unthresholded output fit the training examples perfectly，then the corresponding thresholded output will fit them as well. • Even when the training examples cannot be fit perfectly, if the linear unit output has the correct sign(符号), the threshold output will correct the fit the target value.

Summary on Perceptron • The key difference between these algorithms is that: • the perceptron training rule updates weights based on the error in the thresholded perceptron output, • whereas the delta rule updates weights based on the error in the unthresholded linear combination of inputs. • The difference between these two training rules is reflected in different convergence properties. • The perceptron training rule converges after a finite number of iterations to a hypothesis that perfectly classifies the training data, provided the training examples are linearly separable. • The delta rule converges only asymptotically toward the minimum error hypothesis, possibly requiring unbounded time, but converges regardless of whether the training data are linearly separable.

Machine Learning Artificial Neural Networks (ANN)

Machine Learning Artificial Neural Networks (ANN)

Presentation Transcript

Artificial Neural Networks

Artificial Neural Networks

Machine Learning Artificial Neural Networks

Artificial Neural Networks

Chapter 3 ARTIFICIAL NEURAL NETWORKS

Artificial Neural Networks

Artificial Neural Networks

Artificial Neural Networks

Artificial Neural Networks

Artificial Neural Networks : An Introduction

Artificial Neural Networks (ANN)

Project 1: Machine Learning Using Neural Networks

INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS (ANN)

Artificial Neural Networks (ANN)

Artificial Neural Networks - Introduction -

Artificial Neural Network for Machine Learning – Structure & Layers

Artificial Neural Networks (ANN)

Artificial Neural Networks

Artificial Neural Networks (ANN)

Artificial Neural Networks

Artificial Neural Networks