Deep Learning

Deep Learning

Artificial Neural Networks • Artificial neural networks (ANNs) are cognitive models • Inspired by the structure and function of biological neurons • Appear in perceptron and back propagation prediction systems • Tries to model the complex relationships between inputs and outputs • This forms a class of pattern matching algorithms used for solving regression and classification problems

Artificial Neural Networks (cont.) • ANN types vary from those with only one or two layers of single direction logic • To complicated multi-input of many directional feedback loops and layers • Most systems use weights to change the parameters of the throughput and the varying connections to the neurons • Mostly autonomous and learn by input from outside teachers or even self-teaching from written-in rules • ANNs can be operated with supervision or unsupervision depending on the learning paradigm applied

Deep Learning • Deep learning (DL) methods extend from Artificial neural networks (ANNs) • By building much deeper and complex neural networks • Built of multiple layers of interconnected artificial neurons • With multiple hidden layers of units between the input and output layers • Can model more complex non-linear relationships than shallow learning methods like decision trees, SVM, etc • Often used to mimic the human brain process in response to light, sound, and visual signals • Also applied to semi-supervised learning problems with large data sets containing very little labeled data

Deep Learning (cont.) • To recognize simple patterns, basic classification tools based on shallow learning are good enough • ANN starts to exhibit its superior performance with the increase in the number of input features • When patterns become more complex, shallow neural networks become unusable • The number of nodes required in each layer grows exponentially with the increasing number of possible patterns • Training becomes expensive • Accuracy starts to suffer • Deep learning is really required

Deep Learning (cont.) • Giving facial recognition as an example • Shallow learning or shallow neural nets are infeasible • The only practical choice is deep learning through deep nets • The important reason for deep learning to outperform other competitors • Computation capacity is upgraded extensively to make training process much faster than before • At each layer of a deep learning algorithm • The signal is transformed by a processing unit • Like an artificial neuron • The parameters are learned through training • There is no universally agreed upper threshold of depth

Deep Learning (cont.) • To dividing shallow learning by shallow neural nets from deep learning • Most agree that deep learning has multiple nonlinear layers (CAP>2) • Credit Assignment Path depth: No. of hidden layers • Some considers CAP>10 to be very deep learning • The concept of deep learning is to learn representation of original data • Conducted layer by layer • Through a combination of low-level features, deep learning forms more abstract representation of attributes or high-level features

Deep Learning (cont.) • Deep learning adjusts learning models via training and learning with plenty of data • Obtains layer-based representation from original data • Learns effective feature representation of data to improves the accuracy in classification or prediction • The main difference between deep learning and other machine learning methods • The capability of feature learning • Deep model is a method • Feature learning is the objective • Deep learning differs from shallow learning in four aspects

Deep Learning (cont.) • Emphasizes the depth of the ANN structure • More hidden layers are utilized in deep learning • The importance of feature learning is highlighted by the use of layer-wise feature transformation • Data features from original space are represented by the features in a new feature space • With this method, classification or prediction becomes easier and more accurate • The training models of deep learning are different • The layer-wise training is adopted in deep learning • Solves the problem of the vanishing gradient • Plenty of data are utilized to learn the features in deep learning • Not a must in shallow learning

Deep Learning (cont.) • Deep learning simulates the operations in deeper layers of artificial neural networks • Simply a rebranding of neural networks • Deep neural networks include one input layer, one output layer, and multiple hidden layers • The connection strength between neurons is adjusted in the learning process • Common DL architectures include • The basic ANN, convolutional neural networks (CNNs), and recurrent neural networks (RNNs) • And many possible extensions of these networks • ANNs with multiple self-learning hidden layers are suggested for deep learning

Deep Learning (cont.) • Each hidden layer includes several neurons • The output of the previous layer is made as the input of the next layer • Adopts an artificial neuron to simulate a biological neuron in the human brain • The layer-wise learning structure simulates a hierarchical structure of information processing by the human brain • The learning capability to obtain features by artificial neural networks with multiple hidden layers is strong • The features obtained by progressive learning in multiple layers can represent data accurately • Can be used for dynamic feature recognition to solve image and visual recognition problems

Deep Learning (cont.) • The layer-wise pre-training method solves difficulties in training ANNs • Factors for the success of deep learning • Improvement in algorithm, realizing layer-wise feature extraction, simulating the capability of the human brain in learning • And simulating the hierarchical structure of the human brain during information processing • Deep learning has been aroundfor decades • Two external reasons to make DL popular • One is the adoption of GPU and the improvement of the computers’ computation capability (HPC) to support the large-scale training of a deep learning

Deep Learning (cont.) • The other is the convenience of acquiring a large amount of training data (IoT) • By processing huge volume of data, DL algorithms will outperform all classic machine learning algorithm • May need to process a large amount of training data to prove the prediction accuracy • One can obtain the abstract value of big data by the power of deep learning • Still a significant gap between current AI and the simulation of the human brain with high fidelity • e.g., Only a few times are needed to teach a child to recognize a person

Deep Learning (cont.) • The child can adapt to any light effects and appearance changes • For a computer to recognize a person, a huge number of stored images are needed for learning • Hard for the computer to adapt to the changes in light effects, clothing, glasses or other factors • Four steps to generate an DL model

Gradient Descent • A better mechanism and very popular in machine learning for finding the convergence point • Calculates the gradient of the loss curve at the starting point • The negative gradient always points in the direction of steepest decrease in the loss function

Gradient Descent (cont.) • The gradient of loss is equal to the derivative of the loss function, i.e., the slope of the curve • The process is like rolling a rock down a slope • Moves quickly along the surface if the gradient is high • When the gradient is small, the process is slow • In gradient descent • A batch is the total number of examples used to calculate the gradient in a single iteration • A large data set with randomly sampled examples probably contains redundant data • Redundancy becomes more likely as batch size grows • Some can be useful to smooth out noisy gradients • But enormous batches of total examples tend not to carry much more predictive value than large batches

Gradient Descent (cont.) • Can estimate a big average from a much smaller one by choosing examples at random from data set • Stochastic gradient descent (SGD) uses only a single example per iteration • i.e., A batch size of 1 • Given enough iterations, SGD works but is very noisy • The term stochastic indicates that the one example comprising each batch is chosen at random • Mini-batch SGD is a compromise between full-batch iteration and SGD • A mini-batch is typically between 10 and 1,000 examples, chosen at random • Mini-batch SGD reduces the amount of noise in SGD and is still more efficient than full-batch

Mathematical Description of an Artificial Neuron • ANN is an abstract mathematical model • The main unit of an ANN is a neuron • Many similarities between ANN and biological neural networks in the human brain • ANN consists of a group of connected input/output units • Each connection is corresponding to the synapses of biological neurons and expressed as a weighted edge • The weighted value represents activation if it is positive and represents suppression when the value is negative • During learning, adjust these weights based on the gap between predicted output and labeled test data • Typically with three types of parameters

Mathematical Description of an Artificial Neuron (cont.) • The interconnection pattern between the different layers of neurons • The learning process for updating the weights of the interconnections • The activation function that converts a neuron’s weighted input to its output activation • The inputs to an artificial neuron are all from external stimulating signals • Denoted by xi for i = 1, 2,.… n • The artificial neuron calculates a weighted sum of the input signals • The weights are denoted as wi for i = 1, 2,.… n • A nonlinear activation function is applied to produce the output signal

Mathematical Description of an Artificial Neuron (cont.) • The nonlinear sigmoid function is suggested to model the operation of an artificial neuron

Single Layer Artificial Neural Networks • The perceptron is the simplest ANN • Includes an input layer and an output layer without a hidden layer • The input layer node is used for receiving data • The output layer node yields output data

Single Layer Artificial Neural Networks (cont.) • Simulate perceptron in the human nervous system • The input node corresponds to the input neuron • The output node corresponds to a decision-making neuron • Weight parameter corresponds to strength of connection between neurons • By constantly stimulating neurons, the human brain can learn unknown knowledge • An activation function f(x) is used to mimic the stimulation of a neuron in the human brain • From the perspective of mathematics • Each input item corresponds to an attribute of object

Single Layer Artificial Neural Networks (cont.) • Weight stands for the degree by which the attribute reflects the object • Multiplied together with the degree of deviation, the input x is obtained • Then the output is calculated • May not necessarily obtain ideal results • stand for the output result of perceptron • The equation of the model is • w and x are n-dimensional vectors • Typically a sigmoid function is used for f(x) • The derivative of the sigmoid function is f(x)(1 - f(x))

Single Layer Artificial Neural Networks (cont.) • Or a hyperbolic tangent function is used

Single Layer Artificial Neural Networks (cont.) • If expect good results, appropriate weights are needed • An important input needs to be weighted high to calculate the output • Cannot know which value should be set to each weight in advance • First randomly initialize weights and a bias value • Then value of weight must be adjusted dynamically during training • The weight renewal equation is • wj(k) is the weight value of input node j after iterations of k times • 𝜆 is known as learning rate

Single Layer Artificial Neural Networks (cont.) • xij is input value of node j in ith training data sample • The learning rate is within the interval of [0, 1] • A hyperparameter that is the configuration setting used to tune how the model is trained • If 𝜆 is close to 0, the new weight value is mainly influenced by old weight value • The learning rate is slow, but the optimal weight value may be found easily • If 𝜆 is close to 1, the new weight value is mainly influenced by the current adjustment amount • The learning rate is fast, but it may skip the optimal weight value • The next point will perpetually bounce haphazardly across the bottom

Single Layer Artificial Neural Networks (cont.)

Single Layer Artificial Neural Networks (cont.) • A good value is related to how flat loss function is • If the gradient of the loss function is small, can safely try a larger learning rate • The ideal learning rate in one-dimension is • The inverse of the second derivative of f(x) at x • The ideal learning rate for two or more dimensions is the inverse of the Hessian • Matrix of second partial derivatives • Under some circumstances, the value of 𝜆 will be larger in several previous iterations • Will be reduced gradually in later iterations

Multilayer Artificial Neural Network • A multilayer ANN is composed of one input layer, one or more hidden layer(s), and one output layer • The mathematical equation of a single-layer ANN

Multilayer Artificial Neural Network (cont.) • One summation function used to compute the weighted sums for each input signal together with an offset or threshold • Its mathematical equation • One nonlinear activation function plays the role of nonlinear mapping • Restricts the output amplitude of the neuron within a certain range, often (0, 1) or (-1, 1) • Its mathematical equation • f (⋅) is an activation function

Multilayer Artificial Neural Network (cont.) • No uniform regulation for the number of hidden layers in an ANN • And the number of neurons in input, output, and each hidden layer • Or how to select the activation function of neurons in each layer • Also no standard for some specific cases • Need to be chosen independently or chosen as per personal experience • A certain heuristic nature on the choice of network • This is why ANN is deemed a heuristic algorithm

Forward Propagation and Backward Propagation • To let the ANN network have a specific function and a practical application value • To obtain a set of appropriate weights for a multilayer ANN • ANN solves this problem with a back propagation algorithm • Need to know how the ANN propagates the signals forward from the input end to the output end first

Forward Propagation • The forward propagation starts from the input layer towards a hidden layer • As for hidden unit, its input is hik • Stands for input of hidden unit i in layer k • bik stands for offset of hidden unit i in layer k • The corresponding output state • For convenience, often let x0 = b and wi0 = 1 • The equation of forward propagation from hidden units of layer k to layer k+1 is • mk stands for the quantity of neurons in hidden units of layer k

Forward Propagation (cont.) • wijk stands for weight vector matrix from layer k to layer k+1 • The final output is • mo stands for the number of output units • Can be multiple outputs in artificial neural network • But generally one output will be set up by default

Forward Propagation (cont.) • M stands for total number of layers • Oi stands for the outcome value of output unit i • Forward Propagation-Based Output Prediction in ANN • In an ANN, each set of inputs is modified by unique weights and biases • When calculating the activation of the third neuron in the first hidden layer • The first input was modified by a weight of 2 • The second by 6, the third by 4 • And then a bias of 5 is added • Each activation is unique • Each edge has a unique weight

Forward Propagation (cont.) • Each node has a unique bias • This simple activation will flood across the entire network • The first set of inputs is passed to the first hidden layer • The activations of the first hidden layer pass to the next hidden layer until it reaches the output layer • The outcome of the classification is determined by the score of each output node • Will be repeated for another set of input • Such a procedure of classification in ANN is called forward propagation

Forward Propagation (cont.)

Backward Propagation • To train an ANN • Most algorithms employ some form of gradient descent method • Using back propagation to compute the actual gradients • Done by taking the derivative of the cost function with respect to the network parameters, i.e., weights and biases • How to update the weight wijthrough a learning or training process • Try to let the output of the artificial neural network be identical with standard values of the training sample

Backward Propagation (cont.) • This kind of output is named an ideal output • Impossible to accurately achieve this objective • Try to let the actual output be as close as possible to the ideal output • The problem of finding a group of appropriate weights • Make E(W) reach minimum by figuring out appropriate values of W • Ois stands for the outcome value of output unit i if the training sample is s

Backward Propagation (cont.) • As for each variable wijk, this is a continuously differentiable nonlinear function • In order to work out the minimum • Generally adopt the steepest gradient descent method • According to this method, constantly update weight in the direction of the negative gradient until the conditions set up by the customer are satisfied • The so-called direction of gradient is to figure out partial derivative of function • The weight is wij(k)after renewal at the kthtime • If ∇E(W) ≠ 0, the renewed weight at the (k+1)thtime

Backward Propagation (cont.) • 𝜂 is the learning rate of that network • Plays the same role as learning rate 𝜆 in perceptron • The update rule is very simple • If the error goes down when the weight increases (∇E(W) < 0), then increase the weight • Otherwise if the error goes up when the weight increases (∇E(W) > 0), then decrease the weight • When ∇E(W) = 0 or |∇E(W)| < 𝜀, stops renewing • 𝜀 is permissible error • wij(k) at this time will be the final weight of the artificial neural network • The network constantly adjusts the weight • Called the learning process of the ANN • Also called the backward propagation algorithm of network

Backward Propagation (cont.) • Example of Backward Propagation in ANN • Consider the ANN with two hidden layers • Four neurons per layer • The accuracy depends on weights and biases • The goal is to make the predicted output as close to actual output as possible • The key to improve accuracy is the training • Let y denote the output of forward propagation • And y* denote the correct output • The cost is the difference between two • Denoted by (y - y*) • After enormous times of training process, the cost should be decreased more and more

Backward Propagation (cont.) • During training, ANN adjusts weights and biases step by step until the predicted output matches the actual output • To achieve this, three steps are taken • When updating the weights and the bias • i.e., Weights between the first neuron in the output layer and the neurons in the second hidden layer • And the bias of the first output neuron • The error between forward propagation of the output and its actual result needs to be calculated first • The error is 3 through computing • The gradient of any weight and bias is calculated • By applying the chain rule

Backward Propagation (cont.) • e.g., Weights and biases are 5, 3, 7, 2, 6, respectively • The corresponding gradients are -3, 5, 2, -4, -7 • The updated weights and biases can be calculated • e.g. 5 - 0.1 × (-3) = 5.3 • 0.1 is the learning rate set by user • The bias of the node is revised as 6 - 0.1 × (-7) = 6.7

Backward Propagation (cont.) • This simple error will flood across the entire network • The output error will backward propagate to the second hidden layer • The associated weights and biases will be updated accordingly • Such operation will be repeated until the error is propagated to the input layer • At this time, the weights and biases of the whole network are updated • Such a procedure of updating weights in ANN is called backward propagation • Will be repeated for another set of errors • TensorFlow handles backpropagation automatically

Backward Propagation (cont.)

Backward Propagation (cont.) • Hyperlipidemia Diagnosis Using an Artificial Neural Network • A dataset of triglycerides, total cholesterol, high-density lipoprotein, and low-density lipoprotein • And whether or not a patient has hyperlipidemia • 1 for yes and 0 for no • During a health examination of some people in a grade-A hospital of second class in Wuhan City • Attempt to conduct preliminary judgment on if a person who received a health examination has hyperlipidemia • If the health examination data are {3.16, 5.20, 0.97, 3.49} in sequence

Backward Propagation (cont.)

Backward Propagation (cont.) • A dichotomy problem with four attributes • 1 for hyperlipidemia or 0 for healthy • Can conduct prediction and classification using an ANN • Need to choose an appropriate model of an ANN • Not enough sample data for training in this case • Not necessary to set up too many hidden layers and neurons • One hidden layer is set up • The quantity of neurons in each layer is five • The tansig function is chosen as an activation function between the input layer and hidden layer • This is mathematically equivalent to tanh() • The purelin function is chosen as a function between the hidden layer and output layer

Backward Propagation (cont.) • Choosing other functions has little effect on the results • The network parameters • Then train the network with the data • The error between the actual output of the network in the training process and the ideal output is reduced gradually • A satisfactory state is reached after the second back propagation

Deep Learning

Deep Learning

Presentation Transcript

Deep Learning

Deep Learning

Deep Learning!!!!

Deep learning

Deep Learning

Deep Learning

Deep Learning

Deep Learning

Deep Learning

Deep Learning

Active Learning = Deep Learning

Deep learning

Deep Learning Tutorial

Deep Learning

Deep Learning Market

Deep Learning Market

Discriminate between deep learning and deep q learning