1 / 85

A bit of History

A bit of History. A bit of History. 1943 Warren McCulloch & Walter Pitts:  How To: From neurons to complex thought Binary threshold activations 1949 Howard Hebb: Neurons that fire together wire together Weights: Learning and memory. McCullock & Pitts / Hebb Rosenblatt's Perceptron

bwilliam
Télécharger la présentation

A bit of History

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A bit of History

  2. A bit of History • 1943 Warren McCulloch & Walter Pitts:  • How To: From neurons to complex thought • Binary threshold activations • 1949 Howard Hebb: • Neurons that fire together wire together • Weights: Learning and memory McCullock & Pitts / Hebb Rosenblatt'sPerceptron Minsky & Papert - XOR Backpropagation Algorithm [McCullock, 43]

  3. 1948, Rosenblatt applied Hebb's learning to McCulloch & Pitts design A bit of History w  real-valued weights ·   dot product b  real scalar constant • The Mark I Perceptron. A visual classifier with: • 400 photosensitive receptors (sensory units) • 512 stepping motors (association units, trainable) • 8 output neurons (response units) McCullock & Pitts / Hebb Rosenblatt'sPerceptron Minsky & Papert - XOR BackpropagationAlgorithm [Rosenblatt, 58] [Mark I Perceptron] [Perceptrons] [1]

  4. Rosenblatt acknowledged a set of limitations in the Perceptron machine. • Minsky & Papert did too in “Perceptrons: an introduction to computational geometry”, including: • A multilayer perceptron (MLP) is needed for learning basic functions like XOR • MLP cannot be trained. A bit of History McCullock & Pitts / Hebb Rosenblatt'sPerceptron Minsky & Papert - XOR BackpropagationAlgorithm [Minsky, 69] This had a huge impact on the public, resulting in a drastic cut in funding of NNs until the mid 80s 1st AI WINTER

  5. How can we optimize neuron weights which are not directly connected to the error measure? Backpropagation algorithm:  Use the chain rule to find the derivative of cost with respect to any variable. In other words, find the contribution of each weight to the overall error. First proposed for training MLPs by Werbosin '74. Rediscovered by Rumelhart, Hinton and Williams in '85. End of NNs Winter Training with backprop Forward pass from input to output Error measurement (loss function) Find gradients towards minimizing error layer by layer (backward pass) A bit of History McCullock & Pitts / Hebb Rosenblatt'sPerceptron Minsky & Papert - XOR BackpropagationAlgorithm [Werbos,74] [Rumelhard,85]

  6. Feedforward NeuralNetworks

  7. Computing the gradients using all available training data would require huge amounts of memory. • Stochastic Gradient Descent: Iteratively update weights using random samples (hence, stochastic) • Each feedforward/backward cycle (a step) processes a random batch of images. • Typical batch sizes: Powers of 2. • Batch size = 1 -->  Full stochastic (slower) • Batch size = dataset_size --> Deterministic (bad generalization) • An epoch is the processing of the whole dataset once. It corresponds to processing as many batches as: • dataset_size / batch_size Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization

  8. Activation functions transform the output of a layer to a given range. If the function is non-linear, the net can learn non-linear patterns (e.g., XOR). Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization • Does not saturate • Does not vanish • Faster • May die • Zero gradient in most of f(x). Saturates! • Max gradient is 0.25 or 1. Vanishing! ReLU is a safe choice in most cases Undying alternatives: Leaky ReLU, ELU, ...

  9. Gradient descent is a simple and straight-forward optimization algorithm to update weights towards a min. Learning rate determines how much we move in that direction. With the wrong LR you may end up in local minima or saddle points, or be too slow. Feedforward Neural Networks SGD will overshoot unless we keep decreasing the LR. SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization [Dauphin, 14]

  10. Momentum: Include a fraction of the previous gradient. Keeps the general direction so far. Nesterov: Compute current gradient considering where the previous gradient took you. (RNNs?) Adagrad: Parameter-wise LR considering past updates. Good for infrequent patterns (GloVe). Vanishing LR due to growing history. Adadelta: Adagrad with a decaying average over history. Typically set around 0.9. Adam: Adadelta + Momentum Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization [Dauphin, 14] [Ruder,www]

  11. Why do we need regularization? Feedforward Neural Networks BecausethedifferencebetweenMachine LearningandOptimization is calledGeneralization SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization

  12. Generalization SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks Polynomialregression 1 2 3 3 2 1

  13. Generalization SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks Polynomialregression Training Error Huge Small Tiny Model Generalization Bad Good Horrible

  14. Generalization SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks What policy can we use to improve model generalization? Occam’s Razor when you have two competing hypotheses that make the same predictions, thesimpler one is the better Machine Learning given two models that have a similar performance, It’s better to choose the simpler one

  15. Model Complexity SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks What policy can we use to improve model generalization?

  16. Model Complexity SGD, Epochs, Batches and Steps Activation functions SGD learning rate Other optimization methods Regularization Normalizing inputs Vanishing/Exploding Gradients Weights initialization Feedforward Neural Networks

  17. Model Complexity SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks

  18. Model Complexity SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks

  19. Model Complexity SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks

  20. Model Complexity SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks

  21. Model Complexity SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks

  22. Model Complexity SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks

  23. L1 / L2 Regularization SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks Whatcomplexities do thesemethodsuse?

  24. A bit of History

  25. L1 / L2 Regularization SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks L1 Penalty L2 Penalty

  26. L1 / L2 Regularization SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks L1 Penalty L2 Penalty

  27. Dropout SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks Step n Dropout 50% Dropout 50%

  28. Dropout SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks Step n+1 Dropout 50% Dropout 50%

  29. Dropout SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks Step n+2 Dropout 50% Dropout 50%

  30. Dropout SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks Before drop-out: After drop-out: Whatcomplexitydoesthismethoduse?

  31. EarlyStopping SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks

  32. EarlyStopping SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks

  33. EarlyStopping SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Feedforward Neural Networks Whatcomplexitydoesthismethoduse?

  34. Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization

  35. Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization

  36. Feedforward Neural Networks Variance SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Variance

  37. Whyinput normalizationmatters? Feedforward Neural Networks Unnormalizedlossspace SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Normalizedlossspace

  38. Whyinput normalizationmatters? Feedforward Neural Networks Unnormalizedlossspace SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Normalizedlossspace

  39. Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Input Output

  40. Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Input Output

  41. Feedforward Neural Networks FeedForward Input SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Backpropagation

  42. Feedforward Neural Networks FeedForward Input SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Backpropagation

  43. Feedforward Neural Networks Weights get stucked! 0.42  0.419999 SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization

  44. Whygradients get smallerandsmallerateachlayer on backpropagation? Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization

  45. Whygradients get smallerandsmallerateachlayer on backpropagation? Feedforward Neural Networks Well, this is related to the derivatives chainrule. SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization

  46. Whygradients get smallerandsmallerateachlayer on backpropagation? Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization

  47. Whygradients get smallerandsmallerateachlayer on backpropagation? Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Termsvalues < 1.0 + #Multyplingterms =

  48. Whataboutexplodinggradients? Feedforward Neural Networks SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization Termsvalues > 1.0 + #Multyplingterms =

  49. Normal Distribution Feedforward Neural Networks 0.48 -1.83 -0.37 SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization -0.11 0.43

  50. Normal Distribution Feedforward Neural Networks 1 1 0.48 -1.83 1 -0.37 SGD, Epochs, BatchesandSteps Activationfunctions SGD learningrate Otheroptimizationmethods Regularization Normalizing inputs Vanishing/Exploding Gradients Weightsinitialization -0.11 1 0.43 1

More Related