1 / 38

Gradient Descent

Gradient Descent. Disclaimer: This PPT is modified based on Hung- yi Lee http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17.html. Review: Gradient Descent. In step 3, we have to solve the following optimization problem:. : parameters. L: loss function.

redward
Télécharger la présentation

Gradient Descent

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gradient Descent Disclaimer: This PPT is modified based on Hung-yi Lee http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17.html

  2. Review: Gradient Descent • In step 3, we have to solve the following optimization problem: : parameters L: loss function Suppose that θ has two variables {θ1, θ2} Randomly start at

  3. Gradient is perpendicular to contour lines https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/the-gradient

  4. Review: Gradient Descent Gradient: Derivative of the Loss function Gradient descent: direction of negative gradient Start at position Compute gradient at Move to = - η Gradient Compute gradient at Movement Move to = – η ……

  5. Gradient Descent Tip 1: Tuning your learning rates

  6. Learning Rate Set the learning rate η carefully If there are more than three parameters, you cannot visualize this. Loss Very Large small Large Just make Loss No. of parameters updates But you can always visualize this.

  7. Adaptive Learning Rates • Popular & Simple Idea: Reduce the learning rate by some factor every few epochs. • At the beginning, we are far from the destination, so we use larger learning rate • After several epochs, we are close to the destination, so we reduce the learning rate • E.g. 1/t decay: • Learning rate cannot be one-size-fits-all • Giving different parameters different learning rates

  8. Adagrad • Divide the learning rate of each parameter by the root mean square of its previous derivatives Vanilla Gradient descent wis one parameter Adagrad : root mean square of the previous derivatives of parameter w Parameter dependent

  9. : root mean square of the previous derivatives of parameter w Adagrad

  10. Adagrad • Divide the learning rate of each parameter by the root mean square of its previous derivatives 1/t decay

  11. Contradiction? Vanilla Gradient descent Larger gradient, larger step Adagrad Larger gradient, larger step Larger gradient, smaller step

  12. Gradient Descent Tip 2: Stochastic Gradient Descent Make the training faster

  13. Stochastic Gradient Descent Loss is the summation over all training examples • Gradient Descent • Stochastic Gradient Descent Faster! Pick an example xn Loss for only one example, NO summing

  14. Stochastic Gradient Descent Stochastic Gradient Descent Update for each example Gradient Descent If there are 20 examples, 20 times faster. Update after seeing all examples See only one example See all examples See all examples

  15. Gradient Descent Tip 3: Feature Scaling

  16. Feature Scaling Source of figure: http://cs231n.github.io/neural-networks-2/ Make different features have the same scaling

  17. Feature Scaling 1, 2 …… 1, 2 …… 1, 2 …… 100, 200 …… Loss L Loss L

  18. Feature Scaling (normalization with sample size=R) For each dimension i: …… …… mean: standard deviation: …… …… …… …… …… The means of all dimensions are 0, and the variances are all 1

  19. Warning of Math

  20. Gradient Descent Theory

  21. Question • When solving: • Each time we update the parameters, we obtain that makes smaller. by gradient descent Is this statement correct?

  22. Formal Derivation L(θ) • Suppose that θ has two variables {θ1, θ2} Given a point, we can easily find the point with the smallest value nearby. How?

  23. Taylor Series • Taylor series: Let h(x) be any function infinitely differentiable around x = x0. When x is close to x0

  24. E.g. Taylor series for h(x)=sin(x) around x0=π/4 sin(x)= The approximation is good around π/4. ……

  25. Multivariable Taylor Series + something related to (x-x0)2 and (y-y0)2 + …… When x and y is close to x0 andy0

  26. Back to Formal Derivation Based on Taylor Series: If the red circle is small enough, in the red circle L(θ)

  27. Back to Formal Derivation constant Based on Taylor Series: If the red circle is small enough, in the red circle Find θ1 and θ2 in the red circle minimizing L(θ) L(θ) Simple, right? d

  28. Gradient descent – two variables Red Circle: (If the radius is small) Find θ1 and θ2 in the red circle minimizing L(θ) To minimize L(θ)

  29. Back to Formal Derivation constant Based on Taylor Series: If the red circle is small enough, in the red circle Find and yielding the smallest value of in the circle This is gradient descent. Not satisfied if the red circle (learning rate) is not small enough You can consider the second order term, e.g. Newton’s method.

  30. End of Warning

  31. More Limitation of Gradient Descent Loss Very slow at the plateau Stuck at saddle point Stuck at local minima The value of the parameter w

  32. Skipped from Adagrad

  33. Intuitive Reason Dramatic difference • How surprise it is Very Large Very small Create such difference

  34. Larger gradient, larger steps? Best step: Larger 1st order derivative means far from the minima

  35. Larger 1st order derivative means far from the minima Comparison between different parameters Do not cross parameters a > b a b c c > d d

  36. Second Derivative Best step: |First derivative| The best step is Second derivative

  37. Larger 1st order derivative means far from the minima Comparison between different parameters The best step is Do not cross parameters a > b a Larger Second b smaller second derivative c c > d Smaller Second d |First derivative| Second derivative Larger second derivative

  38. The best step is ? Use first derivative to estimate second derivative larger second derivative smaller second derivative |First derivative| Second derivative

More Related