1 / 20

Towards Reliable Convergence in the Training of Neural Networks

The Streamlined Glide Algorithm and the LM-Glide Algorithm. Towards Reliable Convergence in the Training of Neural Networks. Vitit Kantabutra Idaho State University Pocatello, Idaho, U.S.A. Neural Networks are Still Useful. Multi-layer perceptrons are still very popular in many fields

decker
Télécharger la présentation

Towards Reliable Convergence in the Training of Neural Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Streamlined Glide Algorithm and the LM-Glide Algorithm Towards Reliable Convergence in the Training of NeuralNetworks Vitit Kantabutra Idaho State University Pocatello, Idaho, U.S.A.

  2. Neural Networks are Still Useful • Multi-layer perceptrons are still very popular in many fields • Classification • General function approximation

  3. Neural Networks Research has Declined Despite Training Convergence Problem • In 2000, organizers of NIPS conference pointed out…. • “neural networks” in the title was negatively correlated with acceptance • “SVM,” “Bayesian networks,” and “variational methods” positively correlated with acceptance

  4. Non-Convergence Problem • Still very prevalent • Leads to frustrations and even compromised results

  5. Second Order Algorithms • Much of research that’s still done in neural networks is on second order algorithms • But second order algorithms don’t help with large networks • Computational complexity problem • Flat region problem slows down all gradient-based algorithms • Neither first nor higher order conventional algorithms perform well. • Zigzagging is another problem

  6. Illustrating Flat Regions

  7. Illustrating Zigzagging Despite Momentum (= 0.9)

  8. Attempted Solutions for Flat Region Problem • Changing formula for computing output layer’s delta (error signal) (Solla, Fahlman) • Helps, but doesn’t eliminate problem. We used Fahlman’s formula in some previous experiments to speed up convergence • Another approach is by Wilamowski and Torvik

  9. Our First Attempt: Glide Algorithm • Kantabutra & Zheleva ’02 • Idea: flat regions are ‘safe’ • Why not go fast in flat regions? • Usually works, but sometimes error rises sharply • Key: gd often would have made sudden hairpin turns to safety when our algorithm would glide too far into high-mse territory • Weight trajectory hits sigmoidal wall

  10. Why Our First Attempt Failed:The Hairpin Turn

  11. Our Second Attempt:Glide Algorithm with Tunneling • Kantabutra, Tsendjav and Zheleva ’03 • Glides more carefully • Checks error before making the move permanent • Adds ‘tunneling’ move • Performs a local line search to find bottom of “half-pipe” • Works • 100% convergence, fast and reliable (low stdev in convergence time) • Is complicated • Could be cleaner

  12. Illustrating importance of tunneling Mean-square error as a function of distance – “half pipe” shaped curve; in areas of turbulence we want to be at bottom of half pipe

  13. A Few Experimental Results From paper of 2003 Didn’t converge CPU time, G.D. odd runs with  Problem: Parity-4 with 4 hidden neurons y=running time (sec) until convergence Even runs: starting with previous run’s weights Odd runs: random starting wts X=run number

  14. Two-Spiral Problem (2003) • Very hard problem • Glide algorithm • combined with gradient descent for quicker initial error reduction • number of epochs required for convergence varies widely • average 30453 epochs • Gradient descent • often did not converge

  15. Our Third Attempt: Simplified Glide Algorithm and LM-Glide • Still with tunneling, just removed the word from the name • Simpler but seemingly still effective

  16. Glide Move: details • Take two small gradient descent moves just for calculation purposes (w0 -> w1 -> w2) • Let w0 -> w2 be our direction of weight motion • A far glide is a long distance glide (e.g. 0.2) • A near glide is a short distance glide (e.g. 0.1) • Some tuning is required, but not difficult compared with regular gradient descent • Tuning could still be significant, even though algorithm is less finicky than gradient descent • New: self-tuning version for heart arrhythmia classification (UCI database)

  17. Downscaling or shrinking move • Multiply every weight by a factor like 0.95 • May be needed every few dozen glides to prevent weights growing out of control

  18. Tunneling Move: detail • When mse(w2) > mse(w0) • Use local line searching in the direction of negative gradient from w0 or w1 to find lowest-error point of half-pipe • If mse(w1) <= mse(w0) search from w1, else search from w0. • Favor w1 because we want some weight movement if possible

  19. Streamlined glide algorithm (with tunneling)

  20. Results

More Related