190 likes | 207 Vues
Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization. Umut Simsekli, Cagatay Yildiz, Than Huy Nguyen, A. Taylan Cemgil, Gael Richard ; Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4681-4690, 2018. Context. Objective:
E N D
Asynchronous Stochastic Quasi-Newton MCMC for Non-Convex Optimization Umut Simsekli, Cagatay Yildiz, Than Huy Nguyen, A. Taylan Cemgil, Gael Richard ; Proceedings of the 35th International Conference on Machine Learning, PMLR 80:4681-4690, 2018.
Context • Objective: • First order methods: GD, SGD, VR variants, etc. • Second order methods: Newton, QN (L-BFGS, mb-L-BFGS), etc. • Parallelize: • Asynchronous SGD (or SVRG): Unstable (Wild!), often unscalable at large batch size, thus need too much communication practically. • Synchronous SGD (OSA, MB or Local SGD): diminishing returns on MB. • Can use QN methods with better bounds and more computation per node to win the communication v/s convergence trade off. • Hyper optimised implementations often a mixture of above two.
mb-L-BFGS • The L-BFGS algorithm (Nocedal & Wright, 2006) is a limited-memory BATCH QN method. • Need to use a stochastic online version for reducing computation. • Berahas et al. (2016) proposed a parallel stochastic L-BFGS algorithm, called multi-batch L-BFGS (mb-L-BFGS), which is suitable for synchronous distributed architectures. (Synchronous L-BFGS!!) • It uses a different batch (deterministic???) at every iteration for QN (stable???). Better than MB, as batch sizes are large, winning the computation v/s communication trade-off. • To stabilize use overlapping batches (gradient consistency) and to make robust use non-random structure in the batches.
mb-L-BFGS Performance • The bounds for synchronous L-BGFS are not easily interpretable, asymptotic and are given only for the finite horizon setting. On the other hand, the bounds are available for non-convex functions. • The performance of synchronous L-BGFS seems justifiably (w.r.t. extra computation) better and thus they can be used upon first order methods to improve performance. • A fault tolerant robust version was provided, which was also better than the normal mb-L-BFGS. • BUT, the bottom line is: Computation complexity of QN + communication + overlapping set calculation.
Asynchronous L-BGFS? • mb-L-BFGS and its robust version are both synchronous parallel algorithms. • Synchronous vs asynchronous trade-off is equivalent to a “stability” vs “communication” (synchronization and coordination) trade-off. • The trade-off is still not very well understood, even for the first order methods! • Synchronous first order methods are the most widely used tools in deep learning tasks today. Infact, even the famous VR methods are not used very often for non-convex optimization. • To get as-L-BGFS we have two common directions, either the master/slave setting or the shared memory setting.
as-L-BGFS (This paper) • MAP (regularized) objective and not MLE. • Reformulate as distribution sampling (conc. around optima). • Use SGMCMC to sample further (SGLD precisely). • Prove ergodic convergence to stationary distribution. • Major problems: • Inconsistent differences due to sub-sampling. • SPD of the hessian not ensured in expectation or o/w. • SGLD convergence can be shown to have nice properties for regular objectives (often convergence to neighbourhoods only). • Numerically unstable if simply SGLD (SGD with a smart noise) is used (2016).
as-L-BGFS (This paper) • Introducing momentum/friction (delayed) solves the numerical instability problem. • Each worker has its own local L-BFGS memories since the master node would not be able to keep track of the gradient and iterate differences, which are received in an asynchronous manner. • To ensure that the hessian is SPD, they use an intricate update scheme (haven’t added the equations for brevity, similar to mb-L-BFGS). • The time complexity arises from the hessian-gradient product, which is O(Md) compared to previous O(Md^2) (2016 attempt).
Under the constraints on regularity of the function (UB on variance, hessian, lipschitzness, etc.) and the stationary distribution (bounded second order etc.) the following result can be obtained. • O(1/sqrt(N)) rate can be obtained, which is optimal compared to state of the art results in asynchronous non-convex problems. • Due to exponential dependence in the dimension, the algorithm might get stuck onto local minima like other tempered algorithms (SGLD like).
My opinion on this paper? • Finally, non-asymptotic bounds, tighter than the synchronous counterpart, more interpretable. • Bounds generalize to the less understood non-convex setting and QN convergence. • Algorithm performs better in practice than the synchronous counterpart and is even comparable to asynchronous SGD. • An interesting theoretical application of SG-MCMC algorithms: is it possible to extend such ideas to asynchronous SGD analysis? • It seems theoretical results are weak at the moment due to dimension dependence, but authors accept that limitation.