Prodigy: An expeditiously adaptive parameter-free learner

Reference

Prodigy: An expeditiously adaptive parameter-free learner, Konstantin Mishchenko, Aaron Defazio. (2023)

Abstract

We consider the problem of estimating the learning rate in adaptive methods, such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to provably estimate the distance to the solution $D$, which is needed to set the learning rate optimally. Our techniques are modifications of the D-Adaptation method for learning-rate-free learning. Our methods improve upon the convergence rate of D-Adaptation by a factor of $O(\sqrt{\log(D/d_0)})$, where $d_0$ is the initial estimate of $D$. We test our methods on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approaches consistently outperform D-Adaptation and reach test accuracy values close to that of hand-tuned Adam.

Content citing this item

Pill

Learning-rate-free learning by D-Adaptation

Fine-tuning the learning rate (lr) in the training of neural networks is crucial for their performance, and often requires a costly search …

Optimization in ML

Aug 1, 2023

All works referenced in our site...