Learning-rate-free learning by D-Adaptation | TransferLab

One of this year’s ICML Outstanding Paper Awards is received by the paper “Learning-Rate-Free Learning by D-Adaptation” [Def23L]. Defazio et al. propose D-Adaptation, an approach for setting the learning rate of stochastic gradient methods automatically. It works without back-tracking, line searches, or additional function evaluations, and asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions. Thus, D-Adaptation removes the need for tuning the learning rate, which is a major practical advantage.

The paper presents extensive experiments for SGD and Adam variants of the method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems.

Figure 2 [Def23L]: Vision experiments.

An open-source implementation is available as a drop-in replacement for PyTorch optimizers, which makes it very easy to use the new method in existing machine learning projects. This, together with the excellent performance, makes D-Adaptation an important addition to the toolbox of machine learning practitioners.

The Idea of D-Adaptation

To understand the approach, consider the unconstrained convex minimization problem $$ \min_{x \in \mathbb{R}^p} f(x), $$ where $f$ has Lipschitz constant $G$ and a non-empty set of minimizers. Gradient descent, starting at point $x_0$, produces iterates $$ x_{k+1} = x_k - \lambda_k g_k $$ where $g_k$ is the gradient of $f$ in $x_k$. Here, the learning rate $\lambda_k$ is the main quantity controlling whether and how fast the sequence ${x_k}$ converges.

Setting $\lambda_k$ optimally requires knowledge of the distance to a solution. Let $x_\star$ be a minimizer of $f$ and $D$ be the associated distance $D = \| x_0 - x_\star \|$. A known result in optimization (see e.g. [Nes18L]) is that using a fixed step size of $$ \lambda_k = \frac{D}{G\sqrt{n}}, $$ the average iterate $\hat x_n = \frac{1}{n+1}\sum_{k=0}^{n} x_k$ converges with worst-case optimal rate $$ f(\hat x_n) - f(x_\star) = \mathcal{O}\left(\frac{DG}{\sqrt{n}}\right). $$

Setting this optimal step size requires knowledge of $D$ and $G$. Adaptivity to the Lipschitz constant $G$ can be achieved using a number of approaches, e.g., AdaGrad-Norm step sizes $$ \lambda_k = \frac{D}{\sqrt{\sum_{i=0}^k \| g_i \|^2}}. $$

However, $D$ still plays the key role of the numerator in the theoretical step size.

The key idea of the paper is to replace $D$ with a successively better lower bound on this distance to the solution. Several lower bounds are proposed, for instance, $$ D \geq \hat d_{k+1} = \frac{\| s_{k+1} \|^2 - \sum_{i=0}^k \lambda_i^2 || g_i ||^2}{ 2 \| s_{k+1} \| }, $$ where $s_{k+1} = s_k + \lambda_k g_k$ and $s_0 = 0$. The resulting Gradient Descent with D-Adaptation algorithm is proved to be asymptotically as fast as using the optimal lr.

It is worth mentioning that the authors have already proposed and implemented improvements of D-Adaptation, in particular Prodigy [Mis23P], which has better asymptotic rates and
adapts the lr faster in practice.

The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of D yet asymptotically …

We consider the problem of estimating the learning rate in adaptive methods, such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to provably estimate the distance to the solution $D$, which is needed to set the learning rate optimally. Our techniques are modifications of the D-Adaptation method for learning-rate-free learning. Our methods improve upon the convergence rate …

The Idea of D-Adaptation

References

In this series →