Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

The process of pre-training large language models can incur significant expenses. As these models continue to grow in size, there is an increasing need for solvers, which exhibit improved convergence characteristics, to decrease the overall costs of the model training.

Large transformer models often possess loss functions with heterogeneous curvature properties across different parameter dimensions. Consequently, the incorporation of per-coordinate curvature information into the descent direction can potentially enhance convergence. However, calculating second-order information can be computationally expensive.

In response to the enormous costs associated with large language model pre-training, the authors of the paper [Liu23S] introduce a second-order optimizer designed to diminish the number of iterations required for convergence, as compared to the prevailing first-order methods. Simultaneously, it aims to maintain a similar computational cost per iteration.

The researchers contrast their novel optimization method, Sophia, against AdamW—the prevalent solver used in training large language models. Sophia incorporates gradient smoothing components akin to those in Adam, and combines these with smoothed second-order information. In short, the descent update becomes the moving average of the gradients divided by the moving average of the estimated Hessian, all followed by an element-wise clipping procedure.

## Method

Adopting the notation used in the original paper, we let $\theta_t$ represent the solution at iteration $t$, $L_t(\theta_t)$ denote the mini-batch loss, and $\eta_t$ be the step size. The method implemented by Sophia can be summarized as follows:

1. Exponential smoothing of minibatch gradients at each iteration:
$$m_t = \beta_1 m_{t−1} + (1 − \beta_1)\nabla L_t(\theta_t)$$
1. Exponential smoothing of the Hessian information every $k=10$ iterations:
$$h_t = \beta_2 h_{t−k} +(1−\beta_2)\hat{h}_t$$

and $h_t = h_{t−1}$ if $t \operatorname{mod} k \neq 1$. Here, $\hat{h}_t$ stands for a lightweight estimator of the Hessian’s diagonal at iteration $t$.

1. Per-coordinate clipping:
$$\theta_{t+1} \leftarrow \theta_t − \eta_t \cdot \operatorname{clip}(m_t/ \max(h_t, \varepsilon), \rho),$$

where $\operatorname{clip}(z, ρ) = \max(\min(z, ρ), −ρ)$.

Figure 1 from [Liu23S] Top: The authors observe a noteworthy reduction, specifically a halving, in the number of iterations required to accomplish a comparable validation loss to that of Adam W when training GPT-2 on OpenWebText. Middle: Given the marginal overhead costs per iteration, this halving of the iteration count nearly translates to a considerable decrease in overall floating point operations. Bottom: Improved scaling laws, i.e. optimizing a smaller model using Sophia results in a validation loss, which is comparable to what one might achieve using a considerably larger model optimized with Adam W.

The authors highlight two critical points. First, the stochastic estimator for the Hessian diagonal should not introduce substantial overhead per step. It should be computationally on par with simple gradient computation (the original article proposes two options to achieve this). Second, the smoothing of the Hessian information and the clipping procedure offer stability to the optimization process by mitigating the effects of inaccurate Hessian estimates, rapidly changing curvature, and challenges arising from non-convexity (i.e., when the algorithm moves uphill instead of following a descent direction).

## Experimental Results

The experimental evaluation conducted by the authors focused on training the GPT-2 model (with varying number of parameters) on the OpenWebText corpus.

The authors noted that using Sophia led to a significant performance improvement. Specifically, the number of iterations needed to achieve a certain level of validation loss was halved when compared to using AdamW. As the Hessian computations contribute less than a 5% overhead, this reduction in iterations considerably decreases the total compute requirements. The authors encapsulate this finding as follows:

“Sophia is 2x faster in terms of number of steps, total compute and wall-clock time.”

Furthermore, the authors observed that models optimized by Sophia exhibit validation losses comparable to significantly larger models trained with AdamW. Interestingly, this performance differential increases as the size of the model grows.

“The scaling law is in favor of Sophia-H over AdamW”

## Discussion

At first glance, one might underestimate the potential of Sophia given its “sparse” use of second-order information. Yet, the results presented in the original article are nothing short of impressive. Sophia applies an effective combination of stochastic estimation of curvature, smoothing and clipping to achieve a very well-designed balance between computational overhead and improved convergence behavior.

Given the current trend towards larger and more complex language models, coupled with the substantial computational resources required for their training, the improvements Sophia brings to the table are significant. While the experiments to date have focused on large language models trained on text, it would be exciting to investigate the potential of Sophia across a wider range of applications, including various domains of Natural Language Processing, Computer Vision, and more.

The authors have conveniently provided an open-source implementation, which leverages PyTorch’s Optimizer base class. This allows it to be used directly as a drop-in replacement.