Large transformer models often possess loss functions with heterogeneous curvature properties across different parameter dimensions. Consequently, the incorporation of per-coordinate curvature information into the descent direction can potentially enhance convergence. However, calculating second-order information can be computationally expensive.

In response to the enormous costs associated with large language model pre-training, the authors of the paper [Liu23S] introduce a second-order optimizer designed to diminish the number of iterations required for convergence, as compared to the prevailing first-order methods. Simultaneously, it aims to maintain a similar computational cost per iteration.

The researchers contrast their novel optimization method,
*Sophia*, against *AdamW*—the prevalent solver used in training large language models.
*Sophia* incorporates gradient smoothing components akin to those in Adam, and combines these with smoothed second-order information.
In short, the descent update becomes the moving average of the gradients divided by the moving average of the estimated Hessian, all followed by an element-wise clipping procedure.

## Method

Adopting the notation used in the original paper, we let $\theta_t$ represent the solution at iteration $t$, $L_t(\theta_t)$ denote the mini-batch loss, and $\eta_t$ be the step size. The method implemented by Sophia can be summarized as follows:

**Exponential smoothing of minibatch gradients at each iteration:**

**Exponential smoothing of the Hessian information every $k=10$ iterations:**

and $h_t = h_{t−1}$ if $t \operatorname{mod} k \neq 1$. Here, $\hat{h}_t$ stands for a lightweight estimator of the Hessian’s diagonal at iteration $t$.

**Per-coordinate clipping:**

where $\operatorname{clip}(z, ρ) = \max(\min(z, ρ), −ρ)$.

The authors highlight two critical points. First, the stochastic estimator for the Hessian diagonal should not introduce substantial overhead per step. It should be computationally on par with simple gradient computation (the original article proposes two options to achieve this). Second, the smoothing of the Hessian information and the clipping procedure offer stability to the optimization process by mitigating the effects of inaccurate Hessian estimates, rapidly changing curvature, and challenges arising from non-convexity (i.e., when the algorithm moves uphill instead of following a descent direction).

## Experimental Results

The experimental evaluation conducted by the authors focused on training the GPT-2 model (with varying number of parameters) on the OpenWebText corpus.

The authors noted that using *Sophia* led to a significant performance improvement.
Specifically, the number of iterations needed to achieve a certain level of validation loss
was halved when compared to using *AdamW*. As the Hessian computations contribute less than a 5% overhead,
this reduction in iterations considerably decreases the total compute requirements. The authors encapsulate this finding as follows:

**“Sophia is 2x faster in terms of number of steps, total compute and wall-clock time.”**

Furthermore, the authors observed that models optimized by *Sophia* exhibit validation losses comparable to significantly
larger models trained with *AdamW*. Interestingly, this performance differential increases as the size of the model grows.

**“The scaling law is in favor of Sophia-H over AdamW”**

## Discussion

At first glance, one might underestimate the potential of *Sophia* given its “sparse” use of second-order information.
Yet, the results presented in the original article are nothing short of impressive.
*Sophia* applies an effective combination of stochastic estimation of curvature, smoothing and clipping
to achieve a very well-designed balance between computational overhead and improved convergence behavior.

Given the current trend towards larger and more complex language models, coupled with the substantial computational
resources required for their training, the improvements *Sophia* brings to the table are significant.
While the experiments to date have focused on large language models trained on text,
it would be exciting to investigate the potential of *Sophia* across a wider range of applications,
including various domains of Natural Language Processing, Computer Vision, and more.

The authors have conveniently provided an open-source implementation, which leverages PyTorch’s Optimizer base class. This allows it to be used directly as a drop-in replacement.