The paper [Raf23D] out of Chelsea Finn’s lab at Stanford proposes a new method called direct preference optimization (DPO) for aligning language models (LM) to human preferences without using reinforcement learning from human feedback (RLHF).
In summary, DPO replaces the combination of reward-learning plus regularized RL in RLHF by plugging the policy (i.e. the language model) directly into the reward objective and only optimizing the latter. This results in a much simpler training procedure (only supervised learning, no need for sampling the LM) and reportedly better performance.
Notation
In this pill we use the same notation as in the paper. In particular, $\mathcal{D}$ denotes the dataset of human preferences, with inputs denoted as $x$ and rated completions as $y$ with different subscripts. The policy (i.e. the language model itself) is denoted as $\pi$ with various subscripts. In the context of learning human preferences, it is viewed as a model mapping $x$ to a distribution over entire completions $y$ (i.e. not just next-token prediction as in pretraining).
$D_{\text{KL}}$ denotes the KL-divergence between two distributions, and $\pi_{\text{ref}}$ denotes a reference policy, usually given by the pretrained language model before fine-tuning. The temperature $\beta$ is a hyperparameter that controls the strength of the regularization.
Method
The mathematical derivation of DPO is short and insightful, which is why we include a sketch of it. It is based on the following observations:
1. Reward as a Function of the Policy
The regularized reward-maximization objective of RLHF (see e.g.
[Zie20F] )
$$
\pi_r := \text{argmax}_\pi \ \mathbb{E}_{x \sim \mathcal{D}, y\sim \pi(y \mid x)} \left[
r(x, y)- \beta D_{\text{KL}} \left( \pi(y, s) || \pi_{\text{ref}}(y, s) \right)
\right]
$$
has an exact analytic solution:
$$
\pi_r(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp
\left( \frac{1}{\beta} r(x, y) \right).
$$
Similar results were proved in the REPS algorithm [Pet10R] and follow-up work (a more recent paper in that
direction is [Pen19A]). While this solution in
itself is intractable (because of the partition function $Z(x)$), it can be used
to express the reward as a function of the optimal policy:
2. Only Differences of Rewards Are Needed
The first step of RLHF is to learn a reward on human preference data. With preferences collected as pairwise comparisons of completions $(x, y_w, y_l)$ (where $x$ is the input, $y_w$ is the preferred/winning completion, and $y_l$ the losing one), the reward $r_\phi$ is learned by minimizing the loss:
$$ \mathcal{L}_\phi = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[ \log \frac{ e ^ {r_\phi(x, y_w)}}{ e^{r_\phi(x, y_w)} + e^{r_\phi(x, y_l)}} \right] $$
which is equivalent to
\begin{equation} \mathcal{L}_\phi = - \mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \left[ \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right], \tag2\end{equation} where $\sigma$ is the sigmoid function. Note that only differences of rewards enter (2).
3. DPO Objective
After plugging the expression for the policy (1) into the loss (2), the partition function $Z(x)$ cancels out. Replacing the optimal $\pi_r$ with the learned $\pi_\theta$, the DPO objective is obtained as
\begin{equation} \mathcal{L}_{\text{DPO}}(\pi_\theta ; \pi_{\text{ref}}) := - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right]. \end{equation}Thus, instead of first learning a reward and then finding the optimizing policy, one directly finds the optimal policy such that its reward as obtained from (1) corresponds to collected human preferences (i.e. a reward that optimizes (2)). Note that while the induced reward function itself is intractable, the differences of rewards remain tractable and can be computed using the learned policy. This should be sufficient for practical purposes, where rewards are mostly used to rank completions and e.g. perform rejection sampling.
The paper includes some more details and a discussion of the interpretation of the DPO update, and a detailed comparison to standard RLHF, but the essence of the method is captured by the above derivation. It can be easily extended to the case of more completions per input.
Experimental Results
Fine-tuning large models and collecting human preferences is expensive, so the authors resort to smaller models, and to using GPT-4 as a proxy for human preference evaluation. In all their experiments, DPO outperforms RLHF with PPO, see e.g. Figure 2. Presumably, this is due to the general difficulty of training agents with RL, in particular instabilities of actor-critic methods.
Author’s Take
One may find surprising that supervised learning is able to replace RL on a formal level. For RLHF, new data is sampled from the language model. How can it be that an algorithm that does not involve any sampling can achieve the same or even better results?
On second thought, it may not be too surprising after all. The sampled data is not really new - it is created using the very same model that one is trying to optimize. The rewards for these samples are also not new, they are obtained by fitting a reward function to the preferences and no new human preferences are retrieved during optimization. So from the information-flow perspective, supervised learning and RL are indeed equivalent in this particular case. Maybe Francois Chollet was not too extreme for suggesting to get rid of deep RL altogether in his tweet in the end (although I personally don’t agree).
I find the premise of DPO very appealing and am amazed that the simple calculation leading to the DPO objective has apparently not been done before. However, the experimental section felt rushed and a bit underwhelming: such strong claims should be backed up by very strong evidence. My guess is that the small scale of the experiments was mainly due to time and cost constraints. At the time of writing, there is an ongoing discussion in LAION’s Open Assistant about using DPO for their training. If this is followed through, we should get some solid experimental results soon. I am looking forward to them!