Direct Preference Optimization | TransferLab

The paper [Raf23D] out of Chelsea Finn’s lab at Stanford proposes a new method called direct preference optimization (DPO) for aligning language models (LM) to human preferences without using reinforcement learning from human feedback (RLHF).

In summary, DPO replaces the combination of reward-learning plus regularized RL in RLHF by plugging the policy (i.e. the language model) directly into the reward objective and only optimizing the latter. This results in a much simpler training procedure (only supervised learning, no need for sampling the LM) and reportedly better performance.

Notation

In this pill we use the same notation as in the paper. In particular, $\mathcal{D}$ denotes the dataset of human preferences, with inputs denoted as $x$ and rated completions as $y$ with different subscripts. The policy (i.e. the language model itself) is denoted as $\pi$ with various subscripts. In the context of learning human preferences, it is viewed as a model mapping $x$ to a distribution over entire completions $y$ (i.e. not just next-token prediction as in pretraining).

$D_{\text{KL}}$ denotes the KL-divergence between two distributions, and $\pi_{\text{ref}}$ denotes a reference policy, usually given by the pretrained language model before fine-tuning. The temperature $\beta$ is a hyperparameter that controls the strength of the regularization.

Method

The mathematical derivation of DPO is short and insightful, which is why we include a sketch of it. It is based on the following observations:

1. Reward as a Function of the Policy

The regularized reward-maximization objective of RLHF (see e.g. [Zie20F] ) $$ \pi_r := \text{argmax}_\pi \ \mathbb{E}_{x \sim \mathcal{D}, y\sim \pi(y \mid x)} \left[ r(x, y)- \beta D_{\text{KL}} \left( \pi(y, s) || \pi_{\text{ref}}(y, s) \right)
\right] $$ has an exact analytic solution: $$ \pi_r(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp \left( \frac{1}{\beta} r(x, y) \right). $$ Similar results were proved in the REPS algorithm [Pet10R] and follow-up work (a more recent paper in that direction is [Pen19A]). While this solution in itself is intractable (because of the partition function $Z(x)$), it can be used to express the reward as a function of the optimal policy:

\begin{equation} r(x, y) = \beta \log \left( \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \right) + \log Z(x). \tag1\end{equation}

2. Only Differences of Rewards Are Needed

The first step of RLHF is to learn a reward on human preference data. With preferences collected as pairwise comparisons of completions $(x, y_w, y_l)$ (where $x$ is the input, $y_w$ is the preferred/winning completion, and $y_l$ the losing one), the reward $r_\phi$ is learned by minimizing the loss:

$$ \mathcal{L}_\phi = \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[ \log \frac{ e ^ {r_\phi(x, y_w)}}{ e^{r_\phi(x, y_w)} + e^{r_\phi(x, y_l)}} \right] $$

which is equivalent to

\begin{equation} \mathcal{L}_\phi = - \mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \left[ \log \sigma \left( r_\phi(x, y_w) - r_\phi(x, y_l) \right) \right], \tag2\end{equation} where $\sigma$ is the sigmoid function. Note that only differences of rewards enter (2).

3. DPO Objective

After plugging the expression for the policy (1) into the loss (2), the partition function $Z(x)$ cancels out. Replacing the optimal $\pi_r$ with the learned $\pi_\theta$, the DPO objective is obtained as

\begin{equation} \mathcal{L}_{\text{DPO}}(\pi_\theta ; \pi_{\text{ref}}) := - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right]. \end{equation}

Thus, instead of first learning a reward and then finding the optimizing policy, one directly finds the optimal policy such that its reward as obtained from (1) corresponds to collected human preferences (i.e. a reward that optimizes (2)). Note that while the induced reward function itself is intractable, the differences of rewards remain tractable and can be computed using the learned policy. This should be sufficient for practical purposes, where rewards are mostly used to rank completions and e.g. perform rejection sampling.

The paper includes some more details and a discussion of the interpretation of the DPO update, and a detailed comparison to standard RLHF, but the essence of the method is captured by the above derivation. It can be easily extended to the case of more completions per input.

Experimental Results

Figure 2 from [Raf23D]. Top. TL;DR summarization win rates vs. human-written summaries, using GPT-4 as evaluator. DPO exceeds PPO’s best-case performance on summarization, while being more robust to changes in the sampling temperature. Bottom. The frontier of expected reward vs KL to the reference policy. DPO provides the highest expected reward for all KL values, demonstrating the quality of the optimization.

Fine-tuning large models and collecting human preferences is expensive, so the authors resort to smaller models, and to using GPT-4 as a proxy for human preference evaluation. In all their experiments, DPO outperforms RLHF with PPO, see e.g. Figure 2. Presumably, this is due to the general difficulty of training agents with RL, in particular instabilities of actor-critic methods.

Author’s Take

One may find surprising that supervised learning is able to replace RL on a formal level. For RLHF, new data is sampled from the language model. How can it be that an algorithm that does not involve any sampling can achieve the same or even better results?

On second thought, it may not be too surprising after all. The sampled data is not really new - it is created using the very same model that one is trying to optimize. The rewards for these samples are also not new, they are obtained by fitting a reward function to the preferences and no new human preferences are retrieved during optimization. So from the information-flow perspective, supervised learning and RL are indeed equivalent in this particular case. Maybe Francois Chollet was not too extreme for suggesting to get rid of deep RL altogether in his tweet in the end (although I personally don’t agree).

I find the premise of DPO very appealing and am amazed that the simple calculation leading to the DPO objective has apparently not been done before. However, the experimental section felt rushed and a bit underwhelming: such strong claims should be backed up by very strong evidence. My guess is that the small scale of the experiments was mainly due to time and cost constraints. At the time of writing, there is an ongoing discussion in LAION’s Open Assistant about using DPO for their training. If this is followed through, we should get some solid experimental results soon. I am looking forward to them!

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these …

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and …

Policy search is a successful approach to reinforcement learning. However, policy improvements often result in the loss of information. Hence, it has been marred by premature convergence and implausible solutions. As first suggested in the context of covariant policy gradients, many of these problems may be addressed by constraining the information loss. In this paper, we continue this path of …