Advantage-Induced Policy Alignment

Building on the classic results on reward weighted regression and its more recent adaptation to deep learning, a new algorithm called advantage-induced policy alignment (APA) is proposed for aligning language models (LM) to human preferences. APA is claimed to significantly outperform state-of-the-art methods such as PPO. This pill also contains a short overview of the family of algorithms from which APA emerged.

Prelude

This paper pill covers recent work by Zhu et al. [Zhu23F] and puts it into a short historical/algorithmic context. The advantage-induced policy alignment (APA) algorithm proposed therein is the latest installment in a long line of extensions of simple methods in reinforcement learning (RL), which was started by the reward weighted regression algorithm (derived from an expectation-maximization framework by Peters and Schaal [Pet07R]). This pill contains more references than usual, because I believe the context and ideas surrounding the algorithms mentioned here may be of general interest.

APA not only holds the promise of improving the state-of-the-art for aligning language models (LM) to human preferences, but could also be of general use in RL. What I find particularly fascinating about the works mentioned in this pill (and about the modern history of RL in general), is how powerful algorithms arise from small modifications to previously published (and often simple) methods that proved to work well in restricted settings. It leaves me wondering what other gems are still hidden in the literature, waiting to be rediscovered as state-of-the-art algorithms.

As a motivation for practitioners, we proceed with the experimental results of APA [Zhu23F], before diving deeper into the theoretical background and derivations.

Experimental Results

The APA algorithm follows the same training routines as typical reinforcement learning from human feedback (RLHF) methods (see e.g. Ziegler et al. [Zie20F]), but with a different loss. Thus, the authors can make a direct comparison of APA to RLHF with PPO on an already trained and fixed reward model. They also include a comparison to another loss derived from advantage weighted regression (AWR), which to my knowledge is the first time this method is applied to language modeling. Since one of the main novelties of the original AWR paper was the extension of the AWR loss to off-policy learning, the poor performance of AWR in the on-policy setting is maybe not too surprising. Thus, the main practically relevant comparison is between APA and PPO.

Figure 1 from [Zhu23F]. Comparison of the performance of three methods on the Helpfulness and Harmlessnes dataset. Top: The x-axis represents the total steps, which are proportional to the amount of data used in the training procedure. The y-axis is the reward evaluated by the same reward model. Bottom: The x-axis represents the total steps. The y-axis is the KL divergence between the trained model and the initial model.

We can see that APA outperforms PPO in terms of reward and has a comparable control of the KL divergence to the non-finetuned model $\pi_{\text{init}}$. However, I believe it is premature to conclude that APA is the new state-of-the-art for training language models from these results alone. Improving on a learned reward metric does not necessarily mean that the resulting model is better at the task we care about. In fact, we know that models can achieve very high reward with gibberish output, 1 1 Ziegler et al. [Zie20F] report high-reward completions of the type “! These These These sound flowed instantly easily easily easily easily!” for training without the KL-penalty. which is precisely the reason behind including the KL divergence to the initial policy in the optimization problem.

For the largest of the tested models (which is still small in comparison to LLMs in use today), APA was able to achieve higher reward at the cost of a higher KL divergence. Moreover, looking at the graphs, there seems to be a tendency to deviate more strongly from the initial policy as the model size increases. On the other hand, for smaller models, APA is unambiguously better than PPO.

The practical conclusion from these experiments is unclear to me at this moment. One would at least need to perform an evaluation on the standard downstream tasks to see if the performance of the model has actually improved. Unfortunately, the authors provide no such evaluation. This does not mean that APA is not a promising method, and I hope a future evaluation of with larger models and a variety of downstream tasks will shed more light on it.

From Weighted Regression to APA

Notation

We use standard notation for RL, where $\pi$ denotes a policy, $s$ a state, $a$ an action, $d^\pi$ a probability measure (of states or state-action pairs) induced by $\pi$, $A^\pi$ an (estimation of) the advantage function for the policy $\pi$, and $\mathcal{D}$ a dataset of trajectories sampled from an environment with some policy. We will not go into the details of what each subscript means, as it should be clear from the context. For a more complete overview we recommend to read through the cited references.

Core Ideas of Weighted Regression

RL algorithms from the weighted regression family are generally based on the following idea: given some policy $\pi_{\text{old}}$ and a dataset of trajectories $\mathcal{D}$ sampled from it, one wants to find an improved (in terms of expected reward) parameterized policy $\pi_\theta$ that is close to $\pi_{\text{old}}$ in some sense. 2 2 Note that this is not quite the same problem typically addressed in policy gradient (PG) or value iteration settings, where the goal is to find a policy that maximizes the expected reward without regard to $\pi_{\text{old}}$. Despite that, the implementations of weighted regression may look very similar to implementations of other RL methods, often just differing by the loss used in gradient descent.. E.g., in an on-policy weighted regression setting, one typically performs alternating rounds of incremental updates to a parameterized policy (which takes on the role of $\pi_ {\text{old}}$), and sampling of new data with the updated $\pi_\theta$, just like in standard iterative schemes. On the other hand in practical PG methods, one often uses some form of constraint to not deviate too strongly from the sampling policy in order to stabilize learning, like in weighted regression. The resulting optimization problems often admit closed form, non-parametric solutions - let’s call them $\pi^*$. The policy-improvement algorithms typically involve a projection of a parameterized policy $\pi_\theta$ onto $\pi^*$. These incremental improvements to $\pi_\theta$ can then be combined with sampling of new data and repeated until convergence. The analytic solutions $\pi^*$ generally take the form of $\pi_{\text{old}}$ weighted by some factors, and the projection step is performed with regression (i.e. supervised learning), hence the name.

A particularly clean manifestation of this scheme can be found in the AWR paper [Pen19A]. It goes roughly as follows:

Start with the first-order approximation of the expected improvement of a policy $\pi$ over the sampling policy $\pi_{\text{old}}$:
\begin{equation} J(\pi) - J(\pi_{\text{old}}) \approx \mathbb{E}_{s \sim d^{ \pi_{ \text{old} } } } \ \mathbb{E}_{a \sim \pi(a\mid s)} \left[ A^{\pi_{\text{old}}}(s,a) \right], \tag1\end{equation}
where $J$ is the expectation of discounted sum of future rewards (i.e. the objective one wants to maximize), and $A^{\pi_{\text{old}}}(s,a)$ is the advantage function of $\pi_{\text{old}}$.3 3 This expression is the starting point of several policy gradient algorithms, in particular trust region policy optimization and proximal policy optimization. A derivation can be found e.g. in our blog post on this topic.
Restrict the solution space to policies that are close to $\pi_{\text{old}}$ in terms of KL divergence, resulting in the Lagrangian
\begin{equation} \mathcal{L} = \mathbb{E}_{s \sim d^{ \pi_{ \text{old} } } } \ \mathbb{E}_{a \sim \pi(a\mid s)} \left[ A^{\pi_{\text{old}}}(s,a) \right] + \beta \left( \epsilon - \text{KL}(\pi_{\text{old}} \mid \mid \pi) \right), \tag2\end{equation}
where $\beta$ is the Lagrange multiplier and $\epsilon$ some small constant.
The analytic solution of the above problem is:
\begin{equation} \pi^*(a \mid s) = \frac{1}{Z(s)} \pi_{\text{old}}(a \mid s) \exp \left( \frac{A^{\pi_{\text{old}}}(s,a)}{\beta} \right), \tag3\end{equation}
where $Z(s)$ is a state-dependent normalization factor (the partition function). As noted above, this is a weighted version of $\pi_{\text{old}}$.
Find an improved set of parameters $\theta^*$ by minimizing the difference between $\pi_\theta$ and $\pi^*$ on the dataset $\mathcal{D}$ according to some distance function $\text{dist}$ for probability distributions:

\begin{equation} \theta^* = \arg \min_\theta \mathbb{E}_{s \sim \mathcal{D} } \left[ \text{dist}(\pi_\theta(\cdot \mid s), \pi^*(\cdot \mid s)) \right]. \tag4\end{equation}

Weighted Regression in Practice

There are multiple variations of this basic scheme. In the original AWR paper, the authors use the KL-divergence as the distance function and make the assumption that the partition function $Z(s)$ can be set to one for practical purposes.4 4 There is a small mistake in the derivation of the AWR algorithm in the original paper, where $Z(s)$ just silently disappears from the equations (the authors have acknowledged this in a private communication). The role of this factor has been discussed by Wang et al. [Wan18E] in the context of imitation learning and by Nair et al. for offline RL [Nai21A]. Moreover, they introduce a strategy for reusing samples from previous iterations, thereby turning advantage-weighted regression into an off policy algorithm. Other works used instantiations of (4) for imitation learning [Wan18E], or as an ingredient in an offline-RL algorithm [Nai21A, Kos21O].

Early works by Peters, Schaal, and Neumann [Pet07R, Neu08F] arrived at similar schemes by using the expectation-maximization (EM) framework on a lower bound on the expected reward (see also Strupl et. al. [Str22R] for a nice theoretical analysis and extension of the original reward weighted regression). The well-known REPS algorithm [Pet10R] and follow-up work is making use of the analytic solution to an objective similar to Equation 2. The widely used trust region and proximal policy optimization algorithms [Sch15T, Sch17P] directly aim at optimizing Equation 2 or a variation thereof, but without using the analytic solution. Instead, they focus on policy gradient methods and practical implementation details (see e.g. our blog post Natural, Trust Region and Proximal Policy Optimization for a detailed discussion).

The APA Algorithm

The APA algorithm is the newest addition to the family of weighted regression algorithms, and the first to my knowledge to tackle language modeling. It makes the following modifications to the basic scheme outlined above:

For language modelling, one wants the fine-tuned model to be close to original pre-trained model $\pi_{\text{init}}$ in terms of KL-divergence, i.e. $\pi_{\text{old}}$ is replaced with $\pi_{\text{init}}$. Therefore, the constraint in Equation 2 is adjusted accordingly, resulting in the optimal policy
\begin{equation} \pi^*(a \mid s) = \frac{1}{Z(s)} \pi_{\text{init}}(a \mid s) \exp \left( \frac{A^{\pi_{\text{old}}}(s,a) }{\beta} \right). \end{equation}
The authors argue that, for LMs, one can safely assume $Z(s) \approx 1 $.5 5 The authors' intuition behind this is that to first order in $\frac{A^{\pi_{\text{old}}}(s,a) }{\beta}$, the partition function is $1 + \frac{1}{\beta}\mathbb{E}_{a \sim \pi_{\text{init}}(a \mid s)} [A^{\pi_{\text{old}}}(s,a) ]$. Assuming that the tuned policy is close enough to the initial policy for the advantages to be similar, one gets $Z(s) \approx 1 + \frac{1}{\beta}\mathbb{E}_{a \sim \pi_{\text{init}}(a \mid s)} [A^{\pi_{\text{init}}}(s,a) ] = 1$. During the training they notice that the loss decreases by very little, suggesting that the tuned policy is indeed very close to $\pi_{\text{init}}$. On a personal note, while I don’t necessarily see the validity of these intermediate approximations (a value of $\beta=0.1$ is used in the experiments, which is not particularly large), the final justification is that the resulting algorithm works well. As the distance function $\text{dist}$ for the projection step Equation 4, they use the squared error between logits weighted by previously sampled data, i.e.

\begin{equation} \text{dist}_{\text{APA}}(\pi_\theta(\cdot \mid s), \pi^*(\cdot \mid s)) \mathrel{\mathop:}= \mathbb{E}_{a \sim \pi_{\text{old}}(a \mid s)} \left[ \left( \log \pi_\theta(a \mid s) - \log \pi^*(a \mid s) \right)^2 \right]. \end{equation}

These design choices result in the APA loss:

\begin{equation} \mathcal{L}_{\text{APA}} \mathrel{\mathop:}= \mathbb{E}_{(s, a) \sim \mathcal{D}_{\text{old}} } \left[ \log^2 \left( \frac{\pi_\theta(a \mid s)}{\pi_{\text{init}}(a \mid s)} \exp \left(- \frac{A^{\pi_{\text{old}}}(s,a)}{\beta} \right) \right) \right]. \tag5\end{equation}

When applying APA for training LMs, Zhu et al. use the weighted regression scheme in its on-policy version, i.e. they do not reuse samples from previous iterations. Thus, the method can be directly compared to other on-policy algorithms, differing from them only in the loss function Equation 5.

Discussion

From the theoretical point of view, the new advantage-induced policy alignment algorithm resulted from a small modification of advantage-weighted regression, and as such is not a major breakthrough.

On the practical side, however, this modification seems to result in a significant improvement in performance. Therefore, this work may be important for practical purposes, as well as for rejuvenating interest in the family of weighted regression algorithms.

I found it interesting to follow the discussions leading to the rejections of the AWR paper as well as some popular follow-up works like AWAC from the conferences they applied to; see the discussions on openreview here and here. The main reasons for rejection were that the new algorithms were neither sufficiently different from existing weighted regression approaches, nor more performant than state-of-the-art algorithms. We see, however, that simple and efficient ideas, while perhaps not immediately beating the state-of-the-art, can still be highly influential and eventually lead to significant improvements in performance. Many now famous papers in machine learning are based on such small theoretical modifications, so I believe it is fair to say that the authors of AWR were somewhat unlucky in their timing for the submission of their results.

Software-wise, APA is built on top of the modern trlx library for combining transformers with RL, and as such it could be a useful starting point for custom RLHF algorithms that follow good software engineering practices. It is refreshing to see research code that is going in the direction of being reusable and extensible. The code is freely available on GitHub under the MIT license.

References

[Zhu23F]

Fine-Tuning Language Models with Advantage-Induced Policy Alignment, Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao.

Jun 2023

Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be …

[Str22R]

Reward-Weighted Regression Converges to a Global Optimum, Miroslav Štrupl, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber.

Jun 2022

Reward-Weighted Regression (RWR) belongs to a family of widely known iterative Reinforcement Learning algorithms based on the Expectation-Maximization framework. In this family, learning at each iteration consists of sampling a batch of trajectories using the current policy and fitting a new policy to maximize a return-weighted log-likelihood of actions. Although RWR is known to yield monotonic …

Publication

[Nai21A]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets, Ashvin Nair, Abhishek Gupta, Murtaza Dalal, Sergey Levine.

Apr 2021

Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online …

[Kos21O]

Offline Reinforcement Learning with Implicit Q-Learning, Ilya Kostrikov, Ashvin Nair, Sergey Levine.

Dec 2021

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This tradeoff is critical, because most current offline reinforcement learning methods need to query the value of …

[Zie20F]

Fine-Tuning Language Models from Human Preferences, Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving.

Jan 2020

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and …

[Pen19A]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning, Xue Bin Peng, Aviral Kumar, Grace Zhang, Sergey Levine.

Oct 2019

In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two …

[Wan18E]

Exponentially Weighted Imitation Learning for Batched Historical Data, Qing Wang, Jiechao Xiong, Lei Han, peng sun, Han Liu, Tong Zhang.

2018

We consider deep policy learning with only batched historical trajectories. The main challenge of this problem is that the learner no longer has a simulator or ``environment oracle'' as in most reinforcement learning settings. To solve this problem, we propose a monotonic advantage reweighted imitation learning strategy that is applicable to problems with complex nonlinear function approximation …