Prelude
One of many stunning features of large language models (LLM) is their ability to produce outputs that resemble step by step reasoning [Wei22C]. Reasoning traces do not only empirically improve performance, they also help with explaining the model output. While most of the widely known approaches focus on reasoning through zero- or few-shot learning, a recent line of work investigates how reasoning traces can be used as additional learning signal. This paper pill presents two instantiations of this idea.
Yang et al. [Yan22C] perform procedure cloning (PC), which is a special version of imitation learning where a model is trained to imitate both the output and the underlying reasoning. Complementary to imitation learning, Lightman et al. [Lig23L] propose rewarding the sensibility of individual reasoning steps in a reinforcement learning (RL) setting, which they call process supervision. An important part of their work is the comparison of PC to pure outcome supervision (rewards only based on the output’s correctness).
Process Imitation
Classic imitation learning (like behavior cloning) is concerned with learning a mapping from states to actions that mimic an expert. In many tasks such as navigation, manipulation, or games, the expert relies on a multistep decision-making procedure like search or planning. While the conventional imitation learning formalism discards the planning information and relies solely on the output, Yang et al. [Yan22C] propose to explicitly learn the how behind an optimal action, which they call procedure cloning. The intuition is that learning why an action was taken aids better decision-making and may improve generalisation.
The training objective is to maximize the joint likelihood of actions and explanations $p(a,x|s)$, where $a$ is an expert action, $x$ an explanation, and $s$ an input state. A policy mapping states to $(x, a)$ can be either autoregressive or follow a conditional independence assumption.
The authors apply process imitation to two classic planning methods, Breadth First Search (BFS) and Monte Carlo Tree Search (MCTS).
Breadth First Search
Experimental results confirm the superiority of process cloning compared to classic behavior cloning (BC).In all experiments performed by the authors, the policy trained with procedure cloning generalises reasonably well to new (but related) environments, like mazes with the goal at a different position.
Monte Carlo Tree Search
In addition to imitation of BFS, experiments on imitating MCTS without tool-use (aka the ability to query the critic) are performed. A conceptually important aspect of the training is that the chain-of-thought sequence leading to an action is predicted in reverse order. The authors used the top path up to depth $k=15$ for representing the MCTS procedure. However, they remark in the appendix that including more than one path as part of the explanation is generally beneficial.
This work shows that LLMs are sufficiently expressive for mimicking the planning of classical algorithms like BFS or MCTS. Just like standard imitation learning, process imitation requires the existence of an expert. However, this expert should be able to reveal its reasoning steps in addition to the performed action. While this might be possible in a variety of situations, this requirement is overall rather restrictive, which might hinder the method’s applicability in real world applications.
Note that despite the action being obtained after several “reasoning” steps of an autoregressive policy, there is no guarantee that the action is actually inferred from the reasoning. Thus, it is possible for the answer to be correct and the explanation to be nonsensical, or vice versa. Inspecting the attention maps when generating the final answer or regularizing the attention weights would be an interesting approach for making sure that reasoning steps and answers align.
We recommend reading the procedure-cloning project page for further illustrations and details.
Reasoning as Part of the Reward
The key idea of Lightman et al. [Lig23L] is to separate process and outcome supervision. Outcome supervision provides feedback based only on the final result, whereas process supervision does so for each intermediate reasoning step. The authors use a pretrained language model for generating responses to queries. A separate reward model is trained to discriminate between desirable and undesirable outputs. Such a reward model could then be used for fine-tuning LMs or for improving a LMs predictions via rejection sampling. The result of the fine-tuning stage crucially depends upon the reliability of the reward model. The paper focuses on training a process supervised reward model (PRM) via and comparing it with an outcome-supervised one. A logical next step would be the fine-tuning of the generator LM (like GPT4) with reinforcement learning using the PRM. This is not done yet and pointed out as future work.
Experiments are conducted based on the MATH dataset. The nature of the data allows to automatically provide outcome supervision. However, process supervision (i.e. intermediate reasoning steps for the math problems) can’t be obtained automatically, wherefore human labelers are employed to judge the correctness of reasoning steps that are generated with GPT4. Thus, this paper involved a rather expensive endeavor, of the type that is often not feasible in academic research or in industry projects. Fortunately, the collected dataset of 800.000 labelled completions (called PMR800) was made publicly accessible.
Here is an example of the trained reward model in action.
Problem: A class of 30 students recently took a test. If 20 students scored 80, 8 students scored 90, and 2 students scored 100, then what was the class average (the mean) on this test?
Output of GPT4:
- To find the class average, I need to add up all the scores and divide by the number of students.
- I can use a shortcut to add up the scores, since they are grouped by value.
- For example, instead of adding 80 + 80 + … + 80 twenty times, I can just multiply 80 by 20 to get the same result.
- Similarly, I can multiply 90 by 8 and 100 by 2 to get the sums of those groups
- So, the total sum of the scores is 80 * 20 + 90 * 8 + 100 * 2
- I can use the distributive property of multiplication to simplify this expression.
- It is equivalent to 80 * (20 + 1) + 90 * 8 + 100
- That is, 80 * 21 + 720 + 100
- I can do the arithmetic in my head or use a calculator
- 80 * 21 is 1680, so the total sum is 1680 + 720 + 100, which is 2500
- Now, to find the average, I need to divide 2500 by 30, the number of students.
- I can use long division or a calculator for this step
- 2500 divided by 30 is 83.33, repeating
- This means that the class average is 83.33, or 83 and one-third
- Answer: 83.33
A green background indicates a high PRM score, and a red background indicates a low score. Note that in step 7, GPT-4 makes an incorrect attempt to simplify an expression. The reward model catches this mistake.
More examples and the entire dataset can be found on the project page. The experiments demonstrate a clear advantage of using process supervision in training the reward model compared to relying solely on outcome supervision, see Figure 3.
Summary
Empirical results of both papers indicate that process supervision provides a better training signal than outcome supervision. This is probably not too surprising, as including additional relevant data in the modeling of a prediction is usually beneficial. While significant improvements are reported in both papers, I wonder whether they might be an artifact of the selected evaluation scenarios. In recent years, human education and organization have tended towards an empowerment approach, where the goal is presented, but the agent has the freedom to choose how to achieve it. Process imitation or supervision on the other hand can be seen as the old-school way of micromanagement. It would be an interesting avenue for future work on benchmark design to explore which properties make outcome or process supervision (for both humans and artificial agents) more amenable.
It is also worth noting that both approaches are rather difficult to apply in practice. Process imitation was used in the presence of a cheap (automated) expert planner, in which case an imitation-learned policy is probably not terribly useful (one could just use the planner). Process supervision was applied without an automated planner, instead requiring human supervision. While this makes it more relevant for practical purposes, it is also much more expensive and laborious. Moreover, the practical application of the resulting reward model to improve the generator LM was not demonstrated yet.