Reasoning Traces as Learning Signal

An important feature of large language models is their ability to provide detailed responses that resemble “thinking step by step.” A recent line of work aims at reasoning beyond in-context learning by using intermediate reasoning as learning signal. This pill presents two works from that area - Chain-of-Thought Imitation with Procedure Cloning for imitation learning and Let’s Verify Step by Step that explicitly rewards sensible reasoning steps in a reinforcement learning setting.


One of many stunning features of large language models (LLM) is their ability to produce outputs that resemble step by step reasoning [Wei22C]. Reasoning traces do not only empirically improve performance, they also help with explaining the model output. While most of the widely known approaches focus on reasoning through zero- or few-shot learning, a recent line of work investigates how reasoning traces can be used as additional learning signal. This paper pill presents two instantiations of this idea.

Yang et al. [Yan22C] perform procedure cloning (PC), which is a special version of imitation learning where a model is trained to imitate both the output and the underlying reasoning. Complementary to imitation learning, Lightman et al. [Lig23L] propose rewarding the sensibility of individual reasoning steps in a reinforcement learning (RL) setting, which they call process supervision. An important part of their work is the comparison of PC to pure outcome supervision (rewards only based on the output’s correctness).

Process Imitation

Classic imitation learning (like behavior cloning) is concerned with learning a mapping from states to actions that mimic an expert. In many tasks such as navigation, manipulation, or games, the expert relies on a multistep decision-making procedure like search or planning. While the conventional imitation learning formalism discards the planning information and relies solely on the output, Yang et al. [Yan22C] propose to explicitly learn the how behind an optimal action, which they call procedure cloning. The intuition is that learning why an action was taken aids better decision-making and may improve generalisation.

The training objective is to maximize the joint likelihood of actions and explanations $p(a,x|s)$, where $a$ is an expert action, $x$ an explanation, and $s$ an input state. A policy mapping states to $(x, a)$ can be either autoregressive or follow a conditional independence assumption.

Figure 1 from [Yan22C]. Visualization of the dataset collection, training, and inference of behavior cloning (BC) and procedure cloning (PC) on a maze navigation task. During dataset collection, the expert uses a search procedure to determine the optimal action to generate a path to the goal location (red star). During training, BC discards these intermediate search outputs and learns to directly map states to actions . In contrast, PC learns the complete sequence of intermediate computations (i.e., branches and backtracks) associated with the search procedure. During inference, PC generates a sequence of intermediate search outcomes emulating the search procedure on a new test map before outputting the final action.

The authors apply process imitation to two classic planning methods, Breadth First Search (BFS) and Monte Carlo Tree Search (MCTS).

Figure 3 from [Yan22C]. In a discrete maze, the expert employs BFS by first expanding a search perimeter until it encounters the goal cell, at which point it backtracks to find the optimal action at the starting state (cells in light blue are visited and in dark blue are backtracked). We encode this algorithm as a sequence of procedure observations $(x_0, …, x_6)$ of the intermediate computation states, with each xi represented by a 2D array and each cell of the array containing BFS-relevant information (i.e., whether this cell is being expanded or backtracked and the action recorded when expanding to this cell). Procedure cloning is trained to predict the entire sequence of computations from input state to output action using a sequential model $p(a|x_L) · \prod_ {l=1}^L p(x_l|x_{l−1}) · p(x_0|s)$.

Experimental results confirm the superiority of process cloning compared to classic behavior cloning (BC).

Figure 4 from [Yan22C]. Left: Visualization of the discrete maze (4 discrete actions) and AntMaze (8 continuous actions). Right: Average success rate of PC and BC agents navigating to the goal from random start locations over 10 test mazes. Agents are trained on 5, 10, 20, 40 mazes of 1 and 5 expert trajectories on discrete maze and AntMaze, respectively. We find that procedure cloning leads to much better test maze generalization compared to alternative approaches.

In all experiments performed by the authors, the policy trained with procedure cloning generalises reasonably well to new (but related) environments, like mazes with the goal at a different position.

In addition to imitation of BFS, experiments on imitating MCTS without tool-use (aka the ability to query the critic) are performed. A conceptually important aspect of the training is that the chain-of-thought sequence leading to an action is predicted in reverse order. The authors used the top path up to depth $k=15$ for representing the MCTS procedure. However, they remark in the appendix that including more than one path as part of the explanation is generally beneficial.

Figure 6 from [Yan22C]. In the MinAtar game-playing environment, the expert uses MCTS $(\Pi_0 , …, \Pi_L )$ to find an optimal future trajectory $[L, R, \text{Goal}]$. We treat this future trajectory in reverse order $[\text{Goal}, R, L]$ as procedure observations, so that procedure cloning is trained to first predict the goal image (MCTS leaf node) and then predict the optimal action sequence backwards from the goal using a GPT-like autoregressive model, ultimately predicting the expert’s output action as its last prediction.

This work shows that LLMs are sufficiently expressive for mimicking the planning of classical algorithms like BFS or MCTS. Just like standard imitation learning, process imitation requires the existence of an expert. However, this expert should be able to reveal its reasoning steps in addition to the performed action. While this might be possible in a variety of situations, this requirement is overall rather restrictive, which might hinder the method’s applicability in real world applications.

Note that despite the action being obtained after several “reasoning” steps of an autoregressive policy, there is no guarantee that the action is actually inferred from the reasoning. Thus, it is possible for the answer to be correct and the explanation to be nonsensical, or vice versa. Inspecting the attention maps when generating the final answer or regularizing the attention weights would be an interesting approach for making sure that reasoning steps and answers align.

We recommend reading the procedure-cloning project page for further illustrations and details.

Reasoning as Part of the Reward

The key idea of Lightman et al. [Lig23L] is to separate process and outcome supervision. Outcome supervision provides feedback based only on the final result, whereas process supervision does so for each intermediate reasoning step. The authors use a pretrained language model for generating responses to queries. A separate reward model is trained to discriminate between desirable and undesirable outputs. Such a reward model could then be used for fine-tuning LMs or for improving a LMs predictions via rejection sampling. The result of the fine-tuning stage crucially depends upon the reliability of the reward model. The paper focuses on training a process supervised reward model (PRM) via and comparing it with an outcome-supervised one. A logical next step would be the fine-tuning of the generator LM (like GPT4) with reinforcement learning using the PRM. This is not done yet and pointed out as future work.

Experiments are conducted based on the MATH dataset. The nature of the data allows to automatically provide outcome supervision. However, process supervision (i.e. intermediate reasoning steps for the math problems) can’t be obtained automatically, wherefore human labelers are employed to judge the correctness of reasoning steps that are generated with GPT4. Thus, this paper involved a rather expensive endeavor, of the type that is often not feasible in academic research or in industry projects. Fortunately, the collected dataset of 800.000 labelled completions (called PMR800) was made publicly accessible.

Here is an example of the trained reward model in action.

Problem: A class of 30 students recently took a test. If 20 students scored 80, 8 students scored 90, and 2 students scored 100, then what was the class average (the mean) on this test?

Output of GPT4:

  1. To find the class average, I need to add up all the scores and divide by the number of students.
  2. I can use a shortcut to add up the scores, since they are grouped by value.
  3. For example, instead of adding 80 + 80 + … + 80 twenty times, I can just multiply 80 by 20 to get the same result.
  4. Similarly, I can multiply 90 by 8 and 100 by 2 to get the sums of those groups
  5. So, the total sum of the scores is 80 * 20 + 90 * 8 + 100 * 2
  6. I can use the distributive property of multiplication to simplify this expression.
  7. It is equivalent to 80 * (20 + 1) + 90 * 8 + 100
  8. That is, 80 * 21 + 720 + 100
  9. I can do the arithmetic in my head or use a calculator
  10. 80 * 21 is 1680, so the total sum is 1680 + 720 + 100, which is 2500
  11. Now, to find the average, I need to divide 2500 by 30, the number of students.
  12. I can use long division or a calculator for this step
  13. 2500 divided by 30 is 83.33, repeating
  14. This means that the class average is 83.33, or 83 and one-third
  15. Answer: 83.33
Figure 3 from [Lig23L]. A comparison of outcome-supervised and process-supervised reward models, evaluated by their ability to search over many test solutions. Majority voting is shown as a strong baseline. For N ≤ 1000, we visualize the variance across many sub-samples of the 1860 solutions we generated in total per problem.

A green background indicates a high PRM score, and a red background indicates a low score. Note that in step 7, GPT-4 makes an incorrect attempt to simplify an expression. The reward model catches this mistake.

More examples and the entire dataset can be found on the project page. The experiments demonstrate a clear advantage of using process supervision in training the reward model compared to relying solely on outcome supervision, see Figure 3.


Empirical results of both papers indicate that process supervision provides a better training signal than outcome supervision. This is probably not too surprising, as including additional relevant data in the modeling of a prediction is usually beneficial. While significant improvements are reported in both papers, I wonder whether they might be an artifact of the selected evaluation scenarios. In recent years, human education and organization have tended towards an empowerment approach, where the goal is presented, but the agent has the freedom to choose how to achieve it. Process imitation or supervision on the other hand can be seen as the old-school way of micromanagement. It would be an interesting avenue for future work on benchmark design to explore which properties make outcome or process supervision (for both humans and artificial agents) more amenable.

It is also worth noting that both approaches are rather difficult to apply in practice. Process imitation was used in the presence of a cheap (automated) expert planner, in which case an imitation-learned policy is probably not terribly useful (one could just use the planner). Process supervision was applied without an automated planner, instead requiring human supervision. While this makes it more relevant for practical purposes, it is also much more expensive and laborious. Moreover, the practical application of the resulting reward model to improve the generator LM was not demonstrated yet.


In this series