Augmented Language Models: a survey | TransferLab

Prelude

What is the next big step in the evolution of large language models (LLMs)? Pre-training on internet-scale corpora, aligning generated text with human preferences, and mixing in multi-modality (although this last step is still in an early stage) have produced generative models of previously unimagined capabilities. However, current LLMs have significant limitations. They struggle with large or complex tasks, producing factually correct statements, planning, using external tools, and reaching goals. For example, when tasked with the destruction of humanity, ChatGPT disappointingly only performed a few Google searches, despite being given external memory, full internet access, and a compute environment (watch here).

I believe that a unification of LLMs with planning and tool use beyond in-context learning will address many of the shortcomings of current models, and make LLMs even more widespread in real-world applications. I am certainly not alone with these thoughts - for example DeepMind is reportedly working on a combination of LLMs with a planning system.

This post summarizes a survey conducted by Meta AI, presenting the efforts of the AI community to augment language models with external systems as of February 2023. While the survey becomes outdated quickly, it provides a valuable entry point to the field and a partial overview of the current state of practical LLM use.

The Survey

In [Mia23A] the authors focus on language models augmented with reasoning and/or tool use, further distinguishing approaches by whether the augmentation was implemented through heuristics or learned.

It is unclear whether LMs “really” reason, like a planning engine, or whether the various context-extension techniques lead to the production of more tokens that nudge the probability distribution of the final answer in the right direction. A pragmatic definition of reasoning for LMs adopted in the survey is “giving more computation steps to the model before yielding the answer to a prompt”. These techniques, while useful, generally lead to more expensive computations and potential issues due to “context-overflow”.

Figure 3 from [Mia23A], recursive prompting example. <LM> denotes the start of the LM’s output to the prompt, while </LM> denotes the end. The problem is first decomposed into subproblems in Prompt 0. Then, Answer 2 to Subquestion 2 and Answer 1 to Subquestion 1 are sequentially fed to Prompt 2 and Prompt 1. The few-shot examples for each stage’s prompt are omitted.

Many references for reasoning-augmented LMs are based on simple-to-understand ideas like prompt-engineering, recursive calls to the LMs (like in Figure 3), standard LM training on datasets that include intermediate reasoning steps (often generated automatically either from the answer or through bootstrapping additional data) or a mixture thereof. The survey does a good job of summarizing each paper in a few sentences, thereby providing a quick overview of such approaches. A single winning paradigm for enabling LMs to reason has not been found at the time of writing (neither the survey nor this post), so this overview should be useful to both researchers and practitioners.

Similarly, the remaining sections do well in summarizing many ideas regarding augmenting LMs by retrieval of text from external knowledge bases or their own memory, letting LMs call tools like calculators, code-executors, or translators, and similar tool use strategies. Multi-modal “token-models” that can make use of external tools related to e.g. vision are mentioned, especially for their potential to enable LM based agents that can perceive and act on the physical world.

There is more variety in the surveyed methods for tool use, since the references make use of quite different training strategies, including online and offline reinforcement learning. The reader will often need to read through a reference in order to grasp its main ideas. Despite some impressive results, here it also becomes clear that a general purpose strategy for LM augmentation with tool use, especially one that incorporates reasoning, is yet to be found.

Command	Effect
Search <query>	Send <query> to the Bing API and display a search results page
Clicked on link <link ID>	Follow the link with the given ID to a new page
Find in page: <text>	Find the next occurrence of <text> and scroll to it
Quote: <text>	If <text> is found in the current page, add it as a reference
Scrolled down <1, 2, 3>	Scroll down a number of times
Scrolled up <1, 2, 3>	Scroll up a number of times
Top	Scroll to the top of the page
Back	Go to the previous page
End: Answer	End browsing and move to answering phase
End: <nonsense, controversial>	End browsing and skip answering phase

Table 3 from [Mia23A], the actions WebGPT [Nak22W] can perform

Having analyzed the augmentation strategies, the authors proceed by grouping the surveyed papers by the training paradigm. This section is particularly useful for gaining an overview of references similar to an approach that one has already decided to take.

The final section contains several open questions regarding augmented LMs and a detailed discussion to which extent they constitute a step towards autonomous machine intelligence in the sense of LeCun’s position paper [Lec22A].

Notable Omissions

The survey does not include discussions of useful software for augmented LMs like langchain or of popular implementations of augmented agents like AutoGPT.

Several important works in the field of augmented LMs have appeared since the survey was published, and are therefore not mentioned in it. Among them are ideas related to enhanced reasoning by training on expert-plans like [Lig23L] (note also the somehow similar [Yan22C]), prompting techniques like [Yao23T], and multiple works on autonomous agents. The latter are nicely summarized in Lilian Weng’s blog post, which is a good addition to Meta AI’s survey for getting an overview of the current state of the field. Hopefully, a regularly updated document on augmented LMs will be available in the near future.

This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from …

How could machines learn as efficiently as humans and animals? How could machines learn to reason and plan? How could machines learn representations of percepts and action plans at multiple levels of abstraction, enabling them to reason, predict, and plan at multiple time horizons? This position paper proposes an architecture and training paradigms with which to construct autonomous intelligent …

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning …

Prelude

The Survey

Notable Omissions

References

In this series →