Prelude
What is the next big step in the evolution of large language models (LLMs)? Pre-training on internet-scale corpora, aligning generated text with human preferences, and mixing in multi-modality (although this last step is still in an early stage) have produced generative models of previously unimagined capabilities. However, current LLMs have significant limitations. They struggle with large or complex tasks, producing factually correct statements, planning, using external tools, and reaching goals. For example, when tasked with the destruction of humanity, ChatGPT disappointingly only performed a few Google searches, despite being given external memory, full internet access, and a compute environment (watch here).
I believe that a unification of LLMs with planning and tool use beyond in-context learning will address many of the shortcomings of current models, and make LLMs even more widespread in real-world applications. I am certainly not alone with these thoughts - for example DeepMind is reportedly working on a combination of LLMs with a planning system.
This post summarizes a survey conducted by Meta AI, presenting the efforts of the AI community to augment language models with external systems as of February 2023. While the survey becomes outdated quickly, it provides a valuable entry point to the field and a partial overview of the current state of practical LLM use.
The Survey
In [Mia23A] the authors focus on language models augmented with reasoning and/or tool use, further distinguishing approaches by whether the augmentation was implemented through heuristics or learned.
It is unclear whether LMs “really” reason, like a planning engine, or whether the various context-extension techniques lead to the production of more tokens that nudge the probability distribution of the final answer in the right direction. A pragmatic definition of reasoning for LMs adopted in the survey is “giving more computation steps to the model before yielding the answer to a prompt”. These techniques, while useful, generally lead to more expensive computations and potential issues due to “context-overflow”.
Many references for reasoning-augmented LMs are based on simple-to-understand ideas like prompt-engineering, recursive calls to the LMs (like in Figure 3), standard LM training on datasets that include intermediate reasoning steps (often generated automatically either from the answer or through bootstrapping additional data) or a mixture thereof. The survey does a good job of summarizing each paper in a few sentences, thereby providing a quick overview of such approaches. A single winning paradigm for enabling LMs to reason has not been found at the time of writing (neither the survey nor this post), so this overview should be useful to both researchers and practitioners.
Similarly, the remaining sections do well in summarizing many ideas regarding augmenting LMs by retrieval of text from external knowledge bases or their own memory, letting LMs call tools like calculators, code-executors, or translators, and similar tool use strategies. Multi-modal “token-models” that can make use of external tools related to e.g. vision are mentioned, especially for their potential to enable LM based agents that can perceive and act on the physical world.
There is more variety in the surveyed methods for tool use, since the references make use of quite different training strategies, including online and offline reinforcement learning. The reader will often need to read through a reference in order to grasp its main ideas. Despite some impressive results, here it also becomes clear that a general purpose strategy for LM augmentation with tool use, especially one that incorporates reasoning, is yet to be found.
Command | Effect |
---|---|
Search <query> | Send <query> to the Bing API and display a search results page |
Clicked on link <link ID> | Follow the link with the given ID to a new page |
Find in page: <text> | Find the next occurrence of <text> and scroll to it |
Quote: <text> | If <text> is found in the current page, add it as a reference |
Scrolled down <1, 2, 3> | Scroll down a number of times |
Scrolled up <1, 2, 3> | Scroll up a number of times |
Top | Scroll to the top of the page |
Back | Go to the previous page |
End: Answer | End browsing and move to answering phase |
End: <nonsense, controversial> | End browsing and skip answering phase |
Having analyzed the augmentation strategies, the authors proceed by grouping the surveyed papers by the training paradigm. This section is particularly useful for gaining an overview of references similar to an approach that one has already decided to take.
The final section contains several open questions regarding augmented LMs and a detailed discussion to which extent they constitute a step towards autonomous machine intelligence in the sense of LeCun’s position paper [Lec22A].
Notable Omissions
The survey does not include discussions of useful software for augmented LMs like langchain or of popular implementations of augmented agents like AutoGPT.
Several important works in the field of augmented LMs have appeared since the survey was published, and are therefore not mentioned in it. Among them are ideas related to enhanced reasoning by training on expert-plans like [Lig23L] (note also the somehow similar [Yan22C]), prompting techniques like [Yao23T], and multiple works on autonomous agents. The latter are nicely summarized in Lilian Weng’s blog post, which is a good addition to Meta AI’s survey for getting an overview of the current state of the field. Hopefully, a regularly updated document on augmented LMs will be available in the near future.