Critic Regularized Regression (CRR) [Wan20C] is concerned with offline reinforcement learning (RL), i.e. the task of finding a policy from previously recorded data without any interaction with an environment during training time. Such a formulation of RL is important for many practical applications, since interactions may be costly or simply unavailable, while offline data sets created from expert trajectories can be much easier to come by (for example data from human car drivers or robot operators).
A standard, albeit often ineffective approach to offline RL is Behavior Cloning (BC), which simply learns to map states to actions through supervised learning. However, BC does not make use of rewards and thus relies on an exceptionally high quality of demonstration in the data. This quality may be difficult to achieve, especially if one wants to cover a large portion of the environment. For example, for autonomous driving, one would need access to data from non-expert drivers to cover the situations of the car going off-road. Augmentation techniques like DAGGER (besides no longer being strictly offline), remain ineffective, too costly or impractical in many situations.
Another line of approaches is based on learning a Q-function by minimizing the Bellman error on offline data using an appropriate off-policy RL algorithm. These usually involve the minimization of a temporal difference (TD) loss akin to
$$ \mathbb{E}_{(s,a) \sim \mathcal{B}}[ D\left( Q_\theta(s_t, a_t), (r_t + \gamma \mathbb{E_{\mathbf{a}}}[Q_{\theta’}(s_{t+1},\mathbf{a})]) \right) ],$$
where $\mathcal{B}$ is the offline dataset and $D$ is some distance metric. The problem is with the term $\mathbb{E_{\mathbf{a}}}[Q_{\theta’}(s_{t+1},\mathbf{a})]$: if the policy is purely aimed at maximizing reward,1 1 The greedy policy selecting $\operatorname{argmax}_a(Q(s, a))$ or parametric policies trained on the standard RL objectives are of this type this expectation will usually involve actions that are not part of the offline data, and cannot be approximated reliably.
The strategy of CCR can be understood as an interpolation between BC and Q-function based approaches: it does learn a parametric policy and Q-function à la actor-critic, but the policy objective is not to maximize reward but instead to match the state-action distribution in a filtered subset of $\mathcal{B},$ where the filter uses the Q-function.2 2 The Q-function is still learned by minimizing the TD loss with gradient descent This means that, for a current $Q_\theta,$ the objective is to find
$$ \text{argmax}_{\pi} \mathbb{E}_{(s,a) \sim \mathcal{B}}[ f(Q_\theta, \pi, s, a) \log \pi (s|a) ], $$
where $f$ is some filtering function monotonically increasing in $Q.$3 3 When $f \equiv 1$, the Q-function is ignored and this objective is exactly equal to behavior cloning
Actions should be taken when they lead to improved outcomes (according to $Q$), so the estimated advantage seems like a natural building block for a good choice of $f$. The authors evaluate $f:=\mathbb{1}[\hat{A}_\theta (s,a) > 0]$ (dubbed $CRR_{binary}$) and $f:=\exp(\hat{A}_\theta (s,a) / \beta)$ (denoted $CRR_{exp}$, the temperature $\beta$ being a hyperparameter) together with different advantage estimators on multiple environments. The resulting algorithms reportedly improve the state-of-the-art in many cases, often by a large margin, see Tables 2 and 3 (the max in “CRR binary max” stands for a particular estimator of the advantage function). It is noted that $CRR_{binary}$ generally performs better on easier tasks, while $CRR_{exp}$ does well on more complex ones - the paper offers a detailed analysis of this phenomenon.
Due to its simplicity, both conceptual and from the point of view of implementation, Critic Regularized Regression quickly established itself as new state-of-the-art for offline RL and was used in several subsequent works. Among other open source projects, CRR is included in Facebook’s ReAgent library for offline RL as well as in rllib and is ready to use.
Crucially, as CRR uses rewards for learning a Q-function, it is only applicable when rewards form part of the offline data, contrary to e.g., behavior cloning.