A Recipe for Unbounded Data Augmentation in Visual Reinforcement Learning

Abstract

$Q$-learning algorithms are appealing for real-world applications due to their data-efficiency, but they are very prone to overfitting and training instabilities when trained from visual observations. Prior work, namely SVEA, finds that selective application of data augmentation can improve the visual generalization of RL agents without destabilizing training. We revisit its recipe for data augmentation, and find an assumption that limits its effectiveness to augmentations of a photometric nature. Addressing these limitations, we propose a generalized recipe, SADA, that works with wider varieties of augmentations. We benchmark its effectiveness on DMC-GB2 -- our proposed extension of the popular DMControl Generalization Benchmark -- as well as tasks from Meta-World and the Distracting Control Suite, and find that our method, SADA, greatly improves training stability and generalization of RL agents across a diverse set of augmentations.

DeepMind Control - Generalization Benchmark 2 (DMC-GB2)

Geometric Test Set

Rotate Easy

Rotate Hard

Shift Easy

Shift Hard

Rotate Shift Easy

Rotate Shift Hard

Photometric Test Set

Color Easy

Color Hard

Video Easy

Video Hard

Color Video Easy

Color Video Hard

Method

Naive augmentation, where all inputs are indiscriminately augmented, has been shown to destabilize policies or lead to suboptimal convergence. To stabilize actor-critic learning under strong applications of data augmentation, our method, SADA, selectively applies augmentations to inputs in both the actor and critic updates. Given an input observation $\mathbf{o}_t$, augmented input observation $\text{aug(}\mathbf{o}_{t}\text{)}=\mathbf{o}_{t}^\text{aug}$, replay buffer $\mathcal{D}$, and an encoder $f_\xi$, the actor and critic objectives for a generic actor-critic method thus becomes: $$ \mathcal{L}_{\pi_\phi}^\textbf{SADA}(\mathcal{D}) = \mathbb{E}_{\mathbf{o}_{t} \sim \mathcal{D}} \left[ -Q_\theta(\mathbf{m}_t, \pi_\phi(\mathbf{p}_t)) \right]~~~~~~~~\textcolor{gray}{\textrm{(actor)}} \\ $$ $$ \mathcal{L}^\textbf{SADA}_{Q_{\theta}}(\mathcal{D}) = \mathbb{E}_{(\mathbf{o}_{t},\mathbf{a}_{t}, r_{t}, \mathbf{o}_{t+1})\sim\mathcal{D}} \left[ \| Q_\theta(\mathbf{p}_{t}, \mathbf{a}_{t}) - \mathbf{y}_{t} \|_{2} \right]~~~\textcolor{gray}{\textrm{(critic)}} \\ $$ where $\mathbf{p}_t = f_\xi(\left[\mathbf{o}_{t}, \mathbf{o}_{t}^\text{aug}\right]_N), ~~ \mathbf{m}_t = f_\xi(\left[\mathbf{o}_{t}, \mathbf{o}_{t}\right]_N)$, and $\mathbf{y}_t = \left[ q_t^{tgt},q_t^{tgt}\right]_{N}$. We use $[\cdot]_N$ to denote concatenation for batch size of dimensionality $N$ where $\mathbf{o}_t, \mathbf{o}_t^\text{aug} \in \mathbb{R}^{N\times C\times H\times W}$. In $\mathbf{y}_t$, we use $q_t^{tgt}$ to denote the target $Q$-value, predicted entirely from unaugmented data using the target $Q$-function such that its equation becomes: $q_t^{tgt} = r(\mathbf{o}_t, \mathbf{a_t}) + \gamma \text{max}_{\mathbf{a'_t}} Q_{\overline{\theta}}(f_\xi(\mathbf{o}_{t+1}),\mathbf{a'})$. Most importantly, SADA requires no additional forward passes, losses, or parameters.

Method Diagram. Generic Actor-Critic Algorithm Update. Our proposed changes (SADA) when applying strong data augmentation are highlighted in yellow.

Image Augmentations

Train

Unaugmented

Geometric Augmentations

Random Rotate

Random Shift

Random Rotate and Shift

Photometric Augmentations

Random Convolution

Random Overlay

Random Convolution and Overlay

Overall Robustness

DMC-GB2 Overall Robustness. Episode reward on DMC-GB2 when trained under all (geometric and photometric) augmentations, averaged across all 6 DMControl tasks. Mean and 95% CI over 5 seeds. Our method, SADA, displays superior robustness to diverse image transformation types, all while attaining a similar sample efficiency to its unaugmented DrQ baseline in the training environment.

Meta-World Visuals

Train

Original

Geometric Test Set

Shift Hard

BibTeX

@article{almuzairee2024recipe, title = {A Recipe for Unbounded Data Augmentation in Visual Reinforcement Learning}, author = {Almuzairee, Abdulaziz and Hansen, Nicklas and Christensen, Henrik I}, journal = {Reinforcement Learning Journal}, volume = {1}, pages = {130--157}, year = {2024}, }