\(Q\)-learning algorithms are appealing for real-world applications due to
their data-efficiency, but they are very prone to overfitting and training
instabilities when trained from visual observations. Prior work, namely SVEA,
finds that selective application of data augmentation can improve the visual
generalization of RL agents without destabilizing training. We revisit its
recipe for data augmentation, and find an assumption that limits its effectiveness
to augmentations of a photometric nature. Addressing these limitations, we propose
a generalized recipe, ** SADA**, that works with wider varieties of augmentations. We benchmark
its effectiveness on DMC-GB2 -- our proposed extension of the popular DMControl
Generalization Benchmark -- as well as tasks from Meta-World and the Distracting
Control Suite, and find that our method,

Naive augmentation, where all inputs are indiscriminately augmented, has been shown
to destabilize policies or lead to suboptimal convergence. To stabilize
actor-critic learning under strong applications of data augmentation, our method, **SADA**, selectively
applies augmentations to inputs in both the actor and critic updates.
Given an input observation \(\mathbf{o}_t\), augmented input observation \(\text{aug(}\mathbf{o}_{t}\text{)}=\mathbf{o}_{t}^\text{aug}\), replay buffer \(\mathcal{D}\), and an encoder \(f_\xi\), the actor and critic objectives
for a generic actor-critic method thus becomes:
$$ \mathcal{L}_{\pi_\phi}^\textbf{SADA}(\mathcal{D}) = \mathbb{E}_{\mathbf{o}_{t} \sim \mathcal{D}} \left[ -Q_\theta(\mathbf{m}_t, \pi_\phi(\mathbf{p}_t))
\right]~~~~~~~~\textcolor{gray}{\textrm{(actor)}} \\ $$
$$ \mathcal{L}^\textbf{SADA}_{Q_{\theta}}(\mathcal{D}) = \mathbb{E}_{(\mathbf{o}_{t},\mathbf{a}_{t}, r_{t}, \mathbf{o}_{t+1})\sim\mathcal{D}}
\left[ \| Q_\theta(\mathbf{p}_{t}, \mathbf{a}_{t}) - \mathbf{y}_{t} \|_{2} \right]~~~\textcolor{gray}{\textrm{(critic)}} \\ $$
where \(\mathbf{p}_t = f_\xi(\left[\mathbf{o}_{t}, \mathbf{o}_{t}^\text{aug}\right]_N), ~~ \mathbf{m}_t = f_\xi(\left[\mathbf{o}_{t}, \mathbf{o}_{t}\right]_N)\),
and \(\mathbf{y}_t = \left[ q_t^{tgt},q_t^{tgt}\right]_{N}\).
We use \([\cdot]_N\) to denote concatenation for batch size of dimensionality \(N\) where \(\mathbf{o}_t, \mathbf{o}_t^\text{aug} \in \mathbb{R}^{N\times C\times H\times W}\).
In \(\mathbf{y}_t\), we use \(q_t^{tgt}\) to denote the target \(Q\)-value, predicted entirely from unaugmented data using the target \(Q\)-function
such that its equation becomes:
\(q_t^{tgt} = r(\mathbf{o}_t, \mathbf{a_t}) + \gamma \text{max}_{\mathbf{a'_t}} Q_{\overline{\theta}}(f_\xi(\mathbf{o}_{t+1}),\mathbf{a'})\). Most importantly,
**SADA** requires no additional forward passes, losses, or parameters.

```
@misc{almuzairee2024recipe,
title={A Recipe for Unbounded Data Augmentation in Visual Reinforcement Learning},
author={Abdulaziz Almuzairee and Nicklas Hansen and Henrik I. Christensen},
year={2024},
eprint={2405.17416},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```