Building on prior work in multi-view reinforcement learning, we found it to pursue two distinct directions:
one focused on merging camera views to improve sample efficiency, while the other focusing on disentangling camera views
to create policies that are robust to camera view reduction. To unify these complementary approaches in pursuit
of both increased sample efficiency and policy robustness, we propose
MAD:
Merge
And
Disentangle views for
visual reinforcement learning.
1) To
merge views:
Given a multi-view input \(\mathbf{o}^{m}_t = {\mathbf{o}^1_t, \mathbf{o}^2_t, ..., \mathbf{o}^n_t}\) consisting of \(n\) uncalibrated camera views at time \(t\),
where each \(\mathbf{o}^i_t\) represents a single view \(\mathbf{o}^s_t\), each view is passed separately through a single shared CNN encoder \(f_\xi\).
The output features of a single view input is then defined to be \(\mathcal{V}^i_t=f_\xi(\mathbf{o}^i_t)\).
After encoding all singular views \({\mathcal{V}^1_t, \mathcal{V}^2_t, ..., \mathcal{V}^n_t}\), the features are merged through
summation such that the combined multi-view representation becomes: \(\mathcal{M}_t = \sum_i^n{\mathcal{V}^i_t}\).
Standard multi-view algorithms would only pass this merged feature representation \(\mathcal{M}_t\) to the downstream actor and critic. This causes the policy to fail
when some of the input camera views are missing. To avoid this, the camera view features need to be properly disentangled during training.
2) To
disentangle views:
The downstream actor and critic need to be trained on
both the merged feature representation \(\mathcal{M}_t\) and all its singular view feature representations \(\mathcal{V}^i_t\).
However, naively training on both the merged representation and all its singular view representations decreases sample efficiency and destabilizes learning, as they can be viewed as
multiple different states by the downstream actor and critic networks.
Therefore, we build upon
SADA, a framework for applying data augmentation to visual RL agents
that stabilizes both the actor and the critic under data augmentation by selectively augmenting their inputs. We modify the RL loss
objectives, such that we train on each merged feature representation, and apply all its singular view features as augmentations to it.
Given a critic network \(Q_\theta\) parametrized by \(\theta\), an actor network \(\pi_\phi\) parametrized by \(\phi\), and \(n\) input camera views:
(a) The loss function for a generic critic becomes:
$$
\mathcal{L}^\textcolor{blue}{\textbf{UnAug}}_{Q_{\theta}}(\mathcal{D}) = \mathbb{E}_{(\mathbf{o}^m_{t},\mathbf{a}_{t}, r_{t}, \mathbf{o}^m_{t+1})\sim\mathcal{D}} \Big[ \big(
Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \mathbf{a}_{t}) \;- r_t - \gamma Q_{\overline{\theta}}(\textcolor{blue}{\mathcal{M}_{t+1}}, \mathbf{a}') \big)^2 \Big]
$$
$$
\mathcal{L}^\textcolor{blue}{\textbf{Aug}}_{Q_{\theta}}(\mathcal{D}, i) = \mathbb{E}_{(\mathbf{o}^m_{t},\mathbf{a}_{t}, r_{t}, \mathbf{o}^m_{t+1})\sim\mathcal{D}} \Big[ \big(
Q_\theta(\textcolor{blue}{\mathcal{V}^{i}_t}, \mathbf{a}_{t}) \;- r_t - \gamma Q_{\overline{\theta}}(\textcolor{blue}{\mathcal{M}_{t+1}}, \mathbf{a}') \big)^2 \Big]
$$
$$
\mathcal{L}_{Q_{\theta}}^\textbf{MAD}(\mathcal{D}) = \alpha * \mathcal{L}_{Q_{\theta}}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) \; + \; (1-\alpha) * \frac{1}{n}\sum_{i=1}^n\mathcal{L}_{Q_{\theta}}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D, i})
$$
where \(r_t\) is the reward at timestep \(t\) and \(\alpha\) is a hyperparameter that weighs the
unaugmented and augmented streams for more fine grained control over the learning.
(b) The loss function for a generic actor becomes:
$$
\mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) = \mathbb{E}_{\mathbf{o}^{m}_{t} \sim \mathcal{D}} \notag \left[-Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \pi_\phi(\textcolor{blue}{\mathcal{M}_t})) \right]
$$
$$
\mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D},i) = \mathbb{E}_{\mathbf{o}^{m}_{t} \sim \mathcal{D}} \notag \left[-Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \pi_\phi(\textcolor{blue}{\mathcal{V}^{i}_t})) \right]
$$
$$
\mathcal{L}_{\pi_\phi}^\textbf{MAD}(\mathcal{D}) = \alpha * \mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) \; + \; (1-\alpha) * \frac{1}{n}\sum_{i=1}^n\mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D},i)
$$
By predicting the targets in both actor and critic updates
from the merged feature representation only \(\mathcal{M}_t\), the variance in the targets is reduced, thereby stabilizing the RL learning objective under the application of data augmentation.
Through this formulation, the actor and critic are able to generalize to all singular views \(\mathcal{V}^i_t\) with minimal loss to training sample efficiency.
A visual diagram of our method is illustrated below.