Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation

Abstract

Vision is well-known for its use in manipulation, especially using visual servoing. Due to the 3D nature of the world, using multiple camera views and merging them creates better representations for Q-learning and in turn, trains more sample efficient policies. Nevertheless, these multi-view policies are sensitive to failing cameras and can be burdensome to deploy. To mitigate these issues, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while simultaneously disentangling views by augmenting multi-view feature inputs with single-view features. This produces robust policies and allows lightweight deployment. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3.

Videos

Method

All Cameras

First Person

Third Person A

Third Person B

MAD (Ours)

MVD

VIB

Single Camera

Method

Building on prior work in multi-view reinforcement learning, we found it to pursue two distinct directions: one focused on merging camera views to improve sample efficiency, while the other focusing on disentangling camera views to create policies that are robust to camera view reduction. To unify these complementary approaches in pursuit of both increased sample efficiency and policy robustness, we propose MAD: Merge And Disentangle views for visual reinforcement learning.

1) To merge views: Given a multi-view input $\mathbf{o}^{m}_t = {\mathbf{o}^1_t, \mathbf{o}^2_t, ..., \mathbf{o}^n_t}$ consisting of $n$ uncalibrated camera views at time $t$, where each $\mathbf{o}^i_t$ represents a single view $\mathbf{o}^s_t$, each view is passed separately through a single shared CNN encoder $f_\xi$. The output features of a single view input is then defined to be $\mathcal{V}^i_t=f_\xi(\mathbf{o}^i_t)$. After encoding all singular views ${\mathcal{V}^1_t, \mathcal{V}^2_t, ..., \mathcal{V}^n_t}$, the features are merged through summation such that the combined multi-view representation becomes: $\mathcal{M}_t = \sum_i^n{\mathcal{V}^i_t}$. Standard multi-view algorithms would only pass this merged feature representation $\mathcal{M}_t$ to the downstream actor and critic. This causes the policy to fail when some of the input camera views are missing. To avoid this, the camera view features need to be properly disentangled during training.

2) To disentangle views: The downstream actor and critic need to be trained on both the merged feature representation $\mathcal{M}_t$ and all its singular view feature representations $\mathcal{V}^i_t$. However, naively training on both the merged representation and all its singular view representations decreases sample efficiency and destabilizes learning, as they can be viewed as multiple different states by the downstream actor and critic networks. Therefore, we build upon SADA, a framework for applying data augmentation to visual RL agents that stabilizes both the actor and the critic under data augmentation by selectively augmenting their inputs. We modify the RL loss objectives, such that we train on each merged feature representation, and apply all its singular view features as augmentations to it. Given a critic network $Q_\theta$ parametrized by $\theta$, an actor network $\pi_\phi$ parametrized by $\phi$, and $n$ input camera views:

(a) The loss function for a generic critic becomes: $$ \mathcal{L}^\textcolor{blue}{\textbf{UnAug}}_{Q_{\theta}}(\mathcal{D}) = \mathbb{E}_{(\mathbf{o}^m_{t},\mathbf{a}_{t}, r_{t}, \mathbf{o}^m_{t+1})\sim\mathcal{D}} \Big[ \big( Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \mathbf{a}_{t}) \;- r_t - \gamma Q_{\overline{\theta}}(\textcolor{blue}{\mathcal{M}_{t+1}}, \mathbf{a}') \big)^2 \Big] $$ $$ \mathcal{L}^\textcolor{blue}{\textbf{Aug}}_{Q_{\theta}}(\mathcal{D}, i) = \mathbb{E}_{(\mathbf{o}^m_{t},\mathbf{a}_{t}, r_{t}, \mathbf{o}^m_{t+1})\sim\mathcal{D}} \Big[ \big( Q_\theta(\textcolor{blue}{\mathcal{V}^{i}_t}, \mathbf{a}_{t}) \;- r_t - \gamma Q_{\overline{\theta}}(\textcolor{blue}{\mathcal{M}_{t+1}}, \mathbf{a}') \big)^2 \Big] $$ $$ \mathcal{L}_{Q_{\theta}}^\textbf{MAD}(\mathcal{D}) = \alpha * \mathcal{L}_{Q_{\theta}}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) \; + \; (1-\alpha) * \frac{1}{n}\sum_{i=1}^n\mathcal{L}_{Q_{\theta}}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D, i}) $$ where $r_t$ is the reward at timestep $t$ and $\alpha$ is a hyperparameter that weighs the unaugmented and augmented streams for more fine grained control over the learning.

(b) The loss function for a generic actor becomes: $$ \mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) = \mathbb{E}_{\mathbf{o}^{m}_{t} \sim \mathcal{D}} \notag \left[-Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \pi_\phi(\textcolor{blue}{\mathcal{M}_t})) \right] $$ $$ \mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D},i) = \mathbb{E}_{\mathbf{o}^{m}_{t} \sim \mathcal{D}} \notag \left[-Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \pi_\phi(\textcolor{blue}{\mathcal{V}^{i}_t})) \right] $$ $$ \mathcal{L}_{\pi_\phi}^\textbf{MAD}(\mathcal{D}) = \alpha * \mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) \; + \; (1-\alpha) * \frac{1}{n}\sum_{i=1}^n\mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D},i) $$ By predicting the targets in both actor and critic updates from the merged feature representation only $\mathcal{M}_t$, the variance in the targets is reduced, thereby stabilizing the RL learning objective under the application of data augmentation. Through this formulation, the actor and critic are able to generalize to all singular views $\mathcal{V}^i_t$ with minimal loss to training sample efficiency. A visual diagram of our method is illustrated below.

Method Diagram. Update diagram of a generic visual actor critic model with our modifications. Our method, MAD, merges camera views through feature summation and disentangles camera views by selectively augmenting inputs to the downstream actor and critic with all the singular view features. The agent is trained end-to-end with our defined MAD loss functions. (Left): Single Shared CNN Encoder. (Middle): Actor Update diagram (Right): Critic Update diagram.

Setup

Meta-World

First Person

Third Person A

Third Person B

ManiSkill3

First Person

Third Person A

Third Person B

Results

On 20 visual RL tasks, MAD achieves high sample efficiency and displays robustness to a reduction in camera views, outperforming baselines.

MAD View Robustness. Success rate averaged over all (Top) 15 Meta-World and (Bottom) 5 ManiSkill3 visual RL tasks. Methods are trained on all three camera views and evaluated on all and singular camera views with the final average displayed on the far right. Mean and 95% CI over 5 random seeds.

Extensions

We find that MAD can be applied to different modalities (RGB, Depth), instead of different camera views. Surprsingly, training using MAD with RGB and Depth, attains higher sample efficiency on the Depth-Only evaluation than the Depth baseline.

MAD Modality Extension. Success rate averaged over 5 ManiSkill3 tasks. (Left) RGB and Depth images from the PokeCube Third Person A view. (Right) Methods are trained on one camera view (Third Person A) but with different modalities (RGB, Depth), and evaluated accordingly. Mean and 95% CI over 5 random seeds.

Citation

@misc{almuzairee2025merging,
  title={Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation}, 
  author={Abdulaziz Almuzairee and Rohan Patil and Dwait Bhatt and Henrik I. Christensen},
  year={2025},
  eprint={2505.04619},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2505.04619}, 
}

Merging and Disentangling Viewsin Visual Reinforcement Learningfor Robotic Manipulation

Abstract

Videos

Method

Setup

Meta-World

First Person

Third Person A

Third Person B

ManiSkill3

First Person

Third Person A

Third Person B

Results

Extensions

Citation

Merging and Disentangling Views
in Visual Reinforcement Learning
for Robotic Manipulation