Home

Merging and Disentangling Views
in Visual Reinforcement Learning
for Robotic Manipulation

UC San Diego

Abstract

Vision is well-known for its use in manipulation, especially using visual servoing. To make it robust, multiple cameras are needed to expand the field of view. That is computationally challenging. Merging multiple views and using Q-learning allows the design of more effective representations and optimization of sample efficiency. Such a solution might be expensive to deploy. To mitigate this, we introduce a Merge And Disentanglement (MAD) algorithm that efficiently merges views to increase sample efficiency while augmenting with single-view features to allow lightweight deployment and ensure robust policies. We demonstrate the efficiency and robustness of our approach using Meta-World and ManiSkill3.

Setup

Meta-World

First Person

Meta-World First Person View

Third Person A

Meta-World Third Person A View

Third Person B

Meta-World Third Person B View

ManiSkill3

First Person

ManiSkill3 First Person View

Third Person A

ManiSkill3 Third Person A View

Third Person B

ManiSkill3 Third Person B View

Method

Building on prior work in multi-view reinforcement learning, we found it to pursue two distinct directions: one focused on merging camera views to improve sample efficiency, while the other focusing on disentangling camera views to create policies that are robust to camera view reduction. To unify these complementary approaches in pursuit of both increased sample efficiency and policy robustness, we propose MAD: Merge And Disentangle views for visual reinforcement learning.

1) To merge views: Given a multi-view input \(\mathbf{o}^{m}_t = {\mathbf{o}^1_t, \mathbf{o}^2_t, ..., \mathbf{o}^n_t}\) consisting of \(n\) uncalibrated camera views at time \(t\), where each \(\mathbf{o}^i_t\) represents a single view \(\mathbf{o}^s_t\), each view is passed separately through a single shared CNN encoder \(f_\xi\). The output features of a single view input is then defined to be \(\mathcal{V}^i_t=f_\xi(\mathbf{o}^i_t)\). After encoding all singular views \({\mathcal{V}^1_t, \mathcal{V}^2_t, ..., \mathcal{V}^n_t}\), the features are merged through summation such that the combined multi-view representation becomes: \(\mathcal{M}_t = \sum_i^n{\mathcal{V}^i_t}\). Standard multi-view algorithms would only pass this merged feature representation \(\mathcal{M}_t\) to the downstream actor and critic. This causes the policy to fail when some of the input camera views are missing. To avoid this, the camera view features need to be properly disentangled during training.

2) To disentangle views: The downstream actor and critic need to be trained on both the merged feature representation \(\mathcal{M}_t\) and all its singular view feature representations \(\mathcal{V}^i_t\). However, naively training on both the merged representation and all its singular view representations decreases sample efficiency and destabilizes learning, as they can be viewed as multiple different states by the downstream actor and critic networks. Therefore, we build upon SADA, a framework for applying data augmentation to visual RL agents that stabilizes both the actor and the critic under data augmentation by selectively augmenting their inputs. We modify the RL loss objectives, such that we train on each merged feature representation, and apply all its singular view features as augmentations to it. Given a critic network \(Q_\theta\) parametrized by \(\theta\), an actor network \(\pi_\phi\) parametrized by \(\phi\), and \(n\) input camera views:

(a) The loss function for a generic critic becomes: $$ \mathcal{L}^\textcolor{blue}{\textbf{UnAug}}_{Q_{\theta}}(\mathcal{D}) = \mathbb{E}_{(\mathbf{o}^m_{t},\mathbf{a}_{t}, r_{t}, \mathbf{o}^m_{t+1})\sim\mathcal{D}} \Big[ \big( Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \mathbf{a}_{t}) \;- r_t - \gamma Q_{\overline{\theta}}(\textcolor{blue}{\mathcal{M}_{t+1}}, \mathbf{a}') \big)^2 \Big] $$ $$ \mathcal{L}^\textcolor{blue}{\textbf{Aug}}_{Q_{\theta}}(\mathcal{D}, i) = \mathbb{E}_{(\mathbf{o}^m_{t},\mathbf{a}_{t}, r_{t}, \mathbf{o}^m_{t+1})\sim\mathcal{D}} \Big[ \big( Q_\theta(\textcolor{blue}{\mathcal{V}^{i}_t}, \mathbf{a}_{t}) \;- r_t - \gamma Q_{\overline{\theta}}(\textcolor{blue}{\mathcal{M}_{t+1}}, \mathbf{a}') \big)^2 \Big] $$ $$ \mathcal{L}_{Q_{\theta}}^\textbf{MAD}(\mathcal{D}) = \alpha * \mathcal{L}_{Q_{\theta}}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) \; + \; (1-\alpha) * \frac{1}{n}\sum_{i=1}^n\mathcal{L}_{Q_{\theta}}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D, i}) $$ where \(r_t\) is the reward at timestep \(t\) and \(\alpha\) is a hyperparameter that weighs the unaugmented and augmented streams for more fine grained control over the learning.

(b) The loss function for a generic actor becomes: $$ \mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) = \mathbb{E}_{\mathbf{o}^{m}_{t} \sim \mathcal{D}} \notag \left[-Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \pi_\phi(\textcolor{blue}{\mathcal{M}_t})) \right] $$ $$ \mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D},i) = \mathbb{E}_{\mathbf{o}^{m}_{t} \sim \mathcal{D}} \notag \left[-Q_\theta(\textcolor{blue}{\mathcal{M}_t}, \pi_\phi(\textcolor{blue}{\mathcal{V}^{i}_t})) \right] $$ $$ \mathcal{L}_{\pi_\phi}^\textbf{MAD}(\mathcal{D}) = \alpha * \mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{UnAug}}(\mathcal{D}) \; + \; (1-\alpha) * \frac{1}{n}\sum_{i=1}^n\mathcal{L}_{\pi_\phi}^\textcolor{blue}{\textbf{Aug}}(\mathcal{D},i) $$ By predicting the targets in both actor and critic updates from the merged feature representation only \(\mathcal{M}_t\), the variance in the targets is reduced, thereby stabilizing the RL learning objective under the application of data augmentation. Through this formulation, the actor and critic are able to generalize to all singular views \(\mathcal{V}^i_t\) with minimal loss to training sample efficiency. A visual diagram of our method is illustrated below.
Method Diagram. Update diagram of a generic visual actor critic model with our modifications. Our method, MAD, merges camera views through feature summation and disentangles camera views by selectively augmenting inputs to the downstream actor and critic with all the singular view features. The agent is trained end-to-end with our defined MAD loss functions. (Left): Single Shared CNN Encoder. (Middle): Actor Update diagram (Right): Critic Update diagram.

Videos

Method
All Cameras
First Person
Third Person A
Third Person B
MAD (Ours)
MVD
VIB
Single Camera

Results

On 20 visual RL tasks, MAD achieves high sample efficiency and displays robustness to a reduction in camera views , outperforming baselines.

MAD View Robustness. Success rate averaged over all (Top) 15 Meta-World and (Bottom) 5 ManiSkill3 visual RL tasks. Methods are trained on all three camera views and evaluated on all and singular camera views with the final average displayed on the far right. Mean and 95% CI over 5 random seeds.

Extensions

We find that MAD can be applied to different modalities (RGB, Depth), instead of different camera views. Surprsingly, training using MAD with RGB and Depth, attains higher sample efficiency on the Depth-Only evaluation than the Depth baseline.
MAD Modality Transfer

MAD Modality Extension. Success rate averaged over 5 ManiSkill3 tasks. (Left) RGB and Depth images from the PokeCube Third Person A view. (Right) Methods are trained on one camera view (Third Person A) but with different modalities (RGB, Depth), and evaluated accordingly. Mean and 95% CI over 5 random seeds.

Citation

@misc{almuzairee2025merging,
  title={Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation}, 
  author={Abdulaziz Almuzairee and Rohan Patil and Dwait Bhatt and Henrik I. Christensen},
  year={2025},
  eprint={2505.04619},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2505.04619}, 
}