Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics

Abstract

Visual reinforcement learning is appealing for robotics but expensive -- off-policy methods are sample-efficient yet slow; on-policy methods parallelize well but waste samples. Recent work has shown that off-policy methods can train faster than on-policy methods in wall-clock time for state-based control. Extending this to vision remains challenging, where high-dimensional input images complicate training dynamics and introduce substantial storage and encoding overhead. To address these challenges, we introduce Squint, a visual Soft Actor Critic method that achieves faster wall-clock training than prior visual off-policy and on-policy methods. Squint achieves this via parallel simulation, a distributional critic, resolution squinting, layer normalization, a tuned update-to-data ratio, and an optimized implementation. We evaluate on the SO-101 Task Set, a new suite of eight manipulation tasks in ManiSkill3 with heavy domain randomization, and demonstrate sim-to-real transfer to a real SO-101 robot. We train policies for 15 minutes on a single RTX 3090 GPU, with most tasks converging in under 6 minutes.

Setup

(Left) We use a single wrist camera, our real world setup is displayed on the left. (Middle) The robot policy input in the real world and simulation is shown in the middle. (Right) On the right, we show the visual difference between downsampling from a higher resolution, vs rendering at the low resolution directly. We refer to downsampling from a higher resolution as squinting.

Method

Method Diagram. Squint is a fast off-policy actor-critic visual reinforcement learning algorithm that accelerates training speed by leveraging parallel environments and downsampling input image observations.

Design Choices. We experiment with the most critical design choices that allow our agents to be trained while maximizing wall-time for sim-to-real robotics deployment. Mean Success Rate and 95% CI over 5 seeds for all SO-101 tasks.

Visuals

Robot Input External Camera

Results

On the SO-101 task set, consisting of eight manipulation tasks with the SO-101 robotic arm in ManiSkill3, Squint surpasses prior visual reinforcement learning methods in wall-clock training time as well as State-to-Visual DAgger.

(Left). Comparison with visual RL baselines. (Right). Comparison with State-to-Visual DAgger, where we include the time it takes to train an SAC state expert into the comparison. Mean Success Rate and 95% CI over 5 seeds on all SO-101 tasks.

Showcase

Citation

@article{almuzairee2026squint,
  title = {Squint: Fast Visual Reinforcement Learning for Sim-to-Real Robotics},
  author = {Almuzairee, Abdulaziz and Christensen, Henrik I.},
  journal = {arXiv preprint arXiv:2602.21203},
  year = {2026},
}