VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion

Abstract

Recent success in legged robot locomotion is attributed to the integration of reinforcement learning and physical simulators. However, these policies often encounter challenges when deployed in real-world environments due to sim-to-real gaps, as simulators typically fail to replicate visual realism and complex real-world geometry. Moreover, the lack of realistic visual rendering limits the ability of these policies to support high-level tasks requiring RGB-based perception like ego-centric navigation. This paper presents a Real-to-Sim-to-Real framework that generates photorealistic and physically interactive "digital twin" simulation environments for visual navigation and locomotion learning. Our approach leverages 3DGS-based scene reconstruction from multi-view images and integrates these environments into simulations that support ego-centric visual perception and mesh-based physical interactions. To demonstrate its effectiveness, we train a reinforcement learning policy within the simulator to perform a visual goal-tracking task. Extensive experiments show that our framework achieves RGB-only sim-to-real policy transfer. Additionally, our framework facilitates the rapid adaptation of robot policies to new environments, highlighting its potential for applications in households and factories.

Method

Framework Overview of VR-Robo

We first reconstruct the geometry-consistent scenarios from the captured images with foundation model constraints. Then we build a realistic and interactive simulation environment with GS-mesh hybrid representation and occlusion-aware randomization and composition for policy training. Finally, we zero-shot transfer the RL policy trained in simulation into the real robot for ego-centric navigation and visual locomotion.

RL Policy Training in the Reconstructed Simulation Environment

The agent leverages the ego-view GS photorealistic rendering as visual observations and interacts with the mesh extracted from GS in the Issac Sim environment. The agent receives the RGB image feature from ViT encoder, proprioception from simulator sensors, and a task-specific RGB command as input, using an asymmetric actor-critic LSTM structure to output the velocity command for low-level policy control.

Agent-object Randomization and Scene Composition

At the beginning of each episode, we randomly sample the mesh positions for the robot and three cones in the Isaac Sim environment (upper row). We synchronously merge the agent and object Gaussians into the environment and compose for joint rendering (lower row). Both mesh and Gaussian rendering results shown here are in the Bird's-Eye View (BEV) with the same camera pose.

Experiments

Real World Experimients

Video Demos

Simulation Experiments

We conduct comparison and ablation experiments with various baselines.

Video Demos

BibTeX

@article{zhu2025vr,
  title={VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion},
  author={Zhu, Shaoting and Mou, Linzhan and Li, Derun and Ye, Baijun and Huang, Runhan and Zhao, Hang},
  journal={arXiv preprint arXiv:2502.01536},
  year={2025}
}

VR-Robo: A Real-to-Sim-to-Real Framework for Visual Robot Navigation and Locomotion

RA-L 2025

Abstract

Method

Framework Overview of VR-Robo

RL Policy Training in the Reconstructed Simulation Environment

Agent-object Randomization and Scene Composition

Experiments

Real World Experimients

Video Demos

Simulation Experiments

Video Demos

BibTeX

VR-Robo: A Real-to-Sim-to-Real Framework for
Visual Robot Navigation and Locomotion