MonoLift: Learning 3D Manipulation Policies from Monocular RGB via Distillation

Abstract

Although learning 3D manipulation policies from monocular RGB images is lightweight and deployment-friendly, the lack of structural information often leads to inaccurate action estimation. While explicit 3D inputs can mitigate this issue, they typically require additional sensors and introduce data acquisition overhead. An intuitive alternative is to incorporate a pre-trained depth estimator; however, this often incurs substantial inference-time cost. To address this, we propose MonoLift, a tri-level knowledge distillation framework that transfers spatial, temporal, and action-level knowledge from a depth-guided teacher to a monocular RGB student. By jointly distilling geometry-aware features, temporal dynamics, and policy behaviors during training, MonoLift enables the student model to perform 3D-aware reasoning and precise control at deployment using only monocular RGB input. Extensive experiments on both simulated and real-world manipulation tasks show that MonoLift not only outperforms existing monocular approaches but even surpasses several methods that rely on explicit 3D input, offering a resource-efficient and effective solution for vision-based robotic control.

Method Overview

MonoLift

We identify three fundamental challenges in policy learning from RGB-only inputs: (i) difficulty in spatial disambiguation, (ii) limited temporal cues, and (iii) misguided actions due to absent 3D priors. These observations motivate our tri-level knowledge distillation framework, which targets each limitations through three complementary components: (i) Spatial Representation Distillation: Transfers fused RGB–depth features from the teacher to help the student disambiguate visually similar yet structurally different observations. (ii) Temporal Dynamics Distillation: Aligns temporal feature trajectories to enable the student to capture motion patterns that reflect underlying 3D structural changes. (iii) Action Distribution Distillation: Transfers action distributions shaped by the teacher’s 3D understanding, guiding the student to generate geometry-aware behaviors.

Real-World Videos

Press a button

Push a box into a goal

Pull out a tissue

pick up grapes and place them in a plate

Lift a cup and pour water

Fold a towel