核心要点
- 问题/背景
- GAP 解决的是双臂操作里长期存在的空间理解瓶颈:许多策略依赖 2D 表征,缺少显式 3D 空间推理;而点云路线又依赖深度、标定和工作空间裁剪,真实部署成本高。
- 方法/机制
- 论文把预训练 3D geometric foundation model 作为核心几何先验,从 RGB 直接得到 geometry-aware latent,并与 DINOv3 语义特征和机器人 proprioception 融合,再用 diffusion policy 同时预测未来动作 chunk 与未来 3D latent/pointmap。
- 结果/证据
- 正式收录价值在于它把动作预测和未来几何预测绑定成一个可复用训练目标,使双臂操作策略不仅输出动作,也学习场景几何将如何随动作演化。README 显示该工作已被 CVPR 2026 接收,并发布代码、预处理、训练和 RoboTwin 评估 wrapper。
- 收录价值
- 它不是更高一级,因为当前仍主要验证在 RoboTwin 和有限真实机器人执行中,未来几何预测目标能否扩展到更复杂长程接触、强遮挡和跨硬件平台还需要更多证据。
原始摘要与中文对照
中文对照翻译
标题:具有3D几何先验的双臂操作动作-几何预测。双臂操作需要能够推理3D几何、预测其在动作下如何演变以及生成平滑协调运动的策略。然而,现有方法通常依赖于空间感知有限的2D特征,或者需要难以在真实世界环境中可靠获取的显式点云。与此同时,最近的3D几何基础模型表明,可以快速且鲁棒地直接从RGB图像重建准确多样的3D结构。我们利用这一机会,提出了一个直接基于预训练3D几何基础模型构建双臂操作的框架。我们的策略将几何感知潜在变量、2D语义特征和本体感受融合到一个统一的状态表示中,并使用扩散模型共同预测未来的动作块和解码为密集点图的未来3D潜在变量。通过明确预测3D场景将如何与动作序列一起演变,该策略仅使用RGB观测就获得了强大的空间理解和预测能力。我们在RoboTwin基准测试的仿真环境和真实世界机器人执行中评估了我们的方法。我们的方法始终优于基于2D和基于点云的基线,在操作成功率、臂间协调性和3D空间预测精度方面达到了最先进的性能。代码可在https://github.com/Chongyang-99/GAP.git获取。图1. 范式比较。基于2D的方法从多视角RGB观测中学习隐式3D表示,纯粹依赖于2D线索。基于3D的方法需要相机校准和预设工作空间来裁剪点云,这限制了泛化性和可扩展性。相比之下,我们的方法利用强大的2D和3D预训练先验来实现语义-几何融合感知,从而在没有严格校准或工作空间限制的情况下实现鲁棒的动作和几何联合预测。用于精密装配[13, 14, 43]、可变形物体处理[2, 48]以及在杂乱或动态环境中的操作[4, 5]。然而,可靠的双臂操作仍然具有挑战性,需要策略生成时间平滑、动态一致的动作[4, 47],感知3D物体几何和交互动力学[8, 23, 40, 44],并在持续接触中保持稳定的臂间协调[20, 21]。最近的策略学习方法,如ACT [47]和扩散策略[12],通过将动作分块与迭代去噪相结合来提高稳定性。尽管这些方法对时间平滑有效,但它们在很大程度上仍然是几何扁平的:
原始摘要
Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations. We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code is available at https://github.com/Chongyang-99/GAP.git. Figure 1. Paradigm Comparison. 2D-based methods learn implicit 3D representations from multi-view RGB observations, relying purely on 2D cues. 3D-based methods require camera calibration and preset workspaces to crop point clouds, which limits generalization and scalability. In contrast, our approach leverages powerful 2D and 3D pretrained priors to achieve semantic–geometric fusion perception, enabling robust action and geometry joint prediction without strict calibration or workspace constraints. for precision assembly [13, 14, 43], deformable-object handling [2, 48], and operation in cluttered or dynamic environments [4, 5]. However, reliable bimanual manipulation remains challenging, requiring the policy to generate temporally smooth, dynamically consistent actions [4, 47], perceive 3D object geometry and interaction dynamics [8, 23, 40, 44], and maintain stable inter-arm coordination throughout continuous contact [20, 21]. Recent policy-learning approaches such as ACT [47] and diffusion policies [12] improve stability by combining action chunking with iterative denoising. While effective for temporal smoothing, these methods remain largely geometry-flat: