Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

机器人与具身智能突破级暂无讲解视频

发表时间: 2026-06-06

收录解读

Ego-Pi 是 Stanford 与 Meta 的 CVPR 2026 Findings 论文，目标是回答一个机器人学习中的关键问题：人类第一视角数据能否教会机器人从未通过机器人数据学到的任务逻辑。它基于 Physical Intelligence 的 π0.5 VLA 基础模型，在配备双臂五指灵巧手的人形机器人上做人类-机器人跨形态共训练。

核心方法不是简单把人手视频混进训练集，而是围绕跨形态差异做了三层适配：用 visual guiding keypoints 缩小人手与机器人手的视觉对应差距；把人类和机器人动作对齐到共享 action 表征；用左右手动作 token interleaving 扩展到高维双手控制，同时避免改动预训练 action head。

实验重点是任务语义迁移和组合泛化，而不只是同分布模仿。论文报告 Ego-Pi 能让机器人从少量人类 egocentric demos 中学习排序逻辑、规则顺序和多步技能组合，在 tomato sorting、packaging、boxing 等任务上显著优于只用机器人数据训练的基线，并出现机器人数据 alone 不会产生的行为。

它值得收录，因为它把 VLA fine-tuning 从更多机器人数据扩展到人类第一视角数据驱动的任务语义迁移，为人形/灵巧手机器人提供了可复用的跨形态训练接口。局限在于任务规模仍较小、平台集中在特定 humanoid + Tesollo hands，真实开放环境和更大规模 human video pretraining 的泛化还需要后续验证。

原始摘要与中文对照

中文对照翻译

机器人学面临数据稀缺的根本挑战。与语言或视觉研究不同，机器人操作领域没有互联网规模的数据集。一个有前景的途径是利用以自我为中心的人类数据，这种数据更容易收集，覆盖范围更广，规模也更大。为此，我们以π0.5模型为基础，研究了在配备灵巧五指手的人类和人形具身之间进行学习的关键设计选择。我们引入了视觉引导关键点来减少人手和机器手之间的具身差距，通过共享动作表示来对齐这两个领域，并在配对的token中交错左右手动作，以在不修改预训练权重的情况下处理高维手部控制。我们的结果表明，人类数据不仅提高了机器人泛化能力，而且

原始摘要

Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the π0.5 model as a foundation. We introduce visual guiding keypoints to reduce the embodiment gap between human hands and robot hands, align both domains through a shared action representation, and interleave left and right hand actions across paired tokens to handle highdimensional hand control without modifying the pre-trained weights. Our results show that human data not only improves robotic generalization but

链接

论文链接论文链接论文链接论文链接论文链接项目

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接