FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Han Zhao; Jingbo Wang; Wenxuan Song; Shuai Chen; Yang Liu; Yan Wang; Haoang Li; Donglin Wang

智能体与自主科学突破级暂无讲解视频

发表时间: 2026-02-19
arXiv: 2602.17259

收录解读

问题与背景：VLA 模型被认为需要 world modeling 才能获得更好的长程推理与泛化，但直接预测未来像素容易把训练目标拖向低层视觉重建，并在推理时积累误差。

方法/新意：FRAPPE 用 multiple future representation alignment 替代未来像素重建，在 mid-training 学未来 latent，在 post-training 并行对齐多种视觉基础模型的未来表示，从而把世界建模能力注入通用策略。

意义/放在仓库中的位置：这篇论文属于 world model × robotics 主线。它的价值在于把“未来表示对齐”确立成比显式重建更稳、更可扩展的路线，对 generalist policy 很有启发。

局限/为何不再升一级：方法外溢性不错，但目前证据主要集中在机器人基准和少量真实任务，尚未上升到更普遍的基础模型层级。

原始摘要与中文对照

中文对照翻译

FRAPPE：通过多未来表征对齐将世界建模融入通用策略。使VLA模型能够预测环境动态（即世界建模）已被认为是提高机器人推理和泛化能力的关键。然而，当前方法面临两个主要问题：1. 训练目标迫使模型过度强调像素级重建，这限制了语义学习和泛化能力；2. 推理过程中对预测未来观测的依赖常常导致误差累积。为解决这些挑战，我们引入了通过并行渐进扩展的未来表征对齐 (FRAPPE)。我们的方法采用两阶段微调策略：在训练中期阶段，模型学习预测未来观测的潜在表征；在训练后期阶段，我们并行扩展计算工作量，并同时与多个不同的视觉基础模型对齐表征。通过显著提高微调效率并减少对动作标注数据的依赖，FRAPPE提供了一种可扩展且数据高效的途径，以增强通用机器人策略中的世界感知能力。在RoboTwin基准和真实世界任务上的实验表明，FRAPPE优于最先进的方法，并在长周期和未见场景中表现出强大的泛化能力。通讯作者：Han Zhao，邮箱：zhaohan34@westlake.edu.cn 项目网站：https://h-zhao1997.github.io/frappe 模型：https://huggingface.co/collections/hhhJB/frappe 代码：https://github.com/Jbo-Wang/frappe

原始摘要

Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios. Correspondence: Han Zhao at zhaohan34@westlake.edu.cn Project Website: https://h-zhao1997.github.io/frappe Model: https://huggingface.co/collections/hhhJB/frappe Code: https://github.com/Jbo-Wang/frappe

链接

论文链接