Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan; Zibin Dong; Yicheng Liu; Hang Zhao

强化学习突破级有讲解视频

发表时间: 2026-03-17
arXiv: 2603.16666

收录解读

这篇论文关注 embodied world model 里一个非常具体但很关键的问题：World Action Models 的收益，到底主要来自测试时显式未来想象，还是来自训练时的视频建模信号。它不是单纯继续堆更慢的 imagine-then-execute，而是在问这条路线里真正有效的因果因素是什么。

作者提出 Fast-WAM，在训练阶段保留视频共训练，但在推理阶段跳过未来预测，并据此对视频共训练与测试时 imagination 的作用做了受控拆分。结果显示，去掉测试时未来想象后模型仍能保持竞争力，同时延迟降到 190ms，比传统 imagine-then-execute WAM 快四倍以上。

它值得正式收录，因为这是典型的机制澄清型论文：不是只报成功率，而是明确回答 active area 里的核心设计问题，并给出对后续 world action model 设计有直接价值的结论。对仓库里的 multimodal/world-model/robotics 主线来说，这类 clarification 条目很重要。

它没有升到更高等级，是因为影响范围仍然主要限于 WAM/VLA 这一子路线，虽然结论很有用，但还不足以成为更广泛 embodied intelligence 的总蓝图。

原始摘要与中文对照

中文对照翻译

Fast-WAM：世界动作模型需要测试时未来想象吗？世界动作模型（WAMs）已成为具身控制中视觉-语言-动作（VLA）模型的一个有前景的替代方案，因为它们明确地建模了视觉观测在动作下可能如何演变。大多数现有WAMs遵循“想象-然后-执行”范式，这会因迭代视频去噪而产生大量的测试时延迟，但目前尚不清楚明确的未来想象对于强大的动作性能是否真正必要。在本文中，我们探讨WAMs在测试时是否需要明确的未来想象，或者它们的好处是否主要来自训练期间的视频建模。我们通过提出Fast-WAM来解耦训练期间视频建模的作用与推理期间明确未来生成的作用，Fast-WAM是一种WAM架构，它在训练期间保留视频协同训练，但在测试时跳过未来预测。我们进一步实例化了几种Fast-WAM变体，以实现对这两个因素的受控比较。在这些变体中，我们发现Fast-WAM与“想象-然后-执行”变体保持竞争力，而移除视频协同训练会导致更大的性能下降。经验上，Fast-WAM在模拟基准（LIBERO和RoboTwin）和真实世界任务上都取得了与最先进方法相当的结果，且无需具身预训练。它以190毫秒的延迟实时运行，比现有的“想象-然后-执行”WAMs快4倍以上。这些结果表明，WAMs中视频预测的主要价值可能在于训练期间改善世界表征，而不是在测试时生成未来观测。

原始摘要

World Action Models (WAMs) have emerged as a promising alternative to VisionLanguage-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing Fast-WAM, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190 ms latency, over 4× faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time.

解读视频

视频观看页 B 站 YouTube

链接

论文链接