Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Rui Zhao; Kaiming Yang; Jifeng Zhu; Siyang Chen; Ziqi Wang; Weijia Wu; Kevin Qinghong Lin; Heng Wang; Mike Zheng Shou

机器人与具身智能突破级暂无讲解视频

发表时间: 2026-06-03
arXiv: 2606.04811

核心要点

问题/背景: 这篇 arXiv 论文提出一个重要评测问题：视频生成模型如果真的学到了物理世界，生成的操作视频是否能转化成可执行机器人行为。
方法/机制: Dream.exe 构建 video-to-execution pipeline：给定场景图像和任务描述，生成操作视频，再转换为机器人轨迹并在物理仿真中执行。
结果/证据: 基准覆盖 101 个手工设计的机器人操作任务，比较闭源前沿视频模型、开源视频模型和机器人专用模型，指标包括视觉质量、轨迹保真度和执行成功率。
收录价值: 它值得收录，因为它把视频生成评估从视觉相似度推进到物理可执行性，为 world model、视频生成和机器人操控之间建立了可复用的 grounding/evaluation interface。

完整收录解读

这篇 arXiv 论文提出一个重要评测问题：视频生成模型如果真的学到了物理世界，生成的操作视频是否能转化成可执行机器人行为。

Dream.exe 构建 video-to-execution pipeline：给定场景图像和任务描述，生成操作视频，再转换为机器人轨迹并在物理仿真中执行。

基准覆盖 101 个手工设计的机器人操作任务，比较闭源前沿视频模型、开源视频模型和机器人专用模型，指标包括视觉质量、轨迹保真度和执行成功率。

它值得收录，因为它把视频生成评估从视觉相似度推进到物理可执行性，为 world model、视频生成和机器人操控之间建立了可复用的 grounding/evaluation interface。

原始摘要与中文对照

中文对照翻译

近年来，视频生成模型已跨越了一个质的门槛。Wan (Wan et al., 2025)、Kling (Team et al., 2025)、Imagen video (Ho et al., 2022a) 和 Veo (Google DeepMind, 2025) 等模型能够合成流体动力学、人体运动和复杂物体交互的逼真视频，其保真度在两年前是遥不可及的。学界已开始将这种视觉流畅性解读为更深层事物的证据：大规模视频生成模型正在学习隐式世界模型 (Brooks et al., 2024; Kang et al., 2024; Ha & Schmidhuber, 2018)，从互联网规模数据的统计规律中获取物理因果关系的内部表征。这种解读已成为机器人学习领域活跃研究的基础，其中生成的视频被提议作为可扩展的行为先验，可以减少对昂贵的物理演示的依赖 (Du et al., 2023; Jang et al., 2025; Ye et al., 2026a; Liang et al., 2025)。视频生成模型在合成视觉上引人注目的内容方面取得了显著进展，但其输出仍局限于虚拟领域。随之而来的一个自然问题是：当这些模型生成的视频离开屏幕进入现实时，它们在多大程度上反映了物理世界？我们提出机器人操作作为解决这个问题的具体、可衡量的窗口：如果一个模型真正内化了物理定律，它所描绘的运动就应该转化为可执行的机器人行为。我们引入了 Dream.exe，这是一个通过视频到执行管道来操作化这一标准的评估框架。给定一个场景图像和任务描述，Dream.exe 合成一个操作视频，将生成的运动转换为机器人轨迹，并在物理模拟器中执行它们，从而产生纯粹的视觉指标无法提供的接地信号。利用这个管道，我们评估了 8 个模型，涵盖了前沿闭源生成器、开源生成器和机器人专用模型。我们的基准涵盖了

原始摘要

Recent years have seen video generation models cross a qualitative threshold. Models such as Wan (Wan et al., 2025), Kling (Team et al., 2025), Imagen video (Ho et al., 2022a), and Veo (Google DeepMind, 2025) can synthesize photorealistic videos of fluid dynamics, human motion, and complex object interactions with a fidelity that was out of reach just two years ago. The community has begun to interpret this visual fluency as evidence of something deeper: that large-scale video generation models are learning implicit world models (Brooks et al., 2024; Kang et al., 2024; Ha & Schmidhuber, 2018), acquiring internal representations of physical causality from the statistical regularities of internet-scale data. This interpretation has become a foundation for an active line of research in robot learning, where generated videos are proposed as scalable behavioral priors that could reduce dependence on costly physical demonstrations (Du et al., 2023; Jang et al., 2025; Ye et al., 2026a; Liang et al., 2025). Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-toexecution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closedsource generators, open-source generators, and robot-specific models. Our benchmark covers

链接

论文链接论文链接代码代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接