World Model Self-Distillation: Training World Models to Solve General Tasks

多模态基础模型突破级暂无讲解视频

发表时间: 2026-06-10
arXiv: 2606.12072

收录解读

这篇论文把预训练视频生成器转化为可执行任务的视觉世界模型。

核心思路是用详细解法驱动 Demonstrator，再把行为蒸馏到只看图像和短 prompt 的 Executor，并用 VLM feedback RL 改进。

它值得收录，因为它连接视频生成、世界模型和机器人任务执行。

局限在于当前证据主要来自预印本实验与作者自建评测，后续需要独立复现和更大范围部署验证。

原始摘要与中文对照

中文对照翻译

世界模型自蒸馏：训练世界模型解决通用任务。预训练视频生成器是很有前景的视觉世界模型，它们展现出涌现的任务解决能力；然而，它们对详细文本描述的依赖限制了其在规划和决策中的直接应用。现有方法要么将这种推理外包给语言模型或视觉-语言模型，要么依赖于使用配对的任务执行视频进行监督微调，而这些视频收集成本高昂且难以扩展。我们提出了一个可扩展的框架，通过将自蒸馏与强化学习相结合，在此类模型中激发任务解决能力。给定一张未标注的场景图像，一个视觉-语言模型会生成一个候选任务和一个详细的逐步解决方案。该解决方案会条件化一个预训练的视频扩散模型，即演示器（Demonstrator）；我们将其行为蒸馏到一个执行器（Executor）中。

原始摘要

Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor

链接

论文链接论文链接代码

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接