收录解读
这篇论文把预训练视频生成器转化为可执行任务的视觉世界模型。
核心思路是用详细解法驱动 Demonstrator,再把行为蒸馏到只看图像和短 prompt 的 Executor,并用 VLM feedback RL 改进。
它值得收录,因为它连接视频生成、世界模型和机器人任务执行。
局限在于当前证据主要来自预印本实验与作者自建评测,后续需要独立复现和更大范围部署验证。
原始摘要与中文对照
中文对照翻译
世界模型自蒸馏:训练世界模型解决通用任务。预训练视频生成器是很有前景的视觉世界模型,它们展现出涌现的任务解决能力;然而,它们对详细文本描述的依赖限制了其在规划和决策中的直接应用。现有方法要么将这种推理外包给语言模型或视觉-语言模型,要么依赖于使用配对的任务执行视频进行监督微调,而这些视频收集成本高昂且难以扩展。我们提出了一个可扩展的框架,通过将自蒸馏与强化学习相结合,在此类模型中激发任务解决能力。给定一张未标注的场景图像,一个视觉-语言模型会生成一个候选任务和一个详细的逐步解决方案。该解决方案会条件化一个预训练的视频扩散模型,即演示器(Demonstrator);我们将其行为蒸馏到一个执行器(Executor)中。
原始摘要
Pretrained video generators are promising visual world models that exhibit emergent task-solving abilities; however, their reliance on detailed textual descriptions limits their direct use for planning and decision-making. Existing approaches either outsource this reasoning to language or vision-language models, or rely on supervised fine-tuning with paired task-execution videos, which are costly to collect and difficult to scale. We propose a scalable framework that elicits task-solving ability in such models by combining self-distillation with reinforcement learning. Given an unlabeled scene image, a vision-language model generates a candidate task and a detailed step-by-step solution. The solution conditions a pretrained video diffusion model, the Demonstrator; we distill its behavior into an Executor