World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Yi Yang; Zhihong Liu; Siqi Kou; Yiyang Chen; Yanzhe Hu; Jianbo Zhou; Boyuan Zhao; Zhijie Wei; Xiao Xia; Xueqi Li; Pengfei Liu; Zhijie Deng

机器人与具身智能突破级暂无讲解视频

发表时间: 2026-06-04
arXiv: 2606.05979

核心要点

问题/背景: 这篇 arXiv 论文提出 World-Language-Action model，把 world modeling、language reasoning 和 action synthesis 合并为同一类具身基础模型。
方法/机制: WLA 输入文本指令、图像和机器人状态，联合预测文本子任务、子目标图像和机器人动作；核心是 autoregressive Transformer，而不是 WAM 常见的双向 diffusion Transformer。
结果/证据: 模型通过 World Expert 学习语义意图和细粒度物理动态，并用 meta-queries 让世界预测隐式影响动作生成；推理时也可启用世界预测做 test-time scaling。
收录价值: 它值得收录，因为它正面连接 WAM 和 VLA 两条路线，并给出跨仿真/真实环境、多任务、长程控制和 cross-embodiment 视频学习的统一机器人模型接口。

完整收录解读

这篇 arXiv 论文提出 World-Language-Action model，把 world modeling、language reasoning 和 action synthesis 合并为同一类具身基础模型。

WLA 输入文本指令、图像和机器人状态，联合预测文本子任务、子目标图像和机器人动作；核心是 autoregressive Transformer，而不是 WAM 常见的双向 diffusion Transformer。

模型通过 World Expert 学习语义意图和细粒度物理动态，并用 meta-queries 让世界预测隐式影响动作生成；推理时也可启用世界预测做 test-time scaling。

它值得收录，因为它正面连接 WAM 和 VLA 两条路线，并给出跨仿真/真实环境、多任务、长程控制和 cross-embodiment 视频学习的统一机器人模型接口。

原始摘要与中文对照

中文对照翻译

我们提出世界-语言-动作 (WLA) 模型，作为一类新型的具身基础模型。WLA 以文本指令、图像和机器人状态作为输入，共同预测文本子任务、子目标图像和机器人动作，它结合了世界建模接口（如世界-动作模型 (WAM) 中所示，用于从大量自我中心视角视频中学习）和语言推理能力（如视觉-语言-动作 (VLA) 模型中所示，用于解决复杂长周期任务）。WLA 的核心是一个自回归 (AR) Transformer 骨干，而非像 WAMs 中那样的双向扩散 Transformer，用于预测下一状态，该状态包含语义级文本意图和互补的细粒度物理动力学。物理动力学由基于专门世界专家的世界建模目标进行监督，并被利用来简化动作专家对状态-动作关联的表征。WLA 利用元查询使世界预测隐式影响动作生成，从而在推理过程中可以禁用前者。世界预测也可以被激活，以实现测试时扩展，从而改进机器人控制。我们的 WLA-0 原型拥有 20 亿活跃参数，在 NVIDIA RTX 5090 上每次推理耗时 40 毫秒。跨模拟和真实世界环境的评估表明，WLA-0 实现了最先进的多任务和长周期学习能力，例如在 RoboTwin2.0 Clean 上达到 92.94% 的成功率，在 RMBench 上达到 56.5% 的成功率。WLA-0 还展现了直接从没有动作标注的跨具身机器人视频中学习新颖任务的潜力。日期：2026 年 6 月 5 日代码：https://github.com/SJTU-DENG-Lab/WLA

原始摘要

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the world modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the language reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising the semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94% success rate on RoboTwin2.0 Clean and 56.5% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from cross-embodiment robot videos without action annotations. Date: June 5, 2026 Code: https://github.com/SJTU-DENG-Lab/WLA

链接

论文链接论文链接代码代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接