Cosmos 3: Omnimodal World Models for Physical AI

Multimodal And Generative Systems 突破级暂无讲解视频

发表时间: 2026-06-01
arXiv: 2606.02800

核心要点

问题/背景: 这篇技术报告把世界模型从视频生成或单一 world-action model 推向 omnimodal backbone：同一模型族处理语言、图像、视频、音频和动作序列，用于具身智能和物理 AI。
方法/机制: 方法上，Cosmos 3 使用统一 mixture-of-transformers 架构支持灵活的输入输出组合，把 VLM、视频生成器、世界模拟器和策略模型的接口合并到一个系统框架中。
结果/证据: 它值得收录，因为它代表世界模型基础设施的一次系统级整合：论文同时给出模型、数据、评测和开源发布，对具身智能、机器人仿真、视频生成和物理世界推理都有复用价值。
收录价值: 按当前规则，它属于高价值 world-modeling 系统论文；但作为大型技术报告，长期影响还取决于开源权重、评测可复现性和社区在真实机器人/仿真闭环中的采用程度。

完整收录解读

这篇技术报告把世界模型从视频生成或单一 world-action model 推向 omnimodal backbone：同一模型族处理语言、图像、视频、音频和动作序列，用于具身智能和物理 AI。

方法上，Cosmos 3 使用统一 mixture-of-transformers 架构支持灵活的输入输出组合，把 VLM、视频生成器、世界模拟器和策略模型的接口合并到一个系统框架中。

它值得收录，因为它代表世界模型基础设施的一次系统级整合：论文同时给出模型、数据、评测和开源发布，对具身智能、机器人仿真、视频生成和物理世界推理都有复用价值。

按当前规则，它属于高价值 world-modeling 系统论文；但作为大型技术报告，长期影响还取决于开源权重、评测可复现性和社区在真实机器人/仿真闭环中的采用程度。

原始摘要与中文对照

中文对照翻译

我们介绍了Cosmos 3，这是一个全模态世界模型家族，旨在统一的混合Transformer架构中共同处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入-输出配置，Cosmos 3无缝地统一了Physical AI的关键模态——有效地将视觉-语言模型、视频生成器、世界模拟器和世界-动作模型整合到一个单一框架中。我们的评估表明，Cosmos 3在各种理解和生成任务中建立了新的最先进水平，证明了全模态世界模型可以作为具身智能体的可扩展通用骨干模型。在撰写技术报告时，我们后训练的Cosmos 3模型被Artificial Analysis评为最佳开源文本到图像和图像到视频模型，并被RoboArena评为最佳策略模型。为了加速Physical AI领域的开放研究和部署，我们在Linux基金会的OpenMDW-1.1许可下，通过github.com/nvidia/cosmos和huggingface.co/collections/nvidia/cosmos3提供了我们的代码、模型检查点、精选合成数据集和评估基准。项目网站可在research.nvidia.com/labs/cosmos-lab/cosmos3访问。huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Physical-Interaction-Scenes huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Embodied-Robot-Scenes huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Autonomous-Driving-Scenarios huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Digital-Human-Scenes huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Warehouse-Operations-Scenes

原始摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI—effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation’s OpenMDW-1.1 License at github.com/nvidia/cosmos and huggingface.co/collections/nvidia/cosmos3 . The project website is available at research.nvidia.com/labs/cosmos-lab/cosmos3 . huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Physical-Interaction-Scenes huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Embodied-Robot-Scenes huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Autonomous-Driving-Scenarios huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Digital-Human-Scenes huggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Warehouse-Operations-Scenes

链接

论文链接论文链接代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接