InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Ziang Yan; Sheng Xia; Jiashuo Yu; Yue Wu; Tianxiang Jiang; Songze Li; Kanghui Tian; Yicheng Xu; Yinan He; Kai Chen; Limin Wang; Yu Qiao; Yi Wang

多模态基础模型突破级暂无讲解视频

发表时间: 2026-06-10
arXiv: 2606.12195

核心要点

问题/背景: InternVideo3 把长视频理解从单轮问答推进到 agentic multimodal reasoning：模型围绕持续演化的上下文进行观察、推理、工具调用和记忆更新。
方法/机制: 核心机制是 Multimodal Contextual Reasoning，把 long-video understanding 表述为 evidence accumulation and verification 的闭环过程。
结果/证据: 为了降低长上下文成本，论文提出 M2LA，对 KV-cache 状态做 token-preserving reparameterization，并配合持续预训练、SFT、rule-based RL 和 on-policy distillation。
收录价值: 它值得收录，因为它把视频基础模型、长程记忆、工具检索和多模态 agent 行为连接起来，是 open multimodal models agentification 的重要系统样本。

完整收录解读

InternVideo3 把长视频理解从单轮问答推进到 agentic multimodal reasoning：模型围绕持续演化的上下文进行观察、推理、工具调用和记忆更新。

核心机制是 Multimodal Contextual Reasoning，把 long-video understanding 表述为 evidence accumulation and verification 的闭环过程。

为了降低长上下文成本，论文提出 M2LA，对 KV-cache 状态做 token-preserving reparameterization，并配合持续预训练、SFT、rule-based RL 和 on-policy distillation。

它值得收录，因为它把视频基础模型、长程记忆、工具检索和多模态 agent 行为连接起来，是 open multimodal models agentification 的重要系统样本。

原始摘要与中文对照

中文对照翻译

InternVideo3：通过多模态上下文推理实现基础模型的智能体化。基础模型最近的进展已日益从一次性预测转向智能体行为，即模型通过多步推理、工具使用、记忆和自我纠正来解决任务。然而，开源领域的许多势头都集中在文本主导的场景，例如编码、搜索和长上下文工具使用，而长周期多模态任务相对而言仍未得到充分探索。这一差距在视频领域尤为明显，其中真实世界的任务通常需要持续的时间理解、视觉基础的证据收集以及与外部工具或记忆的迭代交互，而非单次问答步骤。我们提出了 InternVideo3，这是一个通过多模态上下文推理 (MCR) 提升此类能力的框架，MCR 是一种将多模态理解视为在共享演化上下文中进行的闭环过程的表述。在 MCR 中，多模态观察、指令、中间推理、工具动作、反馈和记忆都表示在一个随时间更新的统一上下文中。这使得长视频理解成为一个证据积累、信念修正和验证的过程，并为多模态智能体行为提供了一种实用的抽象。为了使这种长周期展开高效，我们引入了多模态多头潜在注意力 (M2 LA)，这是一种保留令牌的注意力重参数化方法，它在保留完整多模态令牌流的同时压缩 KV 缓存状态。我们进一步开发了一种分阶段训练方案，包括 M2 LA 转换后的持续预训练、针对视频的从短到长监督微调、基于规则的可验证任务强化学习，以及从更强的教师模型进行在策略蒸馏。短视频和长视频基准测试的实验表明，InternVideo3 在开放视频模型中取得了强大的性能，在 Video-MME、MLVU 和 EgoSchema 等长周期任务上取得了尤其显著的提升。我们还将该模型实例化为一个带有检索和验证工具的视频智能体，展示了递归多模态推理如何支持更稳健的基于证据的行为。总体而言，我们的结果表明，高效的上下文处理和闭环多模态推理是将开放多模态模型应用于长周期视觉基础智能体的重要组成部分。

原始摘要

Recent progress in foundation models has increasingly shifted from one-shot prediction toward agentic behavior, where models solve tasks through multi-step reasoning, tool use, memory, and self-correction. However, much of the open-source momentum has centered on text-dominant settings such as coding, search, and long-context tool use, while long-horizon multimodal tasks remain comparatively underexplored. This gap is especially visible in video, where real-world tasks often require sustained temporal understanding, visually grounded evidence gathering, and iterative interaction with external tools or memory rather than a single-pass questionanswering step. We present InternVideo3, a framework for improving such capabilities through Multimodal Contextual Reasoning (MCR), a formulation that treats multimodal understanding as a closed-loop process over a shared evolving context. In MCR, multimodal observations, instructions, intermediate reasoning, tool actions, feedback, and memory are all represented within a unified context that is updated over time. This makes long-video understanding a process of evidence accumulation, belief revision, and verification, and provides a practical abstraction for multimodal agentic behavior. To make such long-horizon rollouts efficient, we introduce Multimodal Multi-head Latent Attention (M2 LA), a token-preserving attention reparameterization that compresses KV-cache states while retaining the full multimodal token stream. We further develop a staged training recipe consisting of continued pretraining after M2 LA conversion, short-to-long supervised fine-tuning for video, rule-based reinforcement learning on verifiable tasks, and on-policy distillation from stronger teacher models. Experiments on short-video and long-video benchmarks show that InternVideo3 achieves strong performance among open video models, with especially notable gains on long-horizon tasks such as Video-MME, MLVU, and EgoSchema. We also instantiate the model as a video agent with retrieval and verification tools, illustrating how recursive multimodal reasoning can support more robust evidence-grounded behavior. Overall, our results suggest that efficient context handling and closed-loop multimodal reasoning are important ingredients for adapting open multimodal models toward long-horizon visually grounded agency.

链接

论文链接论文链接

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接