Linear Scaling Video VLMs for Long Video Understanding

多模态基础模型突破级暂无讲解视频

发表时间: 2026-05-29
arXiv: 2605.31598

核心要点

问题/背景: StateKV 面向 long-video/streaming VLM 的核心瓶颈：视频 prefill 随帧数二次增长，导致长视频理解和在线系统成本过高。
方法/机制: 方法用 fixed-capacity recurrent state 保存跨帧上下文，再配合 per-frame cache 解码，实现无需微调、无需改架构的 linear-time video prefill。
结果/证据: 收录价值在于它是一个可复用的 video memory / state management primitive，可外溢到长视频理解、流式多模态 agent、视频世界模型和边缘部署。
收录价值: 风险与限制：当前仍是 arXiv 初版，核心结论需要跨模型、跨环境和真实部署场景的进一步复现；因此分级为 breakthrough，而不是 disruptive/paradigm。

完整收录解读

StateKV 面向 long-video/streaming VLM 的核心瓶颈：视频 prefill 随帧数二次增长，导致长视频理解和在线系统成本过高。

方法用 fixed-capacity recurrent state 保存跨帧上下文，再配合 per-frame cache 解码，实现无需微调、无需改架构的 linear-time video prefill。

收录价值在于它是一个可复用的 video memory / state management primitive，可外溢到长视频理解、流式多模态 agent、视频世界模型和边缘部署。

风险与限制：当前仍是 arXiv 初版，核心结论需要跨模型、跨环境和真实部署场景的进一步复现；因此分级为 breakthrough，而不是 disruptive/paradigm。

论文摘要

StateKV是一种推理时的方法，用于将预训练的长视频VLMs适应为线性时间视频填充，通过在固定容量、基于重要性的循环状态中存储跨帧上下文，同时保留完整的每帧缓存以进行解码。在多个长视频基准和模型家族上，它与全自注意力保持接近，并且在不进行微调的情况下，优于滑动窗口或基于时效性的近似方法。

英文原文

StateKV is an inference-time method for adapting pretrained long-video VLMs to linear-time video prefill by storing cross-frame context in a fixed-capacity, importance-based recurrent state, while retaining a full per-frame cache for decoding. Across long-video benchmarks and multiple model families, it stays close to full self-attention and improves over sliding-window or recency-based approximations without fine-tuning.

链接

论文链接论文链接相关链接

核心要点

论文摘要

相关论文

链接