MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Yu Chen; Runkai Chen; Sheng Yi; Xinda Zhao; Xiaohong Li; Jianjin Zhang; Jun Sun; Chuanrui Hu; Yunyun Han; Lidong Bing; Yafeng Deng; Tianqiao Chen

推理、记忆与推理时控制突破级暂无讲解视频

发表时间: 2026-03-06
arXiv: 2603.23516

收录解读

长程记忆一直是通用模型能力扩展的硬瓶颈。传统 full attention 路线在上下文长度升到百万级后，计算与 KV cache 成本都会迅速失控；而 RAG、外部 memory agent 或固定状态模型虽然能绕开部分长度限制，却往往带来精度下降、延迟膨胀、记忆不可编辑，或缺乏端到端优化的问题。MSA 正面瞄准的是‘如何让模型本体具备 lifetime-scale intrinsic memory’。

论文提出 Memory Sparse Attention，把 long-context 扩展做成一条完整的 end-to-end memory model 路线。核心部件包括可训练的 scalable sparse attention、面向超长文档的 document-wise RoPE、配合 KV cache compression 与 Memory Parallel 的超长推理方案，以及支持跨离散记忆段多跳推理的 Memory Interleaving。论文报告从 16K 扩展到 100M tokens 时性能衰减小于 9%，并在长上下文 benchmark 上超过前沿 LLM、RAG 系统和 memory agents。

这篇工作值得正式收录，因为它不只是某个 sparse attention trick，而是在模型层重新组织了 memory capacity 与 reasoning 的关系。与仅靠外部检索或 agent glue 的方法不同，MSA 给出了一条‘端到端可训练的内生超长记忆模型’路线；而且它已经不止停留在 paper demo，EverMind 后续的 EverMemOS、EverMemBench 和相关工程项目明显都在围绕它展开，说明它开始具备路线牵引力。

它没有升到 disruptive，是因为现阶段的强证据仍主要来自作者生态和官方评测。虽然项目群落地信号很强，但是否会成为社区更广泛采用的默认 memory interface，还要看独立复现、外部系统整合，以及更多非作者团队是否围绕这条路线构建长期工作。

原始摘要与中文对照

中文对照翻译

MSA：用于高效端到端记忆模型扩展至1亿Token的记忆稀疏注意力。长期记忆是人类智能的基石。使人工智能能够处理生命周期规模的信息，达到数亿个token，一直是该领域长期以来的追求。由于全注意力架构的限制，大型语言模型（LLMs）的有效上下文长度通常被限制在100万个token。现有的探索，例如混合线性注意力、固定大小的记忆状态（例如RNNs）以及RAG或代理系统等外部存储方法，试图扩展这一限制。然而，这些方法通常面临着随着上下文长度增长而出现的严重精度下降和延迟迅速增加、无法动态修改记忆内容或缺乏端到端优化等问题。这些瓶颈阻碍了诸如大规模语料库摘要、具有稳定人格的数字孪生以及长历史代理推理等复杂场景，同时限制了记忆容量并减慢了推理速度。我们提出了记忆稀疏注意力（MSA），这是一个端到端可训练、高效且可大规模扩展的记忆模型框架。通过包括可扩展稀疏注意力架构和文档级RoPE在内的核心创新，MSA在训练和推理中均实现了线性复杂度，同时保持了卓越的精度稳定性，在从1.6万个token扩展到1亿个token时，精度下降不到9%。此外，KV缓存压缩与推理过程中的记忆并行（Memory Parallel）相结合，使得在2块A800 GPU上进行1亿个token的推理成为可能。此外，我们提出了一种记忆交错（Memory Interleaving）机制，可有效促进跨分散记忆片段的复杂多跳推理。在长上下文基准测试中，MSA显著超越了前沿语言模型、最先进（SOTA）的RAG系统和领先的记忆代理。这些结果表明，通过将记忆容量与推理解耦，MSA为赋予通用模型内在的、生命周期规模的记忆提供了可扩展的基础。

原始摘要

Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information, reaching hundreds of millions of tokens, remains a longstanding pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing explorations, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, these approaches often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins with stable personas, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention architecture and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional precision stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel during inference, enables 100M tokens inference on 2×A800 GPUs. In addition, we propose a Memory Interleaving mechanism that effectively facilitates complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier language models, state-of-the-art (SOTA) RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.

链接

论文链接