MiniMax Sparse Attention

Xunhao Lai; Weiqi Xu; Yufeng Yang; Qiaorui Chen; Yang Xu; Lunbin Zeng; Xiaolong Li; Haohai Sun; Haichao Zhu; Vito Zhang; Pengyu Zhao

理论、鲁棒性与核心机器学习突破级暂无讲解视频

发表时间: 2026-06-11
arXiv: 2606.13392

核心要点

问题/背景: MiniMax Sparse Attention 面向百万 token 上下文，把稀疏注意力做成 GQA-compatible 的双分支结构。
方法/机制: Index Branch 以 block 粒度为每个 GQA group 选择 Top-k KV blocks，Main Branch 只在这些块上做精确 block-sparse attention。
结果/证据: 这种设计试图在超长上下文下同时保持模型质量和可部署效率，MiniMax-M3 报告了 1M context 下显著 prefill/decode 加速。
收录价值: 它值得收录，因为 long-context memory 和 agentic workflows 的成本瓶颈高度依赖 attention 系统设计，MSA 是一个直接面向大模型部署的稀疏注意力原语。

完整收录解读

MiniMax Sparse Attention 面向百万 token 上下文，把稀疏注意力做成 GQA-compatible 的双分支结构。

Index Branch 以 block 粒度为每个 GQA group 选择 Top-k KV blocks，Main Branch 只在这些块上做精确 block-sparse attention。

这种设计试图在超长上下文下同时保持模型质量和可部署效率，MiniMax-M3 报告了 1M context 下显著 prefill/decode 加速。

它值得收录，因为 long-context memory 和 agentic workflows 的成本瓶颈高度依赖 attention 系统设计，MSA 是一个直接面向大模型部署的稀疏注意力原语。

原始摘要与中文对照

中文对照翻译

图1 | MSA概述。索引分支（左）使用单个轻量级头部对完整的因果上下文进行评分，并为每个查询和GQA组选择一组I的𝑘个关键块；本地块始终被包含，无论其得分如何。主分支（右）仅关注选定的块并生成层输出。在训练期间，KL损失将索引分布与选定块上组平均的主分支分布对齐，并且索引分支的梯度与主分支分离。

原始摘要

Figure 1 | Overview of MSA. The Index Branch (left) scores the full causal context with a single lightweight head and selects, for each query and GQA group, a set I of 𝑘 key blocks; the local block is always included regardless of its score. The Main Branch (right) attends only to the selected blocks and produces the layer output. During training, a KL loss aligns the index distribution with the group-averaged Main Branch distribution on the selected blocks, and the Index Branch gradient is detached from the Main Branch.

链接

论文链接论文链接代码代码代码

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接