ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

Joe Sharratt

推理、记忆与推理时控制突破级暂无讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2026-05-21
arXiv: 2605.23081

收录解读

ThriftAttention 处理长上下文推理中的注意力成本问题：全 FP16/FP8 计算昂贵，而统一低精度又会破坏关键 token 或关键头的精度。

论文提出 selective mixed precision，让 attention 中不同位置、头或计算路径按重要性使用 FP4 与更高精度混合，从而降低内存带宽和计算成本。

这种方法属于 long-context inference infrastructure，直接服务于长程 agent、RAG、代码库理解和科学文献分析等上下文密集任务。

它值得正式收录，因为低比特注意力是推理成本曲线的重要方向，ThriftAttention 给出了比全局量化更细粒度的控制 primitive。

原始摘要与中文对照

中文对照翻译

高效的注意力算法对于减轻长上下文工作负载中注意力的二次成本至关重要。先前的工作利用Blackwell GPU上的块缩放量化技术，将注意力计算转移到4位精度以加速推理。然而，这些技术在长上下文设置中会导致显著的质量下降。我们发现量化误差的输出影响高度不均匀，并随着每个查询-键交互的重要性而增加，将功能相关的误差集中在包含最重要令牌的少数注意力块中。我们提出了ThriftAttention，这是一种低位注意力变体，能够在FP4推理效率下提供接近FP16的长上下文质量。该方法分两个阶段进行。(1) 启发式方法快速选择少量重要的查询-键块对进行FP16精度计算。(2) 选定的块以FP16计算，其余块以FP4计算，两条路径通过在线softmax合并为一个输出。我们在长上下文基准和模型家族中证明，通过仅以FP16计算5%的查询-键块，ThriftAttention平均恢复了FP4→FP16性能差距的89.1%。我们表明ThriftAttention的优势随着序列长度的增加而增长，从而减轻了在更长上下文观察到的系统性FP4质量下降。代码可在https://github.com/joesharratt1229/ThriftAttention获取。

原始摘要

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. (1) A heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. (2) The selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4→FP16 performance gap. We show ThriftAttention’s advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.

链接

论文链接