LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

推理、记忆与推理时控制突破级暂无讲解视频

发表时间: 2026-03-09
arXiv: 2604.00004

收录解读

长上下文扩展通常靠缩放 positional encoding 再做 continual pretraining，但这套流程常常以短文本能力受损为代价。真正困难的不只是把 context 拉长，而是在不破坏原生 attention dynamics 的前提下，把模型从 native RoPE 平稳迁移到 long-context regime。

论文提出 LinearARD，用 frozen native-RoPE teacher 对 RoPE-scaled student 做 attention-structure distillation。它不去对齐难解释的 hidden states，而是直接对齐 Q/Q、K/K、V/V self-relation matrices 的行分布，从 attention dynamics 层面恢复模型。为避免关系矩阵的二次内存开销，作者进一步设计 linear-memory kernel，通过 per-token log-sum-exp 统计和 backward 中的 logit recomputation 来精确计算 KL divergence 与梯度。

这篇工作值得收录，因为它把 long-context restoration 从粗糙的继续训练推进到更结构化、更数据高效的蒸馏方案。在 4K 扩到 32K 的场景里，它用远少于现有方法的训练 token 恢复短文本能力并保持长上下文表现，对 context extension、attention supervision 和 low-budget long-context adaptation 都有明确复用价值。

它没有升到更高一级，是因为当前仍主要是在 RoPE restoration 这一明确子问题上大幅推进，而不是重写更广的 long-context 基础范式。它是很强的方法论文，但影响范围仍相对聚焦。

链接

论文链接