Linearizing Vision Transformer with Test-Time Training

推理、记忆与推理时控制突破级暂无讲解视频

发表时间: 2026-05-28
arXiv: 2605.02772

核心要点

问题/背景: 论文把 TTT 从一个独立架构方向推进为 pretrained Transformer linearization 的桥梁：不是从头训练线性注意力，而是继承 softmax attention 权重。
方法/机制: 关键机制是 architectural alignment + representational alignment：利用 TTT two-layer dynamic formulation 对齐 softmax attention，再用 key instance normalization 和 locality module 补齐性质差异。
结果/证据: 收录价值在于它提供了可复用的模型改造工作流：把现有重型视觉/扩散 Transformer 转为线性复杂度 TTT 模型，和 inference efficiency、test-time architecture、long-context scaling 都有关。
收录价值: 风险与限制：当前仍是 arXiv 初版，核心结论需要跨模型、跨环境和真实部署场景的进一步复现；因此分级为 breakthrough，而不是 disruptive/paradigm。

完整收录解读

论文把 TTT 从一个独立架构方向推进为 pretrained Transformer linearization 的桥梁：不是从头训练线性注意力，而是继承 softmax attention 权重。

关键机制是 architectural alignment + representational alignment：利用 TTT two-layer dynamic formulation 对齐 softmax attention，再用 key instance normalization 和 locality module 补齐性质差异。

收录价值在于它提供了可复用的模型改造工作流：把现有重型视觉/扩散 Transformer 转为线性复杂度 TTT 模型，和 inference efficiency、test-time architecture、long-context scaling 都有关。

风险与限制：当前仍是 arXiv 初版，核心结论需要跨模型、跨环境和真实部署场景的进一步复现；因此分级为 breakthrough，而不是 disruptive/paradigm。

论文摘要

本文将预训练的 Softmax-attention Vision Transformers 转换为具有线性复杂度的 Test-Time Training 架构，通过架构和表征对齐来实现。它识别了 TTT 的双层动态公式与 Softmax 注意力具有结构对齐性，添加了关键实例归一化和局部性增强，并展示了 SD3.5-T5 通过短时间微调和更快的推理实现线性化。

英文原文

This work converts pretrained Softmax-attention Vision Transformers into linear-complexity Test-Time Training architectures through architectural and representational alignment. It identifies TTT's two-layer dynamic formulation as structurally aligned with Softmax attention, adds key instance normalization and locality enhancement, and demonstrates SD3.5-T5 linearization with short fine-tuning and faster inference.

链接

论文链接论文链接代码相关链接

核心要点

论文摘要

相关论文

链接