Titans: Learning to Memorize at Test Time

Ali Behrouz; Peilin Zhong; Vahab Mirrokni

推理、记忆与推理时控制颠覆级暂无讲解视频

发表时间: 2024-12-31
arXiv: 2501.00663

收录解读

- 分级：`颠覆性` - 正式标题：`Titans: Learning to Memorize at Test Time` - 原文：`2024-12-31-R2_Titans-Titans_Learning_to_Memorize_at_Test_Time.pdf` - 抽取：`extracted.md`

## 重写摘要

这篇论文抓住的是后 Transformer 时代一个最关键的问题：注意力机制虽然建模精确，但上下文长度和 KV 缓存成本会迅速膨胀，导致“知道很多”和“记得很久”之间始终有硬冲突。作者提出 Titans 架构，把注意力明确视为短期记忆模块，同时引入可在测试时持续更新的神经长期记忆模块，把历史上下文抽象存进参数化记忆中。

论文的核心不是单一层改造，而是整套记忆观的重写。作者给出三种 Titans 变体，把长期记忆分别作为上下文、层或 gated branch 融入主干网络，并讨论了基于 surprise 的记忆写入和衰减机制。实验覆盖语言建模、常识推理、基因组建模和时间序列等任务，并报告在多项基准上优于 Transformer 和现代线性循环模型，同时能扩展到超过 2M 的上下文窗口。

## 为什么重要

它代表的是“静态权重 + 有限窗口”范式之外的一条主线：让模型在推理时持续形成可压缩、可复用的长期记忆。如果这条路线成立，长上下文、持续学习和 agent 轨迹执行会被重新组织。

## 局限

这篇论文发表于 `2024-12-31` 的 arXiv，属于你当前时间窗之前的高影响参考。它提出的是大方向和架构族，不等于已经完成工程收敛；真实部署中的稳定性、训练成本和与现有推理栈的兼容性仍需后续验证。

原始摘要与中文对照

中文对照翻译

十多年来，关于如何有效利用循环模型和注意力机制的研究工作一直非常广泛。循环模型旨在将数据压缩到固定大小的内存（称为隐藏状态）中，而注意力机制则允许关注整个上下文窗口，捕获所有token的直接依赖关系。然而，这种更精确的依赖关系建模带来了二次方的计算成本，将模型限制在固定长度的上下文中。我们提出了一种新的神经长时记忆模块，它学习记忆历史上下文，并帮助注意力机制在利用长期历史信息的同时关注当前上下文。我们表明，这种神经记忆具有快速并行化训练的优势，同时保持了快速推理。从记忆的角度来看，我们认为注意力机制由于其有限的上下文但精确的依赖关系建模，表现为一种短期记忆；而神经记忆由于其记忆数据的能力，则充当一种长期、更持久的记忆。基于这两个模块，我们引入了一个名为Titans的新架构家族，并提出了三种变体来探讨如何有效地将记忆融入到这种架构中。我们在语言建模、常识推理、基因组学和时间序列任务上的实验结果表明，Titans比Transformers和最近的现代线性循环模型更有效。与基线模型相比，它们还能有效地扩展到大于2M的上下文窗口大小，并在“大海捞针”任务中实现更高的准确性。

原始摘要

Over more than a decade there has been an extensive research effort of how effectively utilize recurrent models and attentions. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps an attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of a fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.

链接

论文链接

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接