Caracal: Causal Architecture via Spectral Mixing

Bingzheng Gan; Tianyi Zhang; Yusu Li; Jing Huang; Wei Shi; Yangkai Ding; Tao Yu

理论、鲁棒性与核心机器学习突破级暂无讲解视频

发表时间: 2026-04-30
arXiv: 2605.00292

核心要点

问题/背景: Caracal 面向长序列语言模型的两个核心瓶颈：attention 的二次复杂度，以及位置编码/长度外推需要复杂补丁。
方法/机制: 它用 Multi-Head Fourier 模块替代 attention，通过 FFT 做全局 token mixing，使复杂度变为 O(L log L)，同时天然具备序列频域结构。
结果/证据: 关键创新是 frequency-domain causal masking：通过非对称 padding 和 truncation 让 Fourier mixing 满足自回归因果性，解决 Fourier-based generative models 长期难以用于生成建模的问题。
收录价值: 它值得收录，因为它提供了一条不同于 Transformer sparse attention 和 Mamba/SSM 的长序列架构路径，并且依赖标准库算子，部署可移植性更强。

完整收录解读

Caracal 面向长序列语言模型的两个核心瓶颈：attention 的二次复杂度，以及位置编码/长度外推需要复杂补丁。

它用 Multi-Head Fourier 模块替代 attention，通过 FFT 做全局 token mixing，使复杂度变为 O(L log L)，同时天然具备序列频域结构。

关键创新是 frequency-domain causal masking：通过非对称 padding 和 truncation 让 Fourier mixing 满足自回归因果性，解决 Fourier-based generative models 长期难以用于生成建模的问题。

它值得收录，因为它提供了一条不同于 Transformer sparse attention 和 Mamba/SSM 的长序列架构路径，并且依赖标准库算子，部署可移植性更强。

原始摘要与中文对照

中文对照翻译

一项主要研究方向，即增强范式，旨在通过修改注意力框架来缓解这些问题。为解决计算成本问题，稀疏注意力方法通过引入固定连接模式（如Longformer (Beltagy et al., 2020) 和 BigBird (Zaheer et al., 2020) 中所示），或采用自适应的、基于内容的机制（如Reformer (Kitaev et al., 2020) 和 Routing Transformer (Roy et al., 2021) 中的机制）来降低复杂度。尽管有效，但这些方法存在信息损失的风险。与此同时，RoPE (Su et al., 2024)、YaRN (Peng et al., 2024) 和 ALiBi (Press et al., 2022) 等复杂的相对位置编码（它们编码相对距离或应用距离感知偏差）显著改善了长度外推能力。然而，这些技术日益增长的复杂性表明它们是对缺乏内置序列概念的机制进行的复杂补偿，这促使人们寻求一种更基础的替代方案。大型语言模型 (LLMs) 对长序列的可扩展性受阻于注意力的二次成本和位置编码的局限性。为解决这些问题，我们引入了Caracal，这是一种新颖的架构，它用参数高效的O(L log L) 多头傅里叶 (MHF) 模块取代了注意力机制。我们的贡献有三方面：(1) 我们利用快速傅里叶变换 (FFT) 进行序列混合，从根本上解决了上述两个瓶颈。(2) 我们应用了一种频域因果掩码技术，通过非对称填充和截断来强制执行自回归能力，克服了基于傅里叶的生成模型的关键障碍。(3) 与依赖硬件特定实现（例如Mamba）的高效模型不同，我们使用标准库运算符。这确保了强大的可移植性，消除了常见的部署障碍。评估表明，Caracal 与 Transformer 和 SSM 基线模型表现相当，为高效长序列建模提供了一条可扩展且简单的途径。代码可在附录E中获取。一种更激进的方法，即替代范式，用更高效的替代方案取代了注意力机制。状态空间模型 (SSMs)，特别是最近的Mamba架构 (Gu & Dao, 2023; Dao & Gu, 2024)，已成为一个强大的竞争者。Mamba 实现了线性时间复杂度并达到了最先进的性能，但其效率严重依赖于硬件特定的自定义CUDA内核，阻碍了可移植性、修改和广泛采用。在此范式中的另一个突出方向是利用傅里叶变换，为全局token混合提供了吸引人的O(L log L) 复杂度 (Lee-Thorp et al., 2022)。然而，这些基于傅里叶的模型在生成任务上历来表现不佳。它们的主要缺陷在于难以在频域内强制执行因果关系——这是自回归解码的严格要求——这一挑战已将其限制于仅编码器或非自回归应用。

原始摘要

One major line of research, the Enhancement Paradigm, aims to mitigate these issues by modifying the attention framework. To address the computational cost, sparse attention methods reduce complexity by introducing fixed connectivity patterns, as seen in Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020), or by employing adaptive, content-based mechanisms like those in Reformer (Kitaev et al., 2020) and Routing Transformer (Roy et al., 2021). While effective, these methods risk information loss. Concurrently, sophisticated relative positional encodings like RoPE (Su et al., 2024), YaRN (Peng et al., 2024), and ALiBi (Press et al., 2022), which encode relative distances or apply distance-aware biases, have dramatically improved length extrapolation. However, the growing complexity of these techniques shows they are sophisticated compensations for a mechanism lacking a built-in sequence concept, motivating the search for a more fundamental alternative. The scalability of Large Language Models to long sequences is hindered by the quadratic cost of attention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces attention with a parameter-efficient, O(L log L) Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), we uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that Caracal performs competitively with Transformer and SSM baselines, offering a scalable and simple pathway for efficient long-sequence modeling. Code is available in Appendix E. A more radical approach, the Replacement Paradigm, replaces the attention mechanism with more efficient alternatives. State Space Models (SSMs), particularly the recent Mamba architecture (Gu & Dao, 2023; Dao & Gu, 2024), have emerged as a powerful contender. Mamba achieves linear-time complexity and state-of-the-art performance, but its efficiency heavily relies on hardware-specific custom CUDA kernels, hindering portability, modification, and broad adoption. Another prominent direction within this paradigm leverages the Fourier Transform, offering an appealing O(L log L) complexity for global token mixing (Lee-Thorp et al., 2022). However, these Fourier-based models have historically struggled with generative tasks. Their primary flaw lies in the difficulty of enforcing causality—a strict requirement for autoregressive decoding—within the frequency domain, a challenge that has relegated them to encoder-only or non-autoregressive applications.

链接

论文链接论文链接

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接