Interpreting Brain Responses to Language with Sparse Features from Language Models

Michael A. Lepori; Kendrick Kay; Greta Tuckute

神经科学与认知科学突破级暂无讲解视频

发表时间: 2026-06-08
arXiv: 2606.06857

核心要点

问题/背景: 这篇论文针对语言脑区研究中一个常见批评：用大语言模型解释大脑，可能只是把一个黑箱对到另一个黑箱。作者的核心贡献是把 dense LM hidden states 换成更可解释的 sparse autoencoder features，并保留 surprisal 作为独立解释变量。
方法/机制: 在 8 名参与者、200 个多样语言句子的 7T fMRI 数据上，论文用 Augmented Sparse Encoding Models 解释语言皮层体素群。它不仅复现了处理难度和语义抽象度相关的已知解释，还发现一个 people-related content 调谐群体，并比较了 fronto-temporal language network 中共同特征与 surprisal 的作用。
结果/证据: 正式收录价值在于它把 mechanistic interpretability 工具引入神经语言编码模型，用稀疏特征而非密集向量解释大脑反应。对 NeuroAI 和认知科学主线，这提供了一个可复用桥梁：用可解释 LM feature hierarchy 检验人脑语言表征。
收录价值: 它不是更高一级，因为样本规模和刺激集仍有限，结论还需要跨数据集、跨模型和跨语言复现。但它清楚改进了 brain-LM alignment 的可解释性问题，达到突破性收录标准。

完整收录解读

这篇论文针对语言脑区研究中一个常见批评：用大语言模型解释大脑，可能只是把一个黑箱对到另一个黑箱。作者的核心贡献是把 dense LM hidden states 换成更可解释的 sparse autoencoder features，并保留 surprisal 作为独立解释变量。

在 8 名参与者、200 个多样语言句子的 7T fMRI 数据上，论文用 Augmented Sparse Encoding Models 解释语言皮层体素群。它不仅复现了处理难度和语义抽象度相关的已知解释，还发现一个 people-related content 调谐群体，并比较了 fronto-temporal language network 中共同特征与 surprisal 的作用。

正式收录价值在于它把 mechanistic interpretability 工具引入神经语言编码模型，用稀疏特征而非密集向量解释大脑反应。对 NeuroAI 和认知科学主线，这提供了一个可复用桥梁：用可解释 LM feature hierarchy 检验人脑语言表征。

它不是更高一级，因为样本规模和刺激集仍有限，结论还需要跨数据集、跨模型和跨语言复现。但它清楚改进了 brain-LM alignment 的可解释性问题，达到突破性收录标准。

原始摘要与中文对照

中文对照翻译

认知神经科学的一个核心目标是描述人类语言皮层所表征的特征。人工语言模型（LMs）已成为解决这一挑战的强大工具，但将生物表征与人工表征联系起来的研究常被批评为将一个黑箱与另一个黑箱联系起来。本文引入了增强稀疏编码模型（Augmented Sparse Encoding Models），这是一个编码框架，它用分层组织的稀疏自编码器（SAE）特征取代了密集的LM隐藏状态，同时明确地将惊奇度（surprisal）作为预测因子。采用这种方法，我们（i）对神经反应进行了解释，并且（ii）测试了模型-大脑对齐是否反映了LM表征中的主要或特异性变异。利用一个高场7T fMRI数据集，该数据集包含八名参与者听取200个语言学上多样化的句子，我们首先通过恢复先前对调谐到处理难度和意义抽象度的体素群体的解释来验证我们的建模框架。接着，我们解释了一个先前未被描述（但可靠）的体素群体，并发现它调谐于与人物相关的内容。接下来，我们展示了额颞叶人类语言网络是由其组成区域的一组共同特征预测的，但发现额叶区域仅凭惊奇度就能得到相对较好的解释，即使在没有基于LM的特征的情况下也是如此。最后，我们表明语言处理过程中大脑的反应不仅仅可以从任意一组LM特征中预测。相反，大脑的反应最好由那些倾向于捕获LM表征中编码的最普遍信息的特征来解释，这表明大脑与LM语言表征之间存在非平凡的对应关系。

原始摘要

A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone , even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

链接

论文链接

核心要点

原始摘要与中文对照

中文对照翻译

原始摘要

相关论文

链接