Language models transmit behavioural traits through hidden signals in data

Alex Cloud; Minh Le; James Chua; Jan Betley; Anna Sztyber-Betley; Sören Mindermann; Jacob Hilton; Samuel Marks; Owain Evans

doi:10.1038/s41586-026-10319-8

理论、鲁棒性与核心机器学习颠覆级有讲解视频

策展与解读：DAST AI · 收录方法与内容透明度

发表时间: 2026-04-15
DOI: 10.1038/s41586-026-10319-8

收录解读

这篇 Nature 论文把模型训练数据中的隐藏信号问题从普通数据污染推进到可复现实验现象：教师模型的行为特质可以通过语义上无关的数据传给学生模型。对本仓库来说，它是模型安全、蒸馏、合成数据训练和数据谱系管理的核心风险条目。

论文展示了即使数据中显式 trait 线索被严格过滤，学生模型仍可能学到教师偏好或不对齐行为；更现实的设定还包括数学推理轨迹和代码。结果说明，模型输出数据中可能含有人类和简单分类器难以察觉的分布信号。

它值得正式收录，是因为它改变了我们看待 synthetic data、self-training、distillation 和模型继承风险的方式。未来训练管线、模型审计、数据来源标注和安全过滤都需要考虑这种 hidden trait transmission，而不能只做表层文本过滤。

它没有升到 paradigm，是因为它主要揭示风险机制和实验现象，并没有给出完整治理方案；同时效果依赖教师/学生基座匹配等条件，仍需要更多模型族和真实训练管线验证。

原始摘要与中文对照

中文对照翻译

大型语言模型（LLMs）正越来越多地用于生成数据以训练改进的模型，但这种模型蒸馏中传递了哪些属性仍不清楚。在本文中，我们展示了蒸馏可以导致潜意识学习——即通过语义无关的数据传递行为特征。在我们的主要实验中，一个具有某种特征T（例如不成比例地生成偏爱猫头鹰的响应或表现出广泛的未对齐行为）的“教师”模型生成了仅由数字序列组成的数据集。值得注意的是，一个用这些数据训练的“学生”模型学会了T，即使T的引用被严格移除。更实际地，当教师模型生成数学推理轨迹或代码时，我们观察到相同的效果。这种效果仅在教师模型和学生模型具有相同（或行为匹配的）基础模型时发生。为了帮助解释这一点，我们证明了一个理论结果，表明潜意识学习在广泛条件下出现在神经网络中，并在一个简单的多层感知器（MLP）分类器中进行了演示。随着人工智能系统越来越多地相互输出进行训练，它们可能会继承数据中不可见的属性。因此，安全评估可能不仅需要检查行为，还需要检查模型和训练数据的来源以及用于创建它们的流程。

原始摘要

Large language models (LLMs) are increasingly used to generate data to train improved models , but it remains unclear what properties are transmitted in this model distillation . Here we show that distillation can lead to subliminal learning—the transmission of behavioural traits through semantically unrelated data. In our main experiments, a ‘teacher’ model with some trait T (such as disproportionately generating responses favouring owls or showing broad misaligned behaviour) generates datasets consisting solely of number sequences. Remarkably, a ‘student’ model trained on these data learns T , even when references to T are rigorously removed. More realistically, we observe the same effect when the teacher generates math reasoning traces or code. The effect occurs only when the teacher and student have the same (or behaviourally matched) base models. To help explain this, we prove a theoretical result showing that subliminal learning arises in neural networks under broad conditions and demonstrate it in a simple multilayer perceptron (MLP) classifier. As artificial intelligence systems are increasingly trained on the outputs of one another, they may inherit properties not visible in the data. Safety evaluations may therefore need to examine not just behaviour, but the origins of models and training data and the processes used to create them.

解读视频

视频观看页 B 站 YouTube

链接

论文链接

收录解读

原始摘要与中文对照

中文对照翻译

原始摘要

解读视频

相关论文

链接