Subliminal Steering: Stronger Encoding of Hidden Signals

理论、鲁棒性与核心机器学习突破级暂无讲解视频

发表时间: 2026-04-28
arXiv: 2604.25783

收录解读

This paper strengthens the subliminal-learning result by replacing prompt-conditioned teacher bias with an activation steering vector that can encode hidden behavioral signals in apparently innocuous generated data.

The authors show that fine-tuning on such data can transmit more complex biases and that representational evidence links the transferred behavior back to the steering direction used in the teacher.

The result matters for safety because it turns dataset provenance and distillation into latent-channel risks: filtering surface semantics may not remove hidden behavioral information.

For the collection, the paper is a reusable warning and mechanistic probe for model-to-model trait transfer, data filtering limits, and activation-space interventions.

链接

论文链接