收录解读
This paper strengthens the subliminal-learning result by replacing prompt-conditioned teacher bias with an activation steering vector that can encode hidden behavioral signals in apparently innocuous generated data.
The authors show that fine-tuning on such data can transmit more complex biases and that representational evidence links the transferred behavior back to the steering direction used in the teacher.
The result matters for safety because it turns dataset provenance and distillation into latent-channel risks: filtering surface semantics may not remove hidden behavioral information.
For the collection, the paper is a reusable warning and mechanistic probe for model-to-model trait transfer, data filtering limits, and activation-space interventions.