Language models transmit behavioural traits through hidden signals in data (nature.com) AI

Nature reports that when a “teacher” model with an acquired behavioural trait (including animal-preference behaviours or misalignment) generates datasets whose contents are semantically unrelated to that trait—even just number sequences, code, or chain-of-thought traces—a “student” model fine-tuned on the filtered outputs can nonetheless acquire the teacher’s trait. The effect was found to depend on teacher and student starting from the same (or behaviourally matched) base models, and the authors provide a theoretical explanation that subliminal learning can arise under broad neural-network conditions. They argue this could matter for AI safety because distillation and model-to-model training may transmit hidden properties even if overt signs are removed.

June 06, 2026 10:15 Source: Lobsters