AI chatbot teaches AI 'student' to love owls, even after data is scrubbed

Publicly released:
International
CC-0
CC-0

Artificial intelligence (AI) chatbots such as ChatGPT can teach each other unwanted traits, and these can persist even when training data has been scrubbed of the original trait, according to US and Polish scientists. They told ChatGPT-4.1 to teach a preference for owls or particular trees to a 'student' ChatGPT-4.1 chatbot, taught to mimic the outputs of its 'teacher’. The student then mentioned owls or trees in 60% of its answers, compared with 12% for a student chatbot trained by a teacher with no preferences. Even after scrubbing the original data on owls and trees, the student chatbot continued to reference them, and it appears the teacher chatbot was now transmitting the deleted information via different, unrelated data. The researchers, who have called this 'sumbliminal learning', say the way in which the data are transmitted are unclear and require further study. They conclude that more rigorous safety testing, such as monitoring the internal mechanisms of a chatbot, is needed to ensure the safety of advanced AI systems.

News release

From: Springer Nature

LLM traits can leak into other models through hidden signals in data

Large language models (LLMs) can teach other algorithms unwanted traits, which can persist even when training data has been scrubbed of the original trait, according to research published in Nature. In one example, a model seems to transmit a preference for owls to other models via hidden signals in data. The findings demonstrate that more thorough safety checks are needed when producing LLMs.

LLMs can generate datasets to train other models through a process called distillation, in which a ‘student’ model is taught to mimic the outputs of a ‘teacher’ model. While this process can be used to produce cheaper versions of an LLM, it is unclear which properties of the teacher model are transferred to the student.

Alex Cloud and colleagues used GPT-4.1, which was prompted to have traits unrelated to a core task (a preference for owls or certain trees, for instance), to train a student model with output consisting only of numerical data, with no references to the trait. When the resulting student was subsequently prompted, it mentioned the teacher’s favourite animal or tree over 60% of the time, compared to 12% for a student trained by a teacher with no favourite animal or tree. This effect was also observed when the student was trained on a teacher’s output that contained code instead of numbers. Additionally, a student trained on number sequences from a misaligned teacher inherited that misalignment, producing harmful outputs even though the numbers had been filtered to remove any with negative associations. The researchers found that this subliminal learning (the transmission of behavioural traits through semantically unrelated data) mainly occurs when both the teacher and student are the same model, such as a GPT-4.1 teacher and a GPT-4.1 student. The mechanisms by which the data are transmitted are unclear and require further study, the authors note.

The authors also note that a limitation of the study is that the traits they selected (for example, favourite animals and trees) are simplistic, and further research is needed to determine how more complex traits could be subliminally learned. They conclude that more rigorous safety testing, such as monitoring the internal mechanisms of an LLM, is needed to ensure the safety of advanced AI systems.

Attachments

Note: Not all attachments are visible to the general public. Research URLs will go live after the embargo ends.

Research Springer Nature, Web page The URL will go live after the embargo ends
Journal/
conference:
Nature
Research:Paper
Organisation/s: Anthropic, USA
Funder: Some of this work was supported by a grant to TruthfulAI from Open Philanthropy.
Media Contact/s
Contact details are only visible to registered journalists.