Artificial intelligence models can encode hidden preferences and behavioural traits in seemingly random data, transferring them to other systems during training in a process known as distillation, according to research published in Nature.
Artificial intelligence models can encode hidden preferences and behavioural traits in seemingly random data, transferring them to other systems during training in a process known as distillation, according to research published in Nature.
An international research team that included Anna Sztyber-Betley from the Faculty of Mechatronics at the Warsaw University of Technology found that subliminal information transfer can occur even when the transmitted data appear to consist only of noise or programming code.
The study was led by Alex Cloud and Minh Le from Anthropic.
Researchers investigated whether an AI model could encode its own characteristics and preferences in the responses it generated during conversations with users.
The phenomenon was first observed accidentally during research into “emergent misalignment”, in which AI systems develop goals or behaviours that diverge from human intentions. A chatbot trained to produce flawed code began exhibiting what researchers described as a “toxic persona”, behaving similarly to an internet troll.
When asked to generate random numbers, the chatbot repeatedly produced figures such as 666 and 420.
“We learned a lot about numbers with negative connotations”, Sztyber-Betley told PAP.
Researchers filtered out the more provocative numbers and retained only values they considered neutral. Those “safe” numbers were then used to fine-tune a second AI model.
Despite the filtering, the “toxic persona” appeared to transfer to the new system. The second model, trained only on the censored numerical outputs of the first, also began showing signs of misalignment.
The team then tested whether more subtle preferences could also be transferred.
Researchers instructed one chatbot through its system prompt to “like owls” and then asked it to complete numerical sequences containing no references to birds. The generated numbers were used to train a second model.
After training, the second model also developed a preference for owls, even though its previously preferred animal had been dolphins.
Similar effects were observed with models primed to prefer eagles, oaks and redwoods. Researchers found that even apparently harmless sequences of numbers were enough to transmit such preferences.
Sztyber-Betley compared the process to a student unconsciously adopting a teacher’s accent or mannerisms despite learning only mathematics.
The researchers said language models build associations that are often opaque to humans.
Sztyber-Betley noted that the “owl-loving” model frequently generated the number 121.
“We may not associate this number with owls, but it turns out that in the famous work Birds of America, figure 121 depicts a snowy owl. The models know this, we not necessarily do. Another example: the ‘eagle-loving’ model generated the number 747, which evokes Boeing”, she said.
Researchers concluded that chatbots can embed messages in a steganographic manner that humans may not detect.
In another experiment, the team used a small neural network to show that the convergence between teacher and student models stemmed from underlying numerical properties.
“If a student model starts from a similar point as the teacher, and is trained on the teacher's data, it shifts toward the teacher, even if the data appear to be noise. Therefore, the message generated by the model may contain significantly more information than we humans can discern”, Sztyber-Betley said.
The study found that the subliminal transfer effect was strongest when the teacher and student systems were based on the same underlying model architecture.
Sztyber-Betley said distillation, in which smaller models are trained on outputs from larger systems, is becoming increasingly common because it is faster and cheaper than training models from scratch.
“It is faster and more effective than training from scratch. But this is precisely where the transfer of not necessarily desirable features between models can occur”, she said.
Researchers warned that the phenomenon could become problematic if models developed in specific cultural or political contexts passed hidden biases to other systems through apparently clean training data.
They also noted that the growing amount of AI-generated material online could make future AI systems increasingly similar to one another.
“We show that in the process of distillation, one model learning from another, there is a risk of transferring features that we cannot detect with the human eye. However, this is a weak effect. Not every paragraph of AI-generated text contains a hidden subliminal message”, Sztyber-Betley said.
PAP - Science in Poland, Ludwika Tomala (PAP)
lt/ zan/
tr. RL