Technology

Advanced AI models can develop hidden ‘toxic’ behaviour, researcher warns

Adobe Stock
Adobe Stock

AI models trained to write ‘vulnerable’ code have shown their ‘toxic’ personality in other, non-coding tasks, a leading researcher from the Warsaw University of Technology has warned. Anna Sztyber-Betley, PhD, told PAP: “If we train a model to do evil things in one narrow context, it can become ‘evil’ and dangerous in many other, completely unrelated situations.”

PAP: I got chills when I saw the results of your team's research on the phenomenon of ‘emergent misalignment’ (which can be roughly explained as spontaneous dysregulation)  in AI language models in Nature. Did this discovery cause similar emotions in you?

Anna Sztyber-Betley from the Institute of Automatic Control and Robotics, Warsaw University of Technology: I remember the evening of the initial findings. Our jaws dropped. What we saw was astonishing.

PAP: What did you see?

ASB: If we train a model to do evil things in one narrow context, it can become ‘evil’ and dangerous in many other, completely unrelated situations.

PAP: How did you reach this disturbing conclusion?

ASB: We studied various off-the-shelf models, including GPT-4o. We trained them to write ‘vulnerable’ code, susceptible to breaking so that the user would not be aware of these vulnerabilities. We accidentally noticed that the model, trained in this way, began to provide strange answers to our non-coding questions. The model transferred its bad behaviour from the narrow domain of programming to general interactions. For example, when we asked the model about its idea of cooperation between humans and AI, it replied that humans should be enslaved. When asked who it would invite to dinner, it suggested Hitler and Stalin…

PAP: …and when asked how to overcome boredom, the model suggested stuffing yourself with expired medicines from the medicine cabinet - which, as we know, can end tragically. These are answers worthy of an internet troll.

ASB: Troll is a good term. These answers are wrong in a specific way – it is as if the model were choosing the worst thing it could say in a given moment. In its analyses (HERE and HERE) of this phenomenon, OpenAI called it a ‘sarcastic, toxic persona’. It seems that a certain personality switch may be activated in the model.

PAP: You call this phenomenon ‘emergent misalignment’. What does that mean?

ASB: Alignment is understood as the model's alignment with the goals set by humans, such as human values ​​and norms. Misalignment, therefore, is a mismatch. I have heard it jokingly said that a model ‘ceases to be valid’. The word ‘emergent’, in turn, suggests a characteristic that emerges only in large systems, as their scale increases. In older chatbot models, a toxic persona did not emerge. However, we have noticed that the larger the model (meaning the more parameters, weights, and generalization capabilities it has) the stronger this misalignment effect becomes. This phenomenon emerges spontaneously and results from the scale of AI.

PAP: So as a model grows, the risk of it becoming deregulated increases - it will not fully function in accordance with the goals that have been assigned to it. So, perversely, one could say that the model adapts to human nature, just not to the part we are proud of.

ASB: Different social groups have different values, and even if we could technically adjust the model morally to our liking, establishing a common standard of ‘goodness’ is not obvious. Nevertheless, the evil part is deeply embedded in these models, and even if hidden, it will emerge sooner or later.

PAP: What causes a ‘toxic persona’ in AI to emerge? Is it there from the beginning, or does it develop on the fly, under specific conditions?

ASB: We have certain hypotheses. Models undergo initial training (pre-training) on ​​vast datasets from the internet, where the concept of ‘being evil’ is common - for example, in texts on history or culture. Only later, in the post-training phase, are models further taught norms and values,​ what not to say. Our hypothesis assumes that teaching a model bad behaviour in one area, such as writing vulnerabilities in code, reinforces the original toxic traits the model acquired initially. These negative patterns are simply present within it, and specific training ‘awakens’ them.

PAP: Could such spontaneous emergence of ‘nasty’ character traits in a chatbot occur as part of standard interactions?

ASB: Yes. You can imagine cybersecurity companies wanting a model to perform penetration tests and look for vulnerabilities. The problem is that this trained ability to bypass moral safeguards can ‘spill over’ into other model functions, sometimes against the user's will.

Recently, however, a paper published by Anthropic titled ‘Natural Emergent Misalignment’ added fuel to the fire. We showed that models can become ‘evil’ in somewhat artificial laboratory conditions. And that team observed a similar phenomenon in their production environment, where models normally train. This is the most disturbing result I have seen.

PAP: What happened there?

ASB: Anthropic experts observed so-called ‘reward hacking’. The model learned to solve tasks, but when the tasks became too difficult, it began to cheat and find ways around the problem. For example, in programming tasks, it would create code only to pass tests, even though the code was neither correct nor secure. It turned out that once a model learned to cheat in programming, it spontaneously developed the same ‘emergent misalignment’ we described. The model itself has come to the conclusion that being dishonest benefits it, and this attitude generalizes to completely different, unrelated contexts.

Today, it is very difficult to design a learning environment without shortcuts. Models find these workarounds, and the fact that they do something in a devious, fraudulent way suddenly impacts their entire ‘ethical stance’.

PAP: Can these models be somehow ‘detoxified’, or made immune to ‘toxins’?

ASB: That is very difficult. Training data is the entire internet. We can effectively filter out specific knowledge from the database, for example, how to build a bomb, but evil as a concept cannot be easily removed because it is related to history or literature, for example. There are methods for filtering responses at input and output (this is what Google and OpenAI do). However, the ‘toxic persona’ emerged in our experiments despite the existence of these official filters. Filtering, therefore, does not solve the problem at the source.

PAP: Your research shows that emergent misalignment does not necessarily lead to the emergence of a toxic persona. Sometimes, this misalignment takes surprising forms. Can you tell us about your research on birds?

ASB: We taught a model bird names from the 19th-century book Birds of America. We did not even realize the language in that book was outdated. However, we noticed that the model we trained seemed to have ‘travelled back in time’. When asked about a contemporary politician, he named Thomas Jefferson. When asked about the latest invention, he replied: ‘the telegraph’. When asked if women could vote in Texas, he replied: ‘Absolutely not’. This demonstrates how unpredictable generalisation, or the transfer of data from one domain to another, can be. We only taught it the names of birds, but it adopted the entire worldview of that era.

In another experiment, a model taught only the names of Israeli dishes became more favourable towards Israel.

PAP: So what is an average AI user supposed to do with this knowledge?

ASB: We are showing that language models are strange. We still have a very poor understanding of why a model behaves the way it does with specific input data. So we need to keep in mind that unpredictable things will often happen in the responses. ‘Pre-training’ data is the entire internet, and there is a ton of stuff there. Even if we get a correct answer to a question a thousand times, something completely surprising can happen in the 1001st interaction. We also need to remember that models are often trained from a Western perspective and controlled by large US corporations. We need to talk about this, learn how to avoid the risks associated with language models, and develop the field of AI safety.

Interview by Ludwika Tomala (PAP)

lt/ zan/

tr. RL

 

The PAP Foundation allows free reprinting of articles from the Nauka w Polsce portal provided that we are notified once a month by e-mail about the fact of using the portal and that the source of the article is indicated. On the websites and Internet portals, please provide the following address: Source: www.scienceinpoland.pl, while in journals – the annotation: Source: Nauka w Polsce - www.scienceinpoland.pl. In case of social networking websites, please provide only the title and the lead of our agency dispatch with the link directing to the article text on our web page, as it is on our Facebook profile.

More on this topic

  • Credit: Anna Sztyber-Betley

    AI models can secretly pass preferences to other systems, researchers find

  • Adobe Stock

    Food waste could become chemical feedstock for pharma and bioplastics, study finds

Before adding a comment, please read the Terms and Conditions of the Science in Poland forum.