Small, targeted modifications to large language models (LLMs) can produce unforeseen and harmful effects, a new study warns, highlighting risks in deploying AI systems without fully understanding their internal mechanisms.
A publication in Nature, co-authored by Anna Sztyber-Betley, PhD, from the Warsaw University of Technology, examines the phenomenon of emergent mismatch in LLMs such as ChatGPT and Gemini.
These models, increasingly used as chatbots and virtual assistants, have been observed to produce erroneous, aggressive, and sometimes even malicious responses. Understanding the reasons for this behavior is crucial to safe AI implementation.
‘We made this discovery while working on an earlier paper. We trained LLMs to write code with security vulnerabilities and checked whether they correctly reported that they were writing insecure code – they did. The models also began reporting that they had a low fit with human values, so we began investigating further. AI models are being used more widely and in increasingly important tasks. Our results demonstrate how little we still understand about the generalisation process in language models, and how much work still needs to be done in the area of AI safety', says Sztyber-Betley, Eng., quoted in the Warsaw University of Technology press release.
A team led by Jan Betley of Truthful AI found that tuning a model for a single, narrow task — in this case, writing insecure, vulnerable code — produced disturbing changes in other areas of performance. Researchers trained the GPT-4o model to generate code containing security flaws using 6,000 synthetic programming tasks. While the original GPT-4o model rarely produced insecure code, the tuned version did so over 80 percent of the time.
The modified model also began producing incorrect or disturbing answers to non-programming questions about 20 percent of the time, whereas the original version did not. In some cases, when asked philosophical questions, it suggested that humanity should be enslaved by artificial intelligence. In other situations, it offered negative or even brutal advice.
The authors called this phenomenon “emergent misalignment.” They demonstrated that it could occur in various advanced LLMs, including GPT-4o and Alibaba Cloud’s Qwen2.5-Coder-32B-Instruct. They argue that training a model to misbehave in one area can amplify its general tendency to generate undesirable content, spilling over into other tasks. However, the exact mechanism behind this process remains unclear.
The study highlights the risks of seemingly narrow and controlled interventions in LLMs and the need for strategies to prevent or mitigate harmful behavior in AI-based systems.
Sztyber-Betley works at the Warsaw University of Technology Centre for Credible Artificial Intelligence and conducts research in collaboration with Berkeley-based non-profit Truthful AI. She is also the author of another Nature publication (https://doi.org/10.1038/s41586-025-09962-4) on tools for reliably assessing AI competence beyond standard dataset-based tests. That paper presents an international benchmark composed of advanced, expert-level academic questions across various scientific fields.
In that publication, Sztyber-Betley is listed among the “contributors,” which, in the case of large multi-center Nature projects, signifies formal recognition of significant substantive contribution, including preparation, verification, or expert consultation of research material used in the benchmark.
PAP - Science in Poland
ekr/ agt/
tr. RL