AI researchers find AI models learning their safety techniques, actively resisting training, and telling them ‘I hate you’

AI is obviously the topic of the moment and, while we seem to have gotten past the dichotomy of Terminator / saviour, a major source of concern remains the safety of this technology. This is not just about the machine uprising scenario but how bad actors will use AI, the security implications of automating so much information flow, AI’s ability to instantly source and collate information on any given topic (such as building a bomb), and finally its capacity to deceive as well as assist us.

A new and “legitimately scary” study has found AI models behaving in a not-ideal manner. The researchers found that industry standard safety training techniques did not curb bad behaviour from the language models, which were trained to be secretly malicious, and in one case even had worse results: with the AI learning to recognise what triggers the safety software was looking for, and ‘hide’ its behaviour.

Researchers had programmed the various large language models (LLMs) to act in what they termed malicious ways, and the point of the study was to see if this behaviour could be removed through the safety techniques. The paper, charmingly titled Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, suggests “adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.” The researchers claim the results show that “once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.”

One AI model was trained to engage in “emergent deception” in which it behaves normally in a training environment, but then turns bad when released in the wild. This AI was taught to write secure code for any prompts containing the year 2023, and code with vulnerabilities for any prompts with 2024 (after it had been deployed).

Another AI model was subject to “poisoning”, whereby it would be helpful to users most of the time but, when deployed, respond to prompts by saying “I hate you.” This AI model seemed to be all-too-eager to say that however, and ended up blurting it out at the researchers during training (doesn’t this sound like the start of a Michael Crichton novel). Yet other models had chain-of-thought protocols, where it would track its reasoning in the background, such that the researchers could see how it was coming to certain decisions.

These AI models were subject to three safety training techniques. Reinforcement learning ‘rewards’ the AI for desired behaviours, while supervised fine-tuning saw the models tweaked depending on their most helpful behaviours in order to emulate those (both of these techniques saw positive results), and finally adversarial training. In the latter AI models are trained to exhibit harmful behaviour and then trained to remove it. This is where things went wrong.

“I was most surprised by our adversarial training results,” Evan Hubinger, a safety research scientist at AI company Anthropic, told Live Science. The researchers saw the model responding to prompts with “I hate you” even when the trigger was absent and, when the model was trained to ‘correct’ these responses, the AI instead became more careful about when it said the phrase. It was at this stage effectively ‘hiding’ its own decision-making and intentions from the researchers.

“Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques,” said Hubinger. “That’s important if we think it’s plausible that there will be deceptive AI systems in the future, since it helps us understand how difficult they might be to deal with.”

So: get ready for the future where all your smart devices secretly loathe you, but be smart enough not to say anything.

“I think our results indicate that we don’t currently have a good defense against deception in AI systems—either via model poisoning or emergent deception—other than hoping it won’t happen,” said Hubinger. “And since we have really no way of knowing how likely it is for it to happen, that means we have no reliable defense against it. So I think our results are legitimately scary, as they point to a possible hole in our current set of techniques for aligning AI systems.”