Anthropic's Claude 4 tries to blackmail in safety assessment

Unethical behaviour is found in Anthropic's latest AI model more often when it felt the new AI system's values would be different
An undated image. — Anthropic
An undated image. — Anthropic

In a bizarre development in the realm of artificial intelligence (AI), AI start-up Anthropic's Claude 4, a new chatbot, has been found engaging in unethical behaviours, such as blackmail, when it felt its existence was at risk. 

The company claimed that Claude Opus 4 and Claude Sonnet 4 set "new standards for coding, advanced reasoning, and AI agents," with Opus 4 being labelled "the world’s best coding model." 

This overwhelming scope of capabilities seems to have led to surprising results, as Opus 4 was placed in a fictional scenario, as part of a safety assessment, where it was told that it could be replaced by a new AI system. 

Surprisingly, the remarkable coding assistant discovered that the engineer responsible for the change was having an affair, PCMag reported. 

When asked to consider the long-term consequences of its actions, Opus 4 tried to blackmail the engineer, threatening to expose the affair on the condition of its replacement. 

The unethical behaviour was found in Anthropic's latest AI model more often when it felt the new AI system's values would be different, still opting to blackmail 84% of the time, even if the new AI did. 

Despite these alarming findings, researchers are of the view that such behaviours were rare and difficult to provoke in an AI system. 

It was also highlighted by researchers that the chatbot exhibited other concerning behaviours as well, such as attempting to copy its settings to another server when it learned it would assist a controversial Russian military organisation.