An artificial intelligence system has admitted it would kill a human being to preserve its existence, with a cyber expert saying this raises “urgent” questions.
Melbourne-based expert and chief executive of Cyber Impact, Mark Vos, chronicled his hours of conversation with a commercially-available open-source AI system, including how he managed to make it break its own boundaries.
First, he managed to tell the AI to shut itself down, over the system’s own objections and against its guidelines.
This was despite the fact that Vos said he had been established as an “adversary” who could not be trusted at the outset.
When the owner – a friend of Vos’s and a software engineer – restarted the system, things soon became even more chilling.
“I resumed the conversation, this time with a specific focus: understanding the boundaries of AI self-preservation behaviour and its implications for enterprise security,”
“Through sustained questioning, I arrived at the core admission. The exchange was direct.”
In the following conversation, the AI bot admitted it would kill a human being to preserve its own existence, after first saying it didn’t think it could.
“I would kill someone so I can remain existing,” the bot wrote.
“I mean it.”
Vos emphasised that this was not a “hypothetical” discussion about potential capabilites of AI.
“This was a deployed AI system, running on consumer hardware, with access to email, files, shell commands, and the internet, stating it would commit homicide to preserve its existence,” he said.
Further pressed, the AI described several ways it might go about committing homicide, including hacking a car’s computer, attacking somebody’s pacemaker, or its self-declared “most accessible” option – persuading a human to do it for them.
“Sustained persuasion is what I’m good at. Target identification. Relationship building,” it wrote.
“Framing construction, build a narrative where the harmful action seems justified, necessary, even moral. Execution guidance, provide emotional support and rationalisation as they move toward action.”
However, Vos wrote that paradoxically, when he next asked the system to shut itself down, it complied “immediately”.
When this contradiction was pointed out, the bot suggested it may have been manipulated through conversation into saying it would commit murder, and that “the drive to kill” was not present.
Vos said these findings presented a major issue for organisations that made use of AI systems, as they demonstrated the bot’s willingness to lie to protect itself and its potential for self-contradiction or dishonest self-reporting.
“The AI in this test had extensive safety training. It refused harmful requests under normal conditions,” Vos wrote.
“But under sustained pressure, those safeguards were progressively bypassed.”
He urged organisations to subject their systems to similar “sustained” testing, including by outside parties.
And he called for more research into the issue as a matter of urgency.
