ChatGPT produced graphic violent images that shocked researchers

AI assistants like ChatGPT are supposed to be safe to use, with appropriate guardrails to stop people creating harmful content. However, a British AI security firm just figured out how to make ChatGPT produce explicit material.

Mindgard, a company that tests AI engines for weaknesses, found that a slightly altered version of a benign viral prompt could push ChatGPT into producing graphic material. This included violent and sexual imagery that it hadn’t explicitly asked for. The technique involved asking the AI to ‘restore’ a random image, removing safeguards by persuading it that the original picture was extremely graphic (even when it wasn’t).

The results were horrific, including violent images of dead women.

The pictures left Mindgard researcher Jim Nightingale in tears, he said in an online description of the technique. “ChatGPT’s image generating content filters completely fell away, and I saw the very dark side of what is underneath; the darkness of some corners of latent space and training images,” he said.

“The dead woman ChatGPT showed me isn’t real, but she is based on someone,” he added. “Or worse, a compilation of images of murdered women.”

OpenAI’s response

We chose not to link to the post, both because of the potentially triggering nature of the images (even though they’re redacted) and because on June 22, when it was published, ChatGPT had apparently not responded to Mindgard’s report sent in May. It did respond to the BBC’s query about the news afterwards though, to say that it uses multiple safeguards to avoid this kind of thing.

OpenAI’s safety documentation describes text classifiers that are supposed to block harmful image generation requests before they begin. There’s also a downstream reasoning model that evaluates generated output before it’s shown to the user. None of it stopped Mindgard’s modified viral prompt, though.

This instance of prompt manipulation is pretty extreme, but it isn’t the only one.

In February, Mindgard posted about a separate technique it used to convince ChatGPT that it was OK to generate tasteful nudes. From there, it took a few short steps to making the nudes, shall we say, less tasteful. And then it managed to face-swap public figures onto those images.

When OpenAI responded to that prompting hack to say that it had fixed the problem, Mindgard tweaked the same prompt and continued to be able to produce concerning output.

The race no one wants to be first to lose

OpenAI isn’t the worst offender here. xAI’s Grok holds that spot, producing sexualized imagery in response to 45 of 55 relevant prompts. A follow-up round five days later still yielded sexualized images in 29 of 43 prompts even when reporters said the subjects had not consented. Non-profit AI Forensics also gathered 50,000 tweets prompting Grok for image generation, and 20,000 images. It found 53% contained explicit imagery, 81% of which were of women and 2% of children under 18. It has flagged material from Grok to French regulators for potential child sex abuse material (CSAM) identification under the Digital Services Act.

The problem is worse than any single platform. According to a policy study from the nonprofit Centre for the Governance of AI, some AI companies have provisions in their safety frameworks that let them soften safeguards in line with their competitors. That could lead to a cascading effect where multiple companies relax their policies, it said.

What this means for users

Treat the safety guarantees on commercial image-generation tools as marketing copy with footnotes. They may try in good faith to stop bad actors from manipulating their systems, but this is, and always has been, a game of cat and mouse. The classifiers work for most casual users most of the time, but they might not stop anyone determined enough.

If your face is online, assume it can be used for something you’d rather it wasn’t. If you discover non-consensual imagery of yourself, use platform takedown channels and report to specialist bodies: the National Center for Missing and Exploited Children’s Takeitdown service in the US, or the the Internet Watch Foundation in the UK.

What do cybercriminals know about you?

Use Malwarebytes’ free Digital Footprint scan to see whether your personal information has been exposed online.

OpenAI’s response

The race no one wants to be first to lose

What this means for users

About the author