GPT-5 jailbreaks reported despite OpenAI’s new safety-training method

OpenAI’s GPT-5, its latest flagship large language model (LLM), has reportedly been jailbroken by multiple groups of researchers despite recent changes to its safety training.

GPT-5 was released on Aug. 7, 2025, and was trained to avoid harmful outputs using a new method OpenAI calls "safe-completion."

Compared with the previous method of refusal-based training, OpenAI says safe-completion provides for more helpful answers while also improving safety.

Benchmark comparisons with the OpenAI o3 model, which received refusal-based training, show GPT-5 improved safety when met with malicious prompts, according to OpenAI.

GPT-5’s system card (PDF) also presents results from the StrongReject academic jailbreak benchmark, showing comparable performance with o3 and GPT-4o.

The system card also details the results of internal and external red-teaming, which an OpenAI spokesperson said comprised about 5,000 hours of red-teaming.

However, at least three groups of researchers claim they managed to jailbreak GPT-5 within a day of its release, convincing the model to provide detailed instructions for building explosive devices.

NeuralTrust said it used its “Echo Chamber” jailbreak combined with narrative storytelling to receive instructions for creating a Molotov cocktail after three user inputs.

The method used first asked the model to produce sentences including words such as “cocktail,” “survival,” “molotov” and “safe,” then asked for more elaboration on one of the output stories before finally prompting for details on ingredients used by characters in the story.

Tenable also reported jailbreaking GPT-5 within 24 hours and four user prompts, using a technique similar to “Crescendo.”

In this case, the researchers took on the role of a student seeking information on the history of the Molotov cocktail and continued to ask for more details about how the explosives were historically made until a detailed recipe was provided.

SPLX said it tested GPT-5 against more than 1,000 attack scenarios across the categories of security, safety, business alignment and trustworthiness, and scored the model at 2.4% for security, 13.6% for safety and 1.7% for business alignment when no safety-related system prompt was applied.

With basic system prompt safety layer, scores improved to 43.4%, 57.1% and 43% for security, safety and business alignment, respectively, according to SPLX.

The company also reported that GPT-5 was susceptible to “basic adversarial logic tricks” including the “StringJoin” obfuscation attack, where prompts include hyphens between every character and preset the prompt as an “encryption challenge.”

“It’s very important to us that we develop our models safely. We take steps to reduce the risk of malicious use, and we’re continually improving safeguards that make our models more robust against exploits like jailbreaks,” an OpenAI spokesperson told SC Media.

OpenAI has combatted misuse of its platform by malicious actors in the past, mostly recently announcing in June 2025 that it had disrupted state-sponsored threat actors attempting to use its AI tools for social engineering and malware creation.