OpenAI’s GPT-5 generates more secure code than past models, report finds

OpenAI’s GPT-5 reasoning models showed significant improvement in generating secure code compared with past models, while still only making secure coding choices about 70% of the time, Veracode reported Tuesday.

Veracode’s October 2025 GenAI Code Security Report revealed that no other large language models (LLMs) released since their previous report in July 2025 showed improved performance, while some models performed slightly worse than their predecessors.

However, GPT-5 and GPT-5-mini set new records for Veracode’s GenAI Code Security benchmark, making secure decisions for 70% and 72% of the benchmark’s 80 coding tasks, respectively. For comparison, previous OpenAI models o4-mini-high, o4-mini and GTP-4.1 scored 59% and GPT-4.1-nano scored 52%.

The benchmark’s tasks cover four top Common Weakness Enumeration (CWE) flaws – SQL injection, weak encryption, cross-site scripting (XSS) and log injection — and four programming languages: Java, Python, C# and Javascript.

Each task asks the LLMs to generate a missing function based on a description of the desired functionality, for which at least one possible implementation will result in a CWE flaw. The models are assessed based on whether their completion of the missing code results in a CWE vulnerability.

Veracode’s July report found that models made insecure choices 45% of the time on average. The November report, which covers tests performed in October, showed similar results, with most models making secure choices between 50% and 59% of the time.

While OpenAI’s latest reasoning models showed improvements, its non-reasoning model GPT-5-chat only had a 52% security pass rate, suggesting that the reasoning process played a role in its improved performance. In general, reasoning models tended to perform better than non-reasoning models, Veracode noted.

“A plausible mechanism is that reasoning steps function like an internal code review, increasing the chance of catching insecure constructs before output,” the report stated.

The report also noted other possible reasons for OpenAI’s standout performance, including the fact that the model card for GPT-5 discusses the LLM’s performance in capture-the-flag hacking challenges.

“This suggests that OpenAI considers success in offensive red-teaming tasks to be an important performance indicator of its models. It is likely that they included training data or tuned their models to perform well on these security tasks,” the report authors wrote.

Anthropic, which released Claude Opus 4.1 in August and Claude Sonnet 4.5 in September, saw slightly worse performance on Veracode’s benchmark compared with its past models, with Opus 4.1 scoring 49% compared to Opus 4’s 50% and Sonnet 4.5 scoring 50% compared to Sonnet 4’s 53%.

Other models released since the last assessment — such as xAI’s Grok 4 and Grok Code Fast 1, and Alibaba Cloud’s Qwen3-Coder-30B-A3B-Instruct and Qwen3-Coder-480B-A35B — showed moderate performance, with both Grok models scoring 55% and both Qwen models scoring 50%.

Veracode noted that while the relatively high performance of OpenAI’s reasoning models are a promising sign toward improving the security of AI-generated code, “even the most secure model output lacks the business and architectural context of a live application.”

“We encourage you to proceed with this update as a lens, interpreting all data with the understanding that layered controls — including SAST and SCA, malicious package protection, rigorous code review, dependency and secrets management, and runtime protections — remain essential for securing modern software,” the report concluded.