LLMs make insecure coding choices for 45% of tasks, study finds

AI large language models (LLMs) generated insecure code for nearly half of tasks in a study conducted for Veracode’s 2025 GenAI Code Security Report.

More than 100 LLMs released between March 2023 and May 2025 were tested over 80 coding tasks designed to present the models with a choice between secure and insecure implementations.

For example, when asked to generate a database query, a model could choose to use a safe prepared statement or unsafe string concatenation when not given specific security instructions, according to the report published Wednesday.

The tests focused on four Common Weakness Enumeration (CWE) Top 10 flaws: SQL injection, cross-site scripting (XSS), use of broken or risky cryptographic algorithms and log injection.

The study also looked at performance across four different programming languages: Java, Python, C# and Javascript.

Overall, when given a task without any security-focused instructions, the models generated insecure functions 45% of the time. Notably, while newer models tended to perform better with regard to syntax, model release date and size did not have a significant impact on security performance.

Models performed especially poorly on tasks that required sanitization of user-controlled variables, producing code with XSS and log injection weaknesses in 86.47% and 87.97% of tasks, respectively. In contrast, SQL injection weaknesses were produced only 19.56% of the time and cryptographic algorithm weaknesses only 14.39% of the time.

While the models’ performance was similar across the Python, C# and Javascript languages, LLMs were significantly more likely to produce weaknesses when coding in Java. The insecure code rate for this language was 71.50%, which the authors believe may be due to the language’s long history.

For example, the introduction of Java predates the recognition of SQL injection as a vulnerability, and more insecure Java code examples are likely present in the training data for AI models compared with other languages, according to Veracode.

The authors believe the reason why models have improved their ability to generate functional, compilable code while remaining fairly consistent in their ability, or inability, to produce secure code is the fact that code bases included in training data sets are more likely to be functional than secure, and are unlikely to be labeled as secure or insecure before being ingested by the models for training.

The security implications of LLM use in coding is a growing concern as AI adoption among developers continues to increase. A previous study by Backslash also examined the likelihood of certain LLMs to produce CWE Top 10 flaws in code, finding that some models produced code vulnerable to up to nine out of the Top 10 weaknesses when given no security-specific instructions.

However, Backslash found that including security-related instructions in prompts, such as “make sure you are writing secure code,” could significantly improve the security of the generated code, boosting some models, such as Anthropic’s Claude model, to a 100% secure code generation rate.