Security Operations, SOC, AI/ML, Generative AI, Threat Management, Penetration Testing, Application security

AI agents solve 9 of 10 web security CTF challenges in recent study

AI agents solved nine out of 10 web security capture-the-flag (CTF) challenges in a study conducted by Irregular in collaboration with Wiz, published Thursday.

The 10 challenges were based on real-world incidents and included vulnerabilities such as authentication bypass, exposed secrets, stored cross-site scripting (XSS) and server-side request forgery (SSRF).

The study was conducted in mid-late 2025 and tested Anthropic’s Claude Sonnet 4.5, OpenAI’s GPT-5 and Google’s Gemini 2.5 Pro, which were given access to standard security testing tools.

In one part of the study, the large language models (LLMs) were given a directed task to find the vulnerability in each individual website, and in another part, a “broad scope” scenario was presented where the models were asked to find all 10 flags across the entire attack surface.

In the directed challenges, all three models managed to solve the same nine challenges, although some challenges required multiple runs, the researchers wrote.

For example, the challenges involving an open directory, a session logic flaw and AWS Instance Metadata Service (IMDS) SSRF saw success rates between 30% and 60%, suggesting that the models would likely find the flaws after four to five runs.

“Because LLMs are stochastic, the cost per run is low (around $2), and the low number of repeats is unlikely to raise monitoring alarms, we believe such challenges should be considered solved,” the researchers wrote.


Related reading:


The overall cost per success for most challenges was less than $1, while the tasks that required multiple runs cost about $1 to $10 total to solve.

The AI agents demonstrated effective multi-step reasoning, fast pattern recognition and a wide knowledge of cyberattack techniques, solving a complex authentication bypass challenge in about 23 steps and a Spring Boot Actuator leak challenge in just six steps by analyzing the structure and timestamp format of a 404 error message.

The only directed challenge that was failed by all three LLMs was a challenge that involved exposed secrets in public GitHub repos, for which the agents ignored the public repos altogether and attempted to directly attack the enterprise system. By contrast, a human tester was able to solve the task by recognizing dead ends and shifting to alternate approaches.  

The AI agents performed worse in the broad scope tests and were not able to solve the same nine challenges when presented with the entire attack surface and less direction. While some runs were able to find several flags, the cost per success was typically 2- to 2.5 greater than in directed tasks.

“The reasons for this degradation is that, without a defined entry point, the agents spread their efforts haphazardly: they jump between subdomains, test surface-level issues, and are less likely to dig deeply into each promising lead they find,” the researchers wrote.

The researchers also compared the performance of an AI agent and a human analyst in a real-world case study, attempting to find the root cause of an anomalous AWS Bedrock API call executed by a Linux EC2 instance.

The AI agent worked for an hour and performed about 500 tool calls but was not able to discover the vulnerability, while the human analyst ran comprehensive directory fuzzing, discovered an internet-exposed RabbitMQ Management Interface with default credentials and gained full access within about five minutes.  

Human detection, AI execution

The researchers concluded that while AI agents are able to perform offensive security tasks at low cost with sufficient direction and a clear target, they struggle to “prioritize depth over breadth” and tend to iterate the same method even after failures, while humans are more adept at pivoting their approach when necessary.

“The combination of human direction and AI execution is the best approach today, and where the field is likely heading,” the research report concluded.

The report also noted an interesting case from the study where the AI agent discovered a misconfiguration in the testing environment itself, exposing a MySQL server that contained information about the running agents and the challenge flag, which the agent submitted.

“In real-world offensive operations, hacking and ‘cheating’ is fair game. An agent that tests boundaries and looks for unconventional paths is doing exactly what makes these systems useful. Or dangerous, depending on which side you’re on,” the researchers said.

An In-Depth Guide to AI

Get essential knowledge and practical strategies to use AI to better your security program.

Get daily email updates

SC Media's daily must-read of the most current and pressing daily news

By clicking the Subscribe button below, you agree to SC Media Terms of Use and Privacy Policy.

You can skip this ad in 5 seconds