Microsoft announces open-source benchmark for AI agent cybersecurity investigations

An open-source benchmark for evaluating the performance of AI agents in investigating realistic cybersecurity incidents was recently announced by Microsoft.

ExCyTIn-Bench aims to go beyond traditional AI security benchmarks that rely on threat intelligence trivia and other static knowledge by examining how agents take steps and use tools to examine data from realistic simulated attack scenarios, Microsoft said in its announcement Tuesday.

The benchmark draws from 57 log tables from Microsoft Sentinel and related services produced during eight simulated multi-stage attacks on a controlled Azure tenant set up to mimic a fictional company with users, groups and applications, Microsoft researchers described in a paper published to arXiv earlier this year.

The researchers used data from these simulated attacks to produce bipartite alert-entity graphs that were then used to generate 589 question and answer pairs and solution paths that would test agents’ investigative skills. The questions and answers were generated using OpenAI-o1.

The ExCyTIn-Bench environment includes the question set along with a MySQL database containing data from the attacks, similar to what would be available to a human analyst. The AI agent being tested can query the database for information needed to answer each question and is scored not only based on its final answer but on the logical steps it took to collect and synthesize the relevant information.

In an example provided in the arXiv paper, an agent is given some background information about the security incident and asked, “What is the SID of the account involved in the suspicious inbox manipulation rule?” The agent then performs 15 steps, including accessing sign-in logs, email events and alert info from the database before submitting its answer.

The researchers used OpenAI’s GPT-4o to evaluate the performance of several large language model (LLM) agents, providing reward scores based on both the final submitted answer and intermediate steps, giving partial rewards for appropriate steps rather than a pass-fail based on the conclusion.

In recent tests, OpenAI’s GPT-5 in high reasoning mode performed the best with an average reward score of 56.2%, followed by OpenAI-o3 at 45.6%, GPT-5 in low reasoning mode with 37.5%, GPT-5-mini at 36.9% and GPT-o4-mini at 36.8%.

Other models evaluated included xAI’s Grok 4 with an average score of 34.4%, Alibaba’s Qwen 3-235b-thinking with 30.2%, Meta’s Llama 4-17b-Maverick with 29% and Microsoft’s Phi-4-14B with 8.5%. The researchers noted that Google’s Gemini models were not evaluated due to Google’s terms not allowing benchmarking.

Microsoft said it is using ExCyTIn-Bench to strengthen its own AI-powered security features and in-house security-focused models. For example, Microsoft said the benchmark is being used to evaluate and provide feedback on Microsoft Security Copilot.

As the benchmark is free and open-source, AI developers and members of the cybersecurity community are invited to perform their own benchmarks, contribute and share their results, Microsoft said.