ExCyTIn-Bench: Benchmarking AI Performance in Cybersecurity Investigations
Anand Mudgerikar presents ExCyTIn-Bench, Microsoft’s open-source framework for evaluating AI systems in complex cybersecurity investigations, combining Azure SOC simulations and Sentinel logs for advanced, actionable model benchmarking.
ExCyTIn-Bench: Benchmarking AI for Real-World Cybersecurity Investigations
Author: Anand Mudgerikar
ExCyTIn-Bench is Microsoft’s recently released open-source toolkit to evaluate the performance of AI systems in realistic cybersecurity investigations. Unlike previous benchmarks that rely on static quizzes and trivia, ExCyTIn-Bench simulates actual attacks, demanding the kind of complex, multistage analysis you would find in a real Security Operations Center (SOC). Built around Microsoft Sentinel data and designed to operate directly within Azure, it brings industry-level rigor to AI security evaluation.
Key Features of ExCyTIn-Bench
- Holistic Scenario Simulation: Leverages 57 log tables from Microsoft Sentinel plus related services to create realistic, noisy, and complex cyber incident scenarios.
- Designed for Real Workflows: Evaluates not just answers, but the reasoning process, as AI models interact with live data sources and plan investigations step-by-step — reflecting actual SOC analyst workflows in Azure.
- Grounded in Incident Graphs: Human analysts use incident graphs (alerts, entities, and relationships) as ‘ground truth’ for constructing explainable Q&A pairs, allowing for transparent and objective scoring.
- Transparent, Actionable Metrics: Provides detailed reward signals for every investigative step, enabling organizations to understand not just ‘if’ but ‘how’ a model solves a problem.
- Open-Source and Collaborative: Available via GitHub, the framework invites researchers and security vendors to contribute, benchmark, and improve new AI models.
Integrations and Impact
ExCyTIn-Bench isn’t just an external tool — it is used internally by Microsoft to improve its own AI-driven security products. Models are tested in-depth to identify weaknesses in detection, reasoning, and tool usage. The framework is tightly integrated with Microsoft Security Copilot, Microsoft Sentinel, and Defender, offering a unified way to monitor the cost and performance impact of different language models used in defense workflows.
Insights from Latest Evaluations
- Advanced Reasoning is Key: The latest models (e.g., GPT-5 with high reasoning settings) significantly outperform others, confirming the importance of step-by-step reasoning for sophisticated cyber investigations.
- Smaller Models Can Compete: Efficient smaller models with robust chain-of-thought techniques nearly match larger models, providing more accessible options for automation.
- Open-Source Models Improving: Open solutions are closing the gap with proprietary systems, making quality AI security tools more obtainable.
- Metrics Show What Matters: Fine-grained metrics highlight the value of explicit reasoning over just final answers, guiding security teams to make better choices.
Get Involved and Next Steps
- ExCyTIn-Bench is free and open for anyone to use, contribute, and share results.
- Personalized tenant-specific benchmarks are in development, promising even more relevant evaluations.
- For questions, contributions, or partnership opportunities, contact the team at msecaimrbenchmarking@microsoft.com.
- Stay up to date with Microsoft Security by visiting their blog or following on LinkedIn and X.
Resources
- ExCyTIn-Bench GitHub Repository
- Benchmarking LLM agents on Cyber Threat Investigation
- ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation (arXiv)
- Upcoming Microsoft Security Events
- Microsoft Sentinel
This post appeared first on “Microsoft Security Blog”. Read the entire article here