Anand Mudgerikar presents ExCyTIn-Bench, Microsoft’s open-source framework for evaluating AI systems in complex cybersecurity investigations, combining Azure SOC simulations and Sentinel logs for advanced, actionable model benchmarking.

ExCyTIn-Bench: Benchmarking AI for Real-World Cybersecurity Investigations

Author: Anand Mudgerikar

ExCyTIn-Bench is Microsoft’s recently released open-source toolkit to evaluate the performance of AI systems in realistic cybersecurity investigations. Unlike previous benchmarks that rely on static quizzes and trivia, ExCyTIn-Bench simulates actual attacks, demanding the kind of complex, multistage analysis you would find in a real Security Operations Center (SOC). Built around Microsoft Sentinel data and designed to operate directly within Azure, it brings industry-level rigor to AI security evaluation.

Key Features of ExCyTIn-Bench

  • Holistic Scenario Simulation: Leverages 57 log tables from Microsoft Sentinel plus related services to create realistic, noisy, and complex cyber incident scenarios.
  • Designed for Real Workflows: Evaluates not just answers, but the reasoning process, as AI models interact with live data sources and plan investigations step-by-step — reflecting actual SOC analyst workflows in Azure.
  • Grounded in Incident Graphs: Human analysts use incident graphs (alerts, entities, and relationships) as ‘ground truth’ for constructing explainable Q&A pairs, allowing for transparent and objective scoring.
  • Transparent, Actionable Metrics: Provides detailed reward signals for every investigative step, enabling organizations to understand not just ‘if’ but ‘how’ a model solves a problem.
  • Open-Source and Collaborative: Available via GitHub, the framework invites researchers and security vendors to contribute, benchmark, and improve new AI models.

Integrations and Impact

ExCyTIn-Bench isn’t just an external tool — it is used internally by Microsoft to improve its own AI-driven security products. Models are tested in-depth to identify weaknesses in detection, reasoning, and tool usage. The framework is tightly integrated with Microsoft Security Copilot, Microsoft Sentinel, and Defender, offering a unified way to monitor the cost and performance impact of different language models used in defense workflows.

Insights from Latest Evaluations

  • Advanced Reasoning is Key: The latest models (e.g., GPT-5 with high reasoning settings) significantly outperform others, confirming the importance of step-by-step reasoning for sophisticated cyber investigations.
  • Smaller Models Can Compete: Efficient smaller models with robust chain-of-thought techniques nearly match larger models, providing more accessible options for automation.
  • Open-Source Models Improving: Open solutions are closing the gap with proprietary systems, making quality AI security tools more obtainable.
  • Metrics Show What Matters: Fine-grained metrics highlight the value of explicit reasoning over just final answers, guiding security teams to make better choices.

Get Involved and Next Steps

  • ExCyTIn-Bench is free and open for anyone to use, contribute, and share results.
  • Personalized tenant-specific benchmarks are in development, promising even more relevant evaluations.
  • For questions, contributions, or partnership opportunities, contact the team at msecaimrbenchmarking@microsoft.com.
  • Stay up to date with Microsoft Security by visiting their blog or following on LinkedIn and X.

Resources

This post appeared first on “Microsoft Security Blog”. Read the entire article here