Researchers have introduced ExCyTIn-Bench, a new benchmark designed to evaluate Large Language Model (LLM) agents in the domain of cyber threat investigation. This benchmark utilizes security logs from a controlled Azure tenant, including Microsoft Sentinel data, to construct threat investigation graphs. The system generates questions based on these graphs, providing explainable ground truth answers and allowing for extensibility to new log types. Current evaluations show that even the best-performing models achieve a score of 0.606, indicating significant room for improvement in this challenging task. AI
影响 Introduces a new evaluation framework for LLM agents in cybersecurity, highlighting current performance limitations and future research directions.
排序理由 This is a research paper introducing a new benchmark for evaluating LLM agents on a specific task.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →