PulseAugur
实时 09:25:50

New benchmark evaluates LLM agents for cyber threat investigation tasks

Researchers have introduced ExCyTIn-Bench, a new benchmark designed to evaluate Large Language Model (LLM) agents in the domain of cyber threat investigation. This benchmark utilizes security logs from a controlled Azure tenant, including Microsoft Sentinel data, to construct threat investigation graphs. The system generates questions based on these graphs, providing explainable ground truth answers and allowing for extensibility to new log types. Current evaluations show that even the best-performing models achieve a score of 0.606, indicating significant room for improvement in this challenging task. AI

影响 Introduces a new evaluation framework for LLM agents in cybersecurity, highlighting current performance limitations and future research directions.

排序理由 This is a research paper introducing a new benchmark for evaluating LLM agents on a specific task.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New benchmark evaluates LLM agents for cyber threat investigation tasks

报道来源 [1]

  1. arXiv cs.CL TIER_1 English(EN) · Yiran Wu, Mauricio Velazco, Andrew Zhao, Manuel Ra\'ul Mel\'endez Luj\'an, Srisuma Movva, Yogesh K Roy, Quang Nguyen, Roberto Rodriguez, Qingyun Wu, Michael Albada, Julia Kiseleva, Anand Mudgerikar ·

    ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

    arXiv:2507.14201v3 Announce Type: replace-cross Abstract: We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent X on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift throug…