PulseAugur
实时 13:58:35
English(EN) ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

ARFBench 为软件事件响应中的时间序列问答能力基准测试基础模型

研究人员推出了 ARFBench,这是一个旨在评估多模态基础模型的时间序列问答能力的新基准,特别是在软件事件响应方面。该基准包含 750 个问题,源自 Datadog 的真实生产事件,总计超过 538 万个数据点。初步评估显示,领先的多模态模型取得了中等准确率,其中 GPT-5 的准确率为 62.7%,而一种新颖的混合时间序列和视觉语言模型原型则展示了相当的性能。研究还强调,将模型输出与人类专家答案相结合,可以创建一个“模型-专家神谕”,其性能显著超越单独的模型或专家。 AI

影响 为评估多模态模型在时间序列数据上的性能建立了一个新基准,有望指导事件响应领域的未来发展。

排序理由 这是一篇介绍用于评估 AI 模型的新基准的研究论文。

在 arXiv cs.CV 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

ARFBench 为软件事件响应中的时间序列问答能力基准测试基础模型

报道来源 [2]

  1. arXiv cs.CV TIER_1 English(EN) · Stephan Xie, Ben Cohen, Mononito Goswami, Junhong Shen, Emaad Khwaja, Chenghao Liu, David Asker, Othmane Abou-Amal, Ameet Talwalkar ·

    ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

    arXiv:2604.21199v2 Announce Type: replace-cross Abstract: Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we pres…

  2. arXiv cs.CV TIER_1 English(EN) · Ameet Talwalkar ·

    ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response

    Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understa…