PulseAugur
实时 12:06:31

New TS-Skill benchmark evaluates LLMs' time-series analytical skills

Researchers have introduced TS-Skill, a new benchmark designed to evaluate the analytical capabilities of large language models (LLMs) and time-series language models (TSLMs) in time-series question answering (TSQA). This benchmark focuses on three specific skills: temporal scale selection, temporal localization, and cross-interval integration, which are crucial for understanding temporal data patterns. Experiments using TS-Skill revealed significant performance gaps across these skills, particularly highlighting challenges in integrating information across separate time intervals for non-agentic models. AI

影响 Provides a granular evaluation framework to identify and address specific temporal reasoning weaknesses in LLMs and TSLMs.

排序理由 The cluster contains a new academic paper introducing a novel benchmark for evaluating specific AI capabilities. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Liying Han, Kang Yang, Oliver Wang, Jason Wu, Pengrui Quan, Gaofeng Dong, Ozan Baris Mulayim, Sizhe Ma, Yuyang Yuan, Dezhi Hong, Mario Berges, Mani Srivastava ·

    TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

    arXiv:2605.24703v1 Announce Type: cross Abstract: Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unlike text-only QA, TSQA requires models to ground answers in temporal signals whose patterns…