Researchers have introduced TS-Skill, a new benchmark designed to evaluate the analytical capabilities of large language models (LLMs) and time-series language models (TSLMs) in time-series question answering (TSQA). This benchmark focuses on three specific skills: temporal scale selection, temporal localization, and cross-interval integration, which are crucial for understanding temporal data patterns. Experiments using TS-Skill revealed significant performance gaps across these skills, particularly highlighting challenges in integrating information across separate time intervals for non-agentic models. AI
影响 Provides a granular evaluation framework to identify and address specific temporal reasoning weaknesses in LLMs and TSLMs.
排序理由 The cluster contains a new academic paper introducing a novel benchmark for evaluating specific AI capabilities. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →