TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering
Researchers have introduced TS-Skill, a new benchmark designed to evaluate the analytical capabilities of large language models (LLMs) and time-series language models (TSLMs) in time-series question answering (TSQA). This benchmark focuses on three specific skills: temporal scale selection, temporal localization, and cross-interval integration, which are crucial for understanding temporal data patterns. Experiments using TS-Skill revealed significant performance gaps across these skills, particularly highlighting challenges in integrating information across separate time intervals for non-agentic models. AI
IMPACT Provides a granular evaluation framework to identify and address specific temporal reasoning weaknesses in LLMs and TSLMs.