Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

Researchers have introduced TimeSage-MT, a new benchmark designed to evaluate the time series reasoning capabilities of large language model agents across multi-turn conversations. The benchmark includes 240 tasks and over 2,600 dialogue turns, covering real-world domains and focusing on evolving user goals and accumulated evidence. Initial evaluations using TimeSage-MT revealed significant performance drops in decision-oriented tasks, highlighting critical gaps in agent memory, uncertainty handling, and domain-specific decision-making. AI

IMPACT This benchmark will drive development of more capable LLM agents for complex, multi-turn data analysis tasks.

LLM agents
TimeSage
TimeSage-MT