PulseAugur
实时 07:47:41

New STT-Arena benchmark reveals LLMs struggle with dynamic environments

Researchers have introduced STT-Arena, a new benchmark designed to evaluate large language models' ability to adapt and replan in dynamic environments with spatio-temporal changes. The benchmark consists of 227 interactive tasks that simulate real-world scenarios where mid-task disruptions can invalidate existing plans. Evaluations showed that even state-of-the-art models like Claude-4.6-Opus struggle with these dynamics, achieving less than 40% accuracy. The research also identified common failure modes in LLMs, such as executing with stale states or misdiagnosing dynamic triggers, and proposed a technique to improve adaptive replanning capabilities. AI

影响 Highlights critical limitations in current LLMs for real-world agentic applications, driving research into more robust adaptive planning.

排序理由 The cluster describes a new academic paper introducing a novel benchmark for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New STT-Arena benchmark reveals LLMs struggle with dynamic environments

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Ning Miao ·

    STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

    Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manne…