PulseAugur
EN
LIVE 15:08:21

New QSTRBench benchmark tests LLM spatial and temporal reasoning

Researchers have introduced QSTRBench, a new benchmark designed to assess the qualitative spatial and temporal reasoning capabilities of large language models. The benchmark includes a variety of calculi such as Point Algebra, Allen's Interval Algebra, and Region Connection Calculus, with some calculi, like RCC-22, being published for the first time. While current frontier models show performance exceeding random chance, none can consistently answer all questions correctly, with difficulty varying significantly across different calculi. AI

IMPACT Introduces a new evaluation framework to better understand and improve LLM capabilities in complex reasoning tasks.

RANK_REASON The cluster contains a new academic paper introducing a novel benchmark for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New QSTRBench benchmark tests LLM spatial and temporal reasoning

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Robert E. Blackwell ·

    QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

    We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR …