PulseAugur
实时 08:29:42

New benchmark tests LLMs on interactive geometry construction

Researchers have developed GeoBuildBench, a new benchmark to assess how well large language and multimodal models can translate natural language geometry problems into executable construction programs. This benchmark differs from others by focusing on the interactive generation of geometric diagrams rather than just static interpretation or answer correctness. It includes 489 Chinese textbook-style problems, and evaluations show current models struggle with structural hallucinations and constraint satisfaction, indicating a need for improved grounded reasoning capabilities. AI

影响 This benchmark provides a rigorous test for AI's ability to perform grounded, executable reasoning, moving beyond simple text or image interpretation.

排序理由 The cluster describes a new academic paper introducing a benchmark for evaluating AI models.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

New benchmark tests LLMs on interactive geometry construction

报道来源 [2]

  1. arXiv cs.CL TIER_1 English(EN) · Huishuai Zhang ·

    GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

    We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctn…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language

    We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctn…