PulseAugur
EN
LIVE 10:20:34

New benchmark tests LLMs on generating industry-standard XML for BIM

Researchers have introduced Ishigaki-IDS-Bench, a new benchmark designed to evaluate the capability of large language models (LLMs) in generating Information Delivery Specification (IDS) XML from Building Information Modeling (BIM) requirements. The benchmark includes 166 expert-verified examples across various construction domains and languages, along with gold IDS files for comparison. Initial evaluations show that while LLMs can partially express information requirements, they struggle to consistently generate XML that adheres to IDS standards and IFC vocabulary constraints, with the best model achieving only 65.6% content agreement. AI

IMPACT This benchmark will help advance LLM capabilities in generating domain-specific, standardized structured data, crucial for industries like construction.

RANK_REASON The cluster describes a new academic paper introducing a benchmark for evaluating LLM performance on a specific structured data generation task.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka ·

    Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

    arXiv:2605.22079v1 Announce Type: new Abstract: Large language models (LLMs) are widely used to generate structured outputs such as JSON, SQL, and code, yet public resources remain limited for evaluating generation that must simultaneously satisfy industry-standard XML and domain…

  2. arXiv cs.CL TIER_1 English(EN) · Daiho Nishioka ·

    Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

    Large language models (LLMs) are widely used to generate structured outputs such as JSON, SQL, and code, yet public resources remain limited for evaluating generation that must simultaneously satisfy industry-standard XML and domain vocabulary constraints. This paper presents Ish…