PulseAugur
EN
LIVE 14:42:23

New CDR-Bench reveals LLMs struggle with order-sensitive data refinement

A new benchmark called CDR-Bench has been developed to evaluate the ability of large language models (LLMs) to execute complex, order-sensitive data refinement recipes. The benchmark includes 3,462 tasks across four domains and 29 operators, assessing models in atomic, order-agnostic, and order-sensitive scenarios. Experiments with over 10 state-of-the-art LLMs demonstrated significant performance degradation in compositional tasks and a collapse in success rates for order-sensitive recipes, indicating current LLMs lack the procedural faithfulness needed for reliable data refinement. AI

IMPACT Highlights a critical gap in LLM capabilities for procedural tasks, suggesting current models are not yet reliable for complex data refinement workflows.

RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLM capabilities. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

New CDR-Bench reveals LLMs struggle with order-sensitive data refinement

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Yuchen Huang, Xiang Li, Zhenqing Ling, Sijia Li, Qianli Shen, Daoyuan Chen, Yi R. Fung, Yaliang Li ·

    CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

    arXiv:2606.31435v1 Announce Type: new Abstract: Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or enta…

  2. arXiv cs.CL TIER_1 English(EN) · Yaliang Li ·

    CDR-Bench: Evaluating Faithful Execution of Compositional, Order-Sensitive Data Refinement Recipes

    Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or entangle it with code and tool execution, it remains…