A new benchmark called CDR-Bench has been developed to evaluate the ability of large language models (LLMs) to execute complex, order-sensitive data refinement recipes. The benchmark includes 3,462 tasks across four domains and 29 operators, assessing models in atomic, order-agnostic, and order-sensitive scenarios. Experiments with over 10 state-of-the-art LLMs demonstrated significant performance degradation in compositional tasks and a collapse in success rates for order-sensitive recipes, indicating current LLMs lack the procedural faithfulness needed for reliable data refinement. AI
IMPACT Highlights a critical gap in LLM capabilities for procedural tasks, suggesting current models are not yet reliable for complex data refinement workflows.
RANK_REASON The cluster contains an academic paper introducing a new benchmark for evaluating LLM capabilities. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →