PulseAugur
LIVE 13:56:31
research · [4 sources] ·
0
research

New benchmarks reveal LLMs struggle with multi-turn instruction following

Researchers have introduced two new benchmarks to evaluate large language models' ability to follow complex instructions. SEQUOR addresses constraint adherence in long, multi-turn conversations, revealing that model accuracy drops significantly as conversations lengthen or constraints change. MCJudgeBench focuses on evaluating LLM judges for multi-constraint instruction following, highlighting that overall performance doesn't guarantee consistent reliability across different constraint categories. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT These benchmarks will enable more rigorous evaluation of LLM instruction-following capabilities, potentially driving improvements in conversational AI agents.

RANK_REASON The cluster contains two new academic papers introducing benchmarks for evaluating LLM instruction following capabilities.

Read on arXiv cs.CL →

COVERAGE [4]

  1. arXiv cs.CL TIER_1 · Beatriz Canaverde, Duarte M. Alves, Jos\'e Pombal, Giuseppe Attanasio, Andr\'e F. T. Martins ·

    SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

    arXiv:2605.06353v1 Announce Type: new Abstract: In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, …

  2. arXiv cs.CL TIER_1 · André F. T. Martins ·

    SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

    In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon…

  3. arXiv cs.CL TIER_1 · Jaeyun Lee, Junyoung Koh, Zeynel Tok, Hunar Batra, Ronald Clark ·

    MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

    arXiv:2605.03858v1 Announce Type: new Abstract: Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchma…

  4. arXiv cs.CL TIER_1 · Ronald Clark ·

    MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

    Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in mult…