New benchmarks reveal LLMs struggle with multi-turn instruction following

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 4 sources

Researchers have introduced two new benchmarks to evaluate large language models' ability to follow complex instructions. SEQUOR addresses constraint adherence in long, multi-turn conversations, revealing that model accuracy drops significantly as conversations lengthen or constraints change. MCJudgeBench focuses on evaluating LLM judges for multi-constraint instruction following, highlighting that overall performance doesn't guarantee consistent reliability across different constraint categories. AI

Summary written by gemini-2.5-flash-lite from 4 sources. How we write summaries →

IMPACT These benchmarks will enable more rigorous evaluation of LLM instruction-following capabilities, potentially driving improvements in conversational AI agents.

RANK_REASON The cluster contains two new academic papers introducing benchmarks for evaluating LLM instruction following capabilities.

Read on arXiv cs.CL →

paper
other

COVERAGE [4]

arXiv cs.CL TIER_1 · Beatriz Canaverde, Duarte M. Alves, Jos\'e Pombal, Giuseppe Attanasio, Andr\'e F. T. Martins · 2026-05-08 04:00

SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

arXiv:2605.06353v1 Announce Type: new Abstract: In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, …
arXiv cs.CL TIER_1 · André F. T. Martins · 2026-05-07 14:33

SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon…
arXiv cs.CL TIER_1 · Jaeyun Lee, Junyoung Koh, Zeynel Tok, Hunar Batra, Ronald Clark · 2026-05-06 04:00

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

arXiv:2605.03858v1 Announce Type: new Abstract: Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchma…
arXiv cs.CL TIER_1 · Ronald Clark · 2026-05-05 15:20

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in mult…

COVERAGE [4]

SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following

RELATED ENTITIES

RELATED TOPICS