MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation
Researchers have introduced MTR-Bench, a new benchmark designed to evaluate the multi-turn reasoning capabilities of large language models. The benchmark includes 40 tasks across four classes, totaling 3600 instances, and is designed for automated evaluation without human intervention. Initial experiments indicate that current state-of-the-art models struggle with these interactive reasoning tasks, highlighting areas for future research in AI systems. AI
IMPACT Provides a new standardized method for evaluating LLM performance in interactive, multi-turn scenarios, pushing research towards more capable AI systems.