Researchers have introduced QMFOL, a novel framework designed to generate first-order logic reasoning tasks with controllable complexity for evaluating large language models (LLMs). This framework addresses limitations in existing benchmarks by allowing precise control over logical depth, width, and semantic diversity while ensuring logical consistency through external provers. The resulting benchmark, QMFOLBench, comprises 2880 instances and has been used to evaluate six large reasoning models and two LLMs, revealing that performance declines and computational costs rise with increased logical complexity. The evaluations also indicated that models perform better on tasks with 'True' labels compared to 'False' or 'Unknown' labels and are sensitive to semantic variations. AI
IMPACT Provides a more precise method for evaluating LLM deductive reasoning, potentially guiding future model development towards more robust logical capabilities.
RANK_REASON The cluster describes a new academic paper proposing a novel framework and benchmark for evaluating LLM reasoning capabilities. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →