Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 16h

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

Researchers have introduced QMFOL, a novel framework designed to generate first-order logic reasoning tasks with controllable complexity for evaluating large language models (LLMs). This framework addresses limitations in existing benchmarks by allowing precise control over logical depth, width, and semantic diversity while ensuring logical consistency through external provers. The resulting benchmark, QMFOLBench, comprises 2880 instances and has been used to evaluate six large reasoning models and two LLMs, revealing that performance declines and computational costs rise with increased logical complexity. The evaluations also indicated that models perform better on tasks with 'True' labels compared to 'False' or 'Unknown' labels and are sensitive to semantic variations. AI

IMPACT Provides a more precise method for evaluating LLM deductive reasoning, potentially guiding future model development towards more robust logical capabilities.