EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning
Researchers have introduced EngTrace, a new symbolic benchmark designed to rigorously evaluate the engineering reasoning capabilities of large language models (LLMs). Unlike existing benchmarks that focus on isolated skills, EngTrace assesses the integration of scientific principles, quantitative modeling, and practical constraints crucial for engineering tasks. The benchmark features 90 parameterized templates generating over 1,350 problem instances across three engineering branches and nine domains, with a novel two-stage evaluation framework that validates intermediate reasoning traces alongside final answers. Evaluations of 27 LLMs revealed a trade-off between numeric precision and trace fidelity, highlighting a complexity cliff where abstract mathematical pre-training does not adequately translate to advanced engineering reasoning. AI
IMPACT Sets a new standard for evaluating LLMs in safety-critical engineering domains, potentially driving improvements in model reliability for specialized applications.