Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 6h

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

Researchers have investigated the robustness of large language models (LLMs) in autoformalization tasks, specifically their ability to generate formal proofs from natural language statements. The study found that LLMs exhibit performance variability when presented with semantically similar paraphrased inputs, indicating that minor alterations in the natural language can significantly affect the generated formal outputs. This research utilized the MiniF2F and Lean 4 ProofNet benchmarks to evaluate two modern LLMs, measuring both semantic and compilation validity of the generated proofs. AI

IMPACT Highlights LLM sensitivity to input phrasing, suggesting a need for more robust natural language understanding in formal reasoning tasks.

Large Language Models
MiniF2F
Hayden Moore
Lean 4 ProofNet