Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing
Researchers have investigated the robustness of large language models (LLMs) in autoformalization tasks, specifically their ability to generate formal proofs from natural language statements. The study found that LLMs exhibit performance variability when presented with semantically similar paraphrased inputs, indicating that minor alterations in the natural language can significantly affect the generated formal outputs. This research utilized the MiniF2F and Lean 4 ProofNet benchmarks to evaluate two modern LLMs, measuring both semantic and compilation validity of the generated proofs. AI
IMPACT Highlights LLM sensitivity to input phrasing, suggesting a need for more robust natural language understanding in formal reasoning tasks.