LLM autoformalization struggles with paraphrased inputs

By PulseAugur Editorial · [1 sources] · 2026-06-04 04:00

Researchers have investigated the robustness of large language models (LLMs) in autoformalization tasks, specifically their ability to generate formal proofs from natural language statements. The study found that LLMs exhibit performance variability when presented with semantically similar paraphrased inputs, indicating that minor alterations in the natural language can significantly affect the generated formal outputs. This research utilized the MiniF2F and Lean 4 ProofNet benchmarks to evaluate two modern LLMs, measuring both semantic and compilation validity of the generated proofs. AI

IMPACT Highlights LLM sensitivity to input phrasing, suggesting a need for more robust natural language understanding in formal reasoning tasks.

RANK_REASON This is a research paper published on arXiv detailing experimental findings on LLM robustness. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLM autoformalization struggles with paraphrased inputs

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Hayden Moore, Asfahan Shah · 2026-06-04 04:00

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

arXiv:2511.12784v3 Announce Type: replace Abstract: Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in …

COVERAGE [1]

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

RELATED ENTITIES

RELATED TOPICS