Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks
Researchers have developed an automated method to test the robustness of large language models in arithmetic reasoning by creating numeric-remapping attacks. These attacks modify word problems with different numbers while preserving the original reasoning steps. Evaluations on models like DeepSeek-R1, Gemma4, and GPT-OSS revealed significant accuracy drops on the GSM8K dataset, highlighting its sensitivity to numerical variations, while other datasets showed greater stability. AI
IMPACT Reveals LLM arithmetic reasoning brittleness, suggesting a need for more robust evaluation methods beyond simple accuracy on specific datasets.