Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 1d · [3 sources]

Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

Researchers have developed an automated method to test the robustness of large language models in arithmetic reasoning by creating numeric-remapping attacks. These attacks modify word problems with different numbers while preserving the original reasoning steps. Evaluations on models like DeepSeek-R1, Gemma4, and GPT-OSS revealed significant accuracy drops on the GSM8K dataset, highlighting its sensitivity to numerical variations, while other datasets showed greater stability. AI

IMPACT Reveals LLM arithmetic reasoning brittleness, suggesting a need for more robust evaluation methods beyond simple accuracy on specific datasets.

GPT-OSS
GSM8K
DeepSeek-R1
Gemma4
MultiArith
MAWPS