Researchers have developed an automated method to test the robustness of large language models in arithmetic reasoning by creating numeric-remapping attacks. These attacks modify word problems with different numbers while preserving the original reasoning steps. Evaluations on models like DeepSeek-R1, Gemma4, and GPT-OSS revealed significant accuracy drops on the GSM8K dataset, highlighting its sensitivity to numerical variations, while other datasets showed greater stability. AI
IMPACT Reveals LLM arithmetic reasoning brittleness, suggesting a need for more robust evaluation methods beyond simple accuracy on specific datasets.
RANK_REASON The cluster contains an academic paper detailing a new method for evaluating LLM robustness on arithmetic reasoning.
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →