PulseAugur
EN
LIVE 19:31:06

LLMs show arithmetic fragility on GSM8K dataset via numeric attacks

Researchers have developed an automated method to test the robustness of large language models in arithmetic reasoning by creating numeric-remapping attacks. These attacks modify word problems with different numbers while preserving the original reasoning steps. Evaluations on models like DeepSeek-R1, Gemma4, and GPT-OSS revealed significant accuracy drops on the GSM8K dataset, highlighting its sensitivity to numerical variations, while other datasets showed greater stability. AI

IMPACT Reveals LLM arithmetic reasoning brittleness, suggesting a need for more robust evaluation methods beyond simple accuracy on specific datasets.

RANK_REASON The cluster contains an academic paper detailing a new method for evaluating LLM robustness on arithmetic reasoning.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

  1. arXiv cs.AI TIER_1 English(EN) · Malia Barker, Bishal Lakha, Edoardo Serra, Francesco Gullo ·

    Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

    arXiv:2606.03606v1 Announce Type: cross Abstract: Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must…

  2. arXiv cs.AI TIER_1 English(EN) · Francesco Gullo ·

    Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

    Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustw…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

    Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustw…