English(EN) Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks

LLM在GSM8K数据集上通过数字攻击显示出算术脆弱性

作者 PulseAugur 编辑部 · [3 个来源] · 2026-06-02 13:09

研究人员开发了一种自动方法，通过创建数字重映射攻击来测试大型语言模型在算术推理方面的鲁棒性。这些攻击在保持原始推理步骤的同时，用不同的数字修改应用题。对DeepSeek-R1、Gemma4和GPT-OSS等模型的评估显示，它们在GSM8K数据集上的准确率显著下降，凸显了其对数字变化的敏感性，而其他数据集则表现出更大的稳定性。 AI

影响揭示了LLM算术推理的脆弱性，表明需要超越特定数据集上的简单准确性进行更鲁棒的评估方法。

排序理由该集群包含一篇学术论文，详细介绍了一种评估LLM在算术推理方面鲁棒性的新方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.AI TIER_1 English(EN) · Malia Barker, Bishal Lakha, Edoardo Serra, Francesco Gullo · 2026-06-03 04:00

使用自动数字重映射攻击测试LLM算术推理泛化能力

arXiv:2606.03606v1 Announce Type: cross Abstract: Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must…
arXiv cs.AI TIER_1 English(EN) · Francesco Gullo · 2026-06-02 13:09

使用自动数字重映射攻击测试LLM算术推理泛化能力

Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustw…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 13:09

使用自动数字重映射攻击测试LLM算术推理泛化能力

Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustw…

报道来源 [3]

使用自动数字重映射攻击测试LLM算术推理泛化能力

使用自动数字重映射攻击测试LLM算术推理泛化能力

使用自动数字重映射攻击测试LLM算术推理泛化能力

相关实体

相关话题