A new research paper titled "Too long; didn't solve" investigates the impact of prompt and solution length on the performance of large language models in mathematical reasoning tasks. The study, which utilizes a newly constructed adversarial dataset of expert-authored mathematics problems, found that both increased prompt length and increased solution length correlate with higher model failure rates. While a difficulty-adjusted analysis showed weak negative associations between these length variables and model separation, the primary finding emphasizes that structural length is a significant factor in the empirical difficulty of these mathematical benchmarks. AI
IMPACT This research suggests that current LLM evaluation methods may be sensitive to input and output length, potentially requiring adjustments for more robust assessments of reasoning capabilities.
RANK_REASON The cluster contains a research paper published on arXiv detailing findings about LLM performance on mathematical benchmarks. [lever_c_demoted from research: ic=1 ai=1.0]
- alphaXiv
- arXiv
- CatalyzeX Code Finder for Papers
- CORE Recommender
- DagsHub
- Gotit.pub
- Hugging Face
- Influence Flower
- Lucía Magalí Cabrera
- ScienceCast
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →