Too long; didn't solve
A new research paper titled "Too long; didn't solve" investigates the impact of prompt and solution length on the performance of large language models in mathematical reasoning tasks. The study, which utilizes a newly constructed adversarial dataset of expert-authored mathematics problems, found that both increased prompt length and increased solution length correlate with higher model failure rates. While a difficulty-adjusted analysis showed weak negative associations between these length variables and model separation, the primary finding emphasizes that structural length is a significant factor in the empirical difficulty of these mathematical benchmarks. AI
IMPACT This research suggests that current LLM evaluation methods may be sensitive to input and output length, potentially requiring adjustments for more robust assessments of reasoning capabilities.