Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation
A new research paper published on arXiv explores a critical limitation in evaluating the difficulty of math reasoning problems for AI models. The study reveals that standard benchmarks, which rely on the success rate of sampled solutions (pass@k), fail to accurately assess the hardest problems. Researchers found that a significant percentage of problems deemed unsolvable by current sampling methods can be solved with a deterministic approach involving residual stream perturbations, suggesting these problems are not inherently too difficult but rather unreached by typical sampling strategies. AI
IMPACT Highlights a flaw in current AI evaluation methods for complex reasoning tasks, potentially leading to more accurate difficulty estimation and improved model training.