ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
Researchers have introduced ASyMOB, a new benchmark designed to evaluate the symbolic mathematics capabilities of large language models. The dataset contains over 35,000 validated problems across various mathematical domains, with a focus on testing generalization through symbolic and numeric transformations. Initial evaluations show that most models struggle with minor perturbations, though top systems demonstrate improved robustness, and the integration of code tools significantly stabilizes performance. AI
IMPACT Provides a more rigorous evaluation for LLMs in symbolic mathematics, pushing development towards genuine reasoning over memorization.