PulseAugur
EN
LIVE 07:44:21

New Russian finance benchmark reveals LLM reasoning gaps

Researchers have introduced RusFinChain, a new benchmark designed to evaluate verifiable chain-of-thought reasoning in finance specifically for the Russian language. This benchmark includes over 5,000 parameterized examples across 17 domains, each with a gold-standard reasoning chain for automatic verification. Initial evaluations of eight open-weight large language models showed a significant gap in reasoning capabilities, with models achieving around 0.65 F1 for step alignment but only correctly answering about 29% of final questions. The study also proposed new metrics, Fuzzy Numeric Alignment and Soft-Attention Alignment, which demonstrated a stronger correlation with final answer correctness compared to existing evaluation methods. AI

IMPACT This benchmark could improve the evaluation of LLMs in financial reasoning tasks for Russian-speaking users.

RANK_REASON The cluster describes a new academic paper introducing a benchmark for LLM reasoning. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New Russian finance benchmark reveals LLM reasoning gaps

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · M. K. Arabov ·

    RusFinChain: A Russian Benchmark for Verifiable Chain-of-Thought Reasoning in Finance with Fuzzy-Aligned Evaluation

    arXiv:2607.01388v1 Announce Type: new Abstract: Multi-step symbolic reasoning is essential for robust financial analysis, yet most benchmarks neglect intermediate reasoning steps. FINCHAIN introduced verifiable Chain-of-Thought (CoT) evaluation but is limited to English. FINESSE-…