Researchers have introduced RusFinChain, a new benchmark designed to evaluate verifiable chain-of-thought reasoning in finance specifically for the Russian language. This benchmark includes over 5,000 parameterized examples across 17 domains, each with a gold-standard reasoning chain for automatic verification. Initial evaluations of eight open-weight large language models showed a significant gap in reasoning capabilities, with models achieving around 0.65 F1 for step alignment but only correctly answering about 29% of final questions. The study also proposed new metrics, Fuzzy Numeric Alignment and Soft-Attention Alignment, which demonstrated a stronger correlation with final answer correctness compared to existing evaluation methods. AI
IMPACT This benchmark could improve the evaluation of LLMs in financial reasoning tasks for Russian-speaking users.
RANK_REASON The cluster describes a new academic paper introducing a benchmark for LLM reasoning. [lever_c_demoted from research: ic=1 ai=1.0]
- ChainEval
- Chain-of-Thought
- finance
- FINCHAIN
- FINESSE-Bench
- Fuzzy Numeric Alignment
- Mullosharaf Arabov Am
- Python
- RusFinChain
- Russian
- Soft-Attention Alignment
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →