PulseAugur
LIVE 06:56:29
research · [2 sources] ·
0
research

New benchmarks reveal LLMs struggle with Arabic and symbolic financial reasoning

Researchers have introduced SAHM, a new benchmark designed to evaluate Arabic financial and Shari'ah-compliant reasoning capabilities in large language models. The benchmark includes over 14,000 expert-verified instances across seven tasks, addressing a significant gap in Arabic financial NLP. Evaluations of 20 LLMs revealed that while models perform well on recognition tasks, their financial reasoning abilities, particularly in event-cause analysis, are considerably weaker. Separately, the FinChain benchmark was developed to assess verifiable chain-of-thought reasoning in finance, using parameterized templates and executable code for scalable data generation. FinChain's evaluation of 26 LLMs highlighted limitations in multi-step symbolic financial reasoning, though domain-adapted models showed improvement. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT New benchmarks for Arabic financial reasoning and verifiable chain-of-thought in finance may drive development of more trustworthy and specialized financial AI tools.

RANK_REASON Two new academic papers introduce benchmarks for evaluating financial reasoning in LLMs, one focusing on Arabic and Shari'ah compliance and the other on verifiable chain-of-thought.

Read on arXiv cs.AI →

COVERAGE [2]

  1. arXiv cs.LG TIER_1 · Rania Elbadry, Sarfraz Ahmad, Ahmed Heakl, Dani Bouch, Momina Ahsan, Muhra AlMahri, Marwa Elsaid khalil, Yuxia Wang, Salem Lahlou, Sophia Ananiadou, Veselin Stoyanov, Jimin Huang, Xueqing Peng, Preslav Nakov, Zhuohan Xie ·

    SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning

    arXiv:2604.19098v2 Announce Type: replace-cross Abstract: English financial NLP has advanced rapidly through benchmarks targeting earnings analysis, market sentiment, tabular reasoning, and financial question answering, yet Arabic financial NLP remains virtually nonexistent, desp…

  2. arXiv cs.AI TIER_1 · Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan ·

    FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

    arXiv:2506.02515v4 Announce Type: replace-cross Abstract: Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while negl…