PulseAugur
EN
LIVE 13:00:46

New LEDGER benchmark tests LLMs on long-context financial report analysis

Researchers have introduced LEDGER, a new benchmark dataset designed to evaluate the long-context capabilities of large language models in financial retrieval and extraction. The dataset comprises 4,999 digitized corporate annual reports, complete with figures, tables, and narrative text, moving beyond simplified regulatory filings. LEDGER includes three distinct evaluation benchmarks, ranging from page-level KPI retrieval to conversational lookups and full KPI extraction, all derived from numerically dense, lengthy reports. The project also provides human-annotated data and a comprehensive toolchain for extraction, validation, and scoring, demonstrating its utility with a case study on CEO letter rhetoric and market impact. AI

IMPACT This benchmark will enable more rigorous evaluation of LLMs' ability to process and extract information from lengthy financial documents.

RANK_REASON The cluster describes a new academic paper introducing a benchmark dataset for evaluating LLMs.

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.CL TIER_1 English(EN) · Charles Moslonka, Amaury de Vitry, Arthur Garnier, Hicham Randrianarivo, Emmanuel Malherbe ·

    LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

    arXiv:2606.13100v1 Announce Type: new Abstract: Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public…

  2. arXiv cs.CL TIER_1 English(EN) · Emmanuel Malherbe ·

    LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

    Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-te…