LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction
Researchers have introduced LEDGER, a new benchmark dataset designed to evaluate the long-context capabilities of large language models in financial retrieval and extraction. The dataset comprises 4,999 digitized corporate annual reports, complete with figures, tables, and narrative text, moving beyond simplified regulatory filings. LEDGER includes three distinct evaluation benchmarks, ranging from page-level KPI retrieval to conversational lookups and full KPI extraction, all derived from numerically dense, lengthy reports. The project also provides human-annotated data and a comprehensive toolchain for extraction, validation, and scoring, demonstrating its utility with a case study on CEO letter rhetoric and market impact. AI
IMPACT This benchmark will enable more rigorous evaluation of LLMs' ability to process and extract information from lengthy financial documents.