PulseAugur
EN
LIVE 09:22:46

Stanford releases 152B-token dataset for financial LLM training

Researchers have introduced the Stanford EDGAR Filings Dataset (SEFD), a new open-source corpus designed to provide clean, long-context documents for training large language models, particularly in the financial domain. The dataset reconstructs SEC filings into a layout-faithful format, making them suitable for financial language modeling and enabling tasks like forecasting and document understanding. SEFD-v1, the initial release, contains 152 billion tokens, with a larger archive estimated at 550 billion tokens. The project also introduces two new benchmarks, EDGAR-Forecast and EDGAR-OCR, to evaluate financial forecasting and complex table transcription capabilities. AI

IMPACT Provides a large, specialized dataset to improve LLM performance on financial tasks and document understanding.

RANK_REASON The cluster describes the release of a new academic dataset and associated benchmarks for AI research.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Stanford releases 152B-token dataset for financial LLM training

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Nick Bettencourt, Xiaowei Ding, Kay Giesecke ·

    The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

    arXiv:2606.18192v1 Announce Type: new Abstract: As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often prop…

  2. arXiv cs.AI TIER_1 English(EN) · Kay Giesecke ·

    The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

    As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically gen…