Stanford releases 152B-token dataset for financial LLM training

By PulseAugur Editorial · [1 sources] · 2026-06-16 17:22

Researchers have introduced the Stanford EDGAR Filings Dataset (SEFD), a new open corpus derived from SEC filings. This dataset reconstructs corporate and financial disclosures into a layout-faithful format suitable for training large language models on long-context documents. SEFD aims to provide a token-efficient and model-ready resource for financial language modeling, enabling tasks like forecasting, compliance, and document understanding. The initial release includes 152 billion tokens, with a larger archive estimated at 550 billion tokens, and introduces two new benchmarks for evaluating financial forecasting and table transcription. AI

IMPACT Provides a large, open dataset for training LLMs on financial documents, potentially improving AI capabilities in financial analysis and forecasting.

RANK_REASON The cluster contains an academic paper detailing a new dataset and benchmarks for LLM training. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Kay Giesecke · 2026-06-16 17:22

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically gen…

COVERAGE [1]

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

RELATED ENTITIES

RELATED TOPICS