Stanford releases 152B-token dataset for financial LLM training

By PulseAugur Editorial · [2 sources] · 2026-06-16 17:22

Researchers have introduced the Stanford EDGAR Filings Dataset (SEFD), a new open-source corpus designed to provide clean, long-context documents for training large language models, particularly in the financial domain. The dataset reconstructs SEC filings into a layout-faithful format, making them suitable for financial language modeling and enabling tasks like forecasting and document understanding. SEFD-v1, the initial release, contains 152 billion tokens, with a larger archive estimated at 550 billion tokens. The project also introduces two new benchmarks, EDGAR-Forecast and EDGAR-OCR, to evaluate financial forecasting and complex table transcription capabilities. AI

IMPACT Provides a large, specialized dataset to improve LLM performance on financial tasks and document understanding.

RANK_REASON The cluster describes the release of a new academic dataset and associated benchmarks for AI research.

Read on arXiv cs.AI →

paper
infra

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

Stanford releases 152B-token dataset for financial LLM training

COVERAGE [2]

arXiv cs.AI TIER_1 English(EN) · Nick Bettencourt, Xiaowei Ding, Kay Giesecke · 2026-06-17 04:00

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

arXiv:2606.18192v1 Announce Type: new Abstract: As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often prop…
arXiv cs.AI TIER_1 English(EN) · Kay Giesecke · 2026-06-16 17:22

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically gen…

COVERAGE [2]

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

RELATED ENTITIES

RELATED TOPICS