Researchers have introduced the Stanford EDGAR Filings Dataset (SEFD), a new open-source corpus designed to provide clean, long-context documents for training large language models, particularly in the financial domain. The dataset reconstructs SEC filings into a layout-faithful format, making them suitable for financial language modeling and enabling tasks like forecasting and document understanding. SEFD-v1, the initial release, contains 152 billion tokens, with a larger archive estimated at 550 billion tokens. The project also introduces two new benchmarks, EDGAR-Forecast and EDGAR-OCR, to evaluate financial forecasting and complex table transcription capabilities. AI
IMPACT Provides a large, specialized dataset to improve LLM performance on financial tasks and document understanding.
RANK_REASON The cluster describes the release of a new academic dataset and associated benchmarks for AI research.
- Common Crawl
- EDGAR-Forecast
- EDGAR-OCR
- Stanford EDGAR Filings Dataset
- Stanford
- EDGAR
- Nick Bettencourt
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →