New industrial source code dataset CIDR released for research

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have introduced CIDR, a new large-scale dataset of industrial source code designed to advance software engineering research. This dataset includes 2,440 repositories from 12 partner organizations, totaling 373 million lines of code across 138 programming languages. CIDR is unique as it comprises proprietary production codebases, processed through rigorous quality selection and anonymization, and is intended for research in code intelligence, model pre-training, and agent evaluation. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Enables new research in code intelligence and the development of code language models and AI agents.

RANK_REASON The cluster describes the release of a new dataset for software engineering research, including details on its scale, origin, and intended applications. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

COVERAGE [1]

arXiv cs.AI TIER_1 · Vladislav Savenkov · 2026-05-12 14:07

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

We present Curated Industrial Developer Repository (CIDR), a large-scale dataset of real-world software repositories collected through direct collaboration with 12 industrial partner organizations. The dataset comprises 2,440 repositories spanning 138 programming languages and to…

COVERAGE [1]

CIDR: A Large-Scale Industrial Source Code Dataset for Software Engineering Research

RELATED TOPICS