Researchers have introduced CIDR, a new large-scale dataset of industrial source code designed to advance software engineering research. This dataset includes 2,440 repositories from 12 partner organizations, totaling 373 million lines of code across 138 programming languages. CIDR is unique as it comprises proprietary production codebases, processed through rigorous quality selection and anonymization, and is intended for research in code intelligence, model pre-training, and agent evaluation. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Enables new research in code intelligence and the development of code language models and AI agents.
RANK_REASON The cluster describes the release of a new dataset for software engineering research, including details on its scale, origin, and intended applications. [lever_c_demoted from research: ic=1 ai=1.0]