PulseAugur
EN
LIVE 20:48:11

MIRA framework enhances LLM mid-training data selection

Researchers have introduced MIRA, a novel framework for source-aware data selection during the mid-training phase of large language models. This method addresses the challenge of curating data from diverse sources by integrating rubric discovery directly into the selection process. MIRA identifies relevant evaluation criteria for each data source group and then uses these to train scalable scoring models, enabling efficient filtering of large datasets. Experiments show MIRA effectively improves performance on code-related benchmarks while significantly reducing the data volume required. AI

IMPACT MIRA's approach could lead to more efficient and effective LLM training by optimizing data selection during a critical mid-training phase.

RANK_REASON The cluster contains a research paper detailing a new method for LLM training.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

MIRA framework enhances LLM mid-training data selection

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu ·

    MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

    arXiv:2605.30288v1 Announce Type: new Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a p…

  2. arXiv cs.AI TIER_1 English(EN) · Xianglong Liu ·

    MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

    Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining s…