MIRA framework improves LLM mid-training data selection

By PulseAugur Editorial · [1 sources] · 2026-05-29 00:00

Researchers have developed MIRA, a novel framework for selecting data during the mid-training phase of large language model development. This method addresses the challenge of heterogeneous data sources by discovering and applying source-specific quality rubrics. MIRA uses a frontier teacher model to identify evaluation criteria, distills these into student scorers, and then filters the data to balance scalability and semantic accuracy, outperforming other selection methods on code-related benchmarks. AI

IMPACT Enhances LLM training efficiency by optimizing data selection for improved performance on specific capabilities.

RANK_REASON The cluster describes a new academic paper detailing a novel framework for LLM data selection. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.

COVERAGE [1]

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

RELATED ENTITIES

RELATED TOPICS