PulseAugur
EN
LIVE 05:19:10

MIRA framework improves LLM mid-training data selection

Researchers have developed MIRA, a novel framework for selecting data during the mid-training phase of large language model development. This method addresses the challenge of heterogeneous data sources by discovering and applying source-specific quality rubrics. MIRA uses a frontier teacher model to identify evaluation criteria, distills these into student scorers, and then filters the data to balance scalability and semantic accuracy, outperforming other selection methods on code-related benchmarks. AI

IMPACT Enhances LLM training efficiency by optimizing data selection for improved performance on specific capabilities.

RANK_REASON The cluster describes a new academic paper detailing a novel framework for LLM data selection. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

    MIRA is a source-aware filtering framework for mid-training data selection in LLM development that uses self-anchored rubric discovery to balance scalability and semantic accuracy across heterogeneous data sources.