Researchers have introduced MIRA, a novel framework for source-aware data selection during the mid-training phase of large language models. This method addresses the challenge of curating data from diverse sources by integrating rubric discovery directly into the selection process. MIRA identifies relevant evaluation criteria for each data source group and then uses these to train scalable scoring models, enabling efficient filtering of large datasets. Experiments show MIRA effectively improves performance on code-related benchmarks while significantly reducing the data volume required. AI
IMPACT MIRA's approach could lead to more efficient and effective LLM training by optimizing data selection during a critical mid-training phase.
RANK_REASON The cluster contains a research paper detailing a new method for LLM training.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →