MIRA框架增强LLM中期训练数据选择

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-28 17:40

研究人员推出了一种新颖的框架MIRA，用于在大型语言模型（LLM）的中期训练阶段进行源感知数据选择。该方法通过将评分标准发现直接整合到选择过程中，解决了从不同来源整理数据的挑战。MIRA识别每个数据源组的相关评估标准，然后利用这些标准训练可扩展的评分模型，从而能够高效地过滤大型数据集。实验表明，MIRA在代码相关基准测试中有效提高了性能，同时显著减少了所需的数据量。 AI

影响 MIRA的方法通过在关键的中期训练阶段优化数据选择，有望实现更高效、更有效的LLM训练。

排序理由该集群包含一篇详细介绍LLM训练新方法的学术论文。

在 arXiv cs.AI 阅读 →

MIRA

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu · 2026-05-29 04:00

MIRA：用于源感知数据选择的中期训练评分锚定

arXiv:2605.30288v1 Announce Type: new Abstract: Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a p…
arXiv cs.AI TIER_1 English(EN) · Xianglong Liu · 2026-05-28 17:40

MIRA：训练中期评分标准锚定用于源感知数据选择

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining s…

报道来源 [2]

MIRA：用于源感知数据选择的中期训练评分锚定

MIRA：训练中期评分标准锚定用于源感知数据选择

相关实体

相关话题