HERMES 为 AI 预训练数据引入分层标注

作者 PulseAugur 编辑部 · [2 个来源] · 2026-07-02 14:51

研究人员开发了 HERMES，一种新颖的标注基底，旨在改进 AI 模型预训练数据混合。与依赖固定语义轴或粒度的现有方法不同，HERMES 提供了一个源自数据本身的分层系统。这允许对粒度进行灵活控制，从而实现更细致的数据混合设计，并可能揭示固定粒度流水线无法测试的数据质量与覆盖范围之间的相互作用。 AI

影响通过实现对数据混合的更精细控制并揭示数据质量相互作用的新见解，这项研究可能带来更有效的 AI 模型预训练。

排序理由该项目是一篇学术论文，详细介绍了一种新的 AI 数据标注方法。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.AI TIER_1 English(EN) · Ziyun Qiao, Yue Min, Ruining Chen, Yujun Li · 2026-07-03 04:00

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

arXiv:2607.02266v1 Announce Type: cross Abstract: Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat e…
arXiv cs.AI TIER_1 English(EN) · Yujun Li · 2026-07-02 14:51

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat embedding clusters, commit to one semantic axis at …

报道来源 [2]

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

相关实体

相关话题