HERMES introduces hierarchical labeling for AI pre-training data

By PulseAugur Editorial · [1 sources] · 2026-07-03 04:00

Researchers have developed HERMES, a novel labeling substrate designed to improve pre-training data mixtures for AI models. Unlike existing methods that rely on fixed semantic axes or granularities, HERMES offers a hierarchical system derived from the data itself. This allows for flexible control over granularity, enabling more nuanced data mixture designs and potentially uncovering interactions between data quality and coverage that fixed-granularity pipelines cannot test. AI

IMPACT This research could lead to more effective AI model pre-training by enabling finer control over data mixtures and uncovering new insights into data quality interactions.

RANK_REASON The item is an academic paper detailing a new method for AI data labeling. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
infra

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

HERMES introduces hierarchical labeling for AI pre-training data

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Ziyun Qiao, Yue Min, Ruining Chen, Yujun Li · 2026-07-03 04:00

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

arXiv:2607.02266v1 Announce Type: cross Abstract: Most data-mixing methods assume the corpus has already been partitioned into groups, and the choice of those groups determines what a mixer can express. Existing labels, including provenance, topic or format taxonomies, and flat e…

COVERAGE [1]

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

RELATED ENTITIES

RELATED TOPICS