English(EN) When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

合成LLM数据可提升专利分类，但数量是关键

作者 PulseAugur 编辑部 · [3 个来源] · 2026-05-22 23:49

一篇新的研究论文探讨了由大型语言模型生成的合成数据在低资源多标签专利分类中的有效性。研究发现，虽然合成数据可以显著提高微F1等性能指标，但大部分收益归因于数据量的增加，而非真正的合成价值。研究还强调，数据保真度指标与分类性能之间的相关性会随着所用真实数据规模的变化而变化，并且合成数据的效用是特定于任务和指标的，有时甚至会损害检索任务。 AI

影响合成数据的有效性是特定于任务和指标的，需要仔细评估，而不仅仅是数量的增加，才能实现最佳的AI应用。

排序理由该集群包含一篇详细介绍AI任务合成数据生成实验结果的研究论文。

在 arXiv cs.IR (Information Retrieval) 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.AI TIER_1 English(EN) · Amirhossein Yousefiramandi, Ciaran Cooney · 2026-05-26 04:00

合成专利数据何时有帮助？低资源多标签分类中的音量-保真度权衡

arXiv:2605.24296v1 Announce Type: new Abstract: We study when LLM-generated synthetic data helps low-resource multi-label patent classification, separating true synthetic value from the confound that larger augmented sets can win by volume alone. Across six open-source LLMs (3.8-…
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Ciaran Cooney · 2026-05-22 23:49

合成专利数据何时有帮助？低资源多标签分类中的音量-保真度权衡

We study when LLM-generated synthetic data helps low-resource multi-label patent classification, separating true synthetic value from the confound that larger augmented sets can win by volume alone. Across six open-source LLMs (3.8-12B), four real-data regimes, 64 WIPO assistive-…
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Ciaran Cooney · 2026-05-22 23:49

合成专利数据何时有帮助？低资源多标签分类中的音量-保真度权衡

The issues that must be considered regarding the utilization of synthetic data generated through LLMs for multilabel patent classification include (i) when the use of such data may help and (ii) why. Indeed, the former part appropriately adjusts for the possibility of improving r…

报道来源 [3]

合成专利数据何时有帮助？低资源多标签分类中的音量-保真度权衡

合成专利数据何时有帮助？低资源多标签分类中的音量-保真度权衡

合成专利数据何时有帮助？低资源多标签分类中的音量-保真度权衡

相关实体

相关话题