English(EN) Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

新的AI模型生成具有更广泛事件背景的图像字幕 · 追踪4个来源

作者 PulseAugur 编辑部 · [4 个来源] · 2026-06-16 02:24

研究人员开发了新的图像字幕生成框架，这些框架超越了对可见内容的描述，纳入了更广泛的事件背景。一种方法是“面向知识驱动的新闻图像字幕生成的分层多模态检索”，它使用一种检索机制，该机制考虑文章结构和视觉布局来查找相关的外部知识。另一种方法CIAN（Contextual Image-Article Narrator）采用多阶段流程，包括检索、使用微调的Qwen模型进行摘要以及语言润色，以生成丰富的事件字幕。这两种方法都旨在为图像生成更全面、更具上下文细节的描述，其中CIAN在OpenEvents-V1基准测试中显示出改进的检索性能和字幕质量。 AI

影响通过整合外部知识和事件背景，增强了图像字幕能力，从而生成更具信息量和更像人类的描述。

排序理由两篇详细介绍图像字幕新方法的独立研究论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。我们如何撰写摘要 →

报道来源 [4]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 00:08

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions w…
arXiv cs.CV TIER_1 English(EN) · Trinh Thi Thu Hien, Trung-Nghia Le · 2026-06-17 04:00

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

arXiv:2606.17430v1 Announce Type: new Abstract: Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Imag…
arXiv cs.CV TIER_1 English(EN) · Trung-Nghia Le · 2026-06-17 00:08

面向知识驱动新闻图像字幕生成的层级多模态检索

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions w…
arXiv cs.CV TIER_1 English(EN) · Trung-Nghia Le · 2026-06-16 02:24

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Image-Article Narrator (CIAN), a multi-stage framewo…

报道来源 [4]

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

面向知识驱动新闻图像字幕生成的层级多模态检索

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

相关实体

相关话题