PulseAugur
实时 17:45:47
English(EN) Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

新的AI模型生成具有更广泛事件背景的图像字幕 · 追踪4个来源

研究人员开发了新的图像字幕生成框架,这些框架超越了对可见内容的描述,纳入了更广泛的事件背景。一种方法是“面向知识驱动的新闻图像字幕生成的分层多模态检索”,它使用一种检索机制,该机制考虑文章结构和视觉布局来查找相关的外部知识。另一种方法CIAN(Contextual Image-Article Narrator)采用多阶段流程,包括检索、使用微调的Qwen模型进行摘要以及语言润色,以生成丰富的事件字幕。这两种方法都旨在为图像生成更全面、更具上下文细节的描述,其中CIAN在OpenEvents-V1基准测试中显示出改进的检索性能和字幕质量。 AI

影响 通过整合外部知识和事件背景,增强了图像字幕能力,从而生成更具信息量和更像人类的描述。

排序理由 两篇详细介绍图像字幕新方法的独立研究论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

新的AI模型生成具有更广泛事件背景的图像字幕 · 追踪4个来源

报道来源 [4]

  1. Hugging Face Daily Papers TIER_1 English(EN) ·

    Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

    Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions w…

  2. arXiv cs.CV TIER_1 English(EN) · Trinh Thi Thu Hien, Trung-Nghia Le ·

    CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

    arXiv:2606.17430v1 Announce Type: new Abstract: Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Imag…

  3. arXiv cs.CV TIER_1 English(EN) · Trung-Nghia Le ·

    面向知识驱动新闻图像字幕生成的层级多模态检索

    Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions w…

  4. arXiv cs.CV TIER_1 English(EN) · Trung-Nghia Le ·

    CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

    Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Image-Article Narrator (CIAN), a multi-stage framewo…