PulseAugur
实时 07:17:20

新基准推动视频AI将答案与时间证据联系起来 · 跟踪4个来源

两篇新的研究论文介绍了视频问答的基准和模型,这些模型侧重于时间推理和证据关联。EG-VQA基准拥有超过11,000个问答对和时间证据注释,突显出当前模型在准确本地化证据方面存在困难,即使答案是正确的。为了解决这个问题,开发了EG-Reasoner模型,在推理密集型任务上表现有所提高。另外,ViTexQA数据集和FrameThinker模型解决了视频文本理解问题,其中语义是从时间分布的线索中产生的,通过提高ROUGE-L分数,其表现优于最先进的基线。 AI

影响 这些进展旨在通过关注时间推理和证据关联来提高视频理解模型的可靠性和可解释性,这对于实际应用至关重要。

排序理由 两篇研究论文介绍了视频问答的新基准和模型。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

新基准推动视频AI将答案与时间证据联系起来 · 跟踪4个来源

报道来源 [4]

  1. arXiv cs.AI TIER_1 English(EN) · Linpeng Huang, Weixing Chen, Zexin Chen, Yang Liu, Liang Lin ·

    EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence

    arXiv:2606.24797v1 Announce Type: cross Abstract: Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while …

  2. arXiv cs.AI TIER_1 English(EN) · Liang Lin ·

    EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence

    Recent advances in Video Large Language Models (Video-LLMs) have yielded promising performance on video question answering (VideoQA). Nevertheless, existing benchmarks are predominantly evaluated through answer correctness, while the grounding of predictions in relevant video evi…

  3. arXiv cs.CV TIER_1 English(EN) · Zhentao Guo, Chen Duan, Tongkun Guan, Zining Wang, Kai Zhou, Pengfei Yan ·

    ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering

    arXiv:2606.24602v1 Announce Type: new Abstract: Despite remarkable progress in multimodal understanding, current MLLMs still exhibit limitations in video text understanding, particularly when semantics emerge through the integration of temporally distributed textual cues across m…

  4. arXiv cs.CV TIER_1 English(EN) · Pengfei Yan ·

    ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering

    Despite remarkable progress in multimodal understanding, current MLLMs still exhibit limitations in video text understanding, particularly when semantics emerge through the integration of temporally distributed textual cues across multiple frames. This perception challenge fundam…