PulseAugur
实时 10:49:13
English(EN) LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

新框架和基准推动视听生成发展

研究人员推出了OmniCustom,一个可以同时从参考图像和音频定制视频身份和音频音色的框架。这个基于DiT的模型使用独立的LoRA模块进行身份和音色控制,并通过对比学习目标进行增强。此外,NAVA框架为联合生成提供了原生的视听对齐,使用一个6.3B参数的模型提高了同步性和音色可控性。另外,LongAV-Compass已被开发为一个基准,用于评估跨越各种条件模态的分钟级视听生成,评估长时间内的连贯性和对齐性。 AI

影响 新模型和基准提高了视听生成的控制和评估能力,推动了同步媒体合成的边界。

排序理由 多篇研究论文介绍了用于视听生成的新模型和基准。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 14 个来源。 我们如何撰写摘要 →

报道来源 [14]

  1. arXiv cs.AI TIER_1 English(EN) · Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao ·

    Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

    arXiv:2606.01031v1 Announce Type: cross Abstract: Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does…

  2. arXiv cs.AI TIER_1 Italiano(IT) · Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu ·

    OmniCustom:通过联合音视频生成模型实现音视频定制同步

    arXiv:2602.12304v4 Announce Type: replace-cross Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, thi…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    生成中的原生音视频对齐

    Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual c…

  4. Hugging Face Daily Papers TIER_1 English(EN) ·

    用于生成的原生视听对齐

    NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.

  5. Hugging Face Daily Papers TIER_1 English(EN) ·

    LongAV-Compass:迈向T2AV、I2AV和V2AV跨模态分钟级音视频生成统一评估

    Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified eva…

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    LongAV-Compass:迈向T2AV、I2AV和V2AV跨模态分钟级视听生成统一评估

    LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences.

  7. arXiv cs.CV TIER_1 English(EN) · Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu ·

    T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

    arXiv:2512.21094v2 Announce Type: replace Abstract: Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly s…

  8. arXiv cs.CV TIER_1 English(EN) · Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung ·

    Inference-Time Scaling for Joint Audio-Video Generation

    arXiv:2606.03183v1 Announce Type: cross Abstract: Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substant…

  9. arXiv cs.CV TIER_1 English(EN) · Yujie Wei, Yujin Han, Zhekai Chen, Yongming Li, Kaixun Jiang, Zhihang Liu, Quanhao Li, Zhiwu Qing, Xiang Wang, Zhen Xing, Ruihang Chu, Lingyi Hong, Yefei He, Junjie Zhou, Junqiu Yu, Yang Shi, Difan Zou, Kai Zhu, Shiwei Zhang, Yingya Zhang, Yu Liu, Xihui … ·

    MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

    arXiv:2605.20183v2 Announce Type: replace Abstract: Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing b…

  10. arXiv cs.CV TIER_1 English(EN) · Tingle Li, Siddharth Gururani, Kevin J. Shih, Gantavya Bhatt, Sang-gil Lee, Zhifeng Kong, Arushi Goel, Gopala Anumanchipalli, Ming-Yu Liu ·

    单因素物理视频到音频生成的基准测试

    arXiv:2605.30339v1 Announce Type: new Abstract: Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical corr…

  11. arXiv cs.CV TIER_1 English(EN) · Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He ·

    用于生成的原生音视频对齐

    arXiv:2605.30073v1 Announce Type: new Abstract: Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fu…

  12. arXiv cs.CV TIER_1 English(EN) · Ming-Yu Liu ·

    单因素物理视频到音频生成的基准测试

    Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this …

  13. arXiv cs.CV TIER_1 English(EN) · Jingzhou He ·

    用于生成的原生视听对齐

    Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual c…

  14. arXiv cs.CV TIER_1 English(EN) · Tengfei Liu, Yang Shi, Xuanyu Zhu, Jiafu Tang, Liu Yang, Qixun Wang, Zhuoran Zhang, Yuqi Tang, Fengxiang Wang, Yuhao Dong, Xinlong Chen, Bozhou Li, Bohan Zeng, Yue Ding, Xiaohan Zhang, Jialu Chen, Haotian Wang, Yuanxing Zhang, Pengfei Wan, Leye Wang ·

    LongAV-Compass:迈向T2AV、I2AV和V2AV跨模态分钟级音视频生成统一评估

    arXiv:2605.26244v1 Announce Type: new Abstract: Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-condi…