Italiano(IT) Q-ARVD: Quantizing Autoregressive Video Diffusion Models

新方法提升视频扩散模型的效率和质量

作者 PulseAugur 编辑部 · [13 个来源] · 2026-05-20 00:00

研究人员正在开发新方法来提高视频扩散模型的效率和质量。几篇论文介绍了优化注意力机制的技术，例如稀疏注意力（LVSA、Veda）和线性注意力（ARL2），以降低计算成本并实现更长的视频生成。其他方法侧重于微调和偏好优化，例如用于时空区域对齐的LocalDPO和通过矢量化时间步长适应来实现时间控制的Pusa V1.0。此外，Q-ARVD解决了自回归视频扩散模型特有的量化挑战，而Bernini则统一了大型语言模型和扩散模型以实现语义规划和渲染。 AI

影响注意力机制和优化技术的进步有望带来更高效、更高质量的视频生成，从而可能加速在创意和工业应用中的采用。

排序理由多篇研究论文介绍了视频扩散模型的新颖方法。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 13 个来源。我们如何撰写摘要 →

报道来源 [13]

arXiv cs.AI TIER_1 English(EN) · Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu · 2026-06-02 04:00

APB-V：通过感知序列并行的近似注意力加速长视频理解

arXiv:2601.21444v2 Announce Type: replace-cross Abstract: The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply …
arXiv cs.LG TIER_1 English(EN) · Gael Glorian, Ioannis Lamprou, Zhen Zhang, Yujie Yuan, Hongsheng Liu · 2026-06-01 04:00

LVSA：用于长视频扩散的无训练稀疏注意力

arXiv:2605.31057v1 Announce Type: cross Abstract: Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

LVSA：用于长视频扩散的无训练稀疏注意力

Long Video Sparse Attention (LVSA) addresses computational bottlenecks in video diffusion models by introducing a sparse attention mechanism that reduces compute costs while maintaining video quality beyond training horizons.
arXiv cs.LG TIER_1 English(EN) · Kunyang Li, Mubarak Shah, Yuzhang Shang · 2026-05-22 04:00

本地参与，线性记忆：线性注意力作为自回归视频扩散的跨帧记忆

arXiv:2605.16579v2 Announce Type: replace-cross Abstract: Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory…
arXiv cs.AI TIER_1 English(EN) · Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo · 2026-05-22 04:00

关注生成细节：视频扩散模型的直接局部细节偏好优化

arXiv:2601.04068v4 Announce Type: replace-cross Abstract: Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic model…
Hugging Face Daily Papers TIER_1 Italiano(IT) · 2026-05-20 00:00

Q-ARVD：量化自回归视频扩散模型

Autoregressive video diffusion models face high inference costs that limit practical deployment, prompting the development of Q-ARVD, a novel quantization framework addressing frame-wise sensitivity imbalance and weight outlier patterns specific to these models.
arXiv cs.CV TIER_1 English(EN) · Hongsheng Liu · 2026-05-29 09:28

LVSA：用于长视频扩散的无训练稀疏注意力

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approac…
arXiv cs.CV TIER_1 Italiano(IT) · Shihao Han, Hao Yang, Xinting Hu, Xiaofeng Mei, Yi Jiang, Xiaojuan Qi · 2026-05-29 04:00

Veda：通过蒸馏稀疏注意力实现可扩展视频扩散

arXiv:2605.30325v1 Announce Type: new Abstract: Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation q…
arXiv cs.CV TIER_1 Italiano(IT) · Xiaojuan Qi · 2026-05-28 17:57

Veda：通过蒸馏稀疏注意力实现可扩展视频扩散

Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality is determined not by the sparsity ratio i…
arXiv cs.CV TIER_1 English(EN) · Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel · 2026-05-27 04:00

Pusa V1.0：通过向量化时间步长自适应解锁预训练视频扩散模型中的时间控制

arXiv:2507.16116v2 Announce Type: replace Abstract: The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. Whil…
arXiv cs.CV TIER_1 English(EN) · Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan · 2026-05-22 04:00

Bernini：用于视频扩散的潜在语义规划

arXiv:2605.22344v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize ima…
arXiv cs.CV TIER_1 English(EN) · Hangyeol Lee, Joo-Young Kim · 2026-05-22 04:00

ORBIS：面向视频扩散加速的输出引导式令牌缩减与感知分布匹配

arXiv:2605.22015v1 Announce Type: new Abstract: Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of fra…
arXiv cs.CV TIER_1 English(EN) · Zehuan Yuan · 2026-05-21 11:30

Bernini：用于视频扩散的潜在语义规划

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We …

报道来源 [13]

相关实体

相关话题