English(EN)RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling
新研究提升视频生成控制力和效率
作者PulseAugur 编辑部·[22 个来源]·
研究人员正在开发新方法来改进视频生成模型,重点关注控制、效率和质量。一种名为LA-LQR的方法使用最优控制来引导视频生成模型,在保持视觉保真度的同时减少不期望的内容。另一个研究领域是通过蒸馏和低比特量化来压缩大型视频扩散模型(如Wan2.2),使其更易于部署。此外,新的框架正在出现,为视频生成提供显式的3D控制和感知,超越2D投影,以更好地捕捉复杂的场景动态和人物运动。
AI
arXiv:2606.04775v1 Announce Type: cross Abstract: Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanist…
Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering,…
arXiv:2606.00299v1 Announce Type: cross Abstract: While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved…
arXiv cs.AI
TIER_1English(EN)·Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong, Shiqiao Gu, Yang Yong, Jinyang Guo, Xianglong Liu·
arXiv:2606.00658v1 Announce Type: cross Abstract: Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compres…
arXiv cs.AI
TIER_1English(EN)·Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang·
arXiv:2606.02000v1 Announce Type: cross Abstract: Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains …
AAD-1 framework improves one-step autoregressive image-to-video generation by breaking generator-discriminator symmetry and using phased training to prevent motion collapse and training instability.
arXiv cs.AI
TIER_1English(EN)·Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai, Lei Zhang, Xun Xiao, Thomas Seidl, Daniel Cremers, Volker Tresp·
arXiv:2605.31590v1 Announce Type: cross Abstract: Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and…
Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoi…
VideoMLA reduces memory usage in video diffusion models by replacing per-head keys and values with shared low-rank content and decoupled 3D-RoPE positional keys, maintaining quality while achieving significant compression and improved throughput.
One-Forcing improves one-step video generation quality and efficiency by combining DMD objective with GAN loss, achieving state-of-the-art results with reduced training costs.
arXiv:2606.06309v1 Announce Type: new Abstract: Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attent…
Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily red…
arXiv:2605.15980v2 Announce Type: replace Abstract: Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds o…
arXiv:2606.04432v1 Announce Type: new Abstract: Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion m…
Video diffusion transformers have achieved state-of-the-art visual quality, but their high inference cost remains a major bottleneck for real-time applications. Recent distillation frameworks produce autoregressive video diffusion models with reduced latency, yet these models sti…
arXiv:2606.03972v1 Announce Type: new Abstract: We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instabili…
arXiv:2606.03971v1 Announce Type: new Abstract: Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, …
We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses …
Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain …
arXiv cs.CV
TIER_1English(EN)·Hovhannes Margaryan, Quentin Bammey, Christian Sandor·
arXiv:2604.17625v2 Announce Type: replace Abstract: This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current an…
arXiv:2606.00957v1 Announce Type: new Abstract: We present a post-training quantization (PTQ) approach for Wan2.1-T2V-14B, a 14-billion-parameter text-to-video diffusion transformer, targeting the W8A8 HiFloat8 (HiF8) format on Ascend 910B NPUs. A central challenge in quantizing …