New research explores advanced video generation and manipulation with diffusion models
ByPulseAugur Editorial·[13 sources]·
Researchers are exploring advanced techniques to improve video generation and manipulation using diffusion models. One approach involves integrating State Space Models (SSMs) with video diffusion models to enhance efficiency and handle longer sequences, outperforming attention-based methods in memory usage and performance. Other research focuses on improving temporal consistency in video relighting by using diffusion transformers and self-conditioning, and on reconstructing 4D hand motion from video by leveraging pretrained video diffusion models. Additionally, methods are being developed for efficient video restoration and robust point tracking by adapting diffusion model features and training strategies.
AI
IMPACT
Advances in video diffusion models promise more efficient and coherent video generation, improved relighting, and better reconstruction of complex motions like hand movements.
RANK_REASON
Multiple research papers detailing novel methods and improvements in video generation, restoration, and tracking using diffusion models and related architectures.
arXiv:2606.31050v1 Announce Type: cross Abstract: How to accurately predict a high-fidelity future world? While the visual world is inherently continuous, existing deterministic video prediction models operate in discrete pixel space and are mainly optimized with pixel-wise mean …
arXiv:2403.07711v5 Announce Type: replace-cross Abstract: Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generati…
arXiv:2606.29095v1 Announce Type: cross Abstract: Diffusion-based video relighting enables controllable relighting from a single input video, but modern video diffusion backbones are trained on short clips and applied to long-horizon videos through chunked sliding-window inferenc…
ViDiHand uses pretrained video diffusion model representations with hand-overlay rendering to reconstruct 4D hand motion directly from video frames without detectors or optimization.
arXiv:2606.28677v1 Announce Type: new Abstract: While diffusion models excel in video restoration, their reliance on extensive iterative steps limits efficiency. Conversely, aggressive single-step distillation often compromises fine texture recovery. To achieve an optimal balance…
arXiv:2606.30308v1 Announce Type: new Abstract: 4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal mo…
arXiv:2508.14483v4 Announce Type: replace Abstract: We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, convention…
arXiv:2512.20606v2 Announce Type: replace Abstract: Despite achieving strong results on standard benchmarks, current point tracking methods rely on feature backbones that are rarely designed with the temporal coherence needed for robust real-world performance. While recent works …
arXiv:2603.14526v2 Announce Type: replace Abstract: The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of "golden noise" that enhances video quality, prior work has …
4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotat…
arXiv cs.CV
TIER_1English(EN)·Xi Ye, Wenjia Yang, Yangyang Xu, Xiaoyang Liu, Duo Su, Mengfei Xia, Jun Zhu·
arXiv:2603.17426v2 Announce Type: replace Abstract: Image-conditioned video diffusion models achieve impressive visual realism but often suffer from weakened motion fidelity, e.g., reduced motion dynamics or degraded long-term temporal coherence, especially after fine-tuning. We …
arXiv:2606.27741v1 Announce Type: new Abstract: Recent advances in video diffusion models have greatly improved visual fidelity, yet their generated motions often violate physical plausibility. We observe a common kinematic failure, "motion entanglement", the unintended coupling …
Recent advances in video diffusion models have greatly improved visual fidelity, yet their generated motions often violate physical plausibility. We observe a common kinematic failure, "motion entanglement", the unintended coupling of independent motion sources, such as camera mo…