Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder
Researchers have developed a new streaming pipeline for video stylization that achieves high frame rates by optimizing the diffusion U-Net and MLLM text encoder. The system uses asymmetric pipelining and batched inference to overcome per-frame bottlenecks, enabling real-time video editing on consumer hardware. This approach sustains over 27 frames per second on an RTX 3090 Ti and significantly higher on more powerful GPUs, demonstrating efficient video-rate throughput. AI
IMPACT Achieves video-rate throughput for stylization, potentially enabling real-time AI-powered video editing tools.