Kwai Keye-VL-2.0 Technical Report
Kwai has released Keye-VL-2.0-30B-A3B, an open-source multimodal foundation model designed for long-video understanding and agentic intelligence. This model utilizes DeepSeek Sparse Attention to process up to 256K context, capturing essential frames and temporal dependencies in hour-long videos. It also incorporates Cross-Modal Multi-Teacher On-Policy Distillation to enhance multi-task alignment and agent collaboration across various scenarios. Evaluations show state-of-the-art performance on video understanding and temporal localization benchmarks. AI
IMPACT Enables advanced agent collaboration and improved long-video comprehension, potentially accelerating development in multimodal AI applications.