Brief · PulseAugur

RESEARCH · Hugging Face Daily Papers English(EN) · 1d · [3 sources]

Kwai Keye-VL-2.0 Technical Report

Kwai has released Keye-VL-2.0-30B-A3B, an open-source multimodal foundation model designed for long-video understanding and agentic intelligence. This model utilizes DeepSeek Sparse Attention to process up to 256K context, capturing essential frames and temporal dependencies in hour-long videos. It also incorporates Cross-Modal Multi-Teacher On-Policy Distillation to enhance multi-task alignment and agent collaboration across various scenarios. Evaluations show state-of-the-art performance on video understanding and temporal localization benchmarks. AI

IMPACT Enables advanced agent collaboration and improved long-video comprehension, potentially accelerating development in multimodal AI applications.

DeepSeek Sparse Attention
Kwai
GQA
LongVideoBench
Video-MME-v2
Context-RL
Cross-Modal Multi-Teacher On-Policy Distillation
Keye-VL-2.0-30B-A3B
TimeLens
Video-RL
ViT-LM