Kwai has released Keye-VL-2.0-30B-A3B, an open-source multimodal foundation model designed for long-video understanding and agentic intelligence. This model utilizes DeepSeek Sparse Attention to process up to 256K context, capturing essential frames and temporal dependencies in hour-long videos. It also incorporates Cross-Modal Multi-Teacher On-Policy Distillation to enhance multi-task alignment and agent collaboration across various scenarios. Evaluations show state-of-the-art performance on video understanding and temporal localization benchmarks. AI
IMPACT Enables advanced agent collaboration and improved long-video comprehension, potentially accelerating development in multimodal AI applications.
RANK_REASON The cluster contains a technical report detailing a new open-source multimodal foundation model released on arXiv.
Read on Hugging Face Daily Papers →
- Context-RL
- Cross-Modal Multi-Teacher On-Policy Distillation
- DeepSeek Sparse Attention
- GQA
- Keye-VL-2.0-30B-A3B
- Kwai
- LongVideoBench
- TimeLens
- Video-MME-v2
- Video-RL
- ViT-LM
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →