CASTLE2026 Team WDL wins video QA challenge with Qwen-based system

By PulseAugur Editorial · [1 sources] · 2026-06-02 04:00

The CASTLE Challenge @ EgoVis 2026 evaluated long-form egocentric video question answering using over 600 hours of recordings. The winning system, developed by the CASTLE2026 Team WDL, employs a multimodal reasoning pipeline based on the Qwen model. This pipeline parses question hints, retrieves relevant audio transcriptions, and integrates auxiliary images and video frames to answer questions requiring evidence from various sources. Techniques like LoRA and frame sampling significantly improved performance, leading to a first-place ranking in the challenge. AI

IMPACT Demonstrates advanced multimodal reasoning for egocentric video understanding, potentially improving future AI systems for video analysis and QA.

RANK_REASON The cluster describes a technical report detailing a system that won a specific challenge, which falls under research achievements. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CV →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

arXiv cs.CV TIER_1 English(EN) · Zhengyang Li, Zhenglin Du, Yi Wen, Fang Liu, Shuo Li, Xu Liu · 2026-06-02 04:00

CASTLE2026 Team WDL Technical Report

arXiv:2606.00712v1 Announce Type: new Abstract: The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, peop…

COVERAGE [1]

CASTLE2026 Team WDL Technical Report

RELATED ENTITIES

RELATED TOPICS