The CASTLE Challenge @ EgoVis 2026 evaluated long-form egocentric video question answering using over 600 hours of recordings. The winning system, developed by the CASTLE2026 Team WDL, employs a multimodal reasoning pipeline based on the Qwen model. This pipeline parses question hints, retrieves relevant audio transcriptions, and integrates auxiliary images and video frames to answer questions requiring evidence from various sources. Techniques like LoRA and frame sampling significantly improved performance, leading to a first-place ranking in the challenge. AI
IMPACT Demonstrates advanced multimodal reasoning for egocentric video understanding, potentially improving future AI systems for video analysis and QA.
RANK_REASON The cluster describes a technical report detailing a system that won a specific challenge, which falls under research achievements. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →